Data Mining Ontologies

e-LICO Data Mining Ontology (DMO)

The e-LICO Data Mining Ontology (DMO) serves a number of different objectives; it has been modularized in order to ensure quick, goal-oriented development.

One objective is to support planning of the knowledge discovery process and building of workflows for a user task. This is being pursued through the e-LICO Protégé-based planning and Data Mining Work Flow (eProPlan-DMWF) Ontology.

The second objective is to support algorithm and model selection for data mining tasks that require search in the space of possible methods and models, e.g., feature selection or induction (modelling).  The third objective is to support meta-mining, or learning from data mining experimentation records to improve algorithm and model selection for search-intensive tasks. The ontology for Data Mining Optimization (DMOP) is being developed in pursuit of these last two objectives.

 

Data Mining Work Flow Ontology

eProPlan page moved to: http://www.e-lico.eu/?q=node/323.

A major challenge for third generation data mining and knowledge discovery systems is the integration of different data mining tools and services for data understanding, data integration, data preprocessing, data mining, evaluation and deployment, which are distributed across the network of computer systems. In e-Lico WP6 we are building an intelligent discovery assistant (IDA) that is intended to support end-users in the difficult and time consuming task of  designing KDD-Workflows out of these distributed services. The assistant will support the user in checking the correctness of workflows, understanding the goals behind given workflows, enumeration of AI planner generated workflow completions, storage, retrieval, adaptation and repair of previous workflows. It should also be an open easy extendable system. This is reached by basing the system on a data mining ontology (DMO) in which all the services (operators) together with their in-/output, conditions and effects are described.

This approach is described in:

The DMO for planning is divided into several parts:

These ontologies are developed using Protégé 4.0 (build 111 or higher).

To use Protégé 4.0 (build 111 or higher) for planning we are developing the eProPlan plug-in (To use a Protégé 4.0 Plugin, you need to first install Protégé and then simply place the jar-file you get via the links below into the folder named “Plugin” inside the protégé directory).

eProplan movies:

Comments, enhancement proposals and bug reports can be submitted in our bug-tracker.

eProPlanI

Updated at 22.March.2010eProPlan-I: A new reasoner plug-in for Protégé 4.0 combining FacT++ for DL-Reasoning with the XSB based f-Logic system Flora2 for Instance Reasoning (incl. SWRL-Rules). To use (by selecting it from the Reasoners menu from Protege) this reasonert:

  1. You must install XSB Version 3.1 (Sources and Binary for Windows) or 3.2 (Sources and Binary for MacOSX) and
  2. XSB (version 3.1 as well as 3.2) has a problem converting negative reals to strings, if you wana use negative real values in data properties you have to store this patched string.P in $XSB/syslib/ and run make inside this directory (or run makexsb from $XSB/build after replacing string.P with our version).
  3. You must install Flora2 version 0.95 (Androcymbium). (Make shure that the shell-script runflora is working,  we are using it.).
  4. Finally go to Protege Preferences->eProPlan-I  and set the path to Flora 2 directory and the path to a temporary directory with r/w/x rights.

The plugin uses FaCT++ as reasoner for the TBOX inferences, i.e to reason about concept-subsumption from the ontology. Flora-2 does the instance reasoning. It can infer concept membership based on: Sup-/Super-concept relations, Concept definitions, Domain and Range Restrictions of Properties, and SWRL-Rules that conclude on concepts.
It can infer Properties based on: Sub-Properties, Property characteristics e.g. transitive, and SWRL-Rules that conclude on Properties. It does not (not even in the simplest form): Propagate constrains along properties, e.g. from C :< P only R, C(a), P(a,b) it does not infer R(b),  Reasoning by case. It always treats differently named individuals as distinct. Due to XSB/Flora’s tabling it can evaluate a lot of rules, a "normal" prolog interpretation would get lost in an infinite recursion. In principle it reasons about both negation as "complement"-membership of concepts and properties and negation as failure, however Protégé 4.0 SWRL Editor doesn’t allow you to enter such rules.

 

We provide a view with a console that displays all the Flora2 commands and results. Also the user can write directly the commands to the console.

Figure 1 : The eProPlanI Console Window to inspect and test-call the reasoner/planner

 

The user can choose to either install XSB and Flora2 locally or to install them on a server. Before using the eProPlanI reasoner she should set into eProPlanI Preferences tab the path to both Flora2 and XSB. Also she can change the planner and compiler settings.

Figure 2 : eProPlan Preferences for the local version

Figure 3 : eProPlan Preferences for the server version

eProPlanO

Updated at 22.March.2010eProPlan-O:  The basic plugin to edit operator conditions and effects (short description).

This Protégé 4.0 (build 111+) plug-in provides a special Class View named "eProPlan: Operator Conditions & Effects" which allows the user to edit the "condition" and the "effect" annotations of subclasses of the Operator class. These are the basis of the STRIPS-like (but we do not change the world, we only extend it with new objects, but that is not really a restriction, you can make a new possible world for every operator
applications) planning done by eProPlan.
Operator Conditions and Effects are like the normal Protégé 4.0 SWRL-rules with some syntax extensions and some restrictions resulting from the purpose of these rules.

Figure 1 : The eProPlanO Tab to Model Operator

Use of concept expresions: In premisses of Conditions and Effects you can use not only concept names as one-place predicates, but also concept expressions enclosed in "[]", i.e. an atom using a concept expression
looks like [concept-expresion](?Var), e.g.
[DataTable and
(targetAttribute exactly 1 Attribute) and
(inputAttribute min 1 Attribute) and
(targetColumn only (DataColumn and columnHasType only (Scalar or Categorial))) and
(inputColumn only (DataColumn and columnHasType only (Scalar or Categorial)))
](?D)

Use of negation-as-failure not(atom-conjunction): In premisses of Conditions and Effects you can use negation-as-failure, e.g. not(MissingValueFreeDataTable(?D), ScaledTable(?D)), i.e. the atoms in- side must be enclosed in "". (a not used inside concept expresions is the normal DL complement-reasoning, not negation-as-failure). In the conclusion of a condition new(?this), OperatorName(?this) and all required inputs (i.e. uses and sub-properties, parameter and sub-properties and simpleParameter and subproperties) of this operator must appear. The first argument of these inputs must be ?this, the second argument must be a variable bound by
the premisses (i.e. occur in an atom there, but not only inside a negation-as-failure), e.g. uses(?this,?D).
new(?New) is a built-in that generates a new unique individual and returns it in ?New.

In the premisses of an effect OperatorName(?this) must occur. In the conclusion of an effect not only new(?New), but also newFor(?New,?Old), copy(?New,?Old,atom-conjunction), copyComplex(?New,?Old,<?V1,...,?Vn>) may occur. All variables in an atom in the conclusion must be bound by the premisses or as ?New-variable in any previous new/copy-built-in. An atom without ?this or any ?New-variable from any previous new/copy-built-in is NOT allowed, i.e. we do not allow changing the previous-world, but you can re-use (MetaData) parts of the previous-world.

newFor(?New,?Old) is a built-in that for EACH different binding of ?Old (within ?this Operator) a new instance ?New is generated.

copy(?New,?Old,atom-conjunction) is a built-in that for EACH different binding of ?Old (within ?this Operator) a new instance ?New is generated and every thing that is STORED (not inferred) for ?Old and NOT
matches any atom in the conjunction is copied (?Old should be one argment in each atom, the other could be ?_ or bound before).
copyComplex(?New,?Old,<?I1,...,?In>) is a build-in that or EACH different binding of ?Old (within ?this Operator) generates a new instance ?New and for each object ?I related to ?Old via a subProperty of complexObjectPart(?Old,?I) a new ?NI is generated as well, except for those ?I, that are member in <?I1,...,?In> (for every binding of <?I1,...,?In> for ?Old in ?this). Everything STORED (not inferred) for ?Old or a copied ?I is copied to ?New or the corresponding new ?NI. Everything that is stored for any of the <?I1,...,?In> is not copied (be the other argument in a property be ?Old, something different or an ?I not in <?I1,...,?In>).
?_ is the always different anonymous variable (you may know from Prolog as _), that therefore can only be used where unbound variales are allowed, i.e. inside premisses and as the 2nd argument in a property in copy.
We extended the SWRL rule checkers and implemented two new ones: one for the conditions and one for the effects. The editors also have support for autocompletion suggesting to the user what to type next.

Added at 24.Feb.2010

Figure 2: The eProPlanO-Edit-Dialog for Operator Conditions & Effects

eProPlan-O tab layout for download here.

eProPlanM

Updated at 22.March.2010eProPlan-M: A plug-in to edit the task/method decompositions that are used by our HTN-Planning approach.

To do planning with HTNs we have to model a set of tasks: the end-user chooses one for planning (indirectly via choosing a goal), when he presses the plan button in eProPlan-P.
Each task has a set of methods that can solve the task. In the Ontology this is modeled via the objectProperty solvedBy. This plugin offers the Task/Method decomposition view to allow easy editing.
Each method has a condition (editable via condition/contribution View, same syntax as operator condition premisses, no conclusion), that have to be satisfied to choose a method. Applying a method means decomposing it into a sequence of subtasks or operator(-application)s. This is modelled with the step1 , ..., stepn subObject property of decomposedTo. This plugin offers the Task/Method decomposition view to allow easy editing. This is easiest understood as a very powerful grammar (we have first-order logic conditions on grammar rules and parameter passing, editable in the Method Bindings View, so we have a turing-machine equivalent grammar formalism):

  • tasks are non-terminal symbols,
  • operators are terminal symbols,
  • methods are the grammar rules,
  • the plans (workflows) are the words in the language specified by this grammar,
  • planning is enumerating all words in the language

Added at 24.Feb.2010

Figure 1: The eProPlanM Tab to Model Tasks and Methods for HTN planning

eProPlan-M tab layout for download here.

eProPlanP

eProPlan-P: The view consists of two Protégé 4.0 Individual Views, one showing all applicable Operators per IO-Object (Data, Model, Report), the other showing the current DM Workflow. It also allows to select one of the applicable
Operators and apply it. This plugin depends on the eProPlanI plugin since it makes several calls to the Flora2 knowledge base (getting the applicable operators, applying an operator, planning).

Applicable operators
An operator has certain conditions and effects. Both can be inherited from superclasses. Also each operator has a type. The conditions together with the type can specify if an operator is applicable or not. Therefore an operator is applicable if its type is basic (no further decomposable) and its conditions can be fulfilled. The applicable operators view uses in the background a compiler which compiles the operators conditions and effects into Flora2 thus producing the file op-defs.flr. This compilation is done every time a change is made either in the conditions or effects or in the operator’s type an the user calls the reasoner or the "Infer" button or the "Plan" button. The file is loaded into Flora2 and then the applicableOp predicate is called. The view displays a tree consisting of three levels:

  • The first level of the tree contains all the individuals whose types are subclasses of IOObject, more precisely all the individuals which can be used by an Operator.
  • The second level of the tree consists of all applicable operators, more precisely only basic operators whose conditions are satisfied (there is a set of individuals which satisfy all the conditions).
  • The third level of the tree represents a parameter of the operator - the object properties : uses or sub properties of uses, produces and sub properties of produces or parameter and its sub properties, and the data property simpleParameter and it’s sub properties. Which parameters exactly an operator has is inferred by the condition of the operator. Each solution of the premisses of the condition produces one parameter list corresponding to the conclusion of the condition.

On the top the view there is a toolbar with three buttons used to fill the information into the view as follows:

  • The "Infer" button compiles the operators information or the ontology if any change was made the last compilation into Flora2 and refreshes the information from the tree. The button is always enabled since if the ontology is changed it will be recompiled when "Infer" is called. This only works, if eProPlan-I is selected as the current reasoner.
  • The "Apply" button is used to apply an operator with a certain parameter. In order to be able to apply an operator the user has to select its parameter list (which is on the third level of the tree), otherwise the button is disabled. When the button is pressed the operator is applied and it adds the new produced individuals in the ontology and their applicable operators.
  • The "Plan" button is enabled when eProPlan-I reasoner is selected and the ontology is classified. When clicking on the "Plan" button a dialog is displayed containing a tree with the available task instances. The tree contains only those individuals which are connected to an individual goal through the object property useTask. The user needs to select one of the individuals from the tree (on the second level of the tree) and the "Ok" button is enabled. If "Ok" is pressed the HTN AI-planner is called and the plan is displayed in the Plan Graph view.

Added at 24.Feb.2010

Plan Graph View
This view displays the plan as an workflow-graph. It consists of nodes (with labels and icons) – either Operator or IO-Object individuals connected by edges – properties that connect nodes. The top of the view has a toolbar with buttons that can be used to zoom in/out or to delete a node/edge from the workflow. Figure 7 displays the eProPlan-P tab which contains both of the views described before.

eProPlan-P tab layout for download here.

eProPlanG

Updated at 22.March.2010eProPlan-G: A plug-in that allows to specify the Goal of the DM Workflow and the Data Tables to be used. At the moment goals have to be asserted with the normal Protege methods for asserting individuals. e.g. by generating an individual my_goal asserting that it is of type PredictiveModeling and that the planner should use the task my_task of type Demo, i.e. the following facts:

PredictiveModeling(my_goal), useTask(my_goal, my_task), Demo(my_task).

 

The plugin consists of two Protégé 4.0 Individual Views : the Data Table View and the Select table view.

Data Table View

The Data Table view consists of two parts: the top part which displays the available data set from the RapidI repository and calls the Data Table service. The user should choose first an individual goal to which the new data set will belong. He can choose from the first drop down list an object property that relates goals to data sets. Also he must selected one of the available individuals from the second drop down list. There are three ways to acquire the data set descriptors (metadata): choosing a data set from the repository, specifying a link to a data set or also by applying an operator on the initial data set. The last one is not available yet in our
implementation. By pressing the URL button the user is asked for a username and password (at the moment the RapidI service for repository browsing needs authentication). If the authentication succeeds a new dialog displays the structure of the repository.

 

Added at 24.Feb.2010Added at 24.Feb.2010

Figure 1: The authentication window & the repository structure

If the user doesn’t press the URL button and just write in the text field a valid URL then the data from this URL will be analyzed. When the Analyze button is pressed a table is displayed with all the information necessary to describe the data table. The data table obtained from calling the web-service is displayed at the bottom of the view. The data table is in fact an individual with several property characteristics. The user can edit some of the columns such as the type of the columns, the role of the attributes and also can replace the current attribute with others belonging to the same data table format. When the user presses "Store" the obtained set of individuals with their characteristics (with the modifications made after editing) is added in the current active ontology. Also axioms for asserting the format of the table and the attribute of this format are added.

Added at 24.Feb.2010

Figure 2: The eProPlan Data Table Analyser Interface

Select Table View
It displays the contents of each table. Consists of a Protege Individual View which is available each time the user selects an individual of type Data Table. Therefore it helps the use to explore the contents of several tables and compare them. Opposed to the previous view the user can only view the table and not edit it.

 

eProPlan-G tab layout for download here.

InstanceGraph

InstanceGraph Protege plug-in. This is just a small experiment for selecting a Graph Library for our eProPlan-P's Workflow view, but it turned out to be quite useful for individual inspection. This uses the jGraph - A Java Open Source Graph Drawing Component, but also the jGraph LayoutPro which is not Open Source (but commercial for commercials and free for research). Therefore this plug-in only contains a Dummy Layout (random placement), but you automatically get jGraph LayoutPro's Hierarchical Layout, if you replace the jgraphlayout.jar inside our ch.uzh.ifi.ddis.instgraph_1.0.0.jar with the jgraphlayout.jar demo version or a licensed version you acquired from jGraph.

Data Mining Optimization (DMOP) Ontology

 

→ DMOP Browser and Download Page

→ e-LICO Collaborative Ontology Development Platform

 

During the first year of the project, we focus on the use of the data mining ontology to guide algorithm selection. This is a central issue in data mining, for it is a well-known fact that no learning algorithm is inherently superior to all others. However the algorithm selection problem is not specific to the induction task; as early as 1976, J. Rice [1] gave the following generic problem statement: Given a problem  x X characterized by features f(x)F , find an algorithm αA via the selection mapping S(f(x)) such that the performance mapping p(α(x)) P is maximized. Note that selection is based solely on features f(x) F describing the data.  Machine learning researchers have  also followed the same basic strategy, i.e, they have conditioned algorithm selection on data characteristics while treating algorithms essentially as black boxes. So far no attempt has been made to correlate dataset and algorithm characteristics, in other words to understand which intrinsic features of an algorithm explain its expected performance on the given data. As a consequence, current meta-learners cannot generalize over algorithms as they do over data sets.

To overcome this difficulty, we propose to extend the Rice framework and pry open the black box of algorithms. Thus our model includes an additional feature space G representing the set of features used to describe algorithms; selection mapping S(f(x), g(a)) is now a function of both problem (data) and algorithm features. To support this approach to algorithm selection, the Data Mining Ontology (DMO) relies on the 4-tuple of concepts (Task, DataSet, Algorithm, Model). The figure below shows an extract of the Task hierarchy limited to the learning (Induction) phase of the data mining process:

 

The full task hierarchy spans the whole knowledge discovery process -- from data preprocessing through modelling to model evaluation. However, we focus our initial efforts on the induction or modelling phase, and in particular on classification. The following figure illustrates the three major families of algorithms for classification: generative, discriminative, and discriminant function methods. These three families give rise to more specialized methods (e.g., Naive Bayes, Recursive Partitioning) which are then formally specified as algorithms (e.g., NaiveBayesNormal, C4.5) and finally materialized as executable programs (Weka-NaivebayesSimple, RM-DT).

 

Beyond task and method taxonomies, an ontology for algorithm selection must reflect an understanding of each learning algorithm's inductive bias, defined as the sum of mechanisms and basic options that allow it to generalize beyond the given data.  We illustrate this on a specific algorithm in the ontology excerpt below; most nodes have been hidden to highlight the object properties that link support vector machines to the components that explain their intrinsic behavior.

The most recognizable element of an algorithm's inductive bias is the basic structure of the models it produces. To be able to adapt to the data, this model structure necessarily involves a set of free parameters. As shown in the figure below, SVMs produce a common model structure: a linear combination of kernels. Depending on the type of  structure and the number of free parameters, the space of potential models remains relatively vast. For this reason, algorithms  provide hyperparameters that allow users to complement algorithmic bias with other constraints, possibly informed by prior knowledge. In the case of support vector classifiers (SVC), it is the user's task to select the type of kernel, which will ultimately determine the geometry of the decision boundaries drawn in instance space. In addition, the C parameter sets the bounds within which the kernel coefficients will be allowed to evolve. Thereafter, learning is simply the process of adjusting these coefficients in order to ensure optimum performance. This is done by minimizing some objective function, typically an aggregate of error and model complexity. This objective or cost function encapsulates other ingredients of learning bias: these are the metrics used to quantify performance loss and model complexity, as well as the regularization parameter that governs the trade-off between them. Still another component of bias is the optimization strategy adopted to minimize this cost function. In our example, error is measured by hinge loss, complexity by the L2 norm of the coefficients, and the regularization parameter is no other than the value of the C hyperparameter. The optimization strategy used  by the SVC algorithm illustrated below is Sequential Minimal Optimization, an exact (as opposed to heuristic) approach to a continuous optimization problem.

 

 

 

 

 

[1] References and further details can be found in Hilario et al., SoKD 2009. A more recent description of the ontology and its use in meta-mining can be found in Hilario et al., 2011.

 

DMOP Browser and Download Page

This page is your entry point to the e-LICO Ontology for Data Mining Optimization (DMOP).  If this is your first visit, please read the brief description below before going to the browser.

 

Run the DMOP Browser


Alternatively, you can download the DMOP files if you prefer to browse the ontology offline using an ontology editor like Protégé:

The DMOP ontology is a collection of four files:

DMOP.owl contains all the DMOP classes and axioms, and defines the major concepts of the data mining domain, such as Task, Algorithm, AlgorithmAssumption, Data, Model, Operator, OptimizationProblem, CostFunction, etc., as well as axioms defining their properties and relationships. As such, it can be viewed as the terminological core of the ontology. You can use DMOP alone or you can import it through DMKB.owl.

DMKB.owl contains descriptions of instances of concepts defined in DMOP, such as individual algorithms and their implementations (Operators)  in popular data mining software such as RapidMiner and Weka. It stands for Data Mining Knowledge Base and aims to be a compendium of current knowledge on algorithms and models for knowledge discovery. It imports two other files, RMOperators.owl and WekaOperators.owl. The operator files are typically accessed by importing them from DMKB. If you find that DMKB loads too slowly, it's because the reasoner must plod through more than 1800 individual assertions concerning the different algorithms and their corresponding implementations (called Operators) in RapidMiner and Weka.You don't need to import them if you just want to browse the algorithm descriptions in DMKB.

DMEX-DB.owl is not part of the DM ontology and knowledge base, but is an illustrative file that gives an idea of what the Data Mining Experiments database will be.  The schema of this database will be based on DMOP and DMKB. It will contain records of all data mining experiments conducted in the e-LICO project. Each experiment represents the execution of a specific workflow and all its components: the user goal specification, the dataset used, and a description of each workflow step -- the specific task addressed (e.g., discretization, feature selection, classification), the operator selected to do the task, and the parameters, input and output of the operator execution.

As a sneak preview of the true DMEX database, DMEX-DB.owl contains a description of the characteristics of the Iris dataset as well as individual assertions concerning mock executions of operators such as Weka_NaiveBayes. Based on an execution's parameter settings,  the underlying ontology reasoner infers the specific NaiveBayes variant (e.g., NaiveBayesNormal, NaiveBayesKernel) implemented by the operator and thus gives the user access to an in-depth characterization of the underlying algorithm.

The DMEX database, grounded on DMOP and DMKB, will be the source of training metadata for the e-LICO Meta-Miner, whose goal is to optimize workflows and thereby improve the performance of the e-LICO DM lab's Intelligent Discovery Assistant. As its name indicates, the specific goal of DMOP is workflow optimization through meta-learning, in addition to its more theoretical goal of providing a unified conceptual framework for the study of data mining and knowledge discovery.

DMOP is in its early stages of development and many parts of the ontology are currently placeholders awaiting volunteer developers. We suggest that you explore the more developed ontological regions concerning classification algorithms and models, e.g.,  Support Vector Classifiers and their underlying assumptions, optimization problems, objective functions, constraints, and optimization strategies.

We would appreciate all comments and suggestions on this initial version, from data mining as well as ontology engineering specialists. We would also gratefully consider all collaboration offers to participate in the development of the DM Ontology by annotating algorithms that you have authored or of which you have extensive knowledge. Please send your comments and suggestions to Melanie[dot]Hilario[at]unige[dot]ch.