Data Mining Optimization (DMOP) Ontology


→ DMOP Browser and Download Page

→ e-LICO Collaborative Ontology Development Platform


During the first year of the project, we focus on the use of the data mining ontology to guide algorithm selection. This is a central issue in data mining, for it is a well-known fact that no learning algorithm is inherently superior to all others. However the algorithm selection problem is not specific to the induction task; as early as 1976, J. Rice [1] gave the following generic problem statement: Given a problem  x X characterized by features f(x)F , find an algorithm αA via the selection mapping S(f(x)) such that the performance mapping p(α(x)) P is maximized. Note that selection is based solely on features f(x) F describing the data.  Machine learning researchers have  also followed the same basic strategy, i.e, they have conditioned algorithm selection on data characteristics while treating algorithms essentially as black boxes. So far no attempt has been made to correlate dataset and algorithm characteristics, in other words to understand which intrinsic features of an algorithm explain its expected performance on the given data. As a consequence, current meta-learners cannot generalize over algorithms as they do over data sets.

To overcome this difficulty, we propose to extend the Rice framework and pry open the black box of algorithms. Thus our model includes an additional feature space G representing the set of features used to describe algorithms; selection mapping S(f(x), g(a)) is now a function of both problem (data) and algorithm features. To support this approach to algorithm selection, the Data Mining Ontology (DMO) relies on the 4-tuple of concepts (Task, DataSet, Algorithm, Model). The figure below shows an extract of the Task hierarchy limited to the learning (Induction) phase of the data mining process:


The full task hierarchy spans the whole knowledge discovery process -- from data preprocessing through modelling to model evaluation. However, we focus our initial efforts on the induction or modelling phase, and in particular on classification. The following figure illustrates the three major families of algorithms for classification: generative, discriminative, and discriminant function methods. These three families give rise to more specialized methods (e.g., Naive Bayes, Recursive Partitioning) which are then formally specified as algorithms (e.g., NaiveBayesNormal, C4.5) and finally materialized as executable programs (Weka-NaivebayesSimple, RM-DT).


Beyond task and method taxonomies, an ontology for algorithm selection must reflect an understanding of each learning algorithm's inductive bias, defined as the sum of mechanisms and basic options that allow it to generalize beyond the given data.  We illustrate this on a specific algorithm in the ontology excerpt below; most nodes have been hidden to highlight the object properties that link support vector machines to the components that explain their intrinsic behavior.

The most recognizable element of an algorithm's inductive bias is the basic structure of the models it produces. To be able to adapt to the data, this model structure necessarily involves a set of free parameters. As shown in the figure below, SVMs produce a common model structure: a linear combination of kernels. Depending on the type of  structure and the number of free parameters, the space of potential models remains relatively vast. For this reason, algorithms  provide hyperparameters that allow users to complement algorithmic bias with other constraints, possibly informed by prior knowledge. In the case of support vector classifiers (SVC), it is the user's task to select the type of kernel, which will ultimately determine the geometry of the decision boundaries drawn in instance space. In addition, the C parameter sets the bounds within which the kernel coefficients will be allowed to evolve. Thereafter, learning is simply the process of adjusting these coefficients in order to ensure optimum performance. This is done by minimizing some objective function, typically an aggregate of error and model complexity. This objective or cost function encapsulates other ingredients of learning bias: these are the metrics used to quantify performance loss and model complexity, as well as the regularization parameter that governs the trade-off between them. Still another component of bias is the optimization strategy adopted to minimize this cost function. In our example, error is measured by hinge loss, complexity by the L2 norm of the coefficients, and the regularization parameter is no other than the value of the C hyperparameter. The optimization strategy used  by the SVC algorithm illustrated below is Sequential Minimal Optimization, an exact (as opposed to heuristic) approach to a continuous optimization problem.






[1] References and further details can be found in Hilario et al., SoKD 2009. A more recent description of the ontology and its use in meta-mining can be found in Hilario et al., 2011.


DMOP Browser and Download Page

This page is your entry point to the e-LICO Ontology for Data Mining Optimization (DMOP).  If this is your first visit, please read the brief description below before going to the browser.


Run the DMOP Browser

Alternatively, you can download the DMOP files if you prefer to browse the ontology offline using an ontology editor like Protégé:

The DMOP ontology is a collection of four files:

DMOP.owl contains all the DMOP classes and axioms, and defines the major concepts of the data mining domain, such as Task, Algorithm, AlgorithmAssumption, Data, Model, Operator, OptimizationProblem, CostFunction, etc., as well as axioms defining their properties and relationships. As such, it can be viewed as the terminological core of the ontology. You can use DMOP alone or you can import it through DMKB.owl.

DMKB.owl contains descriptions of instances of concepts defined in DMOP, such as individual algorithms and their implementations (Operators)  in popular data mining software such as RapidMiner and Weka. It stands for Data Mining Knowledge Base and aims to be a compendium of current knowledge on algorithms and models for knowledge discovery. It imports two other files, RMOperators.owl and WekaOperators.owl. The operator files are typically accessed by importing them from DMKB. If you find that DMKB loads too slowly, it's because the reasoner must plod through more than 1800 individual assertions concerning the different algorithms and their corresponding implementations (called Operators) in RapidMiner and Weka.You don't need to import them if you just want to browse the algorithm descriptions in DMKB.

DMEX-DB.owl is not part of the DM ontology and knowledge base, but is an illustrative file that gives an idea of what the Data Mining Experiments database will be.  The schema of this database will be based on DMOP and DMKB. It will contain records of all data mining experiments conducted in the e-LICO project. Each experiment represents the execution of a specific workflow and all its components: the user goal specification, the dataset used, and a description of each workflow step -- the specific task addressed (e.g., discretization, feature selection, classification), the operator selected to do the task, and the parameters, input and output of the operator execution.

As a sneak preview of the true DMEX database, DMEX-DB.owl contains a description of the characteristics of the Iris dataset as well as individual assertions concerning mock executions of operators such as Weka_NaiveBayes. Based on an execution's parameter settings,  the underlying ontology reasoner infers the specific NaiveBayes variant (e.g., NaiveBayesNormal, NaiveBayesKernel) implemented by the operator and thus gives the user access to an in-depth characterization of the underlying algorithm.

The DMEX database, grounded on DMOP and DMKB, will be the source of training metadata for the e-LICO Meta-Miner, whose goal is to optimize workflows and thereby improve the performance of the e-LICO DM lab's Intelligent Discovery Assistant. As its name indicates, the specific goal of DMOP is workflow optimization through meta-learning, in addition to its more theoretical goal of providing a unified conceptual framework for the study of data mining and knowledge discovery.

DMOP is in its early stages of development and many parts of the ontology are currently placeholders awaiting volunteer developers. We suggest that you explore the more developed ontological regions concerning classification algorithms and models, e.g.,  Support Vector Classifiers and their underlying assumptions, optimization problems, objective functions, constraints, and optimization strategies.

We would appreciate all comments and suggestions on this initial version, from data mining as well as ontology engineering specialists. We would also gratefully consider all collaboration offers to participate in the development of the DM Ontology by annotating algorithms that you have authored or of which you have extensive knowledge. Please send your comments and suggestions to Melanie[dot]Hilario[at]unige[dot]ch.