Data Mining Optimization (DMOP) Ontology

, Last updated by hilario, on Tue, 07/26/2011 - 17:38


→ DMOP Browser and Download Page

→ e-LICO Collaborative Ontology Development Platform


During the first year of the project, we focus on the use of the data mining ontology to guide algorithm selection. This is a central issue in data mining, for it is a well-known fact that no learning algorithm is inherently superior to all others. However the algorithm selection problem is not specific to the induction task; as early as 1976, J. Rice [1] gave the following generic problem statement: Given a problem  x X characterized by features f(x)F , find an algorithm αA via the selection mapping S(f(x)) such that the performance mapping p(α(x)) P is maximized. Note that selection is based solely on features f(x) F describing the data.  Machine learning researchers have  also followed the same basic strategy, i.e, they have conditioned algorithm selection on data characteristics while treating algorithms essentially as black boxes. So far no attempt has been made to correlate dataset and algorithm characteristics, in other words to understand which intrinsic features of an algorithm explain its expected performance on the given data. As a consequence, current meta-learners cannot generalize over algorithms as they do over data sets.

To overcome this difficulty, we propose to extend the Rice framework and pry open the black box of algorithms. Thus our model includes an additional feature space G representing the set of features used to describe algorithms; selection mapping S(f(x), g(a)) is now a function of both problem (data) and algorithm features. To support this approach to algorithm selection, the Data Mining Ontology (DMO) relies on the 4-tuple of concepts (Task, DataSet, Algorithm, Model). The figure below shows an extract of the Task hierarchy limited to the learning (Induction) phase of the data mining process:


The full task hierarchy spans the whole knowledge discovery process -- from data preprocessing through modelling to model evaluation. However, we focus our initial efforts on the induction or modelling phase, and in particular on classification. The following figure illustrates the three major families of algorithms for classification: generative, discriminative, and discriminant function methods. These three families give rise to more specialized methods (e.g., Naive Bayes, Recursive Partitioning) which are then formally specified as algorithms (e.g., NaiveBayesNormal, C4.5) and finally materialized as executable programs (Weka-NaivebayesSimple, RM-DT).


Beyond task and method taxonomies, an ontology for algorithm selection must reflect an understanding of each learning algorithm's inductive bias, defined as the sum of mechanisms and basic options that allow it to generalize beyond the given data.  We illustrate this on a specific algorithm in the ontology excerpt below; most nodes have been hidden to highlight the object properties that link support vector machines to the components that explain their intrinsic behavior.

The most recognizable element of an algorithm's inductive bias is the basic structure of the models it produces. To be able to adapt to the data, this model structure necessarily involves a set of free parameters. As shown in the figure below, SVMs produce a common model structure: a linear combination of kernels. Depending on the type of  structure and the number of free parameters, the space of potential models remains relatively vast. For this reason, algorithms  provide hyperparameters that allow users to complement algorithmic bias with other constraints, possibly informed by prior knowledge. In the case of support vector classifiers (SVC), it is the user's task to select the type of kernel, which will ultimately determine the geometry of the decision boundaries drawn in instance space. In addition, the C parameter sets the bounds within which the kernel coefficients will be allowed to evolve. Thereafter, learning is simply the process of adjusting these coefficients in order to ensure optimum performance. This is done by minimizing some objective function, typically an aggregate of error and model complexity. This objective or cost function encapsulates other ingredients of learning bias: these are the metrics used to quantify performance loss and model complexity, as well as the regularization parameter that governs the trade-off between them. Still another component of bias is the optimization strategy adopted to minimize this cost function. In our example, error is measured by hinge loss, complexity by the L2 norm of the coefficients, and the regularization parameter is no other than the value of the C hyperparameter. The optimization strategy used  by the SVC algorithm illustrated below is Sequential Minimal Optimization, an exact (as opposed to heuristic) approach to a continuous optimization problem.






[1] References and further details can be found in Hilario et al., SoKD 2009. A more recent description of the ontology and its use in meta-mining can be found in Hilario et al., 2011.