Objectives

The overall project goal is to build a virtual laboratory for interdisciplinary collaboration between researchers in data mining and data-intensive sciences. To achieve this goal, the project will focus on the following major objectives:

1. Upgrade an e-science infrastructure to support collaborative, data mining enabled experimental research.

The project will build on open-source e-science middleware developed by one of the consortium members. The myGrid platform will be upgraded to allow for multi-level, multi-mode collaboration. Each level of collaboration is defined by its scale (from micro-teams to large global communities) and a corresponding degree of commitment (from iron-clad contracts to voluntary, constraint-free participation in dynamic virtual communities). Along a different dimension, collaboration modes can range from simple resource sharing, as in MySpace or Flickr, to interactive collaborative authoring, as in Wikipedia. e-LICO will gather these different levels and modes of collaboration in a single framework for scientific research. The infrastructure will be extended with mechanisms for content creation, whether by humans (e.g., ontology engineering) or by machines (e.g., multimedia data mining).

2. Develop a knowledge-driven data mining assistant to support researchers in data-intensive, knowledge-rich domains.

An intelligent data mining (DM) assistant will take in user specifications of the learning task and available data, plan a methodologically correct learning process, and suggest ranked workflows that the user can enact to achieve the prespecified data-analytical objectives. To plan the workflow and determine the algorithm to apply for a given data mining step, the assistant will harness prior knowledge stored in data mining and domain ontologies and knowledge bases. In addition, information retrieval and extraction mechanisms will allow scientists to gather background domain knowledge (e.g., from document collections) and bring it to bear in the knowledge discovery process.

3. Design and implement mechanisms for meta-mining the knowledge discovery process.

All experiments performed in e-LICO will be recorded in detail in a repository of data mining experiments to allow for replication and comparison of experiments. These meta-data can be leveraged to improve the data mining process itself, for instance by incrementally refining the DM planner's search in the design space of candidate DM operators (and workflows). A kernel-based, probabilistic meta-learner will dynamically adjust transition probabilities between DM operators, conditioned on the current application task and data, user-specified performance criteria, quality scores of workflows applied in the past to similar tasks and data, and the user's profile (based on quantified results from, and qualitative feedback on, her past DM experiments). The proposed meta-learning method will be evaluated against the baseline of a case-based DM planner, which retrieves and adapts workflows from the most similar past experiments. By comparing the DM planner's evolution over time based on these two approaches, e-LICO data miners hope to gain insights into the patterns that govern the efficacy of data mining workflows, operators and parameters.

4. Demonstrate e-LICO on a systems biology approach to disease studies

The goal of systems biology is to provide system-level understanding of complex biological processes. Many processes relating to human health are complex and their understanding will promote the development of treatment, prognostics as well as prevention. The pilot application will focus on diseases of the kidney and urinary pathways (KUP). Domain specialists will undertake the collaborative construction of specialized knowledge sources (e.g. ontology, database), to be used as reference tools by the global community of researchers in the field. They will also conduct data mining experiments on diagnostic and prognostic biomarker discovery for a specific renal or urological disease, a task that requires the integration of heterogeneous (multi-omics and multimedia) data. Differentially expressed genes, proteins or metabolites that discriminate the different pathological states will be used to generate hypotheses and build mathematical models that elucidate the molecular pathways implicated in disease onset and progression.