e-LICO Project

An e-Laboratory for Interdisciplinary Collaborative Research
in Data Mining and Data-Intensive Science

An EU-FP7 Collaborative Project (2009-2012)
Theme ICT-4.4: Intelligent Content and Semantics

co-funded by the European Union

 

 

 

 

 

The goal of the e-LICO project is to build a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences. The proposed e-lab will comprise three layers: the e-science and data mining layers will form a generic research environment that can be adapted to different scientific domains by customizing the application layer.

The e-science layer, built on an open-source e-science infrastructure developed by one of the partners, will support content creation through collaboration at multiple scales and degrees of commitment — ranging from small, contract-bound teams to voluntary, constraint-free participation in dynamic virtual communities.

The data mining layer will be the distinctive core of e-LICO; it will provide comprehensive multimedia (structured records, text, images, signals) data mining tools. Standard tools will be augmented with preprocessing or learning algorithms developed specifically to meet challenges of data-intensive, knowledge rich sciences, such as ultra-high dimensionality or undersampled data. Methodologically sound use of these tools will be ensured by a knowledge-driven data mining assistant, which will rely on a data mining ontology and knowledge base to plan the mining process and propose ranked workflows for a given application problem. Extensive e-lab monitoring facilities will automate the accumulation of experimental meta-data to support replication and comparison of data mining experiments. These meta-data will be used by a meta-miner, which will combine probabilistic reasoning with kernel-based learning from complex structures to incrementally improve the assistant's workflow recommendations.

e-LICO will be showcased in two pilot application areas: systems biology and digital multimedia repositories.

 

 

 

Download the e-LICO flyer

 

 

Download the 2009 Public Report

 

 

 

Download the 2010 Public Report

 

 

 

 

 

 

Partners

University of Geneva - Co-ordinator (Switzerland)

Institut National de la Santé et de la Recherche Médicale (France)

Josef Stefan Institute (Slovenia)

National Hellenic Research Foundation (Greece)

Poznan University of Technology (Poland)

Rapid-I GmbH (Germany)

Ruder Boskovic Institute (Croatia)

University of Manchester (UK)

University of Zurich (Switzerland)

 

Scientific Advisory Board

Data Mining Advisory Board

Ontology Engineering Advisory Board

Kidney and Urinary Pathways Advisory Board

 

Objectives

The overall project goal is to build a virtual laboratory for interdisciplinary collaboration between researchers in data mining and data-intensive sciences. To achieve this goal, the project will focus on the following major objectives:

1. Upgrade an e-science infrastructure to support collaborative, data mining enabled experimental research.

The project will build on open-source e-science middleware developed by one of the consortium members. The myGrid platform will be upgraded to allow for multi-level, multi-mode collaboration. Each level of collaboration is defined by its scale (from micro-teams to large global communities) and a corresponding degree of commitment (from iron-clad contracts to voluntary, constraint-free participation in dynamic virtual communities). Along a different dimension, collaboration modes can range from simple resource sharing, as in MySpace or Flickr, to interactive collaborative authoring, as in Wikipedia. e-LICO will gather these different levels and modes of collaboration in a single framework for scientific research. The infrastructure will be extended with mechanisms for content creation, whether by humans (e.g., ontology engineering) or by machines (e.g., multimedia data mining).

2. Develop a knowledge-driven data mining assistant to support researchers in data-intensive, knowledge-rich domains.

An intelligent data mining (DM) assistant will take in user specifications of the learning task and available data, plan a methodologically correct learning process, and suggest ranked workflows that the user can enact to achieve the prespecified data-analytical objectives. To plan the workflow and determine the algorithm to apply for a given data mining step, the assistant will harness prior knowledge stored in data mining and domain ontologies and knowledge bases. In addition, information retrieval and extraction mechanisms will allow scientists to gather background domain knowledge (e.g., from document collections) and bring it to bear in the knowledge discovery process.

3. Design and implement mechanisms for meta-mining the knowledge discovery process.

All experiments performed in e-LICO will be recorded in detail in a repository of data mining experiments to allow for replication and comparison of experiments. These meta-data can be leveraged to improve the data mining process itself, for instance by incrementally refining the DM planner's search in the design space of candidate DM operators (and workflows). A kernel-based, probabilistic meta-learner will dynamically adjust transition probabilities between DM operators, conditioned on the current application task and data, user-specified performance criteria, quality scores of workflows applied in the past to similar tasks and data, and the user's profile (based on quantified results from, and qualitative feedback on, her past DM experiments). The proposed meta-learning method will be evaluated against the baseline of a case-based DM planner, which retrieves and adapts workflows from the most similar past experiments. By comparing the DM planner's evolution over time based on these two approaches, e-LICO data miners hope to gain insights into the patterns that govern the efficacy of data mining workflows, operators and parameters.

4. Demonstrate e-LICO on a systems biology approach to disease studies

The goal of systems biology is to provide system-level understanding of complex biological processes. Many processes relating to human health are complex and their understanding will promote the development of treatment, prognostics as well as prevention. The pilot application will focus on diseases of the kidney and urinary pathways (KUP). Domain specialists will undertake the collaborative construction of specialized knowledge sources (e.g. ontology, database), to be used as reference tools by the global community of researchers in the field. They will also conduct data mining experiments on diagnostic and prognostic biomarker discovery for a specific renal or urological disease, a task that requires the integration of heterogeneous (multi-omics and multimedia) data. Differentially expressed genes, proteins or metabolites that discriminate the different pathological states will be used to generate hypotheses and build mathematical models that elucidate the molecular pathways implicated in disease onset and progression.

E-lab architecture

The proposed e-lab comprises three layers: the e-science layer and the data mining layer form a generic knowledge discovery platform that can be adapted to different scientific domains by customizing the application layer. The project's overall research strategy can be summarized as the bottom-up construction of this three-tiered architecture.

 

 

The foundation of the e-science layer is a suite of open-source components developed by the University of Manchester (e.g., myGrid e-science platform, Taverna workflow editor). To build the e-LICO infrastructure (figure below), these components will be extended with tools for content creation (e.g. semantic annotation, ontology engineering) as well as mechanisms for multiple levels and modes of collaboration in experimental research.

 

The e-science layer


The data mining layer is the core of e-LICO; it will provide a comprehensive set of multimedia (structured records, text, images, signals) data mining tools. Standard tools will be complemented with preprocessing or learning algorithms developed specifically to respond to problems of data-intensive, knowledge rich sciences, such as extremely high dimensionality and undersampling, learning from heterogeneous data, incorporating prior knowledge into learning. Methodologically sound use of these tools will be ensured by a knowledge-driven, planner-based data mining assistant (WP6), which will rely on a data mining ontology and knowledge base to plan the data mining process and propose ranked workflows for a given application problem. Extensive e-lab monitoring facilities will support comparison and analysis of experiments by a meta-miner, whose role will be to ensure that the data mining assistant's workflow recommendations improve with experience.

The application layer is always domain-specific. In the generic e-lab, the application layer is an empty shell. It is built by the domain user who will use the tools available in the e-science and DM layers to:

1. customize the infrastructure to the needs of the domain, e.g., identify in the e-science layer all the services that the user team would like to access and use;

2. either access existing domain ontologies or create a domain ontology using the collaborative authoring tools provided in the e-science layer;

3. design, run and analyse data mining experiments using tools (algorithms, workflows, models, datasets) in the data mining layer;

4. semantically annotate DM experiments and input data using the semantic annotation tools in the e-science layer.

 

The data mining and application layers


In the e-LICO prototype, the application layer will be instantiated for a systems biology task: biomarker discovery and pathway modelling for diseases affecting the kidney and urinary pathways (KUP). Domain-specific knowledge sources, such as a specialized ontology and a data base on kidney and urinary pathways, will be collaboratively authored by European specialists in the area. The data mining e-lab will be showcased on the discovery of molecular markers and pathways involved in the onset and progression of diseases affecting the KUP, in particular bladder cancer.

The final deliverable of the project will be a free, experimental prototype open to continuous collaborative expansion and refinement by the research community.