Software

RapidMiner Extensions

RapidMiner is the open source data mining solution used within e-Lico for executing data mining operators and workflows. Within e-Lico, we have developed various extensions for RapidMiner.

Using the RapidMiner Community Extension, the user can share data mining workflows on the myexperiment.org portal.

The Image Mining Extension uses the image mining Web service provided by NHRF to execute image mining methods within RapidMiner.

The Market Basket Analysis Extension provides the Rapid Miner operators that build upon the association rule mining framework, but provide additional analytic capabilities beyond simple associations.

IDA Wizard

RapidMiner IDA Extension

This RapidMiner extension uses the Intelligent Discovery Assistant (IDA) developed at UZh to create data mining workflow plans directly in RapidMiner. Input objects can be loaded from the repository, and the plans can be immediately loaded in RapidMiner. The extension is still in beta stage.

Installation

To install, make sure you have a running and patched Flora and XSB (as described on the eProPlan page), download the Jar attached to this page, and copy to your rapidminer/lib/plugins folder. In RapidMiner, open the preferences, navigate to the e-LICO tab and specify the path to your Flora extension as well as an (arbitrary) temporary directory.

 

Usage

To use the extension, choose "Start IDA Wizard" from the Tools menu. You can then specify a Task and Goal and drag data sets onto the specific data requirements. When planning completes, you can open the generated process in RapidMiner and continue working with it as with any other process.

This video shows how you can use the IDA Wizard:

 

AttachmentSize
rapidminer-IDA-5.0.000.jar8.34 MB

Community Workflow Sharing Tool

Using the RapidMiner community extension, you can share your RapidMiner workflows with data mininers all over the word using the myexperiment.org portal. On the myExpeiment web site you can discuss data mining processes, exchange workflows, and meet data miners working on similar problems. Based on workflows shared on myexperiment.org, we will develop tools to assist you in designing new processes.

Using the Community Extension

To show the myExperiment browser, go to the View menu in RapidMiner and enable the MyExperiment Browser tab. I recommend to minimize the tab and open it whenever you need it. You can immediately start browsing the workflow repository, but in order to upload workflows, you need to make an account on myexperiment.org.

Browsing and Opening Workflows

Within RapidMiner, you can open all public RapidMiner workflows. You can recognize the RapidMiner workflows by looking at the "Workflow Type" entry on the upper right corner. To browse the workflow on the Web, click the "Browse" button to open a Web browser showing the respective site.

 

Browsing and downloading workflows on myexperiment.org

Sharing and Uploading your Workflows

To upload your current RapidMiner process to myexperiment.org, click on upload and fill in the following dialog:

Uploading a workflow to myexperiment.org

The description is automatically extracted from the process comment if it exists. Make sure to set the sharing permissions such that people can actually see your workflow.

Installation

To install the Community Extension, download the jar file and place it in the lib/plugins directory in your RapidMiner installation directory. Alternatively, and much easier, use the RapidMiner update mechanism by choosing "Update RapidMiner" in the Help menu.

AttachmentSize
rapidminer-Community-5.0.001.jar61.77 KB

R Package

The RapidMiner R extension allows integration of RapidMiner with the widely used open source statistics package R.

Eight R modelling methods are directly available as RapidMiner operators. RapidMiner ExampleSets can be directly used as input to R operators, and are internally converted to the table representation of R. Furthermore, the user can execute arbitrary R scripts as RapidMiner operators. To save the user from defining frequently used scripts over and over again for each process, they can define such re-usable scripts as custom RapidMiner operators.

For the analysis of bio-data, the R bioconductor packages are particularly relevant. These also operate with this extension, and an example process has been posted to myExperiment.

Intro Video

Download

The extension is available from the RapidMiner update server. Furthermore, more up-to-date development builds are posted on this site.

AttachmentSize
rapidminer-R Extension-5.1.000.jar2.27 MB

Image Mining Operators

With this RapidMiner extension, you can use image mining Web services provided by NHRF within RapidMiner.

Using the Image Mining Operators

The image mining plugin contains two classes of operators plus three helper operators:

List Images

Use this operator to specify a list of directories which will be scanned for images. The operator will then create an example set containing example (row) per image and three attributes (columns): "directory" (the directory from which the image was loaded), "file" (the filename of the image) and "id" (a generated id). After that, you can modify this information or filter examples before you proceed with uploading. A typical step would be to use the "directory" attribute as the label.

Upload Images

This operator takes an example set as produced by the "List Images" operator and uploads them to the server. An "image_reference" column is then appended to the example set. This reference can be used in all subsequent operators to reference or download the image.

Image Transformation

This group of operators perform image transformation methods on the server. The image is not downloaded. These operators require an "image_reference" column to exist. It will generate another image_reference column to reference the result.

Feature Extraction

These operators extract features from images, transforming it in to example sets (tables) that can then be further processed by RapidMiner. In principle, a feature extraction method turns each image into a table, but most of the time, this table will have only a single row. Still, this operator generates a collection of tables, which you can merge into one, using the regular "Append" operator.

Visualize Images

If you insert this operator, you will be presented with an image inspection dialog in the result view whenever double-clicking an example in a plot. This dialog will show the different (intermediate) versions of the image generated so far. More precisely, all "image_reference" attributes in the example set will be displayable.

An image mining process in RapidMiner

The image above shows a typical RapidMiner image mining process using the above-mentioned operators.

Installation

To install the Image Mining Extension, download the jar file and place it in the lib/plugins directory in your RapidMiner installation directory. Since it is still in the beta stage, it is not yet available from the RapidMiner update server.

AttachmentSize
rapidminer-ImageMining-1.0.jar2.75 MB

Market Basket Analysis Operators

This extension consists of the operators provided by PUT that implement 3 pattern mining algorithms for extended market basket analysis. These models build upon the association rule mining framework, but provide additional analytic capabilities beyond simple associations. The first model allows to mine transactional database for negative patterns represented as dissociation itemsets and dissociation rules. The second model of substitutive itemsets filters items and itemsets that can be used interchangeably as substitutes, i.e., itemsets that appear in the transactional database in very similar contexts. Finally, the third model of recommendation rules uses an additional itemset interestingness measure, namely coverage, to construct a set of recommended items using a greedy search procedure. All operators accept the collection of discovered frequent patterns as input data, and produce itemsets and rules as their outputs.  The Figure below shows an example of using proposed operators within a data mining workflow inside Rapid Miner.

Installation

To install the Market Basket Analysis Extension, download the jar file and place it in the lib/plugins directory in your RapidMiner installation directory.

 

AttachmentSize
rapidminer-MarketBasketAnalysisOperators-1.0.000.jar45.79 KB

Subgroup Discovery Operator

Two subgroup discovery algorithms are available in this Rapidminer extension: the SD algorithm and the CN2-SD algorithm . Both implementations are fully compatible with other RapidMiner operators. This was achieved by taking the existing RM Subgroup Discovery operator as a guideline and using the white paper provided by the Rapidminer support. Some further details are available here.

Installation: Put the jar files in "lib/plugins".

AttachmentSize
rapidminer-CN2-SD-1.0.0.jar23.06 KB
rapidminer-SD-1.0.0.jar23.1 KB

Covering Feature Selection Operator

The operator selects a small set of features enabling that a complete and consistent classifier for all examples may be constructed.

The operator supports construction of redundant feature sets so that more than one complete and consistent hypothesis may be generated and that outliers, examples enabling reduction of the size of the minimal set of features, may be detected and eliminated from the set of examples. The included help file describes theoretical basis of the implemented algorithm and specifies the available parameters.

Installation: Put the jar file in "lib/plugins".

AttachmentSize
rapidminer-CFSwOD.jar28.19 KB
CFSwOD-description.doc35.5 KB

Taverna RapidMiner plugin

We have developed a RapidMiner plugin for Taverna. This plugin exposes all of the data-mining operators from RapidMiner as services that can be used in the Taverna workflow system. In order to use this plugin you will need both a version of Taverna 2.3 and access to a RapidAnalytics server.

 

Taverna 2.3 can be downloaded from http://www.taverna.org.uk

RapidAnalytics can be downloaded from Rapid-I at http://rapid-i.com

The Taverna plugin should be considered as a beta release and we would be happy to assist anyone who is interested to use the plugin or learn more about it. If you want to have a quick play with it you can use our local RapidAnalytics server running at The University of Manchester. For more information please e-mail Simon Jupp at simon.jupp [at] manchester.ac.uk

The RapidAnalytics server can be accessed at http://rpc295.cs.man.ac.uk:8081. Login with username: guest and password: password.

The RapidMiner plugin can be installed in Taverna form the plugins menu. In the Advanced menu, choose "Updates and Plugins" and the "Find new plugins" menu. Then select the Rapid Miner Service Type Plugin 1.1.0 and click install.  You will be prompted to restart to Taverna to use the plugin.  Once restarted click on File/Taverna > Preferences, under the "e-LICO" section enter the RapidAnalytics server address (http://rpc295.cs.man.ac.uk:8081) and click Apply, please restart Taverna, the Service Panel will now show a new entry called Rapid Miner Services.

Example workflows and help packs can be found at myExperiment: http://www.myexperiment.org/groups/402.html

Additional information can be found at http://www.mygrid.org.uk/dev/wiki/display/elico/Rapid+Miner+Plugin

plugin list

 

 

AttachmentSize
rm1.png60.03 KB
plugins_list.png51.2 KB

eProPlan

eProPlan is an ontological modeling environment for planning Knowledge Discovery (KDD) workflows. We use ontological reasoning combined with AI planning techniques to automatically generate workflows for solving Data Mining (DM) problems. The KDD researchers can easily model not only their DM and preprocessing operators but also their DM tasks, that are used to guide the workflow generation. It allows to model new operators and produce a task-method decomposition grammar to solve DM problems. Designed as a plugin for the open-source ontology-editor Protege, eProPlan exploits the advantages of the ontology as a formal model for the domain knowledge. Instead of overusing the ontological inferences for planning we extend the ontological formalism with the main components of a plan, namely operator conditions & effects for classical planning and tasks-methods decomposition grammar for HTN-planning.

Download

Prerequisites

- XSB 3.2 - we are currently working on porting it to XSB 3.3 as well

- Flora 2 version 0.95

To use (by selecting it from the Reasoners menu from Protege) this reasoner:

  1. You must install XSB version 3.2 (Sources and Binary for MacOSX) and
  2. XSB (3.2) has a problem converting negative reals to strings, if you wana use negative real values in data properties you have to store this patched string.P in $XSB/syslib/ and run make inside this directory (or run makexsb from $XSB/build after replacing string.P with our version).
  3. You must install Flora2 version 0.95 (Androcymbium). (Make sure that the shell-script runflora is working,  we are using it.).
  4. Finally go to Protege Preferences->eProPlan-I, select as a ReasonerType Local  and set the path to the Flora 2 directory and the path to a temporary directory with r/w/x rights.
  5. Also set the TBOX reasoner you want to use. For Protege 4.0 Pellet works fine, and for 4.1 Pellet and Hermit.

The plugin uses the selected reasoner for the TBOX inferences, i.e to reason about concept-subsumption from the ontology. Flora-2 does the instance reasoning. It can infer concept membership based on: Sup-/Super-concept relations, Concept definitions, Domain and Range Restrictions of Properties, and SWRL-Rules that conclude on concepts.
It can infer Properties based on: Sub-Properties, Property characteristics e.g. transitive, and SWRL-Rules that conclude on Properties. It does not (not even in the simplest form): Propagate constrains along properties, e.g. from C :< P only R, C(a), P(a,b) it does not infer R(b),  Reasoning by case. It always treats differently named individuals as distinct. Due to XSB/Flora’s tabling it can evaluate a lot of rules, a "normal" prolog interpretation would get lost in an infinite recursion. In principle it reasons about both negation as "complement"-membership of concepts and properties and negation as failure, however Protégé 4.0 SWRL Editor doesn’t allow you to enter such rules.

Populous

NEW: SWAT4LS tutorial http://www.e-lico.eu/populous_swat4ls_2011

Populous

Populous is an generic tool for building ontologies from simple spreadsheet like templates. The Populous approach is useful when a repeating ontology design pattern emerges that needs to be populated en-mass. The use of a simple interface, similar to that of a spreadsheet, means that the templates can be populated by users with little or no knowledge of ontology development. Once these teamplates are populated, Populous supports transforming the data into an OWL ontology using a expressive pattern language.

Spreadsheets are currently transformed into OWL/RDF using the Ontology Pre-Processing Language v2 (OPPL). OPPL 2 is a powerful scripting language for generating and manipulating OWL axioms. Populous provides a wizard like interface found in the "Tools" menu to map spreadsheet data to variables in OPPL patterns.

Populous is built on top of RightField. RightField can be used to create Excel spreadsheets that have ontology based restrictions on allowable values in selected cells. RightField spreadsheets allow scientists to annotate their data using standard terminology from ontologies rather than using free text annotations.

Populous and RightField are both open source cross platform Java applications. They use the Apache-POI for interacting with Microsoft documents and manipulating Excel spreadsheets.


Availability

The alpha release of the Populous extension (v0.9) is available here for download.

 

Installation

Populous requires Java 1.6.

1. Unzip the file

2. Windows user execute run.bat

3. Mac/Unix users execute run.command

Documentation

Documentation is currently provided by a screen cast Demo of populous in action. There is a set of slides on NaturePrecedings from a recent presentation about Populous given at SWAT4LS 2010 here. The accompanying files used for the demo are provided in the example folder of the downloaded zip file.  If you are interested in this project and its development please contact me below for further details.

m name="allowscriptaccess" value="always" />
https://i.ytimg.com/vi/MQ_roJ7n2pc/hqdefault.jpg) !important;">

 

Contact

Simon Jupp (simon.jupp [at] manchester.ac.uk) and Robert Stevens (robert.stevens [at] manchester.ac.uk)

 

Background

Ontologies are used to generate terminologies that describe the kinds (classes) of things (instances) we like to talk about within a particular domain. In the life sciences, for example, there are lots of kinds of things we like to talk about, and ontologies give us a mechanism to ensure we are talking about the same kinds of things. Standardising the way we annotate (or talk) about data makes it easier to integrate, process and analyse the data. In order for all of this to work we need to develop lots of ontologies to describe all the different kinds of thing we are interested in. Developing such ontologies is no mean undertaking, so we are constantly looking for new ways to reduce the ontology development bottleneck. One observation is that we often develop patterns to describe similar kinds of things, once these patterns have been identified, they can be left to domain experts to populate. Whilst ontology development environments provide support for template population, they often have steep learning curves, especially for users new to ontologies. We developed Populous as a light weight tool with a familiar spreadsheet style interface for domain experts to populate these ontology templates. The use of a transformation language, like OPPL, means we can separate the knowledge from the underlying ontological representation. This is particularly advantageous in situations where we want to radically change the modelling or offer different representations of the same data.

Bugs

 

 

SWAT4LS Populous tutorial 2011

Populous tutorial - SWAT4LS, London, UK 2011

 

In order to follow this tutorial you will need a copy of the latest version of Populous and the associated tutorial material.

These can be downloaded from the downloads page at http://code.google.com/p/owlpopulous

You will need Populous_v1.1-beta.zip and the material in Populous_tutorial_SWAT4LS_2011.zip

We will refer to the root folder of the tutorial material as $FILES from now on.

 

Task 1 – Start Populous and load the initial workbook.

 

  • Start Populous using the appropriate script
  • Open the basic cell type workbook from $FILES/Workbook/cell_types.xls

 

You need to load some ontologies into Populous before you can begin to apply ontological restrictions to areas of the workbook. You can load ontologies form your local file system or directly from BioPortal.

NOTE: When working with large ontologies you may need to increase the amount of memory allocated to Populous. You can increase this in the Populous run scripts by increase the value in the –Xmx1000M JVM parameter.

Task 2 – Load some ontologies

 

  • Load the cell type ontology from $FILES/Ontology/Input_ontology/cl-redux.owl
  • Select/highlight the cells in column A from rows 2 to 51. We want to restrict these to terms from below the "cell" term in the cell type ontology. Select the "cell" class from the CTO and select “subclasses” from the “type of allowed values” list. Select the “Apply” button to apply this restriction to the selected range in the workbook.
  • The selected cells in column A should change to a green background indicating that a validation has been set. Any terms in column A that match terms in the cell type ontology will be highlighted in green text. Any unmatched terms will appear in red.
  • At this point save the workbook with a new name.

 


<!--[endif]-->

 

 

Task 3 – Create additional ontological restrictions on the workbook.

 

We want to create some additional restrictions on these cell types. Column D is for asserting superclass information and also contains only cell type terms. Column F is for capturing 'part of' relationships to anatomical terms. We use the UBERON ontology to capture anatomical terms that will be used to restrict the valid terms in this column. Column F and G capture germ line and nucleation information about the cells, we can use the Phenotype and Trait Ontology (PATO) to restrict the values in these columns. Column H captures biological process terms from the Gene Ontology, that are used to describe the function of these cell types. Column I and J capture further information about the cells, including the cell lineage and potentiality.

 

  • We can use a collection of ontologies to restrict the columns to the appropriate ontological terms.
  • Apply the “subclasses” on cell from the CTO to Column D, row 2 to 51.
  • Load the UBERON ontology from $FILES/Onology/Input_ontology/uberon_redux.owl
  • Select column B, rows 2 to 51.
  • Apply a restriction to all subclasses of ‘anatomical entity’
  • Open the phenotype ontology from $FILES/Onology/Input_ontology/PATO.owl
  • Search the ontologies for “mononucleate”
  • Select the “nucleate quality” class and create a restriction on column G to all subclasses of nucleate quality.
  • Load the gene ontology from $FILES/Onology/Input_ontology/go_daily-termdb.owl and apply a restriction on column H to all subclasses of ‘biological_process’ term.

 

With the ontology validation loaded we can begin to modify and add new content to the workbook. We can use the auto-complete function on cells in the workbook to assist us in selecting the right terms. For example, for the bladder cells we can begin to add values in Column E to assert "part of" relations between bladder cells and the bladder anatomical region.

 

Task 4 – Working in Excel

 

Populous generated templates can be exported to the MS excel .xls format so users can work on populating the template using their favourite spreadhseet tool such as MS Excel or OpenOffice.

 

  • Save the Populous workbook.
  • Open the saved file in either MS Excel or OpenOffice
  • The workbook can be modified like any normal spreadsheet. Drop down lists of terms are provided as validations on cells to assist the user.

 

 

Task 5 – Converting workbook to OWL

Once users have populated a workbook, we want to return to Populous to A) validate the content, and B) convert the content to an OWL ontology.

 

  • Return to Populous and open the modified cell_type.xls workbook. We will use the OPPL wizard to transform the content into an OWL ontology.
  • Start the OPPL wizard from Populous “Tools” menu, when prompted about opening a previous workflow say no.
  • Select the columns from the workbook that you want to transform. For this demo we will select column A, B, D, E, G and H.
  • Select the rows to convert, select start row: 2 and end row: 35. Select Continue.
  • On the next panel we choose the ontologies that will be used to create the new ontology. Any ontology already loaded into p
  • Populous will be shown. You will need to add an additional ontology that contains the skeleton ontology that we will be adding new terms to.
  • Select “Load from file…” and choose $FILES/Ontology/Input_ontology/properties_populous_tutorial_SWAT4LS2011.owl
  • You can specify an Ontology IRI for the newly created ontology e.g. http://www.populous.org.uk/swat4lstutorial/cell_types.owl
  • Set the physical URI for where the new ontology will be saved e.g. $FILES/Ontology/Output_ontology/cell_types.owl

 

The next panel is for adding OPPL patterns that will be executed to generated the new ontology. The oppl patterns for this tutorial are in $FILES/oppl_script/

 

  • Load the following oppl patterns:
  1. $FILES/oppl_script/cell_label.oppl
  2. $FILES/oppl_script/subClassOfCell.oppl
  3. $FILES/oppl_script/part_of_Anatomy.oppl
  4. $FILES/oppl_script/cell_label.oppl
  5. $FILES/oppl_script/phenotypic_quality.oppl
  6. $FILES/oppl_script/go_process.oppl
  • Assuming all patterns are validated in green, select continue. If a pattern is invalid you need to check the OPPL pattern syntax is correct. NOTE: At this stage valid patterns that refer to any OWL object such as a class or object property must already be loaded into Populous.

The next stage is to map variables in the OPPL patterns to columns in the workbook. Map the columns as follows:

The next panel deals with new entities. Where possible Populous will use the correct URI from imported ontologies when referring to terms from ontologies already loaded. However, when new/unkown terms are encountered (i.e. ones highlighted in red), Populous will create a new term. The “New Entities” panel allows you specify how a new URI will be created. You can specify the base URI, hash or slash URIs, and an auto number/increment systems. You can also specify if a label should be added using the value in the workbook.

 

  • Choose to auto generate the id. Check create label and set the new URI value prefix to CTODEV_
  • Select Continue. The OPPL scripts will now be executed against the workbook. Once complete the newly generated ontology will be printed out in Manchester OWL syntax.
  • At this point you can save all the setting used in the ontology generation workflow. This will generate an XML file than can be used again when you want to re-run this workflow.
  • Select Finish to close the OPPL wizard

 

Task 6 – Viewing the generated ontology in Protégé

 

Once the OPPL wizard has run the new generated ontology will be in $FILES/Ontology/Output_ontology/cell_types.owl. This ontology can now be opened in Protégé for manual inspection. In order to view the newly generated cells in context it is advised to import all the ontologies from $FILES/Ontology/Input_ontology/*.owl. Once imported the ontologies can be classified (recommend HermiT) and we can perfom a DL query such as “cell that participates_in some 'cytokine production'”.

 

<!--EndFragment-->