]>
"0.1"
PROTON Topics (from Inspec Thesaurus) ordered by algorithm X
learning, model, data, semantic, networks, graph, cluster, kernels, algorithms, systems
semantic, web, ontology, search, queries, semantic_web, user, documents, knowledge, services
solar, cell, solar_cell, abstraction, radiation, electron, reconstruct, image, depth, resolution
atomic, particles, mass, light, orbit, electron, physical, energy, accelerated, ray
crystals, liquid, defects, liquid_crystals, optical, nematic, field, electron, particles, electrical
film, thinning, thinning_film, sol, material, magnetism, morphology, oxide, preparative, ceramics
energy, charge, electrical, magnetism, elds, surfaces, laws, force, field, electron
european, political, europe, dr, union, european_union, media, policy, slovenia, belief
vehicle, safety, projects, road, systems, transport, design, research, route, car
road, transport, surfaces, testing, pavement, resisting, projects, research, skidding, measures
language, words, learning, grammar, structural, entailment, recognition, parsing, sequences, feature
cluster, graph, algorithms, data, patterns, learning, labelling, mining, trees, distance
ontology, semantic, web, semantic_web, owl, rdf, knowledge, queries, documents, services
feature, svms, training, classification, data, resource, selected, classifiers, feature_selected, mining
learning, supervised, algorithms, reinforcement_learning, reinforcement, semi_supervised, semi, supervised_learning, labelling, data
robotics, controlled, brain, interactions, video, activities, interfaces, human, systems, music
robotics, controlled, video, transcripts, systems, lecturers, activities, stanford, cognition, game
brain, interfaces, interactions, user, memory, human, controlled, bci, music, activities
networks, social, research, innovation, technologies, conferences, services, business, management, collaboration
climate, climate_change, change, building, global, regional, capacity, cooperation, gender, financial
networks, social, graph, nodes, dynamically, internet, part, social_networks, communication, market
mobile, knowledge, internet, translated, market, world, social, mining, data_mining, phone
social, internet, communication, user, media, online, people, individual, market, facebook
search, semantic, queries, documents, user, web, tagging, information, retrieval, knowledge
sciences, scientific, women, research, collaboration, management, seminar, organism, vo, innovation
networks, graph, social, nodes, dynamically, part, model, social_networks, structural, link
model, genes, systems, dynamically, learning, data, inference, time, protein, biology
bayesian, gaussian, model, gaussian_process, process, prior, inference, topics, latent, mixtures
model, systems, dynamically, genes, time, inference, learning, data, states, simulation
protein, model, learning, genes, predicting, machining, machining_learning, structural, data, detection
dr, slovenia, rk, policy, president, europe, republic, supplied, department, media
european, political, union, european_union, europe, dreams, media, western, history, conversions
ontology, semantic, alignment, user, mapping, web, evaluation, media, multimedia, measures
services, web, semantic, web_services, semantic_web, ontology, semantic_web_services, descriptive, owl, wsmo
web, semantic, semantic_web, information, rdf, knowledge, ontology, queries, research, data
business, semantic, process, business_process, ontology, web, semantic_web, events, web_technologies, management
tagging, semantic, annotation, user, search, concept, metadata, information, queries, video
search, queries, semantic, documents, user, web, information, knowledge, retrieval, text
learning, reinforcement_learning, reinforcement, algorithms, model, activities_learning, unsupervised, deep, activities, robotics
learning, supervised, semi_supervised, semi, supervised_learning, labelling, semi_supervised_learning, classification, data, algorithms
discovery, subgroup, subgroup_discovery, rule, graph, descriptive, distance, classifiers, facial, rule_learning
cluster, data, algorithms, trees, distance, learning, streams, graph, similar, predicting
graph, patterns, algorithms, labelling, mining, rank, learning, gradient, matching, set
semantic, desktop, semantic_desktop, social_semantic, collaboration, semantic_collaboration, desktop_social, social_semantic_collaboration, desktop_social_semantic, annotation
charge, electrical, surfaces, potential, energy, field, electrical_field, conductor, electron, elds
energy, magnetism, elds, laws, power, force, electrical, conservatively, electron, density
film, character, eve, movie, lady, philosophical, bicycle, story, realistic, myth
brain, bci, controlled, music, interfaces, eeg, brain_computer, cognition, patients, signal
interactions, user, human, interfaces, video, user_interfaces, multimodal, adaptation, systems, computer
memory, working_memory, activities, neuronal, visual, cortex, persistence, cognition, working, stimuli
transport, surfaces, bridge, projects, load, testing, research, logistic, design, sustainability
testing, design, logistic, bridge, profile, pavement, surfaces, load, measures, resisting
learning, kernels, cluster, graph, feature, algorithms, data, model, image, methods
image, object, detection, recognition, segmentation, visual, abstraction, feature, scenes, shape
road, transport, vehicle, testing, projects, safety, research, surfaces, design, systems
energy, strings, electron, surfaces, atomic, cell, field, crystals, charge, electrical
root
solar, cell, solar_cell
atomic, particles, mass
crystals, liquid, defects
film, thinning, thinning_film
energy, charge, electrical
european, political, europe
transport, vehicle, projects
road, testing, pavement
language, words, learning
cluster, graph, algorithms
semantic, web, ontology
feature, svms, training
learning, supervised, algorithms
robotics, controlled, brain
robotics, controlled, video
brain, interfaces, interactions
research, innovation, technologies
climate, climate_change, change
social, internet, research
mobile, knowledge, internet
social, internet, communication
ontology,\rdf,\owl
sciences, scientific, women
networks, graph, social
model, genes, systems
bayesian, gaussian, model
model, systems, dynamically
protein, model, learning
dr, slovenia, rk
european, political, union
ontology, semantic, alignment
search, \queries, \user
web, semantic, semantic_web
business, semantic, process
annotation
semantic, queries, search
learning, reinforcement_learning, reinforcement
learning, supervised, semi_supervised
subgroup
cluster, data, algorithms
graph, patterns, algorithms
services, \web, \semantic
charge, electrical, surfaces
energy, magnetism, elds
Film making
brain, bci, controlled
interactions, user, human
memory, working_memory, activities
transport, surfaces, bridge
testing, design, logistic
annotation,\semantic_collaboration,\semantic_desktop
learning, kernels, cluster
image, object, detection
road, transport, vehicle
energy, strings, electron
Information Society 2002 - Ljubljana The Information Society multiconference deals with information technologies, which are of major importance for the development of Europe and Slovenia as a part of it. The United States are ahead of the Old Continent in this field (Some indicators: market share 4:3, business on the internet 2:1, other spheres even range from 3:1 to 6:1). Europe and Slovenia need to catch up and progress into information societies. Stated reasons require that we host a scientific meeting in the form of a multiconference, which will consist of several conferences with specific themes essential for the development of the information society.
NERO 2.0: Neuro Evolving Robotic Operatives This demonstrates the NERO real-time strategy game and the capabilities of its agents. The technology involves neuroevolution.
Interactive derivation viewer This describes the IDV, a tool for graphically rendering derivations that are written in the Thousands of Problems for Theorem Provers (TPTP) language.
Artificial intelligence: An instance of Aibo ingenuity This describes research related to using RL for, among other tasks, learning behaviors for an Aibo robot.
K-nearest neighbor classification In this short animated video the k-nearest neighbor classifier is introduced with simple 3D visuals. A real-world application, word pronunciation, is used to exemplify how the classifier learns and classifies. The video features a synthesized voice over.
A service robot named Markovito This shows the Peoplebot Markovito as it delivers messages and objects between offices. It can perform speech communication, face recognization, global localization, uses a probablistic grid map, and is controlled by a Factored MDP.
Color-based object recognition This demonstrates a robodog that recognize objects in a \"fetch\" task. The software runs on a world-wide computing grid, distributing the computational load over several beowolf clusters.
Power agents at the Mars Desert Research Station A comprehensive demonstration of the agents being used at the MDRS, scripted with inspiration from the HAL 9000. Permission granted for additional video length, although the main video ends at 5min.
Motion planning of multiple agents in virtual environments Describes and demonstrates in simulation the use of coordination graphs to avoid collisions of multiple agents in tasks requiring motion of multiple agents.
Humanoids for autonomous operations The video describes a Humanoid robotics project at JPL, claiming a first practical application of humanoid robotics.
Multimodal Interactive Robot Agent (MIRA robot head) A short video demonstrating the speech communication, reasoning abilities, and humour of MIRA.
Convergence of MDL and Bayesian Methods We introduce a complexity measure which we call KL-complexity. Based on this concept, we present a general information exponential inequality that measures the statistical complexity of some deterministic and randomized estimators. We show that simple and clean finite sample convergence bounds can be obtained from this approach. In particular, we are able to improve some classical results concerning the convergence of MDL density estimation and Bayesian posterior distributions
MARQS: Media album retrieval by query sketch An advertisement-like short demo of a tool for retrieving photos from an album by sketching.
Robot Swarm localization using trilateration A description and demonstration of a robust approach for ground robot formation movement behaviors.
Morphogenesis: Shaping swarms of intelligent robots * Describes, simulates, and demonstrates in hardware the utility of (rule-based) morphogenesis for shaping robot swarms. * For more videos, pictures, and information on Morphogenesis and Morphology Control, see the following site: [[http://iridia.ulb.ac.be/supp/IridiaSupp2007-003/index.html|Photos, videos and information]] * You can also download a high-quality version of this video in various formats from: [[http://iridia.ulb.ac.be/%7Ealyhne/aaai-07/index.html|HQ version of the video]] * You can check the following web sites if you want to know more about the robots, swarm robotics, and swarm intelligence: [[http://www.swarmanoid.com|The Swarmanoid Project web-page]], [[http://www.swarm-bots.org|The Swarm-bots Project web-page]] ;Authors web pages: :[[http://iridia.ulb.ac.be/%7Ealyhne|Anders Lyhne Christensen's homepage]]\\\\ :[[http://iridia.ulb.ac.be/%7Erogrady|Rehan O'Grady's homepage]]\\\\ :[[http://iridia.ulb.ac.be/%7Emdorigo/|Marco Dorigo's homepage]]
Autonomous robot cleaning crew De-centralized collaborative planning and simulation demo for coordinating agent/robot tasks.
Dance evolution The only submission from undergraduates, this unique video challenges AI to learn how to dance by demonstrating how neuroevolution can be used to (interactively) evolve dancing techniques.
Learning without overlearning This course covers feature selection fundamentals and applications. The students will first be reminded of the basics of machine learning algorithms and the problem of overfitting avoidance. In the wrapper setting, feature selection will be introduced as a special case of the model selection problem. Methods to derive principled feature selection algorithms will be reviewed as well as heuristic method, which work well in practice. One class will be devoted to feature construction techniques. Finally, a lecture will be devoted to the connections between feature section and causal discovery. The class will be accompanied by several lab sessions. The course will be attractive to students who like playing with data and want to learn practical data analysis techniques. The instructor has ten years of experience with consulting for startup companies in the US in pattern recognition and machine learning. Datasets from a variety of application domains will be made available: handwriting recognition, medical diagnosis, drug discovery, text classification, ecology, marketing. Lecture 1: Learning without Over-learning Machine Learning How to Train? What is a Risk Functional? Example Risk Functionals Fit / Robustness Tradeoff Overfitting Ockham?s Razor The Power of Amnesia Artificial Neurons Hebb?s Rule Weight Decay Overfitting Avoidance Weight Decay for MLP Theoretical Foundations Risk Minimization Loss Functions Approximations of R[f] Approximations of R[f] Structural Risk Minimization SRM Example Gradient Descent Multiple Structures Hyper-Parameter Selection Bayesian MAP ? SRM Example: Gaussian Prior Minimum Description Length Bias-Variance Tradeoff The Effect of SRM Ensemble Methods Summary Want to Learn More? Lecture 1: Learning without Over-learning Isabelle Guyon isabelle@clopinet.com Machine Learning ? Learning machines include: ? ? ? ? Linear discriminant (including Na?ve Bayes) Kernel methods Neural networks Decision trees ? Learning is tuning: ? Parameters (weights w or ?, threshold b) ? Hyperparameters (basis functions, kernels, number of units) How to Train? ? Define a risk functional R[f(x,w)] ? Find a method to optimize it, typically ?gradient descent? wj ? wj - ? ?R/?wj or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.) What is a Risk Functional? ? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. R[f(x,w)] Parameter space (w) w* Example Risk Functionals ? Classification: ?the error rate ? Regression: ?the mean square error Fit / Robustness Tradeoff x2 x2 x1 x1 Overfitting d=10, r= 0.01 1.5 Example: Polynomial regression Target: a 10th degree polynomial + noise Learning machine: y=w0+w1x + w2x2 ?+ w10x10 Ockham?s Razor ? Principle proposed by William of Ockham in the fourteenth century: ?Pluralitas non est ponenda sine neccesitate?. ? Of two theories providing similarly good predictions, prefer the simplest one. ? Shave off unnecessary parameters of your models. The Power of Amnesia ? The human brain is made out of billions of cells or Neurons, which are highly interconnected by synapses. ? Exposure to enriched environments with extra sensory and social stimulation enhances the connectivity of the synapses, but children and adolescents can lose them up to 20 million per day. Artificial Neurons x1 x2 Cell potential w1 w2 ? Dendrites f(x) Axon Activation of other neurons xn 1 wn b Synapses Activation function McCulloch and Pitts, 1943 f(x) = w ? x + b Hebb?s Rule wj ? wj + yi xij Activation of another xj neuron wj Dendrite Synapse y Axon Link to ?Na?ve Bayes? Weight Decay wj ? wj + yi xij wj ? (1-?) wj + yi xij Hebb?s rule Weigh decay ? ? [0, 1], decay parameter Overfitting Avoidance d=10, r=1e+002 d=10,r=1e+003 d=10, r= 0.01 r=1e+008 r=1e+007 r=1e+006 r=1e+005 r=1e+004 r= 0.1 10 1 1.5 Example: Polynomial regression Target: a 10th degree polynomial + noise Learning machine: y=w0+w1x + w2x2 ?+ w10x10 Weight Decay for MLP Replace: wj ? wj + back_prop(j) by: wj ? (1-?) wj + back_prop(j) ? xj ? ? Theoretical Foundations Structural Risk Minimization Bayesian priors Minimum Description Length Bayes/variance tradeoff Risk Minimization ? Learning problem: find the best function f(x; w) minimizing a risk functional R[f] = ? L(f(x; w), y) dP(x, y) loss function unknown distribution ? Examples are given: (x1, y1), (x2, y2), ? (xm, ym) Loss Functions L(y, f(x)) Decision boundary SVC loss, ?=2 max(0, (1- z))2 Margin Adaboost loss e-z 2.5 logistic loss 1.5 -z) log(1+e 1 0.5 Perceptron loss max(0, -z) 0 -1 SVC loss, ?=1 max(0, 1-z) 0/1 loss square loss (1- z)2 z=y f(x) missclassified well classified Approximations of R[f] ? Empirical risk: Rtrain[f] = (1/n) ?i=1:m L(f(xi; w), yi) ? 0/1 loss 1(F(xi)?yi) : ? square loss (f(xi)-yi)2 : Rtrain[f] = error rate Rtrain[f] = mean square error ? Guaranteed risk: With high probability (1-?), R[f] ? Rgua[f] Rgua[f] = Rtrain[f] + ?(?,C) Structural Risk Minimization Vapnik, 1974 Ga, Guaranteed risk Ga= Tr + ?(C) ?, Function of Model Complexity C Nested subsets of models, increasing complexity/capacity: S 1 ? S2 ? ? S N S Tr, Training error S2 Increasing complexity Complexity/Capacity C S1 SRM Example S1? S2 ? ? SN R ? Rank with ||w||2 = ?i wi2 Sk = { w | ||w||2 > ?k2 }, ?1>?2>?>?k ? Minimization under constraint: min Rtrain[f] s.t. ||w||2 > ?k2 capacity ? Lagrangian: Rreg[f,?] = Rtrain[f] + ? ||w||2 Gradient Descent Rreg[f] = Remp[f] + ? ||w||2 wj ? wj - ? ?Rreg/?wj wj ? wj - ? Remp/?wj - 2 ? ? wj wj ? (1- ?) wj - ? Remp/?wj Weight decay SRM/regularization Multiple Structures ? Shrinkage (weight decay, ridge regression, SVM): Sk = { w | ||w||2> ?k }, ?1>?2>?>?k ?1 < ?2 < ?3 <? < ?k (? is the ridge) ? Feature selection: Sk = { w | ||w||0> ?k }, ?1>?2>?>?k (? is the number of features) ? Data compression: ?1>?2>?>?k (? may be the number of clusters) Hyper-parameter Selection Training data: Make K folds X y ? Learning = adjusting: parameters (w vector). hyper-parameters (?, ?, ?). ? Cross-validation with K-folds: For various values of ?, ?, ?: - Adjust w on a fraction (K-1)/K of training examples e.g. 9/10th. - Test on 1/K remaining examples e.g. 1/10th. - Rotate examples and average test results (CV error). - Select ?, ?, ? to minimize CV error. - Re-compute w on all training examples using optimal ?, ?, ?. Test data Prospective study / ?real? validation Bayesian MAP ? SRM ? Maximum A Posteriori (MAP): f = argmax P(f|D) = argmax P(D|f) P(f) = argmin ?log P(D|f) ?log P(f) Negative log likelihood Negative log prior = Empirical risk Remp[f] = Regularizer ?[f] ? Structural Risk Minimization (SRM): f = argmin Remp[f] + ?[f] Example: Gaussian Prior w2 ? Linear model: f(x) = w.x ? Gaussian prior: P(f) = exp -||w||2/?2 ? Regularizer: ?[f] = ?log P(f) = ? ||w||2 w1 Minimum Description Length ? MDL: minimize the length of the ?message?. ? Two part code: transmit the model and the residual. ? f = argmin ?log2 P(D|f) ?log2 P(f) Residual: length of the shortest code to encode the data given the model Length of the shortest code to encode the model (model complexity) Bias-variance tradeoff ? f trained on a training set D of size m (m fixed) ? For the square loss: ED[f(x)-y]2 = [EDf(x)-y]2 + ED[f(x)-EDf(x)]2 Expected value of the loss over datasets D of the same size Bias2 f(x) Variance Variance E f(x) D Bias2 y target The Effect of SRM Reduces the variance? ?at the expense of introducing some bias. Ensemble Methods ? Variance can also be reduced with committee machines. ? The committee members ?vote? to make the final decision. ? Committee members are built e.g. with data subsamples. ? Each committee member should have a low bias (no use of ridge/weight decay). Summary ? Weight decay is a powerful means of overfitting avoidance (||w||2 regularizer). ? It has several theoretical justifications: SRM, Bayesian prior, MDL. ? It controls variance in the learning machine family, but introduces bias. ? Variance can also be controlled with ensemble methods. Want to Learn More? ? Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031. Structural risk minimization for character recognition, I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S.A. Solla. In J. E. Moody et al., editor, Advances in Neural Information Processing Systems 4 (NIPS 91), pages 471--479, San Mateo CA, Morgan Kaufmann, 1992. http://clopinet.com/isabelle/Papers/srm.ps.Z Kernel Ridge Regression Tutorial, I. Guyon. http://clopinet.com/isabelle/Projects/ETH/KernelRidge.pdf Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book
Introduction to feature selection This course covers feature selection fundamentals and applications. The students will first be reminded of the basics of machine learning algorithms and the problem of overfitting avoidance. In the wrapper setting, feature selection will be introduced as a special case of the model selection problem. Methods to derive principled feature selection algorithms will be reviewed as well as heuristic method, which work well in practice. One class will be devoted to feature construction techniques. Finally, a lecture will be devoted to the connections between feature section and causal discovery. The class will be accompanied by several lab sessions. The course will be attractive to students who like playing with data and want to learn practical data analysis techniques. The instructor has ten years of experience with consulting for startup companies in the US in pattern recognition and machine learning. Datasets from a variety of application domains will be made available: handwriting recognition, medical diagnosis, drug discovery, text classification, ecology, marketing. Lecture 2: Introduction to Feature Selection Notations and Examples Feature Selection Leukemia Diagnosis Prostate Cancer Genes RFE SVM for Cancer Diagnosis QSAR: Drug Screening Text Filtering Face Recognition Nomenclature Univariate Filter Methods Individual Feature Irrelevance Individual Feature Relevance S2N Univariate Dependence Correlation and MI Gaussian Distribution Other Criteria T-test Statistical Tests Multivariate Methods Univariate Selection May Fail Filters vs. Wrappers Search Strategies Multivariate FS Is Complex Embedded Methods L2_featselect1_Page_27 Feature Subset Assessment Complexity of Feature Selection Complexity of Feature Selection Examples of FS Algorithms In Practice Book of the NIPS 2003 Challenge Lecture 2: Introduction to Feature Selection Isabelle Guyon isabelle@clopinet.com Notations and Examples Feature Selection ? Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines. n n? m X Leukemia Diagnosis n? -1 m +1 {-yi} Golub et al, Science Vol 286:15 Oct. 1999 -1 {yi}, i=1:m Prostate Cancer Genes HOXC8 G4 G3 BPH RACH1 U29589 RFE SVM, Guyon-Weston, 2000. US patent 7,117,188 Application to prostate cancer. Elisseeff-Weston, 2001 RFE SVM for cancer diagnosis Differenciation of 14 tumors. Ramaswamy et al, PNAS, 2001 QSAR: Drug Screening Binding to Thrombin (DuPont Pharmaceuticals) - 2543 compounds tested for their ability to bind to a target site on thrombin, a key receptor in blood clotting; 192 ?active? (bind well); the rest ?inactive?. Training set (1909 compounds) more depleted in active compounds. - 139,351 binary features, which describe three-dimensional properties of the molecule. Number of features Weston et al, Bioinformatics, 2002 Text Filtering Reuters: 21578 news wire, 114 semantic categories. 20 newsgroups: 19997 articles, 20 categories. WebKB: 8282 web pages, 7 categories. Bag-of-words: <100000 features. Top 3 words of some categories: ? Alt.atheism: atheism, atheists, morality ? Comp.graphics: image, jpeg, graphics ? Sci.space: space, nasa, orbit ? Soc.religion.christian: god, church, sin ? Talk.politics.mideast: israel, armenian, turkish ? Talk.religion.misc: jesus, god, jehovah Bekkerman et al, JMLR, 2003 Face Recognition ? Male/female classification ?1450 images (1000 train, 450 test), 5100 features (images 60x85 pixels) Relief: Simba: Navot-Bachrach-Tishby, ICML 2004 Nomenclature ? Univariate method: considers one variable (feature) at a time. ? Multivariate method: considers subsets of variables (features) together. ? Filter method: ranks features or feature subsets independently of the predictor (classifier). ? Wrapper method: uses a classifier to assess features or feature subsets. Univariate Filter Methods Individual Feature Irrelevance P(Xi, Y) = P(Xi) P(Y) P(Xi| Y) = P(Xi) P(Xi| Y=1) = P(Xi| Y=-1) Legend: Y=1 Y=-1 density xi Individual Feature Relevance 1 ROC curve Sensitivity ??+ AUC ?- 1- Specificity ?+ S2N m ?- ?+ Golub et al, Science Vol 286:15 Oct. 1999 {-yi} {yi} -1 ??+ |?+ - ?-| S2N = ?+ + ?S2N ? R ? x ? y after ?standardization? x ?(x-?x)/?x Univariate Dependence ? Independence: P(X, Y) = P(X) P(Y) ? Measure of dependence: MI(X, Y) = ? P(X,Y) P(X,Y) log dX dY P(X)P(Y) = KL( P(X,Y) || P(X)P(Y) ) Correlation and MI R=0.02 MI=1.03 nat P(X) X X Y Y P(Y) R=0.0002 MI=1.65 nat X Y Gaussian Distribution P(X) P(Y) MI(X, Y) = -(1/2) log(1-R2) Other criteria ( chap. 3) T-test ?- ?+ P(Xi|Y=1) P(Xi|Y=-1) -1 xi ?- ?+ ? Normally distributed classes, equal variance ?2 unknown; estimated from data as ?2within. ? Null hypothesis H0: ?+ = ?? T statistic: If H0 is true, t= (?+ - ?-)/(?within?1/m++1/m-) Student(m++m--2 d.f.) Statistical tests ( ? H0: X and Y are independent. ? Relevance index ? Pvalue chap. 2) Null distribution test statistic. false positive rate FPR = nfp / nirr pval r0 r ? Multiple testing problem: use Bonferroni correction pval n pval ? False discovery rate: FDR = nfp / nsc ? FPR n/nsc ? Probe method: FPR ? nsp/np Multivariate Methods Univariate selection may fail Guyon-Elisseeff, JMLR 2004; Springer 2006 Filters vs. Wrappers ? Main goal: rank subsets of useful features. All features Filter Feature subset Multiple Feature subsets Wrapper Predictor All features Predictor ? Danger of over-fitting with intensive search! Search Strategies ( chap. 4) ? Forward selection or backward elimination. ? Beam search: keep k best path at each step. ? GSFS: generalized sequential forward n-k g selection ? when (n-k) features are left try all subsets of g features i.e. ( ) trainings. More trainings at each step, but fewer steps. ? PTA(l,r): plus l , take away r ? at each step, run SFS l times then SBS r times. ? Floating search (SFFS and SBFS): One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far. Any time, if a better subset of the same size was already found, switch abruptly. Multivariate FS is complex Kohavi-John, 1997 N features, 2N possible feature subsets! Embedded methods All features Train Train SVM SVM Eliminate Eliminate useless useless feature(s) feature(s) Performance degradation? Yes, stop! No, continue? Recursive Feature Elimination (RFE) SVM. Guyon-Weston, 2000. US patent 7,117,188 Feature subset assessment N variables/features Split data into 3 sets: training, validation, and test set. 1) For each feature subset, train predictor on training data. 2) Select the feature subset, which performs best on validation data. ? Repeat and average if you want to reduce variance (crossvalidation). m1 m2 M samples m3 3) Test on test data. Complexity of Feature Selection With high probability: Error Generalization_error ? Validation_error + ?(C/m2) Method Number of Complexity subsets tried C 2N N Exhaustive search wrapper Nested subsets N(N+1)/2 or log N Feature ranking N n m2: number of validation examples, N: total number of features, n: feature subset size. Try to keep C of the order of m2. Examples of FS algorithms keep C = O(m2) Univariate Multivariate Linear T-test, AUC, feature ranking Mutual information feature ranking RFE with linear SVM or LDA Nearest Neighbors Neural Nets Trees, SVM keep C = O(m1) Non-linear In practice? ? No method is universally better: ? wide variety of types of variables, data distributions, learning machines, and objectives. ? Match the method complexity to the ratio M/N: ? univariate feature selection may work better than multivariate feature selection; non-linear classifiers are not always better. ? Feature selection is not always necessary to achieve good performance. NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges Book of the NIPS 2003 challenge Feature Extraction, Foundations and Applications I. Guyon et al, Eds. Springer, 2006. http://clopinet.com/fextract-book
Graph-based Methods for Retinal Mosaicing and Vascular Characterization In this paper, we propose a highly robust point-matching method (Graph Transformation Matching - GTM) relying on finding the consensus graph emerging from putative matches. Such method is a two- phased one in the sense that after finding the consensus graph it tries to complete it as much as possible. We successfully apply GTM to image registration in the context of finding mosaics from retinal images. Feature points are obtained after properly segmenting such images. In addition, we also introduce a novel topological descriptor for quantifying disease by characterizing the arterial/venular trees. Such descriptor relies on diffusion kernels on graphs. Our experiments have showed only statistical signifficance for the case of arterial trees, which is consistent with previous findings. Graph-based Methods for Retinal Mosaicing and Vascular Characterization Why Retinal Imaging? Objectives Outline Image Feature Extraction Blood Vessel Detection Feature Points and Vessel Tree Extraction Graph Transformation Matching Algorithm (GTM) Input:1) Image Feature Extraction Input:2) Initial Matching GTM Algorithm (1) GTM Algorithm (2) GTM Algorithm (3) GTM Algorithm (4) GTM Algorithm (5) GTM Results (1) GTM Results (2) GTM Results (3) Graph Transformation Matching: Recovery Phase Graph Transformation Matching: Optimization Results (1) Graph Transformation Matching: Optimization Results (2) Graph Transformation Matching: Optimization Results (3) GTM: Other results (1) GTM: Other results (2) GTM: Other results (3) Mosaicing Retinal Mosaicing (1) Retinal Mosaicing (2) Retinal Mosaicing (3) Spectral Vascular Characterization - Diffusion Kernels Reminder Spectral Vascular Characterization - Probability distribution Spectral Vascular Characterization - Building the descriptor Spectral Vascular Characterization - Results with the first descriptor Spectral Vascular Characterization - New descriptor Spectral Vascular Characterization - Is it the balance enough? (1) Spectral Vascular Characterization - Is it the balance enough? (2) Spectral Vascular Characterization - Information Theory may help Conclusions gbr07_perez_gbrmr_Page_38 gbr07_perez_gbrmr_Page_39
Stereo Vision for Obstacle Detection: a Graph-Based Approach We propose a new approach to stereo matching for obstacle detection in the autonomous navigation framework. An accurate but slow reconstruction of the 3D scene is not needed; rather, it is more important to have a fast localization of the obstacles to avoid them. All the methods in the literature, based on a punctual stereo matching, are ineffective in realistic contexts because they are either computationally too expensive, or unable to deal with the presence of uniform patterns, or of perturbations between the left and right images. Our idea is to face the stereo matching problem as a matching between homologous regions. The stereo images are represented as graphs and a graph matching is computed to find homologous regions. Our method is strongly robust in a realistic environment, requires little parameter tuning, and is adequately fast, as experimentally demonstrated in a comparison with the best algorithms in the literature. Stereo Vision for Obstacle Detection: a Graph-Based Approach Obstacle Detection (1) Obstacle Detection (2) Obstacle Detection (3) Outline Related Works (1) Related Works (2) Related Works (3) Related Works: a comparison Related Works: open problems (1) Related Works: open problems (2) Our approach: The Rationale (1) Our approach: The Rationale (2) Our approach: The Algorithm (1) Our approach: The Algorithm (2) Our approach: The Algorithm (3) Our approach: The Algorithm (4) Our approach: The Algorithm (5) Our approach: The Algorithm (6) Our approach: The Algorithm (7) Our approach: The Algorithm (8) Our approach: The Algorithm (9) Our approach: The Algorithm (10) Our approach: The Results (1) Our approach: The Results (2) Our approach: The Results (3) Our approach: The Results (4) Conclusions References References References
A Continuous-Based Approach for Partial Clique Enumeration In many applications of computer vision and pattern recog- nition which use graph-based knowledge representation, it is of great interest to be able to extract the K largest cliques in a graph, but most methods are geared either towards extracting the single clique of max- imum size, or enumerating all cliques, without following any particular order. In this paper we present a novel approach for partial clique enu- meration, that is, the extraction of the K largest cliques of a graph. Our approach is based on a continuous formulation of the clique problem de- veloped by Motzkin and Straus, and is able to avoid extracting the same clique multiple times. This is done by casting the problem into a game- theoretic framework and iteratively rendering unstable the solutions that have already been extracted.
Fast Learning Rates for Support Vector Machines We establish learning rates to the Bayes risk for support vector machines with hinge loss (L1-SVM's). Since a theorem of Devroye states that no learning algorithm can learn with a uniform rate to the Bayes risk for all probability distributions we have to restrict the class of considered distributions: in order to obtain fast rates we assume a noise condition recently proposed by Tsybakov and an approximation condition in terms of the distribution and the reproducing kernel Hilbert space used by the L1-SVM. For Gaussian RBF kernels with varying widths we propose a geometric noise assumption on the distribution which ensures the approximation condition. This geometric assumption is not in terms of smoothness but describes the concentration of the marginal distribution near the decision boundary. In particular we are able to describe nontrivial classes of distributions for which L1-SVM's using a Gaussian kernel can learn with almost linear rate.
A Correspondence Measure for Graph Matching using the Discrete Quantum Walk In this paper we consider how coined quantum walks can be applied to graph matching problems. The matching problem is ab- stracted using an auxiliary graph that connects pairs of vertices from the graphs to be matched by way of auxiliary vertices. A coined quantum walk is simulated on this auxiliary graph and the quantum interference on the auxiliary vertices indicates possible matches. When dealing with graphs for which there is no exact match, the interference amplitudes to- gether with edge consistencies are used to define a consistency measure. We have tested the algorithm on graphs derived from the NCI molecule database and found it to significantly reduce the space of possible match- ings thereby allowing the graphs to be matched directly. An analysis of the quantum walk in the presence of structural errors between graphs is used as the basis of the consistency measure. We test the performance of this measure on graphs derived from images in the COIL-100 database.
Feature construction This course covers feature selection fundamentals and applications. The students will first be reminded of the basics of machine learning algorithms and the problem of overfitting avoidance. In the wrapper setting, feature selection will be introduced as a special case of the model selection problem. Methods to derive principled feature selection algorithms will be reviewed as well as heuristic method, which work well in practice. One class will be devoted to feature construction techniques. Finally, a lecture will be devoted to the connections between feature section and causal discovery. The class will be accompanied by several lab sessions. The course will be attractive to students who like playing with data and want to learn practical data analysis techniques. The instructor has ten years of experience with consulting for startup companies in the US in pattern recognition and machine learning. Datasets from a variety of application domains will be made available: handwriting recognition, medical diagnosis, drug discovery, text classification, ecology, marketing. Play with the Gisette dataset of the feature selection challenge. See how with simple feature extraction methods, performances can be improved over the pure ?agnostic? approach.
Casuality and feature selection Lecture 5: Causality and Feature Selection Variable/Feature Selection What Can Go Wrong? - 2 What Can Go Wrong? - 3 What Can Go Wrong? - 1 Causal Feature Selection L5_causality_Page_07 L5_causality_Page_08 L5_causality_Page_09 L5_causality_Page_10 L5_causality_Page_11 L5_causality_Page_12 L5_causality_Page_13 L5_causality_Page_14 L5_causality_Page_15 L5_causality_Page_16 L5_causality_Page_17 L5_causality_Page_18 L5_causality_Page_19 L5_causality_Page_20 L5_causality_Page_21 L5_causality_Page_22 L5_causality_Page_23 L5_causality_Page_24 L5_causality_Page_25 L5_causality_Page_26 L5_causality_Page_27 L5_causality_Page_28 L5_causality_Page_29 L5_causality_Page_30 L5_causality_Page_31 L5_causality_Page_32 L5_causality_Page_33 L5_causality_Page_34 L5_causality_Page_35 L5_causality_Page_36 L5_causality_Page_37 L5_causality_Page_38 L5_causality_Page_39 L5_causality_Page_40 L5_causality_Page_41 Lecture 5: Causality and Feature Selection Isabelle Guyon isabelle@clopinet.com Variable/feature selection Y X Remove features Xi to improve (or least degrade) prediction of Y. What can go wrong? 120 X 2 100 80 60 40 20 X 1 Guyon-Aliferis-Elisseeff, 2007 X2 X1 X2 X1 Y X2 X1 Y Guyon-Aliferis-Elisseeff, 2007 Causal feature selection X Uncover causal relationships between Xi and Y. Causal feature relevance Lung cancer Markov Blanket Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003) Feature relevance ? Surely irrelevant feature Xi: P(Xi, Y |S\\i) = P(Xi |S\\i)P(Y |S\\i) for all S\\i ? X\\i and all assignment of values to S\\i ? Strongly relevant feature Xi: P(Xi, Y |X\\i) ? P(Xi |X\\i)P(Y |X\\i) for some assignment of values to X\\i ? Weakly relevant feature Xi: P(Xi, Y |S\\i) ? P(Xi |S\\i)P(Y |S\\i) for some assignment of values to S\\i ? X\\i Markov Blanket Lung cancer Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003) PARENTS CHILDREN Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003) SPOUSES Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003) Causal relevance ? Surely irrelevant feature Xi: P(Xi, Y |S\\i) = P(Xi |S\\i)P(Y |S\\i) for all S\\i ? X\\i and all assignment of values to S\\i ? Causally relevant feature Xi: P(Xi,Y|do(S\\i)) ? P(Xi |do(S\\i))P(Y|do(S\\i)) for some assignment of values to S\\i ? Weak/strong causal relevance: ? Weak=ancestors, indirect causes ? Strong=parents, direct causes. Examples Lung cancer Immediate causes (parents) Smoking Genetic factor1 Smoking Non-immediate causes (other ancestors) Smoking Anxiety Lung cancer Non causes (e.g. siblings) Genetic factor1 Other cancers Lung cancer X || Y | C CHAIN C X Y C X FORK Hidden more direct cause Smoking Anxiety Tar in lungs Lung cancer Confounder Smoking Genetic factor2 Lung cancer Immediate consequences (children) Coughing Metastasis Biomarker1 X || Y but X || Y | C Lung cancer X C C C X Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003) X Non relevant spouse (artifact) Lung cancer Biomarker2 Biomarker1 120 X 2 100 80 60 40 20 X 1 Another case of confounder Lung cancer BioSystematic marker2 noise Biomarker1 Truly relevant spouse Lung cancer Allergy Coughing Sampling bias Lung cancer Metastasis Hormonal factor Causal feature relevance Smoking Anxiety Genetic factor2 Allergy Tar in lungs Genetic factor1 Other cancers Lung cancer BioSystematic marker2 noise Biomarker1 Metastasis Coughing (b) Hormonal factor Formalism: Causal Bayesian networks ? Bayesian network: ? Graph with random variables X1, X2, ?Xn as nodes. ? Dependencies represented by edges. ? Allow us to compute P(X1, X2, ?Xn) as ?i P( Xi | Parents(Xi) ). ? Edge directions have no meaning. ? Causal Bayesian network: egde directions indicate causality. Example of Causal Discovery Algorithm Algorithm: PC (Peter Spirtes and Clarck Glymour, 1999) Let A, B, C ?X and V ? X. Initialize with a fully connected un-oriented graph. 1. Find un-oriented edges by using the criterion that variable A shares a direct edge with variable B iff no subset of other variables V can render them conditionally independent (A ? B | V). 2. Orient edges in ?collider? triplets (i.e., of the type: A ? C ? B) using the criterion that if there are direct edges between A, C and between C and B, but not between A and B, then A ? C ? B, iff there is no subset V containing C such that A ? B | V. 3. Further orient edges with a constraint-propagation method by adding orientations until no further orientation can be produced, using the two following criteria: (i) If A ? B ? ? ? C, and A ? C (i.e. there is an undirected edge between A and C) then A ? C. (ii) If A ? B ? C then B ? C. Computational and statistical complexity Computing the full causal graph poses: ? Computational challenges (intractable for large numbers of variables) ? Statistical challenges (difficulty of estimation of conditional probabilities for many var. w. few samples). Compromise: ? Develop algorithms with good average- case performance, tractable for many real-life datasets. ? Abandon learning the full causal graph and instead develop methods that learn a local neighborhood. ? Abandon learning the fully oriented causal graph and instead develop methods that learn unoriented graphs. A prototypical MB algo: HITON Target Y Aliferis-Tsamardinos-Statnikov, 2003) 1 ? Identify variables with direct edges to the target (parent/children) B A Target Y Iteration 1: add A A Iteration 2: add B A B Iteration 3: remove B because A ? Y | B B etc. 2 ? Repeat algorithm for parents and children of Y(get depth two relatives) Target Y 3 ? Remove non-members of the MB B A Target Y A member A of PCPC that is not in PC is a member of the Markov Blanket if there is some member of PC B, such that A becomes conditionally dependent with Y conditioned on any subset of the remaining variables and B . Conclusion ? Feature selection focuses on uncovering subsets of variables X1, X2, ? predictive of the target Y. ? Multivariate feature selection is in principle more powerful than univariate feature selection, but not always in practice. ? Taking a closer look at the type of dependencies in terms of causal relationships may help refining the notion of variable relevance. Acknowledgements and references 1) Feature Extraction, Foundations and Applications I. Guyon et al, Eds. Springer, 2006. http://clopinet.com/fextract-book 2) Causal feature selection I. Guyon, C. Aliferis, A. Elisseeff To appear in ?Computational Methods of Feature Selection?, Huan Liu and Hiroshi Motoda Eds., Chapman and Hall/CRC Press, 2007. http://clopinet.com/isabelle/Papers/causalFS.pdf
Image Classification Using Marginalized Kernels for Graphs We propose in this article an image classification technique based on kernel methods and graphs. Our work explores the possibility of applying marginalized kernels to image processing. In machine learning, performant algorithms have been developed for data organized as real valued arrays; these algorithms are used for various purposes like classification or regression. However, they are inappropriate for direct use on complex data sets. Our work consists of two distinct parts. In the first one we model the images by graphs to be able to represent their structural properties and inherent attributes. In the second one, we use kernel functions to project the graphs in a mathematical space that allows the use of performant classification algorithms. Experiments are performed on medical images acquired with various modalities and concerning di_erent parts of the body.
Universal Coding/Prediction and Statistical (In)consistency of Bayesian inference Part of this talk is based on results of A. Barron (1986) and recent joint work with J. Langford (2004). We introduce the information-theoretic concepts of universal coding and prediction. Under weak conditions on the prior, Bayesian sequential prediction is universal. This means that a code based on the Bayesian predictive distribution allows one to substantially compress data. We give a simple proof of the fact that universality implies consistency of the Bayesian posterior. It follows that Bayesian inconsistency in nonparametric settings (a la Diaconis & Freedman) can only occur if priors are used that do not allow for data compression. This gives a frequentist rationale for Rissanen's Minimum Description Length Principle. We also show that under misspecification, the Bayesian predictions can substantially outperform the predictions of the best distribution in the model. Ironically, this implies that the Bayesian posterior can become *inconsistent*: in some sense good predictive performance implies inconsistency!
Deducing Local Influence Neighbourhoods With Application to Edge-Preserving Image Denoising Traditional image models enforce global smoothness, and more recently Markovian Field priors. Unfortunately global models are inadequate to represent the spatially varying nature of most images, which are much better modeled as piecewise smooth. This paper advocates the concept of local influence neighbourhoods (LINs). The influence neighbourhood of a pixel is defined as the set of neighbouring pixels which have a causal influence on it. LINs can therefore be used as a part of the prior model for Bayesian denoising, deblurring and restoration. Using LINs in prior models can be superior to pixel-based statistical models since they provide higher order information about the local image statistics. LINs are also useful as a tool for higher level tasks like image segmentation. We propose a fast graph cut based algorithm for obtaining optimal influence neighbourhoods, and show how to use them for local filtering operations. Then we present a new expectation-maximization algorithm to perform locally optimal Bayesian denoising. Our results compare favourably with existing denoising methods. Deducing Local Influence Neighbourhoods in Images Using Graph Cuts San Francisco, CA Overview Local neighbourhoods as intermediate image structures Outline Local Influence Neighbourhoods Example: Binary image denoising Problem Constraints Example of box vs. smoothness (1) Example of box vs. smoothness (2) A Better neighbourhood criterion A) Closeness criterion in action B) Contiguity and smoothness gbr07_raj_dlin_Page_14 Markov Random Field Priors Bottomline Graph Cut based Energy Minimization How to minimize E? Minimum cut problem Graph construction Table1: Edge costs of induced graph Graph Algorithm Examples of Detected LINs Results: Most Popular LINs Filtering with LINs Maximum filter using LINs Median filter using LINs EM-style Denoising algorithm Bayesian (Maximum a Posteriori) Estimate EM-style image denoising Results: LIN-based Image Denoising Results: Bike image Table1: Denoising Results Other Applications of LINs Hierarchical segmentation How to measure Fractal Dimension using LINs? FD using LINs Possible advantages of LIN over current techniques Possible Discriminators of Neurodegeneration Summary Contact
An Introduction to Ensemble and Boosting Institute for Computer Graphics and Vision (ICG) Graz University of Technology, Austria http://www.ymer.org/amir/ saffari@icg.tugraz.at , amir@ymer.org PASCAL Bootcamp 2007 Vilanova i la Geltr?, Spain Outline Ensemble Methods Introduction Model Averaging Bagging Stagewise Additive Modeling Stagewise Additive Modeling Boosting Practical Example Introduction Model Averaging Bagging Choosing your operating system Majority voting scheme Ask experts for their opinion and choose the option with majority vote. Let?s say we have a set of M experts: H = {f1 , f2 , . . . , fM }, fm (budget) ? {Linux, Windows} Assume Linux = +1, Windows = ?1, then the majority vote decision will be: 1 F (budget) = sign( M M fm (budget)) m=1 This is the main concept behind ensemble methods. Diversity is just more than great. Notations D = {(x1 , t1 ), (x2 , t2 ), . . . , (xN , tN )} xn ? R d , tn ? {?1, +1} H = {f1 (x), f2 (x), . . . , fM (x)} ym = fm (x) ? {?1, +1} F (x) = ?m ? M m=1 ?m fm (x) R + , M ?m = 1 m=1 Outline Ensemble Methods Introduction Model Averaging Bagging Stagewise Additive Modeling Stagewise Additive Modeling Boosting Practical Example Why to use ensemble methods? Better performance Assume that: ?j : p(ym = t) ? ? > 1/2, and the decisions of different models are independent, then the chance of a wrong decision by the ensemble, p(F = t) = 1 ? Pr (k ? M/2), where Pr (k ? K ) is the cumulative distribution function of a binomial distribution. This upper bound is pretty much better than the original error rate. Performance of ensemble of classi?ers For ? = 0.3 and M = 21, the chance of misclassi?cation is around 0.026 (T. G. Diettrich 2000). Amir Saffari An Introduction to Ensemble and Boosting Methods Why to use ensemble methods? Statistical reason From: T. G. Diettrich, Ensemble Methods in Machine Learning, Lecture Notes in Computer Science, Vol. 1857, pages: 1-15, 2000. Amir Saffari An Introduction to Ensemble and Boosting Methods Computational reason From: T. G. Diettrich, Ensemble Methods in Machine Learning, Lecture Notes in Computer Science, Vol. 1857, pages: 1-15, 2000. Amir Saffari An Introduction to Ensemble and Boosting Methods Representational reason Computational ef?ciency We are looking for a set of weak learners (classi?ers, or hypotheses): p(y = t) > 1/2. Different classes of base models Choices could be: Trees (stumps, small, large), Naive Bayes, k-Nearest Neighbors, Neural Networks, Linear SVM, YOUR-MAGICAL-MODEL, ... Amir Saffari An Introduction to Ensemble and Boosting Methods How to ?nd the base models? Train a diverse set of models on the same datasets. Train a set of models from a speci?c class of learners by using diversity in the datasets, parameters, or initial conditions. Cross-validated committees Bagging Boosting Outline Ensemble Methods Introduction Model Averaging Bagging Stagewise Additive Modeling Stagewise Additive Modeling Boosting Practical Example Bagging Create subsets of the training samples, called bootstrap replicates, each containing examples drawn randomly with replacement from the original training dataset, and train learning algorithms over them. The method is called bootstrap aggregation. Originally developed to reduce the variance of the learning algorithms. L. Breiman, Bagging Predictors, Machine Learning, Vol. 24, pages: 123-140, 1996. Stagewise Additive Modeling Boosting Practical Example Outline Ensemble Methods Introduction Model Averaging Bagging Stagewise Additive Modeling Stagewise Additive Modeling Boosting Practical Example Stagewise additive modeling F (x) = M ?m fm (x) m=1 General Forward Stagewise Additive Modeling Set F (0) (x) = 0 for m = 1 to M, do {fm (x), ?m } = argmin f ,? N (m?1) (x ) n n=1 L(tn , F + ?f (xn )) F (m) (x) = F (m?1) (x) + ?m fm (x) J. Friedman, T. Hastie, R. Tibshirani, Additive Logistic Regression: a Statistical View of Boosting, Annals of Statistics, Vol. 28, pages: 337-407, 2000. Outline Ensemble Methods Introduction Model Averaging Bagging Stagewise Additive Modeling Stagewise Additive Modeling Boosting Practical Example AdaBoost F (x) = M ?m fm (x) m=1 I(t, y ) = ?t.y Discrete AdaBoost Set W = {w1 , w2 , . . . , wN }, ?n : wn = 1/N for m = 1 to M, do fm (x) = argmin f N n=1 wn (tn ? f (xn ))2 em = ?m = wn ? N n=1 wn I(tn , fm (xn )) log 1?em em N n=1 wn wn ? wn exp(?m I(tn , fm (xn ))) Y. Freund, R. Schapire, Experiments with a New Boosting Algorithm, Proceedings of ICML, pages: 148-156, 1997. Outline Ensemble Methods Introduction Model Averaging Bagging Stagewise Additive Modeling Stagewise Additive Modeling Boosting Practical Example Tracking visual objects H. Grabner, M. Grabner, H. Bischof, Real-Time Tracking via On-line Boosting, BMVC, 2006.
Graph Spectral Image Smoothing A new method for smoothing both gray-scale and color images is presented that relies on the heat diffusion equation on a graph. We represent the image pixel lattice using a weighted undirected graph. The edge weights of the graph are determined by the Gaussian weighted distances between local neighbouring windows. We then compute the associated Laplacian matrix (the degree matrix minus the adjacency matrix). Anisotropic diffusion across this weighted graph-structure with time is captured by the heat equation, and the solution, i.e. the heat kernel, is found by exponentiating the Laplacian eigen-system with time. Image smoothing is accomplished by convolving the heat kernel with the image, and its numerical implementation is realized by using the Krylov subspace technique. The method has the effect of smoothing within regions, but does not blur region boundaries. We also demonstrate the relationship between our method, standard diffusion-based PDEs, Fourier domain signal processing and spectral clustering. Experiments and comparisons on standard images illustrate the effectiveness of the method. Graph Spectral Image Smoothing Overview Literature Motivation: Why Graph Diffusion? Aim in this paper Steps Graph Representation of Images Graph Edge Weight Laplacian of a graph Laplacian spectrum Graph Heat Kernel (1) Graph Heat Kernel (2) Graph Heat Kernel (3) Graph Heat Kernel (4) Lazy random walk on graph (1) Lazy random walk on graph (2) Continuous time random walk (1) Continuous time random walk (2) Continuous time random walk (3) Anisotropic diffusion as heat flow on a graph Graph spectral image smoothing (1) Graph spectral image smoothing (2) Graph spectral image smoothing (3) Meaning Numerical Implementation Relation to Anisotropic Diffusion Relation to Signal Processing Relation to Spectral Clustering Results (1) Results (2) Results (3) Root-Mean-Square Error comparison (1) Root-Mean-Square Error comparison (2) Root-Mean-Square Error comparison (3) Root-Mean-Square Error comparison (4) Conclusion
Probabilistic Relaxation Labeling by Fokker-Planck Diffusion on a Graph In this paper we develop a new formulation of probabilistic relaxation labeling for the task of data classification using the theory of diffusion processes on graphs. The state space of our process as the nodes of a support graph which represent potential object-label assignments. The edge-weights of the support graph encode data-proximity and label consistency information. The state-vector of the diffusion process represents the object-label probabilities. The state vector evolves with time according to the Fokker-Planck equation.We show how the solution state vector can be estimated using the spectrum of the Laplacian matrix for the weighted support graph. Experiments on various data clustering tasks show effectiveness of our new algorithm. Probabilistic Relaxation Labelling by Fokker-Planck Diffusion on a Graph Outline Overview Probabilistic Relaxation Labelling Relaxation Labelling (1) Relaxation Labelling (2) Graph theoretical setting for relaxation labelling Graph spectral relaxation labeling Support graph Example Fokker-Planck Diffusion Diffusion Processes Probabilistic Relaxation Labelling by Diffusion Spectral Graph Theory (1) Spectral Graph Theory (2) Graph-spectral solution of FP Eqn. Experiments (1) Experiments (2) Experimental Results (1) Experimental Results (2) Discussion hancock_r_edwin_prl_Page_22
A general purpose segmentation algorithm using analytically evaluated random walks An ideal segmentation algorithm could be applied equally to the problem of isolating organs in a medical volume or to editing a digital photograph without modifying the algorithm, changing parameters, or sacrificing segmentation quality. However, a general-purpose, multiway segmentation of objects in an image/volume remains a challenging problem. In this talk, I will describe a recently developed approach to this problem that inputs a few training points from a user (e.g., from mouse clicks) and produces a segmentation by computing the probabilities that a random walker leaving unlabeled pixels/voxels will first strike the training set. By exact mathematical equivalence with a problem from potential theory, these probabilities may be computed analytically and deterministically. The algorithm is developed on an arbitrary, weighted, graph/mesh in order to maximize the broadness of application. I will illustrate the use of this approach with examples from several segmentation problems (without modifying the algorithm or the single free parameter), compare this algorithm to other approaches and discuss the theoretical properties that describe its behavior. General Purpose Image Segmentation with Random Walks Outline Overview of SCR (1) Overview of SCR (2) Overview of SCR (3) Overview of SCR (4) Outline General Purpose Segmentation (1) General Purpose Segmentation (2) General Purpose Segmentation (3) General Purpose Segmentation (4) General Purpose Segmentation (5) Outline Random Walker - Concept (1) Random Walker - Concept (2) Random Walker - Concept (3) Outline Random Walker - Properties (1) Random Walker - Properties (2) Random Walker - Properties (3) Random Walker - Properties (4) Random Walker - Properties (5) Random Walker - Properties (6) Outline Random Walker - Theory (1) Random Walker - Theory (2) Random Walker - Theory (3) Random Walker - Theory (4) Random Walker - Theory (5) Random Walker - Theory (6) Random Walker - Concept (4) Random Walker - Theory (7) Random Walker - Theory (8) Random Walker - Theory (9) Random Walker - Theory (10) Random Walker - Theory (11) Outline Random Walker - Numerics (1) Random Walker - Numerics (2) Outline Random Walker - Results (1) Random Walker - Results (2) Random Walker - Results (3) Random Walker - Results (4) Random Walker - Results (5) Random Walker - Results (6) Outline Random Walker - New (1) Random Walker - New (2) Random Walker - New (3) Random Walker - New (4) Random Walker - New (5) Outline Conclusion Conclusion ? More Information
Qualitative Spatial Relationships for Image Interpretation by using Semantic Graph In this paper, a new way to express complex spatial relations is proposed in order to integrate them in a Constraint Satisfaction Problem with bilevel constraints. These constraints allow to build semantic graphs, which can describe more precisely the spatial relations between subparts of a composite object that we look for in an image. For example, it allows to express complex spatial relations such as ?is surrounded by?. This approach can be applied to image interpretation and some examples on real images are presented. Image interpretation by using conceptual graph: introducing complex spatial relations Spatial relations are a key point of image understanding Image interpretation Matching of heterogeneous graphs Graph matching Image interpretation AC4 Algorithm : solve the graph matching problem by solving a problem of constraint satisfaction Limits Image interpretation Solution: checking arc-consistency with bilevel constraints Constraint satisfaction problem with two levels of constraint How to match? (1) How to match? (2) How to match? (3) How to match? (4) How to match? (5) How to match? (6) How to match? (7) How to match? (8) How to match? (9) Intra-node constraints: compatibility between two values (Cmpi) (1) Intra-node constraints: compatibility between two values (Cmpi) (2) Image interpretation Complex spatial relations (1) Complex spatial relations (2) Complex spatial relations (3) Complex spatial relations (4) Connectivity-Direction-Metric Formalism (1) Connectivity-Direction-Metric Formalism (2) Connectivity-Direction-Metric Formalism (3) Connectivity-Direction-Metric Formalism (4) How to combine these relations? (1) How to combine these relations? (2) We want more ! (1) We want more ! (2) We want more ! (3) We can linked one interface with different nodes of the shell and described any logical combination An example with the relation ? is around of ? to describe a flower: (1) An example with the relation ? is around of ? to describe a flower: (2) Connectivity-Direction-Metric Formalism (5) Connectivity-Direction-Metric Formalism (6) Experimentations (1) Experimentations (2) Conclusion Thank you for your attention
Separation of the Retinal Vascular Graph in Arteries and Veins The vascular structure of the retina consists of two kinds of vessels: arteries and veins. Together these vessels form the vascular graph. In this paper we present an approach to separating arteries and veins based on a pre-segmentation and a few hand-labelled vessel segments. We use a rule-based method to propagate the vessel labels through the vascular graph. We embed this task as double-layered constrained search problem steered by a heuristical AC-3 algorithm to overcome the NPhard computational complexity. Results are presented on vascular graphs generated from hand-made as well as on automatical segmentation. Separation of the Retinal Vascular Graphin Arteries and Veins Outline Medical Purpose Vessel segmentation Graph-based representation of the vasculature SAT-Problem Specification (vessel labelling) The labelling process (AC-3*) (1) The labelling process (AC-3*) (2) Operations for graph manipulation (edge labelling) Steering the labelling process (Belief propagation) Initial edge labelling Solving Conflicts Interactive labelling tool Results on manually segmented images Discussion results on manual segmentations Results on automatic segmentations Summary and Conclusions Final slide gbr07_rothaus_sotrv_Page_19
Theory and Applications of Kernel Space ;Basics of kernel definitions and theory are first given.\\\\Then 3 algorithms are described with an explicit reference to the representer theorem: : Support vector Mahcines, : Support Vector Regression and : Kernel Principal Components Analysis. The last course is devoted to examples of kernel design (Mahalanobis kernles and Fisher kernels)
Learning the topology of a data set Learning the topology of a data set Introduction A question without answer? An subjective answer Why learning topology: (semi)-supervised applications Why learning topology: unsupervised applications Generative manifold learning Computational topology - 1 Computational topology - 2 Computational topology - 3 Computational topology - 4 Computational topology - 5 Computational topology - 6 Application : known manifold Approximation : manifold known throught a data set Topology representing network - 1 Topology representing network - 2 Topology representing network - 3 Topology representing network - 4 Topology representing network - 5 Topology representing network - 6 Topology representing network - 7 Topology representing network - 8 Topology representing network: some drawbacks - 1 Topology representing network: some drawbacks - 2 Topology representing network: some drawbacks - 3 General assumptions on data generation - 1 General assumptions on data generation - 2 General assumptions on data generation - 3 General assumptions on data generation - 4 3 assumptions?1 generative model - 1 3 assumptions?1 generative model - 2 3 assumptions?1 generative model - 3 A Gaussian-point and a Gaussian-segment Hola! Proposed approach: 3 steps - 1 Number of prototypes Proposed approach: 3 steps - 2 EM updates - 1 EM updates - 2 Proposed approach: 3 steps - 3 Threshold setting Toy experiment - 1 Toy experiment - 2 Toy experiment - 3 Other applications Comments Key points Open questions Related works Thanks bootcamp07_vilanova_gaillard_pierre_Page_52 bootcamp07_vilanova_gaillard_pierre_Page_53 PASCAL BOOTCAMP, Vilanova, July 2007 Learning the topology of a data set Micha?l Aupetit ? Researcher Engineer (CEA) Pierre Gaillard ? Ph.D. Student Gerard Govaert ? Professor (University of Technology of Compiegne) Introduction Given a set of M data in RD, the estimation of the density allow solving various problems : classification, clustering, regression BootCamp PASCAL A question without answer? The generative models cannot aswer this question: Which is the ? shape ? of this data set ? An subjective answer The expected answer is : 1 point and 1 curve not connected to each other The problem : what is the topology of the principal manifolds Why learning topology : (semi)-supervised applications Estimate the complexity of the classification task [Lallich02, Aupetit05Neurocomputing] Add a topological a priori to design a classifier [Belkin05Nips] Add topological features to statistical features Classify through the connected components or the intrinsic dimension. [Belkin] Why learning topology : unsupervised applications Clusters defined by the connected components Data exploration (e.g. shortest path) Robotic (Optimal path, inverse kinematic) [Zeller, Schulten -IEEE ISIC1996] Generative manifold learning Gaussian Mixture MPPCA [Bishop] GTM [Bishop] Revisited Principal Curves [Hastie,Stuetzle] Problems: fixed or incomplete topology Computational Topology All the previous work about topology learning has been grounded on the result of Edelsbrunner and Shah (1997) which proved that given a manifold and a set of N prototypes nearby M, it exists a subgraph* of the Delaunay graph of the prototypes which has the same topology as M M1 M2 * more exactly a subcomplex of the Delaunay complex Extractible topology O(DN3) M1 M2 Application : known manifold Topology of molecules [Edelsbrunner1994] Approximation : manifold known throught a data set Topology Representing Network ? Topology Representing Network [Martinetz, Schulten 1994] Connect the 1st and 2nd NN of each data 2nd 1er Good points : 1- O(DNM) 2- If there are enough prototypes and if they are well located then resulting graph is ? good ? in practice. Some drawbacks from the machine learning point of view Topology Representing Network : some drawbacks ? Noise sensitivity ? Not self-consistent [Hastie] ? No quality measure ? How to measure the quality of the TRN if D <3 ? ? How to compare two models ? For all these reasons, we propose a generative model General assumptions on data generation Unknown principal manifolds ?from which are drawn data with a unknown pdf ?from which are drawn data with a unknown pdf ?corroputed with some unknown noise leading to the observarion The goal is to learn from the observed data, the principal manifolds such that their topological features can be extracted 3 assumptions?1 generative model The manifold is close to the DG of some prototypes p (x) = j?J p( j ) we associate to each component a weighted uniform distribution ? p( j ) p( x j ,? ) p( j ) p( x j ,? ) j we convolve the components by an isotropic Gaussian noise A Gaussian-point and a Gaussian-segment How to define a generative model based on points and segments ? ? D 2 A B p 0 ( x A ,? ) = ( 2 ?? ? ( x ? A )2 exp ? ? ? 2? 2 ? p 1 ( x [AB ], ? ) = ? ]p ( x v ,? [ AB ) dv A can be expressed in terms of ? erf ? Hola ! Proposed approach : 3 steps 1. Initialization Location of the prototypes with a ? classical ? isotropic GM ?and then building of the Delaunay Graph Initialize the generative model (equiprobability of the components) Number of prototypes min BIC ~ - Likelihood + Complexity of the model 3.93 3.92 3.91 3.9 3.89 3.88 3.87 3.86 3.85 3.84 35 x 10 Proposed approach : 3 steps 2. Learning p (x) = j?J p( j ) p( x j ,? ) update the variance of the Gaussian noise, the weights of the components, and the location of the prototypes with the EM algorithm in order to maximize the Likelihood of the model w.r.t the N observed data : L(? , ? ; x, DG ) = ? p( xi ; DG,? , ? ) i =1 N EM updates Proposed approach : 3 steps 3. After the learning Some components have a (quasi-) nul probability (weights): They do not explain the data and can be prunned from the initial graph Threshold setting ~ ? Cattell Scree Test ? ? =0.0022 Cumulative sum of priors 100 150 Number of simplice Toy Experiment Seuillage sur le nombre de witness Other applications O Comments ? There is ? no free lunch ? ? Time Complexity O(DN3) (initial Delaunay graph) ? Slow convergence (EM) ? Local optima Key Points ? Statistical learning of the topology of a data set ? Assumption : ? Initial Delaunay graph is rich enough to contain a sub-graph having the same topology as the principal manifolds ? Based on a statistical criterion (the likelihood) available in any dimension ? ? Generalized ? Gaussian Mixture ? Can be seen as a generalization of the ? Gaussian mixture ? (no edges) ? Can be seen as a finite mixture (number of ? Gaussian-segment ?) of an infinite mixture (Gaussian-segment) ? This preliminary work is an attempt to bridge the gap between Statistical Learning Theory and Computational Topology Open questions ? Validity of the assumption ? good ? penalized-likelihood = ? good ? topology ? Theorem of ?universal approximation? of manifold ? Related works ? Publications NIPS 2005 (unsupervised) and ESANN 2007 (supervised: analysis of the iris and oil flow data sets) ? Workshop submission at NIPS on this topic in collaboration with F. Chazal (INRIA Futurs) , D. Cohen-Steiner (INRIA Sophia), S. Canu and G.Gasso (INSA Rouen) Thanks Equations BootCamp PASCAL Topology intrinsic dimension di=0 di=1 di=2 Number of holes (Betti), connectedness? Homeomorphism: topological equivalence
Learning to Reconstruct 3D Human Pose and Motion from Silhouettes We will describe our ongoing work on learning-based methods for recovering 3D human body pose and motion from single images and from monocular image sequences. The methods work directly with raw image observations and require neither an explicit 3D body model nor a prior labelling of body parts in the image. Instead, they recover the body pose or motion by direct nonlinear regression against shape descriptors extracted automatically from image silhouettes or contours. Learning to Reconstruct 3D Human Pose and Motion from Silhouettes Goal 2 Broad Classes of Approaches ?Model Free? Learning ? based Approach The Basic Idea Silhouette Descriptors Why Use Silhouettes ? Ambiguities Shape Context Histograms Shape Context Histograms Encode Locality Nonlinear Regression Regression Model Regularized Least Squares Relevance Vector Machine ? a brief introduction Contd. Pose from Static Images Training & Test Data Methods Tested Synthetic Spiral Walk Test Sequence Spiral Walk Test Sequence Some statistics .. Glitches TITLE Real Image example Understanding the Problem Pose from Video Sequences Tracking Framework Joint Regression equations Results with Joint Regression Spiral Walk Test Sequence Real Images Test Sequence Conclusion Real Images Test Sequence
Graph Based Shapes Representation and Recognition In this paper, we propose to represent shapes by graphs. Based on graphic primitives extracted from the binary images, attributed relational graphs were generated. Thus, the nodes of the graph represent shape primitives like vectors and quadrilaterals while arcs describing the mutual primitives relations. To be invariant to transformations such as rotation and scaling, relative geometric features extracted from primitives are associated to nodes and edges as attributes. Concerning graph matching, due to the fact of NP-completeness of graph-subgraph isomorphism, a considerable attention is given to different strategies of inexact graph matching. We also present a new scoring function to compute a similarity score between two graphs, using the numerical values associated to the nodes and edges of the graphs. The adaptation of a greedy graph matching algorithm with the new scoring function demonstrates significant performance improvements over traditional exhaustive searches of graph matching. Graph Based Shapes Representation and Recognition Graph Based Shapes Representation Graph Based Shapes Recognition, Greedy Algorithm Graph Based Shapes Recognition, Results Poster Session
System for extracting data (facts) from large amount of unstructured documents System for extracting data (facts) from large amount of unstructured documents (INTERNET) What are we solving Extracting Nobel prize winners - 1 Extracting Nobel prize winners - 2 Extracting Nobel prize winners - 3 Assessing the candidates Bootstraping and feature selection Pseudo code Hit limit in search engines Difficulties Features problem Features solution Precision/recall Conclusion Further work System for extracting data (facts) from large amount of unstructured documents (INTERNET) Luka Brade?ko Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll What are we solving ? Problem: to manually search for factual information on large amount of data, is a tedious, error-prone process ? Solution: user describes wanted class using simple search query and gets a list of instances as a result. ? Examples: ? Search query: Nobel prize winner/winners Instances: Albert Einstein, Kris Cardenas,Yasunori Kawabata,... ? Search query: City/Cities Instances: Ljubljana, New York, Berlin,... Extracting Nobel prize winners ? Input: ? Search query: ?nobel prize winnner;laureate?, ?nobel prize winners;laureates? ? Extraction rules: ? ?>class2< such as >NPList<? ? ?>NP< is a >class<? ? ?>class2< including >NPList<? ? Generate extraction queries for a search engine: ? ?nobel prize winners such as *? ? ?* is a nobel prize winner? ? ?nobel prize winners including *? ? Send extraction queries to the search engine and collect query results: ? Extract candidates from snippets, using extraction rules: Nobel Prize winners such as Toni Morrison, Wole Soyinka, Derek Walcott... ?>class2< such as >NPList<? Toni Morrison, Wole Soyinka , Derek Walcott Assessing the Candidates ? Generate discriminators (from rules and user input): ? ?nobel prize winners such as >Candidate<? ? ?>Candidate< is a nobel prize winner? ? ?>Candidate< is a nobel prize laureate? ? ?nobel prize winners, including >Candidate<? Generate discriminator queries (from discriminators and candidates): ? ?cities such as Toni Morrison? ? ?Toni Morrison is a nobel prize winner? ? ?Toni Morrison is a nobel prize laureate? ? ?nobel prize winners, including Toni Morrison? Evaluate candidates ? Calculate PMI for each discriminator query PMI = HITS ( D + C ) HITS (\" nobel prize winner Toni Morrison\") 1, 760 ? = = 0.001073 HITS (C ) HITS (\" Toni Morrison\") 1, 640, 000 Use PMI numbers as features for learning algorithms Bootstraping and feature selection ? We need a training set of positive and negative instances of the target class ? Instead of evaluating all candidates, which is time consuming, we select n seed candidates ? Use average PMI as evaluation for positive and negative seeds ? m candidates with highest average PMI are positive ? m candidates with lowest average PMI are negative ? Select k beset discriminators, tested on m positive and negative seeds (this two steps can be iterated) ? Evaluate all candidates on k discriminators and use PMI as features. Result:Tony Morrison (0.001073, 0.000, 0.3221, ...) Pseudo code KNOWITALL (information focus I, rules R) { Q = make queries for each I from R Candidates C for each Q { w = send Q to search engine(s) for each w { C = Extract from w, using R } } D=make discriminators from R. Asses random n Candidates take m Candidates with best PMI as seeds take m Candidates with 0 PMI as negative seeds select k best discriminators Asses all Candidates } Hit limit in Search Engines ? Problem: Search engines have limit of top 1000 results per query ? Solution: Recursive Query Expansion ? Q= original query (i.e.1900 results) Q1= Q + w (w= pre-specified list of words. Usually most frequent words) (1200 results) Q2 = Q ? w (700 results) ? Example: Q= ?nobel prize winners such as *? Q1=?nobel prize winners such as *? + ?what? Q2=?nobel prize winners such as *? - ?what? Difficulties ? Features ? PMI does not always give needed information ? Best 5 PMI results are usually not correct. (hard to choose positive and negative seeds) ? Solution: use more , or other features: ? PMI, hits, page rank, redundancy number, number of independent positive PMIs ? Precision/Recall ? Hard to calculate, or even approximate on some classes ? Time complexity ? Problem: Fetching web pages and assesing candidates is time consuming. ? Solution: Extracting just from snippets is as good as from web pages. For fast results use redundancy numbers instead of PMI Features problem ? PMI features, used in original KnowItAll were not always useful. The problem is, when there is a few examples of candidate written in wrong way and they all come from one extractor. In this case PMI measure is very high. It can come close to 1/m, where m is number of assessors. Normal PMI should be in range 1/100m. Amartya Zen Ahmed Zuweil GROPED A WOMAN George Hush Top PMI results are usually wrong Albert Einstein Andrew Jackson American presidents Nobel prize winners Pope John Paul II Middle Powers Features solution ? ? ? For bootstrapping and even if we just use treshold for dividing positive classes from negative, redundancy of candidates seems better evaluation than PMI. Top most hits are in most cases correct. Good solution would be also weighted redundancy, PMI and combination Positive PMI number gives similar results than redundancy number, but is not that fast. Dalai Lama Mikhail Gorbachev Bill Clinton Ronald Reagan Jose Saramago President Bush Frank Sinatra Ludwig van Bertalanffy Precision/recall ? Tested on searching for Nobel prize winners/ american presidents ? Results only from first 1000 hits ? Results edited (A. Einstein ?< Albert Einstein) PMI based Precision Nobel w. 83,7 presidents 66,0 Recall 53,4 81,4 Nobel w. Redundancy based (first 100,35) Precision 100 presidents 90 Recall 12,7 65 Conclusion ? Redundancy number is better evaluation method than PMI. ? From some treshold point on, we can be almost 100% sure that candidates are positive (this is not true for PMI) ? When used on all hits, recall increases without affecting precision ? For automatic ontology populating, precision is more important than recall ? If you don?t need probabilistic evaluation, redundancy method is really good for getting fast, precise results ? Best way to evaluate candidates is to use all of possible weighted features (PMI, redundancy, independence, hits, ...) Further work ? What is working ? Backbone of knowItAll system (structures, main algorithms, principles) ? Working knowItAll prototype ? Some new features for machine learning algorithms ? Playground for testing different learning methods ? What is in development stage ? Recursive Query Expansion ? Multithreading (faster fetching of web results) ? Learning methods (Authors of original knowItAll found that simple treshold is better than Naive Bayes they used) ? Multiple search engines ? Database (caching, storing previous searches, tracking of errors)
Comparing Sets of 3D Digital Shapes through Topological Structures New technologies for shape acquisition and rendering of digital shapes have simplified the process of creating virtual scenes; nonetheless, shape annotation, recognition and manipulation of both the complete virtual scenes and even of subparts of them are still open problems. Once the main components of a virtual scene are represented by structural descriptions, this paper deals with the problem of comparing two (or more) sets of 3D objects, where each model is represented by an attributed graph. We will define a new distance to estimate the possible similarities among the sets of graphs and we will validate our work using a shape graph. Comparing sets of 3D digital shapes through topological graphs Goal : 3D scene comparison Approach
A Quadratic Programming Approach to the Graph Edit Distance Problem In this paper we propose a quadratic programming approach to computing the edit distance of graphs. Whereas the standard edit distance is defined with respect to a minimum-cost edit path between graphs, we introduce the notion of fuzzy edit paths between graphs and provide a quadratic programming formulation for the minimization of fuzzy edit costs. Experiments on real-world graph data demonstrate that our proposed method is able to outperform the standard edit distance method in terms of recognition accuracy on two out of three data sets. A QUADRATIC PROGRAMMING APPROACH TO THE GRAPH EDIT DISTANCE PROBLEM Graph Edit Distance Quadratic Programming for ged 6TH WORKSHOP ON GBR, ALICANTE, 2007 A QUADRATIC PROGRAMMING APPROACH TO THE GRAPH EDIT DISTANCE PROBLEM Michel Neuhaus and Horst Bunke mneuhaus@iam.unibe.ch Institute of Computer Science and Applied Mathematics University of Bern, Switzerland Graph Edit Distance ? Measuring the distance (or similarity of graphs) is an important task in pr and related areas. ? Graph edit distance (ged) is one of the most general graph distance measures. ? ged measures the distance of a pair of graphs in terms of the minimum number of edit operations required to transform one graph into the other one. A Quadratic Programming Approach to the Graph Edit Distance Problem June, 2007 Quadratic Programming for ged ? Several approaches to ged computation have been proposed. ? In this work we introduce an alternative formulation of the ged problem, which is amenable to solving it by means of quadratic programming. ? The basic idea is to simultaneously consider 1 ? n correspondences of all nodes of g and g when computing d(g, g ). ? The idea has been experimentally evaluated and resulted in an improvement of the classi?cation performance on two out of three datasets.
Graph-Based Perceptual Segmentation of Stereo Vision 3D Images at Multiple Abstraction Levels This paper presents a new technique based on perceptual information for the robust segmentation of noisy 3D scenes acquired by stereo vision. A low-pass geometric ?lter is ?rst applied to the given cloud of 3D points to remove noise. The tensor voting algorithm is then applied in order to extract perceptual geometric information. Finally, a graph-based segmenter is utilized for extracting the di?erent geometric structures present in the scene through a region-growing procedure that is applied hierarchically. The proposed algorithm is evaluated on real 3D scenes acquired with a trinocular camera. Graph-Based Perceptual Segmentation of Stereo Vision 3D Images at Multiple Abstraction Levels Segmentation of 3D Images Results of Segmentation in 3D Graph-Based Perceptual Segmentation of Stereo Vision 3D Images at Multiple Abstraction Levels Rodrigo Moreno? , Miguel Angel Garcia?? and Domenec Puig? (? )Intelligent Robotics and Computer Vision Group, Rovira i Virgili University, Tarragona, Spain (?? )Department of Informatics Engineering, Autonomous University of Madrid, Madrid, Spain June 11, 2007 Segmentation of 3D Images The Problem ? Input: a cloud of very noisy points in 3D acquired through stereo vision ? Output: a set of non-overlapping homogeneous regions at multiple abstraction levels ? Restriction: Minimize the number of parameters Our Approach ? Use of geometric ?ltering to generate multiple abstraction levels ? Use of robust perceptual techniques to estimate homogeneity (Tensor Voting) ? Use of fast graph-based segmenters (Felzenswalb & Huttenlocher) Results of Segmentation in 3D Original (registered points) Proposed: tensor-based Vector-based (regions greater than 5 points) ? Tensor-based approach outperform the vector-based ? Performance for outdoor and indoor scenes was similar ? Performance highly depends on the distance ? The new method produces very good results for near regions (centered at 3m or less) ? Results in farther regions were driven by the high amount of noise present in the 3D images
Morphological Operators for Flooding, Leveling and Filtering Images Using Graphs We define morphological operators on weighted graphs in order to speed up image transformations such as floodings, levelings and waterfall hierarchies. The image is represented by its region adjacency graph in which the nodes represent the catchment basins of the image and the edges link neighboring regions. The weights of the nodes represent the level of flooding in each catchment basin ; the weights of the edges represent the altitudes of the pass points between adjacent regions. Morphological operators for flooding, leveling and filtering images using graphs Floodings, razings, levelings simplify images Modelisation for a graph representation Graph based algorithm for flooding (1) Graph based algorithm for flooding (2)
Graph Based Multilevel Temporal Segmentation of Scripted Content Videos This paper concentrates on a graph-based multilevel temporal segmentation method for scripted content videos. In each level of the segmentation, a similarity matrix of frame strings, which are series of consecutive video frames, is constructed by using temporal and spatial contents of frame strings. A strength factor is estimated for each frame string by using a priori information of a scripted content. According to the similarity matrix reevaluated from a strength function derived by the strength factors, a weighted undirected graph structure is implemented. The graph is partitioned to clusters, which represent segments of a video. The resulting structure defines a hierarchically segmented video tree. Comparative performance results of different types of scripted content videos are demonstrated. Graph-Based Multilevel Temporal Segmentation of Scripted Content Videos (1) Graph-Based Multilevel Temporal Segmentation of Scripted Content Videos (2) Graph-Based Multilevel Temporal Segmentation of Scripted Content Videos (3)
Assessing the Performance of a Graph-based Clustering Algorithm Graph-based clustering algorithms are particularly suited for dealing with data that do not come from a Gaussian or a spherical distribution. They can be used for detecting clusters of any size and shape without the need of specifying the actual number of clusters; moreover, they can be profitably used in cluster detection problems. In this paper, we propose a detailed performance evaluation of four different graph-based clustering approaches. Three of the algorithms selected for comparison have been chosen from the literature. While these algorithms do not require the setting of the number of clusters, they need, however, some parameters to be provided by the user. So, as the fourth algorithm under comparison, we propose in this paper an approach that overcomes this limitation, proving to be an effective solution in real applications where a completely unsupervised method is desirable. Assessing the performance of a Graph-based Clustering Algorithm Outline
A Fast Construction of the Distance Graph Used for the Classification It has been demonstrated that the diffcult problem of classifying heterogeneous projection images, similar to those found in 3D electron microscopy (3D-EM) of macromolecules, can be successfully solved by finding an approximate Max k-Cut of an appropriately constructed weighted graph. Despite of the large size (thousands of nodes) of the graph and the theoretical computational complexity of finding even an approximate Max k-Cut, an algorithm has been proposed that finds a good (from the classification perspective) approximate solution within several minutes (running on a standard PC). However, the task of constructing the complete weighted graph (that represents an instance of the projection image classification problems) is computationally expensive. Due to the large number of edges, the computation of edge weights can take tens of hours for graphs containing several thousand nodes. We propose a method, which utilizes an early termination technique, to significantly reduce the computational cost of constructing such graphs. We compare, on synthetic data sets that resemble projection sets encountered in 3D-EM, the performance of our method with that of a brute-force approach and a method based on nearest neighbor search. 3D Reconstruction from Heterogeneous Sets Graph Based Classification Procedure 3D Reconstruction from Heterogeneous Sets Homogeneous Reconstruction Procedure classification Homogeneous Reconstruction Procedure heterogeneous projection?set homogeneous projection?sets 3D?models Projections of simian virus 40 large T-antigen Left column: bent conformation Right column: straight conformation Classification based approach to reconstructing from heterogeneous projection sets Graph Based Classification Procedure y1 x ? ? ? ? ? ? ? y1 y2 Unclassified projection images x x y1 ? ? ? y2 ? ? ? ? y2 Graph Classified projection images Classifying images by finding a cut of (nearly) maximum capacity in a complete weighted graph Properties of the projection images obtained from the same 3D object that are used to construct the dissimilarity measure for projection images The weight assigned to an edge is related to the dissimilarity?of?the?images?that?the?edge?connects.?For?a typical?graph?of?5,000?nodes,?12,497,500?weights?need to?be?calculated. This?is?by?far?the?most?time?consuming?part?of?the classification?procedure!
On the Relation Between the Median and the Maximum Common Subgraph of a Set of Graphs Given a set of elements, the median can be a useful concept to get a representative that captures the global information of the set. In the domain of structural pattern recognition, the median of a set of graphs has also been defined and some properties have been derived. In addition, the maximum common subgraph of a set of graphs is a well known concept that has various applications in pattern recognition. The computation of both the median and the maximum common subgraph are highly complex tasks. Therefore, for practical reasons, some strategies are used to reduce the search space and obtain approximate solutions for the median graph. The bounds on the sum of distances of the median graph to all the graphs in the set turns out to be useful in the definition of such strategies. In this paper, we reduce the upper bound of the sum of distances of the median graph and we relate it to the maximum common subgraph. On the Relation Between the Median and the Maximum Common Subgraph of a Set of Graphs Motivation Contribution On the Relation Between the Median and the Maximum Common Subgraph of a Set of Graphs Miquel Ferrer1 , Francesc Serratosa2 and Ernest Valveny1 Computer Vision Center, Dep. Ci?ncies de la Computaci? Universitat Aut?noma de Barcelona, Bellaterra (Spain) {mferrer,ernest@cvc.uab.es} Departament d?Enginyeria Inform?tica i Matem?tiques Universitat Rovira i Virgili, Tarragona (Spain) {francesc.serratosa@urv.cat} GbR 2007, Alicante, Spain Motivation Given a set of graphs S = {g1 , g2 , ..., gn } ? Z , with ?g ? Z ? gi = {Vi , Ei , LV , LE }. Generalized Median Graph ? g = arg min g?Z gi ?S d(g, gi ) (g, gi ) are: ? The known bounds of the SOD (g )= ? ? ? max {SOD(?)} ? SOD (g ) ? min {SOD (ge ) , SOD (gu )} ? ? Where ge and gu are the empty and union graphs of S respectively and ? is a partition of S. Contribution Based on: Particular cost function Distance based on MCS: d(g1 , g2 ) = |g1 | + |g2 | ? 2 |gM | MCS of a set of graphs (gMS ) ? New Upper bound for SOD(g ) ? max {SOD(?)} ? SOD (g ) ? ? ? ? SOD(gMS ) ? min {SOD (ge ) , SOD (gu )}
Object categorization with SVM: kernels for local features We will focus on object categorization. The basic idea is to combine the nice invariance propreties of local features with the robustness of SVM's and the ability to control generalization in this framework.
Generalized vs Set Median String for Histogram Based Distances: Algorithms and Classification Results in the Image Domain We compare different statistical characterizations of a set of strings, for three different histogram-based distances. Given a distance, a set of strings may be characterized by its generalized median, i.e., the string ?over the set of all possible strings? that minimizes the sum of distances to every string of the set, or by its set median, i.e., the string of the set that minimizes the sum of distances to every other string of the set. For the first two histogram-based distances, we show that the generalized median string can be computed efficiently; for the third one, which biased histograms with individual substitution costs, we conjecture that this is a NP-hard problem, and we introduce two different heuristic algorithms for approximating it. We experimentally compare the relevance of the three histogram-based distances, and the different statistical characterizations of sets of strings, for classifying images that are represented by strings. Generalized vs set median strings for histogram-based distances: Algorithms and classification results in the image domain Motivations Generalized vs set median strings for histogram-based distances: Algorithms and classi?cation results in the image domain Christine SOLNON Jean-Michel JOLION LIRIS / M2Disco Universit? of Lyon, France e Motivations ? Comparison of three statistical characterizations of a set of strings. ? Approximation of the generalized median string (GM) when exact computation is NP-hard. ? GM (S) is better than SM (S) for classi?cation purpose in the image domain. ? Exact computation of GM (S) is possible for two of the three distances. ? A new distance which allows order-based strings, is less sensitive to the length of the strings, allows application-based tuning through learning of costs.
Constellations and the Graph Embedding using Quantum Commute Times In this paper, we explore analytically and experimentally the commute time of the continuous-time quantum walk. For the classical random walk, the commute time has been shown to be robust to errors in edge weight structure and to lead to spectral clustering algorithms with improved performance. Our analysis shows that the commute time of the continuous-time quantum walk can be determined via integrals of the Laplacian spectrum, calculated using Gauss-Laguerre quadrature. We analyse the quantum commute times with reference to their classical counterpart. Experimentally, we show that the quantum commute times can be used to emphasise cluster-structure.
Constellations and the Unsupervised Learning of Graphs In this paper, we propose a novel method for the unsupervised clustering of graphs in the context of the constellation approach to object recognition. Such method is an EM central clustering algorithm which builds prototypical graphs on the basis of fast matching with graph transformations. Our experiments, both with random graphs and in realistic situations (visual localization), show that our prototypes improve the set median graphs and also the prototypes derived from our previous incremental method. We also discuss how the method scales with a growing number of images. Constellations and the Unsupervised Learning of Graphs Contents Constellations & Recognition (1) Constellations & Recognition (2) Our goal? (1) Our goal? (2) Our goal? (3) Mapping graphs to prototypes (Algorithm) (1) Mapping graphs to prototypes (intuition) Mapping graphs to prototypes (Partitions) (1) Mapping graphs to prototypes (Partitions) (2) Mapping graphs to prototypes (Algorithm) (2) Building the prototypes GTM and EM Clustering (Matching) (1) GTM and EM Clustering (Matching) (2) GTM and EM Clustering (Features) GTM and EM Clustering (Algorithm) (1) GTM and EM Clustering (Algorithm) (2) From the prototype: inverse map implicit Experiments: Random generated graphs Experiments: Visual Localization (1) Experiments: Visual Localization (2) Experiments: Visual Localization (3) Experiments: Visual Localization (4) Conclusions & Future Work gbr07_escolano_culg_Page_26
Grouping Using Factor Graphs: an Approach for Finding Text with a Camera Phone We introduce a new framework for feature grouping based on factor graphs, which are graphical models that encode interactions among arbitrary numbers of random variables. The ability of factor graphs to express interactions higher than pairwise order (the highest order encountered in most graphical models used in computer vision) is useful for modeling a variety of pattern recognition problems. In particular, we show how this property makes factor graphs a natural framework for performing grouping and segmentation, which we apply to the problem of finding text in natural scenes. We demonstrate an implementation of our factor graph-based algorithm for finding text on a Nokia camera phone, which is intended for eventual use in a camera phone system that finds and reads text (such as street signs) in natural environments for blind users. Grouping Using Factor Graph: An Approach for finding Text with a Camera Phone Motivation Motivation More Examples Why a Cell Phone (1) Why a Cell Phone (2) Why a Cell Phone (3) Why a Cell Phone (4) Why a Cell Phone (5) Our Algorithm: Feature Selection and Factor Graph (1) Our Algorithm: Feature Selection and Factor Graph (2) Our Algorithm: Feature Selection and Factor Graph (3) Feature Selection (1) Feature Selection (2) Feature Selection: bottom up Matched Vertical/Horizontal Edgelets Anchored Vertical Edgelets A factor graph model The Equations Our Simplifications (1) Our Simplifications (2) Our Simplifications (3) An Example Factor Graph (1) An Example Factor Graph (2) Our Non-Iterative Algorithm (1) Our Non-Iterative Algorithm (2) An example More examples (1) More examples (2) More examples (3) More examples (4) More examples (5) Summary / Discussion (1) Summary / Discussion (2) Summary / Discussion (3) Summary / Discussion (4) Summary / Discussion (5) Summary / Discussion (6)
An Efficient Ontology-Based Expert Peering System This paper proposes a novel expert peering system for information exchange. Our objective is to develop a real-time search engine for an online community where users can query experts, who are simply other participating users knowledgeable in that area, for help on various topics.We consider a graph-based scheme consisting of an ontology tree where each node represents a (sub)topic. Consequently, the fields of expertise or profiles of the participating experts correspond to subtrees of this ontology. Since user queries can also be mapped to similar tree structures, assigning queries to relevant experts becomes a problem of graph matching. A serialization of the ontology tree allows us to use simple dot products on the ontology vector space effectively to address this problem. As a demonstrative example, we conduct extensive experiments with different parameterizations. We observe that our approach is efficient and yields promising results. An Efficient Ontology-Based Expert Peering System Outline Overview The Ontology Mapping to Ontology-space (1) Mapping to Ontology-space (2) Methods Properties Experiment Setup (1) Experiment Setup (2) Numerical Results (1) Numerical Results (2) Observations Application Architecture Conclusion Thank you! An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan C. Bauckhage S. Agarwal Approach Simulations Application Demo Conclusion Deutsche Telekom Laboratories GBR 2007 Outline Introduction Approach Simulations Application Demo Conclusion An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction Approach Simulations Application Demo Conclusion Overview An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan A novel expert peering system for community-based information exchange A graph-based scheme consisting of a taxonomy where each node represents a (sub)topic. User queries and pro?les of the participating experts are mapped to subtrees of this ontology. Assigning queries to relevant experts becomes a problem of graph matching. A serialization of the taxonomy allows use of simple dot products on the ontology vector space effectively to address this problem. The Ontology An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction Approach Simulations Application Demo Conclusion Mapping to Ontology-space An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction The ontology tree is serialized to obtain an associated vector representation v(T ) ? RN . Represent entities as subtrees of the ontology and equivalently as vectors on the so called ontology-space S(T ) ? RN . De?ne a similarity measure r (i, j) between two entities i and j : r (i, j) := |B(i) ? B(j)| , | B(i) | | B(j) | Approach Simulations Application Demo Conclusion An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction Approach Simulations Application Demo (a) An expert on two topics represented by leaves of the ontology can be modeled as a subtree. Conclusion (b) A query on a topic represented by a leave of the ontology can be modeled as a subtree. Methods Properties An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan The ontology (vector) space has a much smaller dimension than the commonly used term-by-document spaces. It also avoids the need for maintaining large, inef?cient, and static dictionaries. Each dimension of the ontology-space, which actually corresponds to a node (subject), has inherent semantic relations with other nodes. One such relation is hierarchical and immediately follows from the tree structure of the ontology. It is also possible to de?ne other graph theoretic relations, for example, by de?ning overlay graphs. Introduction Approach Simulations Application Demo Conclusion Experiment Setup An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction Approach Simulations Application Demo Conclusion We choose a 245 node subset of an ontology prepared by an UK education agency. The procedure for populating the ontology: 1. The node?s name is used in ?nding 10 top ranked documents through Yahoo! search web services. 2. The obtained HTML documents are converted to a single text document. 3. This resulting document is further processed using the Natural Language Toolkit (NLTK) by (a) tokenizing, (b) stop word removal, and (c) stemming with Porter?s stemmer. Random expert and query generation The expert peering: set of experts R(q) with highest matching score, m(q, e) := v(q) ? v(e) given query q. The ?ground truth? vectors are used to calculate the set of ?correct? experts A(q) The recall and precision measures are calculated as the average of N = 1000 such queries recall = 1 N N i=1 An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction Approach Simulations Application Demo Conclusion | A(qi ) ? R(qi ) | | A(qi ) | N i=1 precision = 1 N | A(qi ) ? R(qi ) | | R(qi ) | Numerical Results An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction Approach ? = 0.0, 50 experts 100 experts ? = 1.0, 50 experts 100 experts recall ? = 0.0, 50 experts 100 experts ? = 1.0, 50 experts 100 experts Simulations Application Demo Conclusion 40 60 80 number of keywords 40 60 80 number of keywords Precision and recall when assigning the set of experts with the top three rankings to each query. Precision and recall when assigning the set of experts with the top six rankings to each query. Observations 1. Choosing the larger ? = 1 value for the algorithm leads to improved results as it restricts unnecessary branching, and hence noise. 2. The precision remains high regardless of the number of experts and ?. We attribute this result to hierarchical structure and robustness of our system. 3. With the correct set of parameters we observe that both the precision and recall are relatively insensitive to the number of experts which indicates scalability. 4. The precision and recall curves are rather ?at demonstrating that our system performs well in peering the experts even when given limited information. An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction Approach Simulations Application Demo Conclusion Application Architecture Conclusion presented an ontology-based approach for an expert peering and search system. described a graph-based representation scheme consisting of an ontology tree where each node corresponds to a (sub)topic and is associated with a bag of words. addressed the graph matching problem of assigning queries to relevant experts on a vector space, which follows from a serialization of the ontology tree. Preliminary experiments demonstrate the ef?ciency, robustness, and high performance of our algorithm over a range of parameters. A prototype system is under development and user testing. Thank you! An Ef?cient Ontology-Based Expert Peering System Tansu Alpcan Introduction Approach This publication and more information about the Spree project are available on my website: http://decision.csl.uiuc.edu/?alpcan/ or http://deutsche-telekom-laboratories.de/?alpcan/ Simulations Application Demo Conclusion
Computing Homology Group Generators of Images Using Irregular Graph Pyramids We introduce a method for computing homology groups and their generators of a 2D image, using a hierarchical structure i.e. irregular graph pyramid. Starting from an image, a hierarchy of the image is built, by two operations that preserve homology of each region. Instead of computing homology generators in the base where the number of entities (cells) is large, we first reduce the number of cells by a graph pyramid. Then homology generators are computed efficiently on the top level of the pyramid, since the number of cells is small, and a top down process is then used to deduce homology generators in any level of the pyramid, including the base level i.e. the initial image. We show that the new method produces valid homology generators and present some experimental results. Computing Homology Group Generators of Images Using Irregular Graph Pyramids Wy does my cup of tea look like my donut... Outline Handling subdivided objects (1) Handling subdivided objects (2) Topological Properties (1) Topological Properties (2) Why homology ? (1) Why homology ? (2) Why homology ? (3) Homology Characterizes ?Holes? (1) Homology Characterizes ?Holes? (2) Homology Characterizes ?Holes? (3) Homology Characterizes ?Holes? (4) Homology Characterizes ?Holes? (5) Computing homology generators (1) Computing homology generators (2) Computing homology generators (3) Computing homology generators (4) Computing homology generators (5) Computing homology generators (6) Computing homology generators (7) Computing homology generators (8) Computing homology generators (9) Agoston?s Method (1) Agoston?s Method (2) Agoston?s Method (3) Agoston?s Method (4) Homology for Images (1) Homology for Images (2) Homology for Images (3) Homology for Images (4) Homology for Images (5) Homology for Images (6) Related Works Homology Computation Using Pyramids (HCP) (1) Homology Computation Using Pyramids (HCP) (2) Homology Computation Using Pyramids (HCP) (3) Homology Computation Using Pyramids (HCP) (4) Homology Computation Using Pyramids (HCP) (5) Building the Pyramid Delineating Generators (1) Delineating Generators (2) Delineating Generators (3) Delineating Generators (4) Experiments (1) Experiments (2) Conclusion - Future Research (1) Conclusion - Future Research (2) Computing Homology within Image Context (CHIC) Thank You Computing Homology Group Generators of Images Using Irregular Graph Pyramids S. Peltier A. Ion Y. Haxhimusa W. Kropatsch G. Damiand Vienna University of Technology, Faculty of Informatics, Pattern Recognition and Image Processing Group, Austria University of Poitiers SIC, FRE CNRS 2731, France Wy does my cup of tea look like my donut... Topological Background Topological properties Computation of homology generators - drawbacks Homology for computer vision Homology Computation with Pyramids (HCP) HCP overcomes drawbacks Validation for 2D images Conclusion - Future Research Topological properties Computation of homology generators - drawbacks Homology for computer vision Handling subdivided objects rendering computational geometry... image representation grids... CAD image structuration... rendering computational geometry... Common problem Computing structural properties (topological properties) Topological Properties Homeomorphic objects Non homeomorphic objects Why homology ? Classically studied in algebraic topology Elements of algebraic topology J.R. Munkres 1984 Algebraic topology A. Hatcher 2002 Powerful topological invariant Homology groups are de?ned in any dimension Contains Euler characteristic Complete classi?cation of closed surfaces Orientability of a closed manifold Directly linked to the structure of an object Homotopy groups are not linked to the structure Homology Characterizes ? Holes? Dimension 0 Connected components Dimension 1 Tunnels Dimension 2 Cavities 3 tunnels 3 tunnels 1 tunnel 0 tunnel, 1 cavity 2 tunnels, 1 cavity Computing homology generators Aim Characterizing and locating the holes closed curves Aim Characterizing and locating the holes Agoston?s algorithm [1976] Based on the classical Smith Method Computes generators Aim Characterizing and locating the holes Agoston?s algorithm [1976] Based on the classical Smith Method Computes generators Drawbacks Huge incidence matrices Time complexity < O(n3 ) No control of the geometry of the resulting generators Agoston?s Method Incidence Matrices E p describes the boundary of each (p + 1)?cell a6 f2 f1 ? a1 1 ?1 a2 ? ?1 0 ? ? ? a3 ? 0 1 ? ? ? E 1 = a4 ? 0 1 ? ? ? a5 ? 1 0 ? ? ? a6 ? 0 0 ? a7 0 0 ? v5 a2 a5 v4 a4 v3 f1 a1 f2 v1 a7 v2 a3 Incidence Matrices E p describes the boundary of each (p + 1)?cell Smith Normal Form =< Generators a6 f1 f2 ? a1 1 ?1 a2 ? ?1 0 ? ? ? a3 ? 0 1 ? ? ? E 1 = a4 ? 0 1 ? ? ? a5 ? 1 0 ? ? ? 0 ? a6 ? 0 a7 0 0 ? v5 a2 a5 v4 a4 v3 f1 a1 f2 Homology for Images Homology describes classes of images Examples Medical image analysis Metallurgy Object categorisation Moving objects Tracking people [ACV] Related Works Computing the number of generators Del?nado and Edelsbrunner, An Incremental Algorithm for Betti numbers of simplicial complexes. 1993 Kaczynski et al. Computational homology, 2004 Storjohann, Near Optimal Algorithms for Computing Smith Normal Forms of Integer Matrices, 1996 Munkres, Elements of algebraic topology, 1984 ... Computing generators Agoston, Algebraic Topology, a ?rst course, 1976 Dey and Guha, Computing homology groups of simplicial complexes in R 3 , 1996 Peltier et al. Computation of homology groups and generators, 2005 ... Computing ? nice? generators Zomorodian, Localized homology, preliminary draft 2006 HCP overcomes drawbacks Validation for 2D images Homology Computation Using Pyramids (HCP) Algorithm Input : a segmented image Algorithm Input : a segmented image 1) Build a pyramid Algorithm Input : a segmented image 1) Build a pyramid 2) Agoston?s method on top level Algorithm Input : a segmented image 1) Build a pyramid 2) Agoston?s method on top level 3) Down-projection Output : generators on each level Algorithm Input : a segmented image 1) Build a pyramid 2) Agoston?s method on top level 3) Down-projection Output : generators on each level Advantages Reducing the number of cells The resulting generators ?t on borders Building the Pyramid Removal Operation Contraction Operation Delineating Generators Level k Level k-1 re?nement Experiments vertices 29 Agoston?s Method HCP Method edges 43 faces 1 vertices 15636 edges 30325 faces 14676 vertices 15636 edges 30325 faces 14676 Conclusion - Future Research Conclusion A new method for computing homology generators HCP generators always ?t on borders for 2D images Conclusion A new method for computing homology generators HCP generators always ?t on borders for 2D images Future Research Towards higher dimensions (nD images) Towards a notion of ?canonical? homology generators Computing Homology within Image Context (CHIC project) Computing Homology within Image Context (CHIC) PRIP (Vienna) SIC (Poitiers) LAIC (Clermont) LMA (Poitiers) Stability of generators under image operations Homology classi?cation of images Speci?city of di?erent combinatorial structures E?cient computation Thank You
The Construction of Bounded Irregular Pyramids with a Union-Find Decimation Process The Bounded Irregular Pyramid (BIP) is a mixture of regular and irregular pyramids whose goal is to combine their advantages. Thus, its data structure combines a regular decimation process with a union-find strategy to build the successive levels of the structure. The irregular part of the BIP allows to solve the main problems of regular structures: their inability to preserve connectivity or to represent elongated objects. On the other hand, the BIP is computationally efficient because its height is constrained by its regular part. In this paper the features of the Bounded Irregular Pyramid are discussed, presenting a comparison with the main pyramids present in the literature when applied to a colour segmentation task. The construction of Bounded Irregular Pyramids with a union-find decimation process Index Index - Introduction Introduction (1) Introduction (2) Introduction (3) Introduction (4) Index Index - Construction of the BIP Construction of the BIP (1) Construction of the BIP (2) Construction of the BIP (3) Construction of the BIP (4) Construction of the BIP (5) Construction of the BIP (6) Construction of the BIP (7) Construction of the BIP (8) Construction of the BIP (9) Index Index - Preservation of connectivity Preservation of connectivity Index Index - Representation of elongated objects Representation of elongated objects Index Index - Results Results (1) Results (2) Results (3) Results (4) Index Index - Conclusions and future work Conclusions and future work Thanks for your attention!!
Probabilistic account for multi-view stereo This paper describes a method for dense depth reconstruction from wide-baseline images. In a wide-baseline setting an inherent difficulty which complicates the stereo correspondence problem is self-occlusion. Also, we have to consider the possibility that image pixels in different images, which are projections of the same point in the scene, will have different colour values due to non-Lambertian effects or discretization errors. We propose a Bayesian approach to tackle these problems. In this framework, the images are regarded as noisy measurements of an underlying 'true' image-function. Also, the image data is considered incomplete, in the sense that we do not know which pixels from a particular image are occluded in the other images. We describe an EM-algorithm, which iterates between estimating values for all hidden quantities, and optimising the current depth estimates. The algorithm has few free parameters, displays a stable convergence behaviour and generates accurate depth estimates.
Extending the Notion of AT-Models for Integer Homology Computation When the ground ring is a field, the notion of algebraic topological model (AT-model) is a useful tool for computing (co)homology, representative (co)cycles of (co)homology generators and the cup product on cohomology of nD digital images as well as for controlling topological information when the image suffers local changes. In this paper, we formalize the notion of lambda-AT-model (lambda being an integer) which extends the one of AT-model and allows the computation of homological information in the integer domain without computing the Smith Normal Form of the boundary matrices. We present an algorithm for computing such a model, obtaining Betti numbers, the prime numbers p involved in the invariant factors (corresponding to the torsion subgroup of the homology), the amount of invariant factors that are a power of p and a set of representative cycles of the generators of homology mod p, for such p.
ML in Bioinformatics After a brief introduction to the use of machine learning in computational biology, we focus on the problem of biological networks inference. We define the problem as a problem of kernel learning using prediction in kernelized output spaces. Methods based on Output kernel Tree are presented to solve the problem. Results on two benchmarks are shown.
Mining Frequent Closed Unordered Trees Through Natural Representations Mining Frequent Closed Unordered Trees Through Natural Representations Introduction (1) Introduction (2) Introduction (3) Introduction (4) Introduction (5) Introduction (6) Related Work Natural Representation (1) Natural Representation (2) Mining frequent subtrees in the ordered case (1) Mining frequent subtrees in the ordered case (2) Mining frequent subtrees in the ordered case (3) Canonical Forms Mining frequent subtrees in the unordered case (1) Mining frequent subtrees in the unordered case (2) Closure-based mining (1) Closure-based mining (2) Example: Ordered Case Example: Unordered Case Experiments: Gazelle Unordered Trees Conclusions and Future Work Future Work Tree Kernels Tree Kernels Mining Frequent Closed Unordered Trees Through Natural Representations Jos? L. Balc?zar, Albert Bifet and Antoni Lozano Universitat Polit?cnica de Catalunya Pascal Workshop: learning from and with graphs 2007 Alicante Mining frequent trees is becoming an important task Applications: chemical informatics computer vision text retrieval bioinformatics Web analysis. Trees are sanctuaries. Whoever knows how to listen to them, can learn the truth. Herman Hesse Many link-based structures may be studied formally by means of unordered trees Unordered Trees One unordered tree with two different drawings, each of which corresponds to a different ordered tree. Induced subtrees: obtained by repeatedly removing leaf nodes Embedded subtrees: obtained by contracting some of the edges Introduction What Is Tree Pattern Mining? Given a dataset of trees, ?nd the complete set of frequent subtrees Frequent Tree Pattern (FT): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CT): Include no tree which has a super-tree with the same support CT ? FT Closed Frequent Tree Mining provides a compact representation of frequent trees without loss of information Ordered Subtree Mining D = {A, B}, min_sup = 2 # Closed Subtrees : 2 # Frequent Subtrees: 8 Closed Subtrees: X, Y Frequent Subtrees: Unordered Subtree Mining A: B: X: Y: Related Work Yu Chi , Richard Muntz, Siegfried Nijssen, Joost Kok Frequent Subtree Mining-An overview 2005 FREQUENT Labelled and Rooted Trees UnOrdered Induced Unot [Asai 2003] UFreqT [Nijssen 2003] HybridTreeMiner [Chi 2004] PathJoin [Xiao 2003] CLOSED Labelled and Induced Trees: CMTREEMINER [Chi, Yang, Xia, Muntz 2004] Labelled and relaxed included Trees: DRYADE [Termier, Rousset, Sebag 2004] Labelled and Attribute Trees: CLOATT [Arimura, Uno 2005] De?nition Given two sequences of natural numbers x, y x ? y : concatenation of x and y x + i: addition of i to each component of x x+ = x + 1 Example De?nition A natural sequence is a sequence (x1 , . . . , xn ) of natural numbers such that x1 = 0 each subsequent number xi+1 belongs to the range 1 ? xi+1 ? xi + 1. x = (0, 1, 2, 3, 1, 2) = (0) ? (0, 1, 2)+ ? (0, 1)+ De?nition Let t be an ordered tree. If t is a single node, then t = (0). Otherwise, if t is composed of the trees t1 , . . . , tk joined to a common root r (where the ordering t1 , . . . , tk is the same of the children of r ), then t = (0) ? t1 Example x = (0, 1, 2, 2, 3, 1) ? t2 ? ? ? ? ? tk t is the natural representation of t. (0)?(0, 1, 1, 2)+ ?(0)+ Mining frequent subtrees in the ordered case De?nition y is a one-step extension of x (in symbols, x pre?x of y and |y | = |x| + 1. y ) if x is a a series of one-step extensions from (0) to a natural sequence x (0) x1 xk ?1 x always exists and must be unique, since the xi ?s can only be the pre?xes of x. F REQUENT _S UBTREE _M INING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T . insert t into T for every t that can be extended from t in one step do if support(t ) ? min_sup then F REQUENT _S UBTREE _M INING(t , D, min_sup, T ) return T F REQUENT _S UBTREE _M INING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T . insert t into T C?? for every t that can be extended from t in one step do if support(t ) ? min_sup then insert t into C for each t in C do T ? F REQUENT _S UBTREE _M INING(t , D, min_sup, T ) return T Canonical Forms De?nition Let t be an unordered tree, and let t1 , . . . , tn be all the ordered trees obtained from t by ordering in all possible ways all the sets of siblings of t. The canonical representative of t is the ordered tree t0 whose natural representation is maximal (according to lexicographic ordering) among the natural representations of the trees ti , that is, such that t0 = max{ ti | 1 ? i ? n}. Mining frequent subtrees in the unordered case F REQUENT _S UBTREE _M INING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T . insert t into T C?? for every t that can be extended from t in one step do if support(t ) ? min_sup then insert t into C for each t in C do T ? F REQUENT _S UBTREE _M INING(t , D, min_sup, T ) return T F REQUENT _S UBTREE _M INING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T . 1 2 if not C ANONICAL _R EPRESENTATIVE(t) then return T insert t into T C?? for every t that can be extended from t in one step do if support(t ) ? min_sup then insert t into C for each t in C do T ? F REQUENT _S UBTREE _M INING(t , D, min_sup, T ) return T Closure-based mining C LOSED _S UBTREE _M INING(t, D, min_sup, T ) if not C ANONICAL _R EPRESENTATIVE(t) then return T C?? for every t that can be extended from t in one step do if support(t ) ? min_sup then insert t into C for each t in C do T ? C LOSED _S UBTREE _M INING(t , D, min_sup, T ) return T C LOSED _S UBTREE _M INING(t, D, min_sup, T ) if not C ANONICAL _R EPRESENTATIVE(t) then return T C?? for every t that can be extended from t in one step do if support(t ) ? min_sup then insert t into C do if support(t ) = support(t) then t is not closed if t is closed then insert t into T for each t in C do T ? C LOSED _S UBTREE _M INING(t , D, min_sup, T ) return T Example: Ordered Case min_sup = 2 A : (0, 1, 2, 3, 2, 1), B : (0, 1, 2, 3, 1, 2, 2) Example: Unordered Case A: B: X: Y: Experiments: Gazelle Unordered Trees 10 8 Time 6 4 2 0 0 5 10 CMTreeMiner Our method 15 20 25 Support x 1000 Conclusions and Future Work Through our proposed representation of ordered trees, we have presented ef?cient algorithms for mining ordered and unordered frequent closed trees. The sequential form of our representation, where the number-encoded depth furnishes the two-dimensional information, is key in the fast processing of the data. Future work : Consider labelled subtrees Consider embedded subtrees Future Work Tree Kernels De?nition (Subset Trees) Set of connected nodes of a tree T De?nition (Colins and Duffy 2001) Denote by T , T trees and by t k (T , T ) = t T ,t T T a subset tree of T, then wt ?t,t De?nition (Vishwanathan and Smola 2002) In case we count matching subtrees then t is a subtree of T and k (T , T ) = t T ,t T T denotes that t wt ?t,t S. V. N. Vishwanathan and Alexander J. Smola. Fast Kernels for String and Tree Matching 2002 We can compute tree kernel by Converting trees to strings Computing string kernels Advantages Simple storage and simple implementation (dynamic array, suf?ces) All speedups for strings work for tree kernels, too(XML documents,etc.)
Learning with spectral representations and use of MDL principles Recent Progress on Learning with Graph Representations Outline Motivation Problem Problem (2) Problem (3) Measuring similarity of graphs Viewed from the perspective of learning Learning with graphs (circa 2000) Why is structural learning difficult Structural Variations Contributions Spectral Methods Graph (structural) representations of shape Delaunay Graph MOVI Sequence Shock graphs Graph characteristics Pairwise clustering Embeddings Generative model Spectral Generative Model Algebraic graph theory (PAMI 2005) ?.joint work with Richard Wilson Spectral Representation Properties of the Laplacian Eigenvalue spectrum Eigenvalues are invariant to permutations of the Laplacian. Why Symmetric polynomials Power symmetric polynomials Symmetric polynomials on spectral matrix Spectral Feature Vector ?extend to weighted attributed graphs. Complex Representation Spectral analysis Pattern Spaces Manifold learning methods Separation under structural error Variation under structural error (MDS) CMU Sequence MOVI Sequence YORK Sequence Visualisation (LLP+Laplacian Polynomials) Cospectrality problem for trees Cospectral trees Overcome using quantum random walk The positive support of a matrix Cospectral Trees Stongly regular graphs Generative Tree Union Model ..work with Andrea Torsello Ingredients Illustration Cluster structure Model Union as tree distribution Generative Model Max-likelihood parameters Description length Expectation on observation density Tree Union Simplified Description Cost Description Length Gain Unattributed Future
Genetic Approximate Matching of Attributed Relational Graphs Genetic Approximate Matching of Attributed Relational Graphs Motivation 1/2 Motivation 2/2 Outline EC (Sub-)Graph Isomorphism GA - Encoding GA - Crossover Strict position-based crossover GA ? Local Search GA ? other parameters Combining GA with A* Outline Evolution Process Diversity Precision ? Crossover 1/2 Precision ? Crossover 2/2 Results - Runtime Combined results Conclusions Future Work gbr07_baerecke_gam_Page_21
Molecular Graph Kernels for Drug Discovery Molecular Graph Kernels for Drug Discovery Overview (1) Overview (2) Drug Discovery Current Approaches to QSAR Overview Our Approach Representing Molecules as Graphs (1) Representing Molecules as Graphs (2) Kernel Methods Kernels for Graphs Overview p-Length Graph Kernel Dynamic Programming Extensions - Non-tot and Soft-match Extensions - Gaps Overview Results Overview One-class vs. Two Class Motivation for using only One Class Experiment: Compare One-Class and Two-class (1) Experiment: Compare One-Class and Two-class (2) Experiment: Vary amount of inactives Discussion Overview Review References Questions? Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia ISIS Group Electronics and Computer Science University of Southampton Southampton, UK Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions 14 June, 2007 Overview Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions Drug Discovery Drug Discovery Many stages (Target Identi?cation, Screening, Hits, Lead Identi?cation, Clinical Trials) QSAR analysis ? Virtual Screening Activity - numeric measure, ligand binding to target Must also consider other properties (ADME) - more drug-like. Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Focus on Virtual Screening Learn patterns between a molecular structure and biological activity in order to predict the activity of unseen molecules. Conclusions Current Approaches to QSAR De?ne a Fingerprint. C OH N O OH C C C C C Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results Apply ML algorithm (SVM, Decision tree, etc). Classify as drug/non-drug. One vs. Two Class Conclusions Overview Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions Our Approach Use kernel methods for graphs Decisions for selecting model: 1. Which graph-based representation of a molecule? 2. Which graph kernel? Representing Molecules as Graphs ? We consider 3 representations of molecules as graphs: molecular graphs, topological pharmacophore graphs and reduced graphs Molecular Graph Vertices are atom types, edges are bond type H S C O D C H H S S S C S H S S C D O H S O S C D S C S C D C S H S S C S H C D S H Topological Pharmacophore (TP) Graph Vertices are unique label indicating its function (aromatic,bond donor, etc), no edge labels Mo Mo Mo Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results Ni Ni Ni Sc Sc Zn One vs. Two Class Conclusions Sc Zn Zn Zn Reduced Graph Only several functions are kept as vertex labels, no edge labels Ni Sc Mo Kernel Methods Kernels A Kernel function, ?(x, z), is a positive de?nite function. Corresponds to an inner product in a feature space ?(x, z) = ?(x), ?(z) , where ? : X ? F F X X A number of algorithms include inner products These can be replaced by any pos. def kernel Kernels for Graphs Kernels for Graphs Pos. def. kernels between structured data conceptually, feature vector consists of all possible sub-parts of structured data actually, only inner product is needed Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions Graph kernels -inner-products using di?erent features Sub-graphs - very di?cult to use [G?rtner et al., 2003], a motivates other approaches Walks - sequence of vertices (ex C-C-C). [G?rtner et al., 2003, Kashima et al., 2003] a Label Distances [G?rtner et al., 2003] a Trees [Ramon and G?rtner, 2003] a Cycles [Horvath et al., 2004] We focus on walk kernels - achieved best performance, allows for interesting extensions Overview Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions p-Length Graph Kernel Count walks of length p using DP Product Graph X A C B G Q Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Virtual Screening for Drug Discovery Our Approach G C A B Z p-Length Graph Kernels A C C A B B G X Results One vs. Two Class Conclusions Matching vertex labels ? product vertex Edges added if exist in original graphs ? Counting all walks of length p in the product graph is the same as counting all walks that match in G1 and G2 that match. Dynamic Programming Dynamic Programming Approach Finite walk lengths of length p are calculated using the following DP equations: D0 (vi ) = 1 Dn (vi ) = vj :(vj ,vi )?E? (1) Dn?1 (vj ) (2) p-Length Graph Kernels Results One vs. Two Class Conclusions for each vertex vi ? V? . The kernel can then be calculated as kp (G1 , G2 ) = vi ?V? Dp (vi ) Extensions - Non-tot and Soft-match Non-tottering Tottering paths can be removed [Mah? et al., 2004] e For each index of Di store the contribution of each connected amount. Sum Di?1 whose contribution is not from the current vertex Soft-matching Extension of matching walks Build product graph, but allow non-matching vertices Keep a substitution matrix, a value between 0 and 1 indicating how close two labels are Contribution from non-matching vertices is down-weighted by this value Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions Extensions - Gaps Gaps Motivation: Allow small local changes in structure Augment original graphs G1 and G2 , so that there is a new vertex where each edge exists. wildcard vertex single-gaps: don?t allow wildcards to match multiple-gaps: allow wildcard matches (increases computational complexity) Overview Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions Results NCI-HIV dataset Optimized C, Pos Ratio, with 10% 5-Fold cross validation remaining 90% Inactives C SM 1G AUC 0.906 0.916 0.899 Overview Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions One-class vs. Two Class Given: A set of examples S = {(x1 , y1 ), . . . , (x , y )} Binary Classi?cation: yi ? {1, ?1} for 1 ? i ? . Model learns from both pos. and neg. examples. One class yi ? {1} for 1 ? i ? . Model learns from only pos. examples. Finds distribution support of pos. examples. Test sets include both pos. and neg. If it lies outside the distribution of pos. examples it is an ?outlier? - neg. example. Motivation for using only One Class Datasets are often very large (Thousands - Millions of compounds Although highly unbalanced. NCI-HIV has 42,689 examples, 96.4% inactive while only 1.0% are active (411 examples). Time to build model usually scales with the number of examples used for training 90% of Data for Training, Binary: 38,000 examples: One-class (370 examples). Second motivation: Di?cult to construct a dataset that represents ?non drug?, as this is extremely large. Experiment: Compare One-Class and Two-class Subset of NCI-HIV dataset 10% Actives used for parameter tuning Remaining 90% Actives (370 examples) split into 5 folds. 1000 inctives paired with each fold to compare with binary classi?cation 1-SVM. Optimized parameters, although built poor classi?er (0.56 AUC average over 5 folds) SVDD. Optimized parameters, best model achieved (0.7560 AUC average over 5 folds) Experiment: Vary amount of inactives How many inactives are needed to for binary classi?cation? Inactives 1 10 100 500 1000 AUC 0.6570 0.8479 0.919 0.9291 0.9363 With just 10 neg. examples, binary is outperforming one class (0.756) Discussion Binary classi?cation outperforms 1 class, even when using a small inactive set. One-class is much faster, although model training is usually o?-line in drug discovery process, so extra cost of building two-class is worthwhile. Overview Virtual Screening for Drug Discovery Our Approach p-Length Graph Kernels Results One vs. Two Class Conclusions Review p-Length Graph Kernels Graph Kernels using alternate graph representations Addition of Soft-matching and Gaps through a DP framework Comparison of One-class and Two-class Algorithms G?rtner, T., Flach, P. A., and Wrobel, S. (2003). a On graph kernels: Hardness results and e?cient alternatives. In Learning Theory and Kernel Machines, 16th Annual Conferences on Learning THeory and 7th Kernel Workshop, Proceedings, volume 2843, pages 129?143. Springer Verlag. Horvath, T., G?rtner, T., and Wrobel, S. (2004). a Cyclic pattern kernels for predictive graph mining. In Proceedings of the International Conference on Knowledge Discovery and Data Mining. Kashima, H., Tsuda, K., and Inokuchi, A. (2003). Marginalized kernels between labeled graphs. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, USA. Mah?, P., Ueda, N., Akutsu, T., and Vert, J. (2004). e Extensions of marginalized graph kernels. In Proceedings of the 21st International Conference on Machine Learning (ICML-2004), Ban?, Alberta, Canada. Ramon, J. and G?rtner, T. (2003). a Expressivity versus e?ciency of graph kernels. In First International Workshop on Mining Graphs, Trees and Sequences (held with ECML/PKDD03). Molecular Graph Kernels for Drug Discovery Anthony Demco, Craig Saunders, Alex Dolia Virtual Screening for Drug Discovery Our Approach Questions? p-Length Graph Kernels Results One vs. Two Class Conclusions
Graphs Regularization for Data Sets and Images: Filtering and Semi-Supervised Classification Graphs Regularization for Data Sets and Images: Filtering and Semi-Supervised Classification Outline What are the Main Ideas? What are the Main Ideas? (2) What are the Main Ideas? (3) What are the Main Ideas? (4) Graphs and Regularization Framework What is a Weighted Graph? What is a Weighted Graph? (2) What is a Weighted Graph? (3) What is a Weighted Graph? (4) What is a Weighted Graph? (5) Why Use Graph Representation? Operators? Operators? (2) Operators? (3) Operators? (4) Weighted Graph Based Regularization? Weighted Graph Based Regularization? (2) Weighted Graph Based Regularization? (3) Weighted Graph Based Regularization? (4) Graph Based Regularization is Not New. . . Applications Filtering by Regularization Filtering by Regularization (2) Image Filtering: Classical Example Image Filtering: Classical Example (2) Data Set Filtering: A Toy Example Data Set Filtering: A Toy Example (2) Data Set Filtering: A Toy Example (3) Data Set Filtering: UCI Data Bases Data Set Filtering: UCI Data Bases (2) Applications Semi Supervised Classification by Regularization (1) Semi Supervised Classification by Regularization (1) (2) Semi Supervised Classification by Regularization (2) The Two Moons Example The Two Moons Example (2) The Two Moons Example (3) Image Semi Supervised Segmentation (1) Image Semi Supervised Segmentation (1) (2) Image Semi Supervised Segmentation (2) Image Semi Supervised Segmentation (2) (2) Conclusion Conclusion (2) Conclusion (3) Graphs Regularization for Data Sets and Images: Filtering and Semi-Supervised Classi?cation Vinh Thong Ta, Olivier Lezoray, Abderrahim Elmoataz Computer Science PhD Thesis Vision and Image Analysis Group University of Caen, Low-Normandy (France) GBR Pascal Workshop 2007 Alicante (Spain), June 14th, 2007 GBR Pascal WS?07 Vinh Thong Ta Outline 1 Weighted Graph Based Regularization Framework 2 Applications in Image and Data Set Filtering Semi-supervised classi?cation What are the Main Ideas? From. . . Image processing: ?ltering, denoising Data set processing: semi-supervised classi?cation Therefore. . . Why not to use image ?ltering methods on data sets? Why not to use semi-supervised classi?cation methods on images? Why not to use image ?ltering methods on data sets? Why not to use semi-supervised classi?cation methods on images? How can we solve these two apparently dissimilar tasks within a same framework? How? Weighted graph structure Functional regularization based on di?usion processes GBR Pascal WS?07 3 / 21 Vinh Thong Ta Graphs and Regularization Framework Some basis and notations on graphs. . . Weighted Graph Based Regularization Framework GBR Pascal WS?07 4 / 21 Vinh Thong Ta What is a Weighted Graph? A ?nite set of vertices, nodes V 2 G = (V,E,w) A ?nite set of vertices, nodes V 2 A subset of edges E ?V ?V A ?nite set of vertices, nodes V 2 w24 w25 w13 w23 4 5 3 w43 A subset of edges E ?V ?V 1 w12 A weight function w (u, v ) : E ? R+ A weight function w (u, v ) : E ? R+ G is undirected, connected, no self loop Why Use Graph Representation? Image: Data Set: Operators? G = (V , E , w ); f : V ? R+ G = (V , E , w ); f : V ? R+ ; ?u, v ? V ; ?(u, v ) ? E Edge Derivative and Di?erence Operator ?v f (u) = ?f ?(u,v ) v = (df )(u, v ) = w (u, v )(f (v ) ? f (u)) Gradient Operator f (v )= (?v f (u) : (u, v ) ? E , u ? v )T The Norm of the Gradient Operator || f (v )|| = u?v u?v Weighted Graph Based Regularization? G = (V , E , w ) Optimization Problem min Ep (f , f 0 , ?) = f v ?V f (v ) p +? f ?f0 Solutions for p = 1 and p = 2 ?Ep ?f v = (?p f )(v ) + 2?(f (v ) ? f 0 (v )) = 0, ?v ? V . ? 0 ? f =f Gauss-Jacobi and p = 2 (?2 f (v ): Graph Laplacian Operator) P1 w (u,v ) ? f t+1 (v ) = GBR Pascal WS?07 ?+ ?f 0 (v ) + u?v u?v w (u, v )f t (u) , ?v ? V . Weighted Graph Based Regularization Framework 8 / 21 Vinh Thong Ta Graph Based Regularization is Not New. . . M. Belkin et al. Manifold Regularization: a Geometric Framework for Learning from Examples. Journal of Machine Learning Research, 2007, to appear. D. Zhou and B. Sch?lkopf o Semi-Supervised Learning. Discrete Regularization, MIT Press, 221?232, 2006. O. Lezoray and S. Bougleux and A. Elmoataz Graph Regularization For Color Image Processing. Computer Vision and Image Understanding 107(1?2): 38?55, 2007. Weighted Graph Based Regularization Framework GBR Pascal WS?07 9 / 21 Vinh Thong Ta Applications Application in Filtering. . . Applications in Image and Data Set GBR Pascal WS?07 10 / 21 Vinh Thong Ta Filtering by Regularization G = (V , E , w ) Vertices = Data points Each vertex is described by a vector of K features Filtering by Regularization K independent regularization ?i ? [1, K ] : ? 0 ? fi = fi t+1 ? fi (v ) = GBR Pascal WS?07 ?+ u?v P1 w (u,v ) ?fi 0 (v ) + u?v w (u, v )fi t (u) , ?v ? V . Applications in Image and Data Set 11 / 21 Vinh Thong Ta Image Filtering: Classical Example Corrupted Images Applications in Image and Data Set GBR Pascal WS?07 12 / 21 Vinh Thong Ta Filtered Images by Regularization, G=8-connectivity grid graph Data Set Filtering: A Toy Example Original Data Corrupted Data Filtering Result G = Fully connected graph Data Set Filtering: UCI Data Bases Iris: Wine: Iris: Filtering Results by Regularization Applications in Image and Data Set GBR Pascal WS?07 14 / 21 Vinh Thong Ta Applications Application in Semi-Supervised Classi?cation. . . Applications in Image and Data Set GBR Pascal WS?07 15 / 21 Vinh Thong Ta Semi Supervised Classi?cation by Regularization (1) G = (V , E , w ) Vertices = Data points Classi?cation of K classes problem Initial labels C = {ci , i ? [1, K ]} Vertices = Data points Classi?cation of K classes problem Initial labels C = {ci , i ? [1, K ]} ?i ? [1, K ] : ? 0 ? fi (v ) = +1 if v ? ci with i ? [1, K ], f 0 (v ) = ?1 otherwise, ? i0 fi (v ) = 0 ?v ? {V \\ C }, ?v ? C , Semi Supervised Classi?cation by Regularization (2) Classi?cation by Regularization: Label Propagation fi t+1 (v ) = K independent regularization: ?i ? [1, K ] ?+ u?v P1 w (u,v ) ?fi 0 (v ) + u?v w (u, v )fi t (u) , ?v ? V . Decision Function f (v ) C (v ) = argmax Pi fi (v ) i ?v ? V . The Two Moons Example Original Data Initial Labels G = Fully Connected Graph Classi?cation Image Semi Supervised Segmentation (1) User Marked Images Segmentation Results Segmentation Results Conclusion Summary Weighted Graph Based Regularization Framework: solve ?ltering and semi-supervised classi?cation in a same way Weighted Graph Based Regularization Framework: solve ?ltering and semi-supervised classi?cation in a same way Apply image processing methods on data sets Apply data sets processing methods on images Future Work Demonstrate the bene?ts of data sets ?ltering on classi?cation accuracies: A new machine learning pre processing method Extends the semi-supervised classi?cation concept for images or objects categorization
Graph Signature: A Simple Approach for Clustering Similar Graph Graph Signature: A Simple Approach for Clustering Similar Graphs Applied to Graphic Symbols Recognition Plan Introduction Introduction (2) Graph Based Symbol?s Representation Graph Based Symbol?s Representation (2) Graph Based Symbol?s Representation (3) Graph Based Symbol?s Representation (4) Graph Based Symbol?s Representation (5) Graph Based Symbol?s Representation (6) Graph Matching Greedy Algorithm, Score of mappings Greedy Algorithm, SimGraph SimGraph Continue? SimGraph Continue? (2) Graph Signature (G - Signature) Graph Signature (G - Signature) (2) Graph Signature (G - Signature) (3) Graph Signature (G - Signature) (4) Graph Signature (G - Signature) (5) Results Improvement suggested Conclusions Thats it ! qureshi_rashid_jalal_Page_25
Special Session: Projects in Multimodal Interaction - AMI AMI Project Overview AMI in a slide AMI partners Multimodal processing and meetings A Typical project meeting... ...deserves a better record than this Typical AMI questions AMI Vision AMI research Schematic data flow Instrumented meeting rooms Capture AMI data collection Scenario meetings Annotation Signal labelling Dialog act labelling Summarization Meeting browser Audio-video processing Structure & content extraction AMIDA - remote meetings
Information Technology Solutions for Diabetes Management and Prevention: Current Challenges and Future Research directions
Ambulatory blood pressure monitoring is highly sensitive for detection of early cardiovascular rsk factors in young adults
The Education and Training of the Medical Physicist in Europe. The European Federation of Organisations for Medical Physics - EFOMP Policy Statements and Efforts
Use of rapid prototyping technology in comprehensive rehabilitation of a patient with congenital facial deformity or partial finger or hand amputation
Experimental evaluation of training device for upper extremities sensory-motor ability augmentation
New Experimental Results in Assessing and Rehabilitating the Upper Limb Function by Means of the Grip Force Tracking Method
Verification of planned relative dose distribution for irradiation treatment technique using half-beams in the area of field abutment
Laminar Axially Directed Blood Flow Promotes Blood Clot Dissolution: Mathematical Modeling Verified by MR Microscopy
SRL - The next decade SRL: The Next Decade Instructions Past: Focus on Representations Present: Focus on Tasks - 1 The Entity Resolution Problem InfoVis Co-Author Network Fragment Present: Focus on Tasks - 2 ILIADS Future: Focus on Integrated Tasks Research Agenda D-Dupe: An Interactive Tool for Entity Resolution The End icml07_corvallis_getoor_lise_Page_13 SRL: The Next Decade Lise Getoor Li G t University of Maryland, College Park ILP07 June 21, 2007 Instructions Dear panelists: Thank you for agreeing to be on the panel on the long-term research agenda for inductive logic programming (ILP) and relational learning. With many different and interesting approaches to relational and structured machine learning being developed, we feel that this is an appropriate time to think about long term strategic directions for the field The panel is from 10:40 to 11:30 AM on the 21 of June in Austin long-term field. Auditorium of Lasells Stewart Center in Oregon State University. It is going to be open to both ICML and ILP audiences. To emphasize the long view, and the breadth of the research area, the title of the panel is \"Structured Machine Learning: The Next Ten Years\" Each of you will be given 5 minutes to make opening remarks. The rest of the time will be used to comments and questions from the remarks audience and open discussion. In your opening comments, we ask you to outline a 10-year research program in structured machine learning. Some relevant questions you might address are as follows: - What are the important open problems in this area from AI/Computer Science point of view? - What application problems have the most potential to make us confront these problems? - How does solving these problems advance the state of the art and impact the world? - How does it facilitate interaction between different \"subareas\" of this field including ILP Statistical Relational Learning Relational ILP, Learning, Reinforcement Learning, and others? - What might be one or two focused Ph.D. thesis topics in this research program? Please feel free to deviate from these if you think it is appropriate. We look forward to an insightful and thought-provoking discussion. Thanks again for your time and effort in doing this this. Hendrik Blockeel Jude Shavlik Prasad Tadepalli ILP'07 Program Co-chairs Past: Focus on Representations Present: Focus on Tasks Collective Classification Datasets and code available at http://www.cs.umd.edu/linqs/projects/lbc Information Diffusion Entity Resolution Datasets and code available at D t t d d il bl t http://www.cs.umd.edu/linqs/projects/er Link Prediction Community Discovery/Group Detection Ontology Alignment O t l Ali t The Entity Resolution Problem y John Smith James Smith ?John Smith? ?Jim Smith? ?J Smith? ?James Smith? Jonathan Smith ?Jon Smith? Jon Smith ?J Smith? ?Jonthan Smith? Issues: 1. Identification 2. Disambiguation InfoVis Co-Author Network Fragment g before after Present: Focus on Tasks Collective Classification Datasets and data generator available at http://www.cs.umd.edu/linqs/projects/lbc Information Diffusion Entity Resolution Datasets and code available at D t t d d il bl t http://www.cs.umd.edu/linqs/projects/er Link Prediction Community Discovery/Group Detection Ontology Alignment O t l Ali t ILIADS Goal: Produce hi h P d high-quality i t lit integration via a fl ibl method able ti i flexible th d bl to adapt to a wide variety of ontology sizes and structures Method: Combining statistical and logical inference Use schema (structure) and data (instances) effectively Solution: Integrated Learning In Alignment of Data and Schema (ILIADS) Datasets and code available at: http://www.cs.umd.edu/linqs/projects/iliads Future: Focus on Integrated Tasks g Putting it all together? Bioinformatics Computer Vision Natural Language Processing Personal Information Management Research Agenda g Visual Analytics Complexity of the integrated SRL tasks require sophisticated user interfaces which allow user feedback and support explanation Query-time adaptive information gathering Complexity of th i t C l it f the integrated SRL t k require fl ibl adaptive t d tasks i flexible, d ti algorithms which retrieve relevant information in real time Some related areas to keep in mind: resurgence of work in probabilistic databases (DB), social network analysis (social science), network science (physicists) D-Dupe: An Interactive Tool for Entity Resolution http://www.cs.umd.edu/projects/linqs/ddupe htt // d d / j t /li /dd Novel combination of network visualization and statistical relational models well-suited to the visual analytic task at hand Thanks! http:www.cs.umd.edu/~getoor Work sponsored by the National Science Foundation, Google, KDD program and National Geospatial Agency
Linear Projections and Gaussian Process Reconstructions Linear Projections and Gaussian Process Acknowledgements Linear Dimensionality Reduction Linear Reconstructions A Poor Reconstruction vs a Cool Reconstruction Reconstruction as a Regression Problem Bayesian Regression with Gaussian Process Priors Gaussian Processes as Smooth Priors Over Functions Evidence and predictive distribution Gaussian Process Latent Variable Model (GP-LVM) pt 1 Gaussian Process Latent Variable Model (GP-LVM) pt 2 The GP-LVM in action Limitations of the GP-LVM Symbiosis Digits Revisited pt 1 Digits Revisited pt 2 Swiss Roll Discussion Linear Projections and Gaussian Process Reconstructions Joaquin Qui?onero-Candela1 n Neil D. Lawrence2 Carl E. Rasmussen3 1 Technical University of Berlin and Fraunhofer FIRST.IDA (Sept-Dec 2007 visiting Universidad Carlos III de Madrid) (from January 2007 Microsoft Research Cambridge) 2 University of She?eld (from January 2007 University of Manchester) 3 Max Planck Institute for Biological Cybernetics (from April 2007 Cambridge University) Learning06 - Vilanova i la Geltr? u Tuesday October 3rd, 2006 Acknowledgements Thanks PASCAL for funding visit of JQC to NL in She?eld, in the summer 2005, where the back-constraints idea was cooked up Linear Dimensionality Reduction Dimensionality reduction: D q Consider high-dimensional data Y = [y1 , . . . , yN ] in RD low dimensional latent representation X = [x1 , . . . , xN ] in Rq Linear Projection Find a matrix P of size q ? D and project xi = P y i Standard choice are principal components of data (PCA) Rows of P are the ?rst q eigenvectors of YY Minimum mean squared reconstruction error (up to scaling) Linear Reconstructions Linear map from latent to data The reconstruction of the yi from the xi is also linear Reconstructed hyperplane is spanned by principal eigenvectors This is often a poor reconstruction! But most dimensional reduction methods don?t even o?er a map between latent and data Example: hand-written digits 16?16 gray-scale images of the 2, 3, 4 and 5s 2-dimensional PCA projection Linear reconstruction from PCA A Poor Reconstruction vs a Cool Reconstruction linear GP Reconstruction as a Regression Problem Once we have linearly projected, we have a set of pairs of inputs and outputs {xi , yi } Learn a mapping through non-linear regression! Bayesian Regression with Gaussian Process Priors left samples from our prior, a Gaussian Process middle samples from the posterior, data observed (crosses) and uniform noise model (horizontal bars) right predictive distribution, empirically computed from the posterior samples. Here mean and 2 std dev given parameters of the prior? Either specify hyperprior on, or learn the parameters of the prior by maximizing the evidence Gaussian Processes as Smooth Priors Over Functions Smoothness enforcing priors if xi and xj are similar, then f (xi ) and f (xj ) are similar p f (xi ) f (xj ) xi , xj , ? =N 0, Kii Kij Kij Kjj Covariance function determines kind of smoothness, example: Kij = Cov {f (xi ), f (xj )} = k(xi , xj , ?) = v 2 exp ? xi ?xj 2?2 Evidence and predictive distribution Assuming an independent Gaussian noise model yi = f (xi ) + i i ? N (0, ? 2 ) p(y|f) = N (f, ? 2 I) the evidence is a Gaussian Process as well p(y|X, ?) = p(y|f) p(f|X, ?)df = N (0, K + ? 2 I) the predictive distribution at a new input x? is a Gaussian too p(f (x? )|x? , X, y, ?) = N (m? , v? ) m? = K?,N [KN,N + ? 2 I]?1 y v? = K?,? ? K?,N [KN,N + ? 2 I]?1 KN,? Gaussian Process Latent Variable Model (GP-LVM) Until now I have been given the embedding X In addition to reconstructing, can I also learn the embedding? A product of GPs model (Lawrence, NIPS 16, 2004) Predict each dimension of Y with an independent GP Take X to be the common inputs to all D regression models D p(Y|X, ?) = d=1 p(yd |X, ?) learn the inputs X (and the hyperparameters ?) The GP-LVM in action Motion capture data Subject breaking into a run from standing Data dimension: 102, 3D position of 34 markers Data from Ohio State University Advanced Computing Center for the Arts and Design http://accad.osu.edu/research/mocap/mocap_data.htm Strength of the GP-LVM A powerful, probabilistic reconstruction mapping from latent to data space Limitations of the GP-LVM Optimization in a large space (dim at least N ? q) There are extremely many local optima (initialize carefully) No explicit mapping from data to latent space The GP-LVM is is not similarity preserving The GP-LVM is dissimilarity preserving (a limitation?) Because it is a smooth mapping from X to Y Advantage of avoiding overlapping e?ect (LLE, Isomap, etc) Less sensitive to noise than local similarity preserving embeddings Inability to preserve local structure in the data ? Lawrence initializes with PCA! Symbiosis Linear projections need GP reconstructions, and the GP-LVM needs linear projections Learn an optimal projection for a GP reconstruction Instead of initializing with PCA, why not directly learn the optimal linear projection for GP reconstruction? Replace X by X = P Y and learn P by max GP evidence Smaller q ? D optimization space (can init at random) What kind of linear projections do we get? More dissimilarity preserving than PCA! Examples: motion capture, digits, and swiss roll Digits Revisited Swiss Roll Discussion Powerful, probabilistic generative GP model latent to data Computer animated graphics, imitation learning Prior over poses (tracking, pose recovery) (Growchow et al, SIGGRAPH?03)(Urtasun et al, ICCV?05) A linear map from data to latent optimized for GP reconstruction Heals the GP-LVM from some of its curses Particular case of the back-constrained GP-LVM (Lawrence and Qui?onero-Candela, ICML 2006) n Is this still a proper probabilistic model?
Adaptive Dimension Reduction Using Discriminant Analysis and K-means Clustering Regularized Kernel Discriminant Analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. The performance of RKDA depends on the selection of kernels. In this paper, we consider the problem of learning an optimal kernel over a convex set of kernels. We show that the kernel learning problem can be formulated as a semidefinite program (SDP) in the binary-class case. We further extend the SDP formulation to the multi-class case. It is based on a key result established in this paper, that is, the multi-class kernel learning problem can be decomposed into a set of binary-class kernel learning problems. In addition, we propose an approximation scheme to reduce the computational complexity of the multi-class SDP formulation. The performance of RKDA also depends on the value of the regularization parameter. We show that this value can be learned automatically in the framework. Experimental results on benchmark data sets demonstrate the efficacy of the proposed SDP formulations.
Optimal Dimensionality of Metric Space for Classification For large-scale classification problems, the training samples can be clustered beforehand as a downsampling pre-process, and then only the obtained clusters are used for training. Motivated by such assumption, we proposed a classification algorithm, Support Cluster Machine (SCM), within the learning framework introduced by Vapnik. For the SCM, a compatible kernel is adopted such that a similarity measure can be handled not only between clusters in the training phase but also between a cluster and a vector in the testing phase. We also proved that the SCM is a general extension of the SVM with the RBF kernel. The experimental results confirm that the SCM is very effective for largescale classification problems due to significantly reduced computational costs for both training and testing and comparable classification accuracies. As a by-product, it provides a promising approach to dealing with privacy-preserving data mining problems. Optimal Dimensionality of Metric Space for kNN Classification Outline Related Work Main Idea Outline Setup Objective Function How to Compute P What Does the Positive/Negative Eigenvalue Mean? Choosing the Leading Negative Eigenvalues Learned Mahalanobis Distance Outline Three Classes of Well Clustered Data Two Classes of Data with Multimodal Distribution Three Classes of Data Five Classes of Non-Separable Data UCI Sonar Dataset Comparisons with the State-of-the-Art - 1 UMIST Face Database Comparisons with the State-of-the-Art - 2 Outline Conclusions Thanks for Your Attention! Any Questions? Optimal Dimensionality of M t i O ti l Di i lit f Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei G Y f i Guo, and Hong L dH Lu Dept. Dept of Computer Science & Engineering FUDAN University, Shanghai, China Outline Motivation Related Work Main Idea Proposed Algorithm p g Discriminant Neighborhood Embedding Dimensionality Selection Criterion Experimental Results Toy Datasets Real-world Datasets Conclusions Related Work Many recent techniques have been proposed to learn a more appropriate metric space for better performance of many learning and data mining algorithms, for examples, g g , p , Relevant Component Analysis, Bar-Hillel, A., et al. ICML2003. Locality Preserving Projections, He, X. et al., NIPS 2003. Neighborhood Components Analysis, Goldberger, J., et al. NIPS 2004. Marginal Fisher Analysis, Yan, S., et al., CVPR 2005. Local Discriminant Embedding Chen H T et al. CVPR 2005. Embedding, Chen, H.-T., al 2005 Local Fisher Discriminant Analysis, Sugiyama, M. ICML 2006 ?? However, the target dimensionality of the new space is selected empirically in the above mentioned approaches Main Idea Given finite labeled multi-class samples what can we do for better multi class samples, performance of kNN classification? Original Space (D=2) New Space (d=1) Can we learn a low dimensional embedding for that kNN points in the g p same class have smaller distances to each other than to points in different classes? Can we estimate the optimal dimensionality of the new metric space in the meantime ? Outline Motivation Related Work Main Idea Proposed Algorithm Discriminant Neighborhood Embedding Dimensionality Selection Criterion y Experimental Results Toy Datasets y Real-world Datasets Conclusions Setup N labeled multi class points: (x i , y i ), x i ? ? D , y i ? (1,2,...c ) multi-class k nearest neighbors of x i in the same class: Neig I (i ) k nearest neighbors of x i in the other classes: Neig E (i ) Discriminant adjacent matrix F? ?+1 (x i ? Neig I ( j ) ? x j ? Neig I (i )) ? ? Fij = ??1 (x i ? Neig E ( j ) ? x j ? Neig E (i )) ? otherwise ?0 ? N i =1 Objective Function Objective Function ? (P ) = ?(P ) ? ? (P ) = ? P x i ? P x j Fij T T 2 = 2trace P T X (S ? F)X T P i, j T T 2 (S is a diagonal matrix whose entries are column sums of F) Intra-class compactness in the new space? ?( P) = ? P x i ? P x j , i, j (x i ? Neig I ( j ) ? x j ? Neig I (i )) Inter-class separability in ? ( P) = ? PT x i ? PT x j i, j the th new space? (x i ? Neig E ( j ) ? x j ? Neig E (i )) How to Compute P arg min P ?P i =1 i m T X ( S ? F ) X Pi T s.t. Pi T Pi = 1, Pi T Pj = 0, (i ? j ) Note The matrix X(S-F)XT is symmetric, but not positive definite. It might have negative, zero, or positive eigenvalues The optimal transformation P can be obtained by the eigenvectors of X(S-F)XT corresponding to its all d negative g eigenvalues What does the Positive/Negative Eigenvalue Mean? The ith eigenvector Pi corresponding to the ith eigenvalue ? i ?(Pi ) : the total kNN pairwise distance in the same class ? (Pi ) : the total kNN pairwise distance in different class ? ( Pi ) = ?( Pi ) ? ? ( Pi ) = 2( Pi T X ( S ? F ) X T Pi ) = 2 Pi T ?i Pi = 2?i ? > 0 most might be correctly classified ? if ?i ? ? ? 0 most might be misclassified g ? Choosing the Leading Negative Eigenvalues Among all the negative eigenvalues, some might have much larger absolute values, but the others with small absolute values could be ignored We can then choose t (t>d) negative eigenvalues with the largest ( ) g g g absolute values such that ?? i =1 t i ? ? ? ?i i =1 d Learned Mahalanobis Distance In the original space, the distance between any pair of points can be obtained by dist ( x i , x j ) = P x i ? P x j T T 2 = ( x i ? x j ) T PP T ( x i ? x j ) = ( x i ? x j ) T M( x i ? x j ) = xi ? x j 2 M Outline Motivation Related Work Main Idea Proposed Algorithm Discriminant Neighborhood Embedding Dimensionality Selection Criterion y Experimental Results Toy Datasets y Real-world Datasets Conclusions Three Classes of Well Clustered Data Both eigenvalues are negative and comparable Need not perform dimensionality y reduction k =1 Two Classes of Data with Multimodal Distribution A big difference between two negative eigenvalues ?1 << ?2 The leading eigenvector g g P1 corresponding to ?1 will be kept. Three Classes of Data Two eigenvectors corresponding to positive and negative eigenvalues, respectively. The eigenvector with p positive eigenvalue g should be discarded from the point of view of kNN p classification. Five Classes of Non separable Data Non-separable Both eigenvalues are p positive, and it means , that we could not perform kNN classification well both in the original and new spaces p UCI Sonar Dataset When eigenvalues > 0, the more dimensionality, the higher accuracy When eigenvalues near 0, its optimum can be achieved When eigenvalues < 0, the performance decreases Cumulative eigenvalue curve Comparisons with the State of the Art State-of-the-Art UMIST Face Database Comparisons with the State of the Art State-of-the-Art k =1 Outline Motivation Related Work Main Idea The Proposed Algorithm Discriminant Neighborhood Embedding Dimensionality Selection Criterion y Experimental Results Toy Datasets y Real-world Datasets Conclusions Summary A low dimensional embedding can be LEARNED for g better accuracy in kNN classification given finite training samples Optimal dimensionality can be estimated Future work F t k For large scale datasets, how to reduce the computational complexity? Thanks for your Attention! Any q y questions?
Learning for Efficient Retrieval of Structured Data with Noisy Queries Increasingly large collections of structured data necessitate the development of efficient, noise-tolerant retrieval tools. In this work, we consider this issue and describe an approach to learn a similarity function that is not only accurate, but that also increases the effectiveness of retrieval data structures. We present an algorithm that uses functional gradient boosting to maximize both retrieval accuracy and the retrieval efficiency of vantage point trees. We demonstrate the effectiveness of our approach on two datasets, including a moderately sized real-world dataset of folk music. Learning for Efficient Retrieval of Structured Data with Noisy Queries Structured Data Retrieval: The Problem Sequence Alignment: Introduction Obligatory Overview Slide Sequence Alignment: Basics Sequence Alignment: Alignment Costs The Dynamic Time Warping (Smith-Waterman) Algorithm Gradient Boosting: Learning Distance Functions Metric Access Methods: Overview The Triangular Inequality The Triangular Inequality: Concave Function Application Vantage Point Trees: Overview Vantage Point Trees: Demonstration - 1 Vantage Point Trees: Demonstration - 2 Vantage Point Trees: Demonstration - 3 Vantage Point Trees: Demonstration - 4 Vantage Point Trees: Demonstration - 5 Vantage Point Trees: Demonstration - 6 Vantage Point Trees: Demonstration - 7 Vantage Point Trees: Searching for Nearest Neighbors Within ?t? - 1 Vantage Point Trees: Searching for Nearest Neighbors Within ?t? - 2 Vantage Point Trees: Searching for Nearest Neighbors Within ?t? - 3 Vantage Point Trees: Searching for Nearest Neighbors Within ?t? - 4 Vantage Point Trees: Searching for Nearest Neighbors Within ?t? - 5 Vantage Point Trees: Searching for Nearest Neighbors Within ?t? - 6 Vantage Point Trees: Searching for Nearest Neighbors Within ?t? - 7 Vantage Point Trees: Searching for Nearest Neighbors Within ?t? - 8 Vantage Point Trees: Searching for Nearest Neighbors Within ?t? - 9 Vantage Point Trees: Optimizing Boosting for Efficiency: Summary Boosting for Efficiency: Vp-Tree-Based Loss Function Boosting for Efficiency: Gradient Expression Synthetic Domain: Summary Query-by-Humming: Summary Query-by-Humming: Basic Techniques for Query Processing Application to Query-by-Humming: Our Data Set Results: Experimental Setup Results: Query-by-Humming Domain Results: Synthetic Domain Conclusion Future Work Fin Learning for Efficient Retrieval of i f ffi i i l f Structured Data with Noisy Queries Charles Parker, Alan Fern, Prasad Tadepalli Oregon State University Structured Data Retrieval: The Problem Vast collections of structured data (images, music, video) Development of noise-tolerant retrieval tools Query-by-content Query by content Accurate as well as efficient retrieval Sequence Alignment: q g Introduction Given a query sequence and a set of targets, choose the best matching (correct) target for the given query. Useful in many applications Protein secondary structure prediction Speech recognition Plant identification (Sinha, et. al.) Query-by-humming Obligatory Overview Slide g y Sequence Alignment Basics Metric Access Methods The Triangular Inequality VP-Trees Boosting for Efficiency Results and C R lt d Conclusion l i Sequence Alignment: q g Basics Having H i a query sequence and a t d target sequence we t can align the two sequences. Matching (or close) characters should align ? characters only present in one or the other should not. Suppose we have query = ?DRAIN? and target = ?CABIN? . . . Sequence Alignment: q g Alignment Costs Scoring the alignment for evaluation Suppose we have a function c that gives us costs for edit operations: c(a, b) = 3 if a = b and 0 if other non-null character c(-, b) = 1 c(a, -) = 1 The li Th alignment below has a cost of 13 b l h f The Dynamic Time Warping (Smithy p g( Waterman) algorithm Find best reward path from any point in target and query q y Fill the values in the matrix using the following equations starting from (0,0) C D R A I N A B I N ?c (?, q j ) + align (i + 1, j , t, q) ? align(i, j , t, q ) = max ?c(ti ,?) + align (i, j + 1, t, q) ? ?c(ti , q j ) + align(i + 1, j + 1, t, q) Gradient Boosting: g Learning Distance Functions Define the margin to be the score of the correct target minus the score of the highest scoring incorrect target Formulate a loss function according to this definition of margin Take the gradient of this f nction at each function possible ?replacement? that occurs in the training data. t i i d t Iteratively move in this direction y Metric Access Methods: Overview Accuracy is not enough! Avoidance of linear search Ability to cull subsets with the computation of a single distance distance. Use of the triangular inequality Use of previous work to assure some level of satisfaction The Triangular Inequality g q y (Skopal, 2006) Need to increase small distances while large ones stay the same Applying a concave function does the job d ( x, y ) = d ( x, y ) 1 1+ w Function moves distances within the same range Could create a useless metric space Use line search to assure optimality The Triangular Inequality: g q y Concave Function Application Vantage Point Trees: g Overview Given a set S Select a ?vantage point? v in S. g Split s into sl and sr according to distance from v Call recursively on sl and sr Builds balanced binary tree Vantage Point Trees: g Demonstration D E B G F A C H A E G F C B D H G E F A H C D B G F E C D B H A G F E C D B H Vantage Point Trees: g Searching for Nearest Neighbors Within ?t? Q A E G F C B D H Assume Sl contains a sequence within distance t of Q, g q y and that the triangular inequality holds d(Q,Sl) > t d(Q, A) < m + t (Q, ) Q A E G F C B D H d(Q,Sl) > t d(A,Sl) > m d(Q, A) < d(A,Sl) + d(Q,Sl) d(Q, A) < d(A,Sl) + d(Q,Sl) but according to the triangular inequality, d(Q, A) >= d(A,Sl) + d(Q,Sl) Thus we have a contradiction and can eliminate Sl from consideration Vantage Point Trees: g Optimizing Similar Si il proof for d(Q, A) > m ? t f f d(Q However, if m ? t > d(Q, A) > m + t, we can do nothing, and must search linearly If there are no nearest neighbors within t, no g , guarantees t should be as small as possible . . . . . . or, target/query distances should be as far as possible to the correct side of the median distance Boosting for Efficiency: g y Summary Create an instance of a metric access data structure given a target set Define a loss function particular to that structure Take the gradient of this loss function Use the gradient to tune the distance metric to the structure Boosting for Efficiency: g y vp-tree-based Loss Function Sum loss for each training query and each target along path S f to correct target Scale loss according to ?cost? of mistake cost i is training query, j is target along path vij is ?left or right?, mij is median left right fij(a,b) is replacement count for jth target in the ith training query?s path i j log(1 + exp(vij [mij ? ?a ?b c(a, b) f ij (a, b)])) g( p( 2j Boosting for Efficiency: g y Gradient Expression vij f ij (a, b) ?L = ??? j ?ck (a, b) ( i j 2 (1 + exp( ?vij ( mij ? d (t ij , q ij )))) Retains properties of the accuracy gradient (easy computation ability to approximate) computation, Expresses desired notion of loss and margin Synthetic Domain: y Summary Targets are sequences of five tuples (x, y) T t f fi t l ( ) with domains (0,9) and (0,29) respectively Queries generated by moving sequentially through With p = 0.3, generate a random query event with y <=15 (insert) Else, if target y is > 15, generate match. If target y is <= 15, skip to next target element (delete). Matches M t h are (x1, y1) -< ( 1, ( 1 + 1) % 30) ( < (x (y Domain is engineered to have structure Q y y Query-by-Humming: g Summary Database of songs (targets) Can be queried aurally q y Applications Commercial Entertainment Legal Enables music to be queried on it?s own q terms Q y y Query-by-Humming: g Basic techniques for query processing Queries require preprocessing Filtering g Pitch detection Note segmentation Targets, too! Polyphonic transcription P l h i t i ti Melody spotting We end up with a sequence of tuples (pitch, time) ) Application to Q y y pp Query-by-Humming: g Our Data Set 587 Queries 50 Subjects ( j (college and church choirs) g ) 12 Query songs Queries lit for training d test Q i split f t i i and t t Results: Experimental Setup First 30 iterations: Accuracy boosting only Construct VP-Tree Compare two methods for second 30 iterations Accuracy only Accuracy + efficiency (May require rebuild of tree) Efficiency is measured by plotting % of target set culled vs. error Vary t Would like low error and high % culled In reality, lowering t introduces error Results: Query-by-Humming Domain Results: Synthetic Domain Conclusion Designed a way to specialize a metric to a metric data structure Showed empirically that accuracy is not enough Showed successful ?twisting? of the metric space Future Work Extending to other types of structured data Extending to other metric access methods g Some are better (metric trees) Some are worse (cover trees) Use as general solution to structured prediction problem di ti bl Use in automated planning and reinforcement p g learning Fin.
Introduction to the panel Structured Machine Learning: The Next 10 Years Motivation Questions to Panel Structured Machine Learning: The Next 10 Years Tom Dietterich, Pedro Domingos, Lise Getoor, Bernhard Pfahringer, Stephen Muggleton Moderator: Prasad T d M d t P d Tadepalli lli Motivation ? St t d machine learning, ILP, and relational Structured hi l i ILP d l ti l learning made a lot of progress over the last 20 years. years ? Many new approaches and research directions are emerging e g statistical relational learning emerging, e.g., learning, relational reinforcement learning, etc. ? A large number of recent workshops on related topics, e.g., David Jensen?s list. ? This seems like a good time to do some long longterm strategic planning. Questions to Panel ? Wh t are the important open scientific problems in What th i t t i tifi bl i this area or these areas? ? What application problems have the most potential? ? How does solving these problems advance the state of the art and the world? ? How does it facilitate interaction between different ``subareas?? of this field including ILP, Statistical Relational Learning, Relational Reinforcement Learning, and others? ? Wh t are one or two focused Ph.D. thesis topics in What t f d Ph D th i t i i your research program?
Robust Non-linear Dimensionality Reduction using Successive 1-Dimensional Laplacian Eigenmapse Non-linear dimensionality reduction of noisy data is a challenging problem encountered in a variety of data analysis applications. Recent results in the literature show that spectral decomposition, as used for example by the Laplacian Eigenmaps algorithm, provides a powerful tool for non-linear dimensionality reduction and manifold learning. In this paper, we discuss a significant shortcoming of these approaches, which we refer to as the repeated eigendirections problem. We propose a novel approach that combines successive 1dimensional spectral embeddings with a data advection scheme that allows us to address this problem. The proposed method does not depend on a non-linear optimization scheme; hence, it is not prone to local minima. Experiments with artificial and real data illustrate the advantages of the proposed method over existing approaches. We also demonstrate that the approach is capable of correctly learning manifolds corrupted by significant amounts of noise.
Stability and Resampling Methods for Clustering Model assessment is one of the most crucial aspects of statistical data analysis problems. In particular in data clustering it is difficult to devise reasonable tools for this purpose - the most prominent example is the problem of choosing the number k of clusters one wants to construct. Stability-based methods and resampling methods have become a popular choice for model selection in clustering, which is documented by the wealth of literature on this topic. The basic rationale of those approaches is that valid models should be reproducible under perturbation or resampling of the data. If high instability of models is observed, the inferred solution does not seem to be a generally valid model, or at least seems to have missed some important aspects of the data. \\\\ Many scientists report that stability and resampling methods work well for clustering model selection. Moreover, for supervised learning there is a wealth of literature that proves that stable classification algorithms have a good generalization performance. On the other hand, it has recently been claimed that stability methods for clustering can be misleading and do not necessarily work the way people believe they do. There is still an ongoing debate on how those results should be interpreted. But many researchers working on clustering stability methods agree that there is a lack of theoretical understanding for stability methods in clustering. In particular it seems unclear in which situations stability works and what the mechanism is which makes it a successful tool in those situations. \\\\ This lack of understanding is the motivation for holding a workshop on stability and resampling methods for clustering. We plan to hold a rather small workshop for specialists working on stability questions for clustering, or on stability-related questions in other areas of computer science or mathematics. We want to have a small number of invited talks, but want to dedicate a considerable amount of time to discussions. Hopefully, combining the expertise of people working on different aspects of stability and resampling will lead to a deeper understanding of this tool and its role with respect to clustering.
A formal analysis of stability - lessons and open questions A formal analysis of stability - >br<lessons and challenges What is a good clustering??? (1) What is a good clustering??? (2) In what sense is the leftmost clustering >br<better than the middle one? Even if we commit to a fixed cost>br< function Even harder questions Quest for a general theory A more modest approach Stability - the basic idea Stability - the formal definition (In)Stability detects>br< non-clusterability: Stability distinguishes relevant from >br<irrelevant clustering paradigms: Stability detects correct k (1) Stability detects correct k (2) Conclusions (as of Dec. 2005) Have we found a good answer? Some bothersome examples The bottom line of a formal analysis The formal results Proof Idea 1: >br<Uniqueness implies stability Proof idea (2):>br< Multiple solutions imply instability Proof idea (continued) (1) Proof idea (continued) (2) Some Examples (1) Some Examples (2) Some Examples (3) Some Examples (4) Some Examples (5) Some Examples (6) Some Examples (7) Some Examples (8) The bottom line Other notions of stability Two different topics for discussion Some thoughts on the >br<?finite samples? issue Alternative notions of clusterability A formal analysis of stability lessons and challenges Shai Ben-David PASCAL Workshop on Stability and Resampling Methods for Clustering What is a good clustering??? ?Clustering? is an ill defined problem There are many different clustering tasks, leading to different clustering paradigms: In what sense is the leftmost clustering better than the middle one? 2-d data set Compact partitioning into tw o strata Unsupervised learning Even if we commit to a fixed cost function You get a data set. Run your 5-means clustering algorithm, and get a clustering C. You compute its 5-means cost and its 0.7. Can you conclude that C is a good clustering? How can we verify that structure described by C is not just ?noise?? Even harder questions How can we tell if a given data set has a good k-clustering solution (for a given k)? Can we have an efficient algorithm for the above task (say, running in time sub-linear in the size of the input data)? Note that even approximating the cost of the optimal k-clustering is NP- hard. Quest for a general theory Can we find answers that are independent of any particular algorithm, particular objective function or specific generative data model A more modest approach Formulate conditions that should be satisfied by any conceivably good clustering function. (Sidestepping the issue of ?what is a good clustering clustering??) In other words ? find necessary conditions for good clustering Stability - the basic idea Cluster independent samples of the data. Compare the resulting clusterings. Meaningful clusterings should not change much from one independent sample to another. This idea has been employed as a tool for choosing the number of clusters in several empirical studies ([Ben-Hur et al?02], [Lange, Brown, Roth, Buhmann ?03] and many more). However, currently there is very limited theoretical support. Stability - the formal definition Given, Probability dist. P over some domain X. Clustering function A defined on {S : S ? X}. Similarity measure over clusterings, d. Sample size m. InStab m ( A, P ) = E S, S' ?P m d ( A( S ), A( S ' )) Namely, the expected distance between the clusterings generated by two P-random i.i.d. samples of size m. (In)Stability detects non-clusterability: There is no distribution-free stability guarantee. Example 1: The uniform distribution over a circle InStab(C,P) will be large for any non-trivial clustering function. Stability distinguishes relevant from irrelevant clustering paradigms: Example 2: A mixture of two uniform distribution over circles InStab(C,P) will be large for any center-based clustering function, but single-linkage turns out to be stable (for some choice of parameters) Stability detects correct k: Example 3: A mismatch of # of clusters UnStab(C,P) will be large for any center-based 2-cluster function. Conclusions (as of Dec. 2005) We formally define a measure of statistical generalization for sampling-based clustering ?stability. Stability is a necessary property for any clustering method to be considered ?meaningful?. Stability is viewed as a measure of the fit between a clustering function and an input data. We show that this measure can be reliably estimated from finite samples. Have we found a good answer? This is what we thought in January 2006, when we (with Ule and David Pal) set up to prove that ?stability is a reliable model-selection tool?. Its another interesting question how can one provide a mathematical formulation of such a statement. Some bothersome examples A perfect ?circular? data is unstable for every k<1, Once the symmetry is broken, it becomes stable for every choice of k. The bottom line of a formal analysis Stability does a nice model-selection job on simple synthetic distributions, (as well as in many practical applications). We characterize it for the k-means optimization algorithms (BD-Luxburg-Pal, COLT06, BD-Pal-Simon, COLT07). We conclude that that success should be considered a lucky coincidence rather than a reliable rule. The formal results We consider cost-minimizing clustering algorithms. (E.g. K-Means, minimizing the sum of square distances of points to their respective centers). We say that an algorithm A is stable on a data set D if Limm-<?InStabm(A,D)=0 Theorem (BD-Pal-Luxburg 06, DB-Pal-Simon 07): A cost minimizing algorithm, A, is stable on data set D if and only if there is a unique clustering solution to the cost minimization problem. Proof Idea 1: Uniqueness implies stability Proof idea (2): Multiple solutions imply instability Proof idea (continued) Proof Idea (continued) Some Examples The bottom line In practice, since no real data set is nicely symmetric, there will always be a unique costminimizing solution. Consequently, Any choice of clustering parameters, on any real data set, will always end up being stable. Stability does not do the job we thought it did! Other notions of stability There are several different reasonable notions of clustering stability. For example, one could consider dataperturbations rather than random re-sampling. While our proof applies only to the re-sampling stability, we believe that or results apply to such other notions as well. Two different topics for discussion Is uniqueness of optimal solution a reasonable measure of data clusterability? Can the asymptotic nature of our results explain the ?theory-practice gap? concerning clustering stability? Some thoughts on the ?finite samples? issue Clearly, the claim that in practice any data set has a unique clustering cost minimizer, may fail if relaxed to ?almost minimal? solution?. It follows that sample sizes that are not large enough may ?view? the data as having multiple minima, and show the expected instability. But how large will ?not large enough? be? To make a practically useful contribution, we?d rather have a sample size estimate that can be derived from random samples (without any prior information about the structure of data). I doubt if this may be possible. Alternative notions of clusterability In forthcoming work, we investigate a variety of notions of clusterability and of clustering quality: Clustering Separability ? the ratio between the k-means cost and the (k-1)-means cost. (Variance within clusters)/(Variance between clusters). Clustering robustness to data pertubations.
On a statistical model of cluster stability ON STATISTICAL MODEL OF CLUSTER STABILITY Concept Motivating works Clustering Clustering (cont) Example: a three-cluster set partitioned into 2 and 4 clusters Implication Concept Concept (cont. 1) Concept (cont. 2) Some probability metrics Examples Ky Fan metrics Concentration measure index Simple and Compound Metrics Geometrical Algorithm General algorithm. Given a probability metric dis(?, ?) Klebanov?s N-distances Simple distances (cont) Graphical illustration Graphical illustration (cont. 1) Distances between points belonging to different samples Graphical illustration (cont. 2) Remark Euclidean Minimal Spanning Tree An EMST of 60 random points How can an EMST be used in the cluster validation problem? Graphical illustration. Stable clustering Graphical illustration. Non-stable clustering The two-sample MST-test (cont. 2) The two-sample MST-test (cont. 3) Theorem?s application Theorem?s application (cont. 1) Example: Calculation of Rn(S1,S2) Distances from normality The Kolmogorov-Smirnov Distance Example : synthetic data Example : synthetic data (cont. 1) Membership Stability Algorithm A family of clustering algorithms Clusters correspondence problem Correspondence between labels a and ? obtained for a sample S. Example:The Iris Flower Dataset Graph of the normalized mean value Graph of the normalized quartile value Histograms of the distances? values
Auxillary Variational Information Maximization for Dimensionality Reduction Mutual Information (MI) is a long studied measure of in- formation content, and many attempts to apply it to feature extraction and stochastic coding have been made. However, in general MI is com- putationally intractable to compute, and most previous studies redefine the criterion in forms of approximations. Recently we described proper- ties of a simple lower bound on MI [2], and discussed its links to some of the popular dimensionality reduction techniques. Here we introduce a richer family of the auxiliary variational bounds on MI, which gener- alize our previous approximations. Our specific focus then is on apply- ing the bound to extracting informative lower-dimensional projections in the presence of irreducible Gaussian noise. We show that our method produces significantly tighter bounds on MI compared with the as-if Gaussian approximation [7]. We also show that learning projections to multinomial auxiliary spaces may facilitate reconstructions of the sources from noisy lower-dimensional representations.
Stability and convergence Some Questions on Stability and Finite Samples Setup What Does Stability Tell Us? Stability ) Convergence J(A(S),D) vs. d(A(S),A(D)) Stability ) Convergence Stability, but far from converging Stability ) Convergence, Correction: Stability of Local Search Are there local maxima in the infinite-sample likelihood of Gaussian mixture estimation? Are there local maxima in the infinite-sample likelihood of Gaussian mixture estimation?
Graph mincut, transductive inference and spectral clustering: some new elements Clustering and Transductive Inference for Undirected Graphs Overview of this talk Clustering Generalities Clustering Generalities01 Clustering and Transductive Inference Clustering and Transductive Inference 01 Clustering and Transductive Inference 02 Clustering and Transductive Inference 03 Transductive Inference for Weighted Graphs Transductive Inference for Weighted Graphs 01 Transductive Inference for Weighted Graphs (Ct?d) Transductive Inference for Weighted Graphs (Ct?d) 01 Transductive Inference for Weighted Graphs (Ct?d) 02 Generalization Bound Generalization Bound 01 Plausible Labeling Classes H' Plausible Labeling Classes H' (Ct?d) Plausible Labeling Classes H' (Ct?d) 01 Plausible Labeling Classes H' (Ct?d) 02 Plausible Labeling Classes H' (Ct?d) 03 Plausible Labeling Classes H' (Ct?d) 04 Plausible Labeling Classes H' (Ct?d) 05 Plausible Labeling Classes H' (Ct?d) 06 Plausible Labeling Classes H' (Ct?d) 07 Plausible Labeling Classes H' (Ct?d) 08 Plausible Labeling Classes H' (Ct?d) 09 Plausible Labeling Classes H' (Ct?d) 10 Stability in Learning Stability in Learning (Ct?d) Stability in Learning (Ct?d) 01 Stability in Clustering Stability in Clustering 01 Stability in Clustering 02 Graph Algorithms, MINCUT, and Regularization Conclusion Clustering and Transductive Inference for Undirected Graphs Kristiaan Pelckmans Stability and Resampling Methods for Clustering ESAT - SCD/sista KULeuven, Leuven, Belgium July, 2007 Overview of this talk (0) Clustering generalities (1) Clustering and Transductive Inference (2) Plausible Labeling Classes (3) Stability in Learning and Clustering (4) Graph Algorithms, MINCUT, and Regularization Clustering Generalities De?nition (An attempt) For a set of n observations S and an hypothesis set H, max J? (f |S) = Falsi?able(f ) + ? ReproducibleSm ?S (f |Sm ) f ?H (f |S) as ?f projected on the data S? Reproducibility ? stability, robust evidence, ... Falsi?ability ? surprisingness, -entropy, ... Very general ... ...thus specify VQ and compression Generative model (Image) Segmentation Multiple view prediction (2 - associative clustering; many - (Krupka, 2005)) Organization and retrieval ... ? What is a good taxonomy? Clustering and Transductive Inference Particular class of clustering Clustering as ?nding Apparent Structure ...or ?fast learnable hypothesis class? Clustering for deriving plausible hypothesis space But interpretable: clusterwise constant (independence) Clustering as stage before prediction... ... or as developing regularization (prior) Different from VQ (auto-regression) No stochastical assumption so far... Suppose Alice and Bob have both a fair knowledge of the political structure in the world Alice knows a speci?c law in Europe/ USA/... ? It would be easy to communicate this piece of information Alice Bob Transductive Inference for Weighted Graphs So what can be learned from results in transductive inference? Deterministic Weighted Undirected graphs Fixed amount of n ? N0 nodes (objects) V = {v1 , . . . , vn } Organized in deterministic graph G = (V , E) with edges E = {xij ? 0}i=j (symmetrical xij = xji , no loops xii = 0) Fixed label yi ? {?1, 1} for any node i = 1, . . . , n, but only partially observed: S ? {1, . . . , n} yS = {yi ? {?1, 1}}i?S . Predict the remaining labels y?S = {yi ? {?1, 1}}i?S . Transductive Inference for Weighted Graphs (Ct?d) Example Gene coexpression: Risk Hypothesis set: ? ? H = q ? {?1, 1}n with |H| = 2n Given a restricted hypothesis set H ? H with |H | yS where S is uniform without replacement Actual risk R(q|G) = E[I(yJ qJ > 0)] = with E over uniform choice of j in yJ ? {yi }i . Empirical risk RS (q|G) = Transductive risk R?S (q|G) = 1 X I(yi qi > 0). m i?S |H|, and a few observations n 1X I(yi qi > 0), n i=1 1 X I(yi qi > 0). n?m i?S Generalization Bound why Empirical Risk Minimization works.. Theorem (Generalization Bound) Let S ? {1, . . . , n} be uniformly sampled without replacement. Consider a set of hypothetical labelings H ? Hn having a cardinality of |H | ? N. Then the following inequality holds with probability higher than (1 ? ?) > 1. r 2(n ? m + 1) sup R(q|G) ? RS (q|G) ? (ln(|H |) ? ln(?)). (1) mn q?H Transductive risk ? ?q ? H : R?S (q|G) ? R(q|G) + n n?m ?r 2(n ? m + 1) (ln(|H |) + ln(1/?)). mn Plausible Labeling Classes H Which labelings H = {q} are supported by a graph G? Can be formalized as e.g. Which labelings H can be reconstructed (?predicted?) with a rule? Which labelings H can be compressed with respect to a rule? Which labelings H correlate with the graph structure? Which labelings H are stable with respect to subsampling of the graph Which labelings H have large ?between/within cluster? ratio? Plausible Labeling Classes H (Ct?d) Measuring the Richness of H ? Cardinality: |H | (?nite setting!) Covering balls (if many hypotheses in q ? H similar) Kingdom Dimension (VC-dim ) of H : max |S| s.t. ?p ? {?1, 1}|S| , ?q ? H : p = q|S S where q|S denotes q restricted to S. Compression coef?cient Rademacher complexity \" R(H ) = E sup q?H 2 n ?# ? n ? ?X ? ? ?i qi ? ? ? ? i=1 Clusterwise constant hypothesis De?nition (Clusterwise constant hypothesis) Assume a clustering C k = {C1 , . . . , Ck } such that Ci ? Cj = ?, ? n o ? HC k = q ? {?1, 1}n ? qCi = 1|Ci | or qCi = ?1|Ci | ?i, and qj = ?1elsewhere VCdim(HC k ) = k Rademacher complexity R (HC k ) dependent on size clusters! Rademacher complexity related to normalization factor as in (Lange, 2004) The VC dim of G? ?For any vertex v of a graph G, the closed neighborhood N(v ) of v is the set of all vertices of G adjacent to v . We say that a set S of vertices is shattered if every subset R ? S can be realized as R = S ? N(v ) for some vertex v of G. The VCdim of G is de?ned to be the largest cardinality of a shattered set of vertices.? [Haussler and Welz, 1987] Let the neighbor-based labeling q corresponding to v ? V be de?ned as qv : ( 1 ?e(v , vj ) qv ,j = ?1 else Hypothesis set H = {qv , ?v ? V } The VC-dim (Kingdom Dimension) of H : max |S| s.t. ?p ? {?1, 1}|S| , ?v ? V : p = qv |S S where q|S denotes q restricted to S. Consistent 1NN Rule 1NN Predictor rule: fq (vi ) = q(i) where (i) = arg maxj=i xij . Consistent predictor rule as restriction mechanism. 0 > qi fq (vi ) = qi q(i) , ?i = 1, . . . , n ? Kruskal?s MSP algorithm, VCdim(1NN) = disconnected components. K. Pelckmans (KULeuven - ESAT - SCD) Clustering and TI Tubingen 2007 17 / 27 Consider hypothesis set (Average Nearest Neighbors) n o Hk = q ? {?1, 1}n : q T Lq ? k Theorem (Cardinality of Hk (Pelckmans, Shawe-Taylor, 2007) ) Let {?i }n denote the eigenvalues of the graph Laplacian L = D ? W where i=1 D = diag(d1 , . . . , dn ) ? Rn?n . The cardinality of the set Hk can then be bounded as ! ? ?n? (k ) n? (k ) X n en |Hk | ? ? , (2) d n? (k ) d=0 where n? (k ) is de?ned as n? (k) = |{?i : ?i ? k }| . (3) K. Pelckmans (KULeuven - ESAT - SCD) Clustering and TI Tubingen 2007 Stability in Learning Lemma (Permutation Stability (R. El Yaniv, D. Pechyony, 2007) ) Let Z ? Z be a random permutation vector. Let f : Z ? R be an (m, n) symmetric permutation function satisfying |f (Z ) ? f (Z ij )| ? ? for all (i, j) exchanging entries in Sm = {1, . . . , m} and Sn = {m + 1, . . . , n}. Then ? ? 2 P (f (Z ) ? E[f (Z )] ? ) ? exp ? 2 2? K (m, n) P with K (m, n) = (n ? m)2 (H(n) ? H(n ? m)) and H(k ) = k i1 (and hence i=1 2 1/K (m, n) ? m). Stability in Learning (Ct?d) Rademacher Complexity for Transductive Inference De?nition (Rademacher Complexity for TI) Given hypothesis class H , one has 2 R(H |G) = E sup q?H n \" ?# ? n ? ?X ? ? ?i qi ? ? ? ? i=1 1 with {?i }n Bernouilli random variables with P(?i = ?1) = P(?i = 1) = 2 . i=1 Assume further H = ?H (dropping | ? |). Theorem (Rademacher bound) With probability exceeding 1 ? ? for ? < 0, on has for all q ? H that s? ? ? ? n ? 2m n2 n R(H |G) + log(1/?) +2 R?S (q|G) ? RS (q|G) + 4m(n ? m) 2m(n ? m) m(n ? m) Stability in Learning and Clustering Back to the clustering story... Stability in Clustering De?nition (Stability for Clustering) Consider an algorithm A : X ? C. If ?? such that d(A(Sm ), A(Sm )) ? ?, ?Sm , Sm ? {1, . . . , n} then for any < 0, one has ? ? P d(A(Sm ), E[A(Sm )]) ? ? ?( ) for an appropriate ? : R+ ?]0, 1]. 0 Idea: encode d(A(Sm ), A(Sm )) as ? ? ?c(A(Sm )) ? c(A(Sm ))? , 1 for an appropriate encoding function c : C ? {0, 1}nc , and Sm = D(Z |m), or shortly as Z |m, then Corollary (Stability for Clustering) Consider an algorithm A : X ? C. If ??(A, m) such that ? 1 ? ?c(A(Z |m)) ? c(A(Z |m))? ? ?(A, m), ?Sm , Sm ? {1, . . . , n} 1 nc then for any < 0, one has ? ? ? ? 2 1 P |c(A(Z |m) ? EZ [c(A(Z |m))]|1 ? ? 2 exp ? 2 nc 2? (A, m)K (m, n) Pk 1 with K (m, n) = (n ? m)2 (H(n) ? H(n ? m)) and H(k ) = i=1 i 2 (and hence 1/K (m, n) ? m). Remarks Natural encoding by mapping points on canonical ?cluster identity?. McDiarmid inequality... Notions of weak stability Norm ? Better choices with supnc Does k comes in only via ?(A, m)? Graph Algorithms, MINCUT, and Regularization Application Can be recast as a convex graph ?ow algorithm! Incorporating prior knowledge by relaxing as an LP/QP - e.g. B ? n. Pn i=1 (1 ? qi ) ? B for Conclusion ! Learning in ?nite universes! ! Clustering as setting the stage for prediction (particular sense) ! Clusterwise constant hypothesis class ! Stability results in Learning ? Cluster-ability and Learnability ? Falsi?able vs. Reproducible ? Need for a taxonomy?
Cluster stability and robust optimization Cluster Stability and Robust Optimization >br<- An Idea Grouping/Segmentation Principles Overview of this Talk What is Data Clustering? Example: k-means clustering The Validation Problem in Clustering (1) The Validation Problem in Clustering (2) Two Instance Scenario Information Theoretic Idea to Control>br< Approximation Relation to Gibbs Sampling Scales in Data Analysis and Vision Conclusion & Open Issues Cluster Stability and Robust Optimization - An Idea Joachim M. Buhmann Institute for Computational Science, ETH Zurich Datum Grouping/Segmentation Principles Compactness criterion K-Means Clustering Pairwise Clustering, Average Association Max-Cut, Average Cut Normalized Cut Connectedness criterion Single Linkage Path Based Clustering Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich Overview of this Talk My view on clustering? The stability approach to cluster validation and an analogy to source channel coding. Empirical Risk Approximation and its connection to annealing What is Data Clustering? Given are measurements/data X ? X to characterize objects o ? O. Clusterings partition objects into groups, i.e., c : O ? {1, . . . , k} o 7? c(o) ? C hypothesis class Clustering quality: cost function R : X ? C ? R+ P (c, X) 7? R(c, X) = o?O Ro (c, X) Example: k-means clustering Cost per object: Ro (c, X) = kx(o) ? yc(o) k2 y? : centroids Optimal clustering solution copt (o) = arg min EX {kx(o) ? yc(o) k2 } c?C Hypothesis class Vector quantization C VQ = {c(o) : c(o) = arg min kx(o) ? y? k} ? Mixture models C MM = {c : all partitions of O} dimVC (C MM ) = ? s.t. y? : centroids for copt (o) The Validation Problem in Clustering Modelling problem: Does the cluster model describe the data? Selection of the costs/hypothesis class! Model order selection problem: Is the number of clusters and/or features correct? Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich 6 Overfitting in Clustering Requirement: Structures in two different data sets of the same data source should have approximately the same quality (costs)! Two Instance Scenario Instance space X1x x xX2 ?(c) Solution spaces C? (2) C? x x ? x: ERM solutions +: transferred ERM solution Joachim M. Buhmann, Department of Computer Science, ETH Zurich (1) ? C? Wednesday, 18 July 2007 Information Theoretic Idea to Control Approximation Use data partition as a k-ary code Communication is achieved via instances since test instances are perceived as perturbed training instances Determine how well a partition can be approximated when you see a test instance. Space filling argument yields k^(entropy of partition type) / Card(approx. set) Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich 9 Size of the Approximation Set? Optimality condition: ? (1) ? (2) Too small =< intersection empty ? C? ? C? = ? (1) (2) (1) (2) or nearly empty |C? ? C? | ? max{C? , C? } =< the training solution has little to do with the test solution =< overfitting Too large =< approximation is not precise enough (2) C? and (2) ?(C? ). Randomly sample from from ?Optimal? Precision: Find the smallest ? for which both sets are maximally overlapping. Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich Stochastic Approximation Learning procedure: sample typical solutions from an approximation set ? ? (1) c? ? C? = c : R(c, X(1) ) ? minc R(?, X(1) ) + ? c ? ? ? (2) Generalization performance: c := arg minc R c, X ?(c) maps solutions from the training instance X(1) to solutions of the test instance X(2) by prediction. Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich 11 ? ? ? ? ? (2) ?? (2) ? R c ,X EX(2) R ?(c? ), X Vapnik-Chervonenkis Inequality Bounding test performance of training solution ? ? ? ? R ?(c? ), X(2) ? R c? , X(2) ? Take expectations w.r.t. test data X(2) ? ? ? ? R ??1 (c? ), X(1) ? R c? , X(2) + ? ? ? ? ? R ?(c? ), X(2) ? R c? , X(1) + Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich Bound on Expected Performance ? ? (2) Vapnik-Chervonenkis inequality c := arg minc R c, X n ? ? ? ?o ??+ EX(2) R ?(c? ), X(2) ? R c?, X(2) ? ? ? n ? 2 max ER ?(c? ), X(2) ? R c? , X(1) , ? ? ?o ? ER ??1 (c? ), X(1) ? ER c? , X(2) Joachim M. Buhmann, Department of Computer Science, ETH Zurich 13 Probability of Large Deviation Estimate probability of large deviations o n ? ?? P ER(2) (?(c? )) ? ER(2) c < ? + ? ? 1st term can be bounded in simple cases by Hoeffding or Bernstein inequality since c n? ? ?1 ? ? ? ? ?? ? o ? ? P ?ER(1) ? (c ) ? ER(2) c ? < + 2 ? ?o n? ? (1) ? (2) P ?R (c? ) ? ER (?(c? ))? < 2 Joachim M. Buhmann, Department of Computer Science, ETH Zurich does not depend on training data. 2nd term requires uniform convergence since c? is data dependent. Union Bound / Uniform Convergence Estimate probability of large deviations n? ? ?1 ? ? ? ? ?? ?o P ?ER(1) ? (c ) ? ER(2) c ? | < . 2 exp(??n?2 ) 2 n? ? ?o |C| ?R(1) (c? ) ? ER(2) (?(c? ))? < P . 2 exp(??n?2 ) 2 |C? | q ER(2) (?(c? )) . E min R(2) (c) + ? + c log(1 + c?C Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich Bound on expected risk |C| ) C? + log 2 ? Relation to Gibbs Sampling Relation to statistical mechanics of learning: ? d bound ? determine ? for minimum of bound ? = 0? ? ? ? d? ? d entropy d log |C? | = = T ?1 ? d energy d? s 1 |C| ) + log 2 u c log(1 + ? T stop |C? | Gibbs Sampling with temperature T<T stop Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich 16 Estimate of Stopping Temperature Experiment: Data are drawn from a model with k=5 groups. Inference algorithm assumes k max =10 groups. Important: we do not infer more than 5 groups! Infered parameters are similar to the true parameter values. Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich Scales in Data Analysis and Vision Coarsening of Variable Space coarse Coarsening of Optimization Criterion Coarsening of Model Order fine Increment Level of Resolution Pyramid Increase Regularization Reduce # of Segments Conclusion & Open Issues Stability provides a convincing framework to adjust model complexity! What are the components of a theory which optimally trades stability against informativity? Empirical Risk Approximation requires a thorough Information Theory basis! What can we learn from clustering for other combinatorial optimization problems? Wednesday, 18 July 2007 Joachim M. Buhmann, Department of Computer Science, ETH Zurich 19
Machine Learning and Cognitive Science of Language Acquisition Language acquisition and processing has been one of the central research issues in cognitive science. It is also an area in which the use of cognitive computational modelling has been especially intense. Language, and especially language acquisition, has been the key battleground for nativists and empiricists; and between advocates of rule-based, probabilistic, and connectionist models of thought. Yet the computational models proposed by CogSci researchers are often far behind, in scale and accuracy, the non-cognitively motivated models proposed by computational linguists, which are heavily based on machine learning techniques. \\\\ This workshop asks how far these techniques, and their theoretical underpinnings, provide tools for building richer theories of cognitive processes. For example, can powerful machine learning techniques (e.g. kernel methods) help build models of the cognitive operations involved in human language acquisition? Conversely, can insights from cognitive science help inform and focus computational linguistic and machine learning? Can evidence concerning the spectacular computational performance of the human language processor help inspire new generations of computational linguistic and machine learning tools? \\\\ This workshop will bring together participants from all of the disciplines that address this problem to discuss a range of related topics from methodological issues in computational modelling of language acquisition, including evaluation of empirical learning models, to technical problems in machine learning and grammatical inference. The workshop includes invited talks by some of the leading researchers in these fields.
Music on Videolectures.Net The academic world is not only rich with knowledge but also with music. Enjoy in the compositions of some of the musicians that inspired and were inspired by the world of science. Soon more music to come...
Druga Godba Festival- performance of the tango quintet Astorpia In recent times, Slovene musicians have been busy proving that they can successfully test themselves against virtually any style of music, traditional or classical. For several years now, tango has been a part of everyday life here, and its popularity keeps on growing. This new, tight and well-honed ensemble- most of whose members have classical music training- can undoubtedly contribute greatly to increasing the popularity of tango in Slovenia still further, but they also offer much more than just reinterpretations of the old Argentinian standards. Since the name Astorpia itself alludes to one of the most important *new tango* artists, Astor Piazzolla it should come as no surprise that we find five of his compositions on their first album MAR DEL PLATA, among them Libertango and Invierno Porteno, which have already become classics and make up part of a cycle dedicated to the four seasons. In addition to these pieces, which Astorpia manage to imbue with renewed vigour and freshness, their repertoire includes compositions by a number of other artists. They are not averse to playing wittily with the established rhythms of tango, not even to placing them in a Balkan context, in a similar way to composer Milos Simic. In one of their songs, the group take a different direction entirely with a fiery Czardas. Viva el Tango!
The Centibots 100 Robot Project The Centibots system was a multi-robotic system developed in part by SRI. Its team of 100 small robots were built from off-the-shelf components. This video describes the distributed robot control software and subsequent demonstration.
How to say \"No\" to a robot This describes an integrated robotic system for spatial understanding and situated interaction in indoor environments. Robot communication is performed using only natural language, but sometimes it needs more than a \"natural\" language to understand.
The 13th International Conference on Knowledge Discovery and Data Mining During the past years, the [[http://www.acm.org/sigs/sigkdd/|**ACM SIGKDD**]] conference has established itself as the premier international conference on knowledge discovery and data mining with an attendance of 600-900 people. To continue with this tradition, the thirteenth ACM SIGKDD conference will provide a forum for researchers from academia, industry, and government, developers, practitioners, and the data mining user community to share their research and experience. \\\\ The SIGKDD conference will feature keynote presentations, oral paper presentations, poster presentations, workshops, tutorials, and panels, as well as the **KDD Cup competition.** KDD-2007 will also award scholarships to selected students to help defray the cost of participating in the conference. **[[http://www.kdd2007.com/|The Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining]]**, in cooperation with AAAI, will continue the tradition of featured keynote speakers, paper and poster presentations, workshops, tutorials, and panels. [[http://www.acm.org/sigs/sigkdd/]]\\\\ [[http://www.kdd2007.com/]]\\\\ [[http://www.acm.org/]]
Statistical Modeling of Relational Data KDD has traditionally been concerned with mining data from a single relation. However, most applications involve multiple interacting relations, either explicitly (in relational databases) or implicitly (in semi-structured and multimodal data). Examples include link analysis, social networks, bioinformatics, information extraction, security, ubiquitous computing, etc. Mining such data has become a topic of keen interest in the KDD community in recent years. The key difficulty is that data in relational domains is no longer i.i.d. (independent and identically distributed), greatly complicating statistical modeling. However, research has now advanced to the point where robust, easy-to-use, general-purpose techniques and languages for mining non-i.i.d. data are available. The goal of this tutorial is to add a sufficient subset of these concepts and techniques to the toolkits of both researchers and practitioners. MAP/MPE Inference pt 1 MAP/MPE Inference pt 2 MAP/MPE Inference pt 3 MAP/MPE Inference pt 4 The MaxWalkSAT Algorithm But ? Memory Explosion Computing Probabilities Ground Network Construction But ? Insufficient for Logic Learning Weight Learning Structure Learning pt 1 Structure Learning pt 2 Alchemy pt 1 Alchemy pt 2 Overview Applications Running Alchemy Uniform Distribn.: Empty MLN Binomial Distribn.: Unit Clause Multinomial Distribution Multinomial Distrib.: ! Notation Multinomial Distrib.: + Notation Logistic Regression Text Classification pt 1 Text Classification pt 2 Hypertext Classification Information Retrieval Entity Resolution pt 1 Entity Resolution pt 2 Hidden Markov Models Information Extraction pt 1 Information Extraction pt 2 Information Extraction pt 3 Statistical Parsing pt 1 Statistical Parsing pt 2 Semantic Processing Bayesian Networks Relational Models pt 1 Relational Models pt 2 Practical Tips Summary Markov Logic: Intuition Markov Logic: Definition Example: Friends & Smokers pt 1 Example: Friends & Smokers pt 2 Example: Friends & Smokers pt 1 (a) Example: Friends & Smokers pt 2 (a) Example: Friends & Smokers pt 3 Example: Friends & Smokers pt 4 Example: Friends & Smokers pt 5 Example: Friends & Smokers pt 6 Example: Friends & Smokers pt 7 Example: Friends & Smokers pt 8 Markov Logic Networks Relation to Statistical Models Relation to First-Order Logic Statistical Modeling Of Relational Data Overview pt 1 Motivation Examples Costs and Benefits of Multi-Relational Data Mining Goal and Progress Plan Disclaimers Overview pt 2 Markov Networks pt 1 Markov Networks pt 2 Markov Nets vs. Bayes Nets Inference in Markov Networks MCMC: Gibbs Sampling Other Inference Methods MAP/MPE Inference MAP Inference Algorithms Overview pt 3 Learning Markov Networks Generative Weight Learning Pseudo-Likelihood Discriminative Weight Learning Other Weight Learning Approaches Structure Learning Overview pt 4 First-Order Logic Inference in First-Order Logic Satisfiability Stochastic Local Search The WalkSAT Algorithm Overview pt 5 Rule Induction Learning a Single Rule Learning a Set of Rules First-Order Rule Induction Overview pt 5 Plethora of Approaches Key Dimensions Knowledge-Based Model Construction Stochastic Logic Programs Probabilistic Relational Models Relational Markov Networks Bayesian Logic Markov Logic pt 1 Markov Logic pt 2
Latent Semantic Variable Models In the context of information retrieval and natural language processing, latent variable models are quite useful in modeling and discovering hidden structure that often leads to \"semantic\" data representations. This talk will provide an overview of the most popular approaches and discuss the range of possible applications for such models, including language modeling, ad hoc retrieval, text categorization and collaborative filtering. Latent Semantic Variable Models Introduction Information Retrieval & Latent Semantic Indexing Probabilistic Latent Semantic Indexing Semantic Features for Text Categorizisation Probabilistic HITS Collaborative Filtering Concl Latent Structure Matrix Decomposition Introduction Information Retrieval & Latent Semantic Indexing Probabilistic Latent Semantic Indexing Semantic Features for Text Categorizisation Probabilistic HITS Collaborative Filtering Concl Searching & Finding Ad Hoc Retrieval Document-Term Matrix A 100 Millionths of a Typical Document-Term Matrix Robust Information Retrieval ? Beyond Keyword-based Search Challenges Latent Semantic Analysis Singular Value Decomposition Low-rank Approximation LSA Decomposition Latent Semantic Analysis Introduction Information Retrieval & Latent Semantic Indexing Probabilistic Latent Semantic Indexing Semantic Features for Text Categorizisation Probabilistic HITS Collaborative Filtering Concl Search as Statistical Inference Language Model Paradigm in IR Language Model Paradigm Language Model Paradigm Naive Approach Estimation Problem Probabilistic Latent Semantic Analysis pLSA ? Latent Variable Model pLSA: Matrix Decomposition pLSA: Graphical Model pLSA via Likelihood Maximization Expectation Maximization Algorithm EM Algorithm: Derivation Tempered EM Algorithm Example (1) Example (2) Experimental Evaluation Live Implementation Latent Dirichlet Allocation Introduction Information Retrieval & Latent Semantic Indexing Probabilistic Latent Semantic Indexing Semantic Features for Text Categorizisation Probabilistic HITS Collaborative Filtering Concl Concept-based Text Categorization Terms & Concepts as Features Improvements on Reuters-21578 Improvements on OHSUMED87 Literature & Related Work Introduction Information Retrieval & Latent Semantic Indexing Probabilistic Latent Semantic Indexing Semantic Features for Text Categorizisation Probabilistic HITS Collaborative Filtering Concl Probabilistic HITS Finding Latent Web Communities Decomposing the Web Graph Linking Hyperlinks and Content Example: Ulysses Literature & Related Work Introduction Information Retrieval & Latent Semantic Indexing Probabilistic Latent Semantic Indexing Semantic Features for Text Categorizisation Probabilistic HITS Collaborative Filtering Concl Predictions & Recommendations
Parallel Manipulators in Biomechanics and Robotics In the lecture we will be shown the newest discoveries on the field of parallel manipulators. Parallel manipulators are mechanisms which contain more closed, parallel, cinematic loops, where only few are driven by motors. The attention will be focused on the influence of different parameters; air at the joints, constructional defects, cinematic singularities and isotropy, the activity of the manipulator especially from the biomechanics and robotic point of view.
Text Mining and Link Analysis for Web and Semantic Web The tutorial on Text Mining and Link Analysis for Web Data will focus on two main analytical approaches when analyzing web data: text mining and link analysis for the purpose of analyzing web documents and their linkage. First, the tutorial will cover some basic steps and problems when dealing with the textual and network (graph) data showing what is possible to achieve without very sophisticated technology. The idea of this first part is to present the nature of un-structured and semi-structured data. Next, in the second part, more sophisticated methods for solving more difficult and challenging problems will be shown. In the last part, some of the current open research issues will be presented and some practical pointers on the available tolls for solving previously mentioned problems will be provided. OntoGen ? Main Screen Ontology Construction from Content Visualization Semantic Web - Cyc system (Example of Deep Reasoning) Cyc ?a Little Cit of Historical Context The Cyc Ontology Part of Cyc Ontology on Human Beings Structure of Cyc Ontology pt 1 Structure of Cyc Ontology pt 2 Structure of Cyc Ontology pt 3 Structure of Cyc Ontology pt 4 Structure of Cyc Ontology pt 5 Cyc KB Extended w/Domain Knowledge pt 1 Cyc KB Extended w/Domain Knowledge pt 2 An Example of Psychoanalyst?s Cyc Taxonomic Context Example Vocabulary: Senses of ?In? Relation (1/3) Example Vocabulary: Senses of ?In? Relation (2/3) Example Vocabulary: Senses of ?In? Relation (3/3) Cyc?s Front-End: ?Cyc Analytic Environment? ? Querying (1/2) Cyc?s Front-End: ?Cyc Analytic Environment? ? Justification (2/2) Wrap-Up References to Some Text-Mining & Link Analysis Books References to Some Semantic Web Books References to the Main Conferences References to Some of the Text-Mining & Link Analysis Workshops References to Video Content Some of the Products Major Databases & Text-Mining Final Remarks Example Learning Algorithm: Perceptron Measuring Success ? Model Quality Estimation Reuters Dataset ? Categorization to Flat Categories Distribution of Documents SVM, Perceptron & Winnow Text Categorization Performance on Reuters Text Categorization into Hierarchy of Categories Considering Promising Categories Only (Classification by Naive Bayes) Summary of experimental results Active Learning pt 1 Active Learning pt 2 Some Approaches to Active Learning Experiment Illustration of Active Learning Uncertainty Sampling of Unlabeled Example pt 1 Uncertainty Sampling of Unlabeled Example pt 2 Uncertainty Sampling of Unlabeled Example pt 3 Uncertainty Sampling of Unlabeled Example pt 4 Uncertainty Sampling of Unlabeled Example pt 5 Uncertainty Sampling of Unlabeled Example pt 6 Uncertainty Sampling of Unlabeled Example pt 7 Uncertainty Sampling of Unlabeled Example pt 8 Uncertainty Sampling of Unlabeled Example pt 9 Uncertainty Sampling of Unlabeled Example pt 10 Uncertainty Sampling of Unlabeled Example pt 11 Uncertainty Sampling of Unlabeled Example pt 12 Uncertainty Sampling of Unlabeled Example pt 13 Uncertainty Sampling of Unlabeled Example pt 14 Uncertainty Sampling of Unlabeled Example pt 15 Uncertainty Sampling of Unlabeled Example pt 16 Uncertainty Sampling of Unlabeled Example pt 17 Unsupervised Learning Document Clustering K-Means Clustering Algorithm Example of Hierarchical Clustering (Bisecting K-Means) Latent Semantic Indexing LSI Example Visualization Why Visualizing Text? Example: Visualization of PASCAL Project Research Topics Typical Way of Doing Text Visualization Graph Based Visualization of 1700 IST Project Descriptions into 2 Groups Graph Based Visualization of 1700 IST Project Descriptions into 3 Groups Graph Based Visualization of 1700 IST Project Descriptions into 2 Groups (a) Graph Based Visualization of 1700 IST Project Descriptions into 3 Groups (a) Graph Based Visualization of 1700 IST Project Descriptions into 10 Groups Graph Based Visualization of 1700 IST Project Descriptions into 20 Groups Tiling Based Visualization WebSOM WebSOM Visualization ThemeScape ThemeScape Document Visualization ThemeRiver - Topic Stream Visualization Kartoo.com ? Visualization of Search Results SearchPoint ? Re-Ranking of Search Results TextArc ? Visualization of Word Occurrences NewsMap ? Visualization of News Articles Document Atlas ? Visualization of Document Collections and Their Structure Information Extraction Example: Extracting Job Openings from the Web Example: IE from Research Papers What Is ?Information Extraction? pt 1 What Is ?Information Extraction? pt 2 What Is ?Information Extraction? pt 3 What Is ?Information Extraction? pt 4 What Is ?Information Extraction? pt 5 What Is ?Information Extraction? pt 6 IE in Context Typical Approaches to IE Link-Analysis What Is Link Analysis? What Is Power Law? Power-Law on the Web pt 1 Power-Law on the Web pt 2 Power-Law on the Web pt 3 Structure of the Web ? ?Bow Tie? Model SCC - Strongly Connected Component Modeling the Web Growth ?Preferential Attachment Model? Algorithm Semantic-Web What Is Semantic Web? (Informal) What Is Semantic Web? (Formal) What Is the Link between Text-Mining, Link Analysis and Semantic Web? Semantic Web - Ontologies (Formalization of Semantics) Ontologies ? Central Objects in SW What Is an Ontology? Which Elements Represent an Ontology? Semantic Web - Semantic Web Languages (XML, RDF, OWL) Which Levels Semantic Web Is Dealing With? Stack of Semantic Web Languages Bluffer?s Guide to RDF (1/2) Bluffer?s Guide to RDF (2/2) OWL Layers Semantic Web - OntoGen System (Example of Ontology Learning) Ontology Learning OntoGen ? Main Scenarios Using OntoGen ? Main Scenario OntoGen ? Main Screen Tutorial on Text Mining and Link Analysis for Web and Semantic Web Outline Text-Mining Why Do We Analyze Text? What Is Text-Mining? Why Dealing with Text Is Tough? Why Dealing with Text Is Easy? Who Is in the Text Analysis Arena? What Dimensions Are in Text Analytics? How Dimensions Fit to Research Areas? Broader Context: Web Science Text-Mining - How Do We Represent Text? Levels of Text Representations pt 1 Levels of Text Representations pt 2 Character Level Good and Bad Sides Levels of Text Representations pt 3 Word Level Words Properties Stop-Words Word Character Level Normalization Stemming (1/2) Stemming (2/2) Levels of Text Representations pt 4 Phrase Level Google N-Gram Corpus Example: Google N-Grams Levels of Text Representations pt 5 Part-of-Speech Level Part-of-Speech Table Part-of-Speech Examples Levels of Text Representations pt 6 Taxonomies/Thesaurus Level WordNet ? Database of Lexical Relations WordNet ? Excerpt from the Graph WordNet Relations WordNet ? Excerpt from the Graph (a) WordNet Relations (a) Levels of Text Representations pt 7 Vector-Space Model Level Bag-of-Words Document Representation Word Weighting Example Document and Its Vector Representation Similarity between Document Vectors Levels of Text Representations pt 8 Language Model Level Levels of Text Representations pt 9 Full-Parsing Level Levels of Text Representations pt 10 Cross-Modality Level Example: Aligning Text with Audio, Images and Video Levels of Text Representations pt 11 Collaborative Tagging Example: flickr.com Tagging Example: del.icio.us Tagging Example: flickr.com Tagging (a) Example: del.icio.us Tagging (a) Levels of Text Representations pt 12 Template / Frames Level Examples of Templates of KnowItAll System Levels of Text Representations pt 13 Ontologies Level Example: Text Represented in the First Order Logic Text-Mining - Typical Tasks on Text Document Summarization pt 1 Document Summarization pt 2 Selection Based Summarization Example of Selection Based Approach from MS Word Knowledge Rich Summarization Knowledge Rich Summarization Example Training of Summarization Model Example of Summarization Automatically Generated Graph of Summary Triples Text Segmentation pt 1 Text Segmentation pt 2 Hearst Algorithm for Text Segmentation Supervised Learning Document Categorization Task Document Categorization Algorithms for Learning Document Classifiers Example Learning Algorithm: Perceptron
Learning Bayesian Networks Bayesian networks are graphical structures for representing the probabilistic relationships among a large number of variables and doing probabilistic inference with those variables. The 1990's saw the emergence of excellent algorithms for learning Bayesian networks from passive data. I will discuss the constraint-based learning method using an intuitive approach that concentrates on causal learning. Then I will discuss the Bayesian approach with some simple examples. I will show how, using the Bayesian approach, we can even learning something about causal influences from passive data on two variables. Finally, I will show some applications to finance and marketing. slide 1 Some Caveats slide 3 References (1) References (2) Conflicting Empirical Results Relaxing the Assumption of No Hidden Common Causes slide 3 slide 4 Example 6 How could we end up with this contradiction? Causal Embedded Faithfulness Assumption slide 8 Learning Causal Influences Under the Causal Embedded Faithfulness Assumption Example 3 revisited Example 5 revisited Example 6 revisited Example 7 Example 8 slide 15 Example 9 slide 17 The actual causal relationships may be as follows slide 19 The Causal Embedded Faithfulness Assumption with Selection Bias Example 5 (revisited) slide 22 Causal Learning with Temporal Information Example 4 (revisited) Learning Causes From Data on Two Variables slide 26 Example 1 Example 2 Example 3 slide 30 slide 31 Application to Causal Learning Suppose the entire population is distributed as follows slide 34 Statistical Causality slide 2 A common way to learn (perhaps define) causation is via manipulation experiments (non-passive data) slide 4 slide 5 slide 6 slide 7 slide 8 Causal Graphs The Causal Markov Assumption Examples slide 12 slide 13 slide 14 Experimental evidence for the Causal Markov Assumption Exceptions to the Causal Markov Assumption 1. Hidden common causes 2. Causal feedback 3. Selection bias 4. The entities in the population are units of time slide 21 slide 22 Causal Faithfulness Assumption Exceptions to the Causal Faithfulness Assumption Learning Causal Influences Under the Causal Faithfulness Assumption slide 26 Example slide 28 Example 1 Example 2 Example 3 Example 4 Theorem Example 5 How much data do we need? Empirical Results Conflicting Empirical Results
From Mining the Web to Inventing the New Sciences Underlying the Internet As the Internet continues to change the way we live, find information, communicate, and do business, it has also been taking on a dramatically increasing role in marketing and advertising. Unlike any prior mass medium, the Internet is a unique medium when it comes to interactivity and offers ability to target and program messaging at the individual level. Coupled with its uniqueness in the richness of the data that is available for measurability, in the variety of ways to utilize the data, and in the great dependence of effective marketing on applications that are heavily data-driven, makes data mining and statistical data analysis, modeling, and reporting an essential mission-critical part of running the on-line business.
Calculating Latent Demand in the Long Tail An analytical framework for using powerlaw theory to estimate market size for niche products and consumer groups.He is the author of New York Times bestselling book The Long Tail: Why the Future of Business is Selling Less of More, which as published in 2006, and runs a blog on the subject at longtail.com. In 2007 he was named one of the ?Time 100,? the newsmagazine?s list of the 100 men and women whose power, talent or moral example is transforming the world. [[http://en.wikipedia.org/wiki/Chris_Anderson_%28writer%29|Chris Anderson - Wikipedia article]]
Information Genealogy: Uncovering the Flow of Ideas in Non-Hyperlinked Document Databases We now have incrementally-grown databases of text documents ranging back for over a decade in areas ranging from personal email, to news-articles and conference proceedings. While accessing individual documents is easy, methods for overviewing and understanding these collections as a whole are lacking in number and in scope. In this paper, we address one such global analysis task, namely the problem of automatically uncovering how ideas spread through the collection over time. We refer to this problem as Information Genealogy. In contrast to bibliometric methods that are limited to collections with explicit citation structure, we investigate content-based methods requiring only the text and timestamps of the documents. In particular, we propose a language-modeling approach and a likelihood ratio test to detect influence between documents in a statistically well-founded way. Furthermore, we show how this method can be used to infer citation graphs and to identify the most influential documents in the collection. Experiments on the NIPS conference proceedings and the Physics ArXiv show that our method is more effective than methods based on document similarity.
Upping the Baseline for High-Precision Text Classifiers Many important application areas of text classifiers demand high precision and it is common to compare prospective solutions to the performance of Naive Bayes. This baseline is usually easy to improve upon, but in this work we demonstrate that appropriate document representation can make outperforming this classifier much more challenging. Most importantly, we provide a link between Naive Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization. Motivated by document-specific feature selection we propose monotonic constraints on document term weighting, which is shown as an effective method of fine-tuning document representation. The discussion is supported by experiments using three large email corpora corresponding to the problem of spam detection, where high precision is of particular importance.
Fast Direction-Aware Proximity for Graph Mining In this paper we study asymmetric proximity measures on directed graphs, which quantify the relationships between two nodes or two groups of nodes. The measures are useful in several graph mining tasks, including clustering, link prediction and connection subgraph discovery. Our proximity measure is based on the concept of escape probability. This way, we strive to summarize the multiple facets of nodes-proximity, while avoiding some of the pitfalls to which alternative proximity measures are susceptible. A unique feature of the measures is accounting for the underlying directional information. We put a special emphasis on computational efficiency, and develop fast solutions that are applicable in several settings. Our experimental study shows the usefulness of our proposed direction-aware proximity method for several applications, and that our algorithms achieve a significant speedup (up to 50,000x) over straightforward implementations. Fast Direction-Aware Proximity for Graph Mining Proximity on Graph Edge Direction w/ Proximity Motivating Questions (Fast DAP) Roadmap pt 1 Defining DAP: Escape Probability Escape Probability: Example Escape Probability is Good, but? Issue 1: \\\"Degree-1 Node\\\" Effect Universal Absorbing Boundary Introducing Universal-Absorbing-Boundary Issue 2: Weakly Connected Pair Practical Modifications: Partial Symmetry Roadmap pt 2 Solving Escape Probability: [Doyle+] Solving DAP: [Doyle+] Solving vk (j, i) [Doyle+] Transition Matrix Solving DAP (Straight-Forward Way) Challenges FastAllDAP FastAllDAP: Observation FastAllDAP: Rescue FastAllDAP: Theorem FastAllDAP: Algorithm FastOneDAP FastOneDAP: Observation pt 1 FastOneDAP: Observation pt 2 FastOneDAP: Observation pt 3 FastOneDAP: Iterative Alg. FastOneDAP: Property Roadmap pt 3 Datasets (All Real) We Want to Check? Link Prediction: Existence pt 1 Link Prediction: Existence pt 3 Link Prediction: Direction Efficiency: FastAllDAP Efficiency: FastOneDAP Roadmap pt 4 Conclusion (Fast DAP) More in the Paper? Cupid Uses Arrows, So Does Graph Mining! Fast Direction-Aware Proximity for Graph Mining (a)
Correlation Search in Graph Databases Correlation mining has gained great success in many application domains for its ability to capture the underlying dependency between objects. However, the research of correlation mining from graph databases is still lacking despite the fact that graph data, especially in various scientific domains, proliferate in recent years. In this paper, we propose a new problem of correlation mining from graph databases, called Correlated Graph Search (CGS). CGS adopts Pearson?s correlation coefficient as a correlation measure to take into consideration the occurrence distributions of graphs. However, the problem poses significant challenges, since every subgraph of a graph in the database is a candidate but the number of subgraphs is exponential. We derive two necessary conditions which set bounds on the occurrence probability of a candidate in the database. With this result, we design an efficient algorithm that operates on a much smaller projected database and thus we are able to obtain a significantly smaller set of candidates. To further improve the efficiency, we develop three heuristic rules and apply them on the candidate set to further reduce the search space. Our extensive experiments demonstrate the effectiveness of our method on candidate reduction. The results also justify the efficiency of our algorithm in mining correlations from large real and synthetic datasets. Correlation Search in Graph Databases Outline Introduction pt 1 Introduction pt 2 Introduction - Motivation pt 1 Introduction - Motivation pt 2 Introduction - Motivation pt 3 Introduction - Motivation pt 4 Introduction - Motivation pt 5 Introduction - Correlation Search in Graph Databases pt 1 Introduction - Correlation Search in Graph Databases pt 2 Introduction - Correlation Search in Graph Databases pt 3 Introduction - Correlation Search in Graph Databases pt 4 Introduction - Contributions pt 1 Introduction - Contributions pt 2 Problem Definition - Correlation Measure pt 1 Problem Definition - Correlation Measure pt 2 Problem Definition - Correlation Measure pt 3 Problem Definition Solution - Candidate Generation pt 1 Solution - Candidate Generation pt 2 Solution - Candidate Generation pt 3 Solution - Candidate Generation (cont?) pt 1 Solution - Candidate Generation (cont?) pt 2 Solution - Candidate Generation (cont?) pt 3 Solution - Candidate Generation (cont?) pt 4 Solution - Heuristic Rules pt 1 Solution - Heuristic Rules pt 2 Solution - CGSearch Algorithm Performance Evaluation Effect of Candidate Generation when Varying Query Support Effect of Graph Size Conclusions Thank You Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Department of Computer Science and Engineering The Hong Kong University of Science and Technology Outline Introduction Problem De?nition Solution Performance Evaluation Conclusions Introduction Graphs Model objects and their relationships Everywhere in various scienti?c domains Bioinformactics: protein interaction networks Chemistry: chemical compound structures Social science: social networks Many more: work ?ows, Web site structures, etc Existing Research on Graph Search Focus on structural similarity search: ?nd the graphs structurally the same as or similar to a given query graph Introduction - Motivation (a) Graph A (b) Graph B (c) Graph C (d)Query Application Scenario Structural similarity search Need: ?nd co-occurrent molecular structure of a given molecule Structural similarity search fails to ?nd such results Co-occurrent structures may decide some chemical properties Y. Ke et al (CSE, HKUST) KDD?07 4 / 17 Introduction - Correlation Search in Graph Databases Correlation Capture the underlying dependence between objects Well-studied in boolean databases, quantitative databases, multimedia databases, data streams, and many more New Challenges in Graph Databases Large search space Each subgraph of a graph in the database is a candidate Exponentially many subgraphs Expensive graph operation Subgraph isomorphism testing (NP-Complete) Y. Ke et al (CSE, HKUST) KDD?07 Introduction - Contributions New Problem of Correlated Graph Search (CGS) Correlation measure: Pearson?s correlation coef?cient Effective and Ef?cient Solution: CGSearch Theoretical bounds for the support (occurrence probability) of a candidate Candidate generation from the projected database of query graph Three heuristic rules to further reduce number of candidates Problem De?nition - Correlation Measure Pearson?s Correlation Coef?cient Popularly used as a correlation measure in many other contexts: stream data, transaction databases De?nition ?(g1 , g2 ) = supp(g1 , g2 ) ? supp(g1 )supp(g2 ) supp(g1 )supp(g2 )(1 ? supp(g1 ))(1 ? supp(g2 )) Measure the departure of two variables from independence Fall within [?1, 1]: 0 indicates independence; positive indicates positive correlation; negative indicates negative correlation Our work: focus on positive correlation Y. Ke et al (CSE, HKUST) KDD?07 7 / 17 Problem De?nition CGS Problem Given a graph database D, a correlation query graph q and a minimum correlation threshold ? (0 > ? ? 1), ?nd the set of all graphs whose Pearson?s correlation coef?cient with q is no less than ? Y. Ke et al (CSE, HKUST) KDD?07 Solution - Candidate Generation Bounds of supp(g) supp(q) supp(q) ? supp(g) ? 2 ? supp(q)) + supp(q) ? (1 ? supp(q)) + supp(q) ??2 (1 Range: Candidate Generation from D Mine the set of Frequent subGraphs (FGs) from D using the above two bounds as thresholds Drawback All existing FG mining algorithms generate graphs with higher support before those with lower support Not ef?cient and scalable, especially when D is large or the lower bound is low Y. Ke et al (CSE, HKUST) KDD?07 9 / 17 Solution - Candidate Generation (cont?) Bound of supp(q, g; Dq ) supp(q, g; Dq ) ? ??2 (1 1 ? supp(q)) + supp(q) Candidate Generation from Dq Mine the set of FGs from Dq using the above threshold Compared with Range Dq is much smaller than D The minimum support threshold is higher Advantages Ef?cient candidate generation Signi?cant reduction in search space Solution - Heuristic Rules Heuristic 1: identify graphs that are guaranteed to be answers All supergraphs of q in the candidate set are in the answer set Heuristics 2 and 3: get rid of false-positives If a graph g is not in the answer set, prune all its subgraphs that have the same support as q or have support less than (? (1?supp(q))supp(g)(1?supp(g)) supp(q) + supp(g)) in Dq Y. Ke et al (CSE, HKUST) KDD?07 Solution - CGSearch Algorithm Input: Graph database D, query q, correlation threshold ? Output: The answer set Aq Obtain Dq Mine the set of candidate graphs C from Dq , using 1 as the minimum support threshold ??2 (1?supp(q))+supp(q) Check whether ?(q, g) ? ? for each graph g ? C; re?ne C by three heuristic rules Performance Evaluation Datasets Real dataset: 100K compound structures of cancer and AIDS data, averagely 21 nodes and 23 edges in each graph, 88 distinct labels Synthetic dataset: four datasets of 100K graphs by varying average number of edges from 40 to 100, 30 distinct labels and 0.15 average graph density Other Algorithms Used Obtain projected database: FG-index [SIGMOD?07] Mine FGs: gSpan [Yan and Han, ICDM?02] Baseline Range: candidate generation from D with a support range Y. Ke et al (CSE, HKUST) KDD?07 13 / 17 Effect of Candidate Generation when Varying Query Support Summary CGSearch is two orders of magnitude faster than Range The candidate set produced by CGSearch is much closer to the answer set and is over an order of magnitude smaller than Range Range CGSearch_P Number of Graphs Time (sec) Range CGSearch_P Answer Set 10 F1 F2 Query Sets F3 F4 10 F1 F2 Query Sets F3 F4 (a) Running Time Y. Ke et al (CSE, HKUST) (b) Size of Candidate Set KDD?07 14 / 17 Effect of Graph Size CGSearch is up to four orders of magnitude faster and consumes 41 times less memory than Range CGSearch is much more stable on resource usage than Range Memory (MB) F1 (Range) F4 (Range) F1 (CGSearch_P) F4 (CGSearch_P) F1 (Range) F4 (Range) F1 (CGSearch_P) F4 (CGSearch_P) Graph Size Graph Size (b) Memory Consumption Conclusions Correlated Graph Search Take into account the occurrence distributions of graphs using Pearson?s correlation coef?cient Mining Algorithm: CGSearch Theoretical bounds for support of candidates Candidate generation from a projected database Three heuristic rules Experiments Candidate generation from the projected database is ef?cient Three heuristic rules are effective Compared with Range, CGSearch is orders of magnitude faster CGSearch achieves very stable performance for various query support, minimum correlation thresholds, as well as graph sizes Y. Ke et al (CSE, HKUST) KDD?07 16 / 17 Thank you Q&A Poster: Board 2 on Aug 13 Y. Ke et al (CSE, HKUST) KDD?07
Real-Time Collaborative Environments Real-Time Collaborative Environments Talk Outline Real-Time Collaborative Environments The AccessGrid The AccessGrid: Venue Client The AccessGrid: Typical Media Tools The AccessGrid: User Experience History and Standards Development Current Deployment Status Underlying Technologies Control Protocols: Concepts Control Protocols: Implementations Data Transfer Protocols RTP: Real-time Transport Protocol Capture and Analysis of Media Sessions Underlying Technologies Research Issues and Future Directions Summary
A Framework For Community Identification in Dynamic Social Networks We propose frameworks and algorithms for identifying communities in social networks that change over time. Communities are intuitively characterized as ?unusually densely knit? subsets of a social network. This notion becomes more problematic if the social interactions change over time. Aggregating social networks over time can radically misrepresent the existing and changing community structure. Instead, we propose an optimization-based approach for modeling dynamic community structure. We prove that finding the most explanatory community structure is NP-hard and APX-hard, and propose algorithms based on dynamic programming, exhaustive search, maximum matching, and greedy heuristics. We demonstrate empirically that the heuristics trace developments of community structure accurately for several synthetic and real-world examples. A Framework For Community Identification in Dynamic Social Networks Social Networks History of Interactions Community Identification The Question: What is Dynamic Community? Approach: Graph Model Approach: Assumptions pt 1 Approach: Color = Community Approach: Assumptions pt 1 (a) Approach: Color = Community (a) Approach: Assumptions pt 2 Costs pt 1 Approach: Assumptions pt 3 Costs pt 2 Approach: Assumptions pt 4 Costs pt 3 Problem Definition Model Validation and Algorithms Southern Women Data Set Ethnography An Optimal Coloring: (?,?1,?2,?)=(1,1,3,1) An Optimal Coloring: (?,?1,?2,?)=(1,1,1,1) Conclusions Thank You Computational Population Biology Lab UIC
Fast Best-Effort Pattern Matching in Large Attributed Graphs We focus on large graphs where nodes have attributes, such as a social network where the nodes are labelled with each person?s job title. In such a setting, we want to find subgraphs that match a user query pattern. For example, a ?star? query would be, ?find a CEO who has strong interactions with a Manager, a Lawyer, and an Accountant, or another structure as close to that as possible?. Similarly, a ?loop? query could help spot a money laundering ring. Traditional SQL-based methods, as well as more recent graph indexing methods, will return no answer when an exact match does not exist. Our method can find exact-, as well as near-matches, and it will present them to the user in our proposed ?goodness? order. For example, our method tolerates indirect paths between, say, the ?CEO? and the ?Accountant? of the above sample query, when direct paths do not exist. Its second feature is scalability. In general, if the query has nq nodes and the data graph has n nodes, the problem needs polynomial time complexity O(nnq ), which is prohibitive. Our G-Ray (?Graph X-Ray?) method finds high-quality subgraphs in time linear on the size of the data graph. Experimental results on the DLBP author-publication graph (with 356K nodes and 1.9M edges) illustrate both the effectiveness and scalability of our approach. The results agree with our intuition, and the speed is excellent. It takes 4 seconds on average for a 4- node query on the DBLP graph.
Support Feature Machine for Classification of Abnormal Brain Activity In this study, a novel multidimensional time series classification technique, namely support feature machine (SFM), is proposed. SFM is inspired by the optimization model of support vector machine and the nearest neighbor rule to incorporate both spatial and temporal of the multi-dimensional time series data. This paper also describes an application of SFM for detecting abnormal brain activity. Epilepsy is a case in point in this study. In epilepsy studies, electroencephalograms (EEGs), acquired in multidimensional time series format, have been traditionally used as a gold-standard tool for capturing the electrical changes in the brain. From multi-dimensional EEG time series data, SFM was used to identify seizure pre-cursors and detect seizure susceptibility (pre-seizure) periods. The empirical results showed that SFM achieved over 80% correct classification of per-seizure EEG on average in 10 patients using 5-fold cross validation. The proposed optimization model of SFM is very compact and scalable, and can be implemented as an online algorithm. The outcome of this study suggests that it is possible to construct a computerized algorithm used to detect seizure pre-cursors and warn of impending seizures through EEG classification. Support Feature Machine for Classification of Abnormal Brain Activity Agenda Objectives How Many People Having Epilepsy? Epilepsy and Seizures Intracranial EEG Acquisition Electroencephalogram (EEG) 10-Second EEGs: Seizure Evolution Open Problems Data Transformation Using Chaos Theory Measure of Chaos Classification of Physiological States Support Vector Machine vs Support Feature Machine Nearest Neighbor for Time Series Similarity Measures Support Feature Machine Decision Rule: Basic Ideas Optimization Model I: Averaging Model I: Averaging Formulation Optimization Model II: Voting Decision Rule: Basic Ideas Model II: Voting Formulation Data Selection and Sampling Sensitivity and Specificity 5-Fold Cross Validation Result DTW pt 1 DTW pt 2 Automated Seizure Prediction Paradigm Concluding Remarks Reference Acknowledgements
Automatic Labeling of Multinomial Topic Models Multinomial distributions over words are frequently used to model topics in text collections. A common, major challenge in applying all such topic models to any text mining problem is to label a multinomial topic model accurately so that a user can interpret the discovered topic. So far, such labels have been generated manually in a subjective way. In this paper, we propose probabilistic approaches to automatically labeling multinomial topic models in an objective way. We cast this labeling problem as an optimization problem involving minimizing Kullback-Leibler divergence between word distributions and maximizing mutual information between a label and a topic model. Experiments with user study have been done on two text data sets with different genres. The results show that the proposed labeling methods are quite effective to generate labels that are meaningful and useful for interpreting the discovered topic models. Our methods are general and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations. Automatic Labeling of Multinomial Topic Models Outline Statistical Topic Models for Text Mining Topic Models: Hard to Interpret What is a Good Label? Our Method Relevance (Task 2): the Zero-Order Score Relevance (Task 2): the First-Order Score Discrimination and Coverage (Tasks 3 & 4) Variations and Applications Experiments Result Summary Results: Sample Topic Labels Results: Context-Sensitive Labeling Summary Thanks
Mining Statistically Important Equivalence Classes The support condence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that efectively simplifies the search lattice. This computational convenience brings both quality and statistical laws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chi-square, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical signifiance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depth first search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the eficiency have a high potential for real life applications, especially in biomedical and nancial fields where classical test statistics are of dominant interest. Mining Statistically Important Equivalence Classes and Delta Discriminative Emerging Patterns The Research Problem Objectives New Problem Contribution A Data Set Frequent Itemsets (Patterns) Equivalence Classes Closed Patterns and Generators An Example Observation 1 Observation 2 Computational Steps Revised FP-Tree for Pruning Non-Generators To Identify Closed Patterns in Parallel An Option to Find Performance Comparison pt 1 Performance Comparison pt 2 Performance Comparison pt 3 Conclusion
Local Decomposition for Rare Class Analysis Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attentions in the literature. However, the rare-class problem remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for Classification using lOcal clusterinG (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as Support Vector Machines (SVMs), for classification. Indeed, our experimental results on various real-world data sets show that our method produces significantly higher prediction accuracies on rare classes than state-of-the-art methods. Furthermore, we show that COG can also improve the performance of traditional supervised learning algorithms on data sets with balanced class distributions. Local Decomposition for Rare Class Analysis Outline pt 1 Rare Class Analysis Research Motivation ? Problems Problem Formulation Our Contributions Outline pt 2 Directions and Objectives of Our Method Algorithm Description An Example Effect of COG&COG-OS on Rare Class pt 1 Effect of COG&COG-OS on Rare Class pt 2 Outline pt 3 Experimental Design The Experimental Setup 1-1: COG on Imbalanced 2-class Data pt 1 1-1: COG on Imbalanced 2-Class Data pt 2 1-2: COG on Imbalanced Multi-Class Data 1-3: COG vs. Resampling 1-4: COG on KDDCUP99 Data 2-1: COG on Balanced Data pt 1 2-1: COG on Balanced Data pt 2 2-2: COG vs. Random Partitioning 3: Discussion on Feature Selection Outline pt 4 Related Work Outline pt 5 Concluding Remarks Thank You!
Trajectory Pattern Mining The increasing pervasiveness of location-acquisition technologies (GPS, GSM networks, etc.) is leading to the collection of large spatio-temporal datasets and to the opportunity of discovering usable knowledge about movement behaviour, which fosters novel applications and services. In this paper, we move towards this direction and develop an extension of the sequential pattern mining paradigm that analyzes the trajectories of moving objects. We introduce trajectory patterns as concise descriptions of frequent behaviours, in terms of both space (i.e., the regions of space visited during movements) and time (i.e., the duration of movements). In this setting, we provide a general formal statement of the novel mining problem and then study several different instantiations of different complexity. The various approaches are then empirically evaluated over real data and synthetic benchmarks, comparing their strengths and weaknesses. Trajectory Pattern Mining Plan of the Talk Motivations Motivations (2) Motivations (3) Sequential Patterns for Trajectories pt 1 Sequential Patterns for Trajectories pt 2 T-Patterns for Trajectories Continuity Issues (Space & Time) T-Pattern: Approximate Occurrence pt 1 T-Pattern: Approximate Occurrence pt 2 T-Pattern: Approximate Occurrence pt 3 T-Pattern: Approximate Occurrence pt 4 Computing General T-Patterns Simple Forms of T-Pattern Static Neighborhoods From ST-Sequences to Sequences Translating ST-Sequences Static Neighborhoods: Issue Static Neighborhoods pt 1 Static Neighborhoods pt 2 Multi-Step Refinement RoI Step-Wise Dynamic RoI: Example pt 1 Step-Wise Dynamic RoI: Example pt 2 Step-Wise Dynamic RoI: Example pt 3 Step-Wise Dynamic RoI Sample T-Patterns Performances Ongoing Work End of the Talk Trajectory Pattern Mining Fosca?Giannotti,?Mirco?Nanni, Dino?Pedreschi,?Fabio?Pinelli Knowledge?Discovery?and?Delivery?Lab (ISTI?CNR & Univ.?Pisa) www?kdd.isti.cnr.it 2007?ACM?SIGKDD San?Jose,?CA???August?12?15,?2007 Plan of the talk ? Motivations ? T?Patterns:?definition ? T?Patterns:?the?approach(es) ? Experiments ? Conclusions KDD?2007 ? Regions?of?Interest?approach ? RoI?extraction ? Step?wise?refinement?of?RoI Trajectory?Pattern Mining (2/30) Motivations ? Large?diffusion?of?mobile?devices,?mobile services?and?location?based?services Motivations (2) Such?devices?leave?digital?traces?that?can?be collected?to?for?trajectories?describing?the?mobility behavior?of?its?owner From?this?large?amount?of?data,?high?level information?should?be?extracted,?e.g.,?patterns describing?mobility?behaviors Sequential patterns for trajectories Question:?what?should?a?sequential?pattern about?moving?objects?look?like? Answer:?it?should?describe?their?movements?in?space and?in?time Temporal?information Area?A ?t?=?5?minutes ?t?=?35?minutes Area?B Area?C Spatial?information Trajectories?are?usually?given?as?spatio?temporal (ST)?sequences: >(x1,y1,t1),?...,?(xn,yn,tn)< (x5,y5,t5) (x4,y4,t4) Time Y (x5,y5,t5) (x3,y3,t3) (x2,y2,t2) (x1,y1,t1) X (x1,y1,t1) (x4,y4,t4) (x3,y3,t3) (x2,y2,t2) Y X T-Patterns for trajectories A?Trajectory?Pattern?(T?pattern)?is?a?couple?(s,??): s?=?>(x0,y0),...,?(xk,yk)< is?a?sequence?of?k+1?locations ??=?>?1,...,??k< are?the?transition?times?(annotations) also?written?as: A?T?pattern?Tp?occurs?in?a?trajectory?if?it?contains a?sub?sequence?S?such?that: each?(xi,yi)?in?Tp?matches?a?point?(xi?,yi?)?in?S,?and the?transition?times?in?Tp?are?similar?to?those?in?S Continuity issues (space & time) The?same?exact?spatial?location?(x,y)?usually never?occurs?twice yet,?close?locations?essentially?represent?the?same place,?so?they?should?match The?same?exact?transition?times?usually?do not?occur?often same?as?above a?notion?of?spatial?neighborhood a?notion?of?temporal?tolerance Solution:?allow?approximation T-Pattern: approximate occurrence Two?points?match?if?one?falls?within?a spatial?neighborhood?N()?of?the?other Two?transition?times?match?if?their temporal?difference?is?? ? Example: Computing general T-Patterns T?pattern?mining?can?be?mapped?to?a?density estimation?problem?over?R3n?1 2?dimensions?for?each?(x,y)?in?the?pattern?(2n) 1?dimension?for?each?transition?(n?1) mapping?each?sub?sequence?of?n?points?of?each?input trajectory?to?R3n?1 drawing?an?influence?area?for?each?point?(composition of?N()s?and??s),?that?sums?up?with?all?others Density?computed?by Too?expensive?!!! Simple forms of T-Pattern ? Spatial?neighborhood?is?a?parameter?of?the definition ? Some?neighborhood?functions?yield tractable?versions?of?the?T?Pattern?mining problem ? ?Static?neighborhoods?:?Regions?of?Interest Static Neighborhoods Regions-of-Interest (RoI) Given?a?set?of?Regions?of?Interest R,?define?the neighborhood?of?(x,y)?as: NR(x,y) = A if A?R & (x,y)?A ? otherwise ? Neighbors???belong?to?the?same?region ? Points?in?no?region?have?no?neighbors KDD?2007 Trajectory?Pattern Mining (16/30) From ST-sequences to sequences With?static?neighborhoods?NR()?ST?sequences replaced?by?corresponding?seqs?of?regions: A?T?pattern?(s,?)?is?contained?in?a?ST?sequence?S=>(x1,y1,t1), ...,?(xn,yn,tn)<?? the?TAS?(s?,?)?is?contained?in?sequence?S? ? s??(resp.?S?)?is?obtained?by?mapping?each ? element?(x,y)?of?s?(resp.?S)?to?NR(x,y) TAS?=?Temporally?annotated?seq.?of?labels ? E.g.: ? Mining?TAS?=?previous?work??<?efficient?algs Trajectory?Pattern Mining (17/30) KDD?2007 Translating ST-sequences Example R1 Y (x5,y5,t5) R3 R2 R4 X (x1,y1,t1) (x4,y4,t4) (x3,y3,t3) (x2,y2,t2) S=>(x1,y1,t1),?...,?(x5,y5,t5)< >(R4,t1),?(R3,t3),?(R3,t4),?(R1,t5)< Static Neighborhoods: issue ? What?if?RoI?are?not?known?a?priori? ? Solution:?define?heuristics?for?automatic RoI?extraction?from?data ? Wide?range?of?heuristics: ? Geography?based?(e.g.,?crossroads) ? Usage?based?(e.g.,?popular?places) ? Mixed?(e.g.,?popular?squares) Static Neighborhoods A usage-based heuristic Impose?a?regular?grid?over?space Find?dense?cells?(i.e.,?touched?by?many?trajs.) Coalesce?cells?into?rectangles?of?bounded?size start?from?densest?cell consider?any?direction?that?(i) adds?a?dense?cell,?(ii)?keeps avg?density?high,?(iii)?avoids overlap?of?regions select?locally?best?direction Multi-step refinement RoI Static?RoI ? Cells?approximate?single?points,?regions?group?points ? Yet,?they?should?regard?only?trajectories?that?support the?discovered?pattern,?not?all?database that?are?likely?to?form?similar?patterns Towards?general?T?patterns ? Check?&?update?dense?cells?and?regions?of?each ? Approximation:?Perform?the?update?as?step?wise refinement?as?patterns?grow pattern?against?the?trajectories?that?support?it Step-wise dynamic RoI Example Start?computing?regions as?basic?RoI?approach Regions?describe interesting?places?of everybody Focusing?on?A,?we consider?only?the?subset of?relevant?trajectories Regions?can?change (usually?shrink/split) They?are?interesting?only for?who?passes?thru?A Focusing?on?A?<F?(with some?transition?time),?we further?restrict?the?set?of trajectories?involved The?process?is?repeated as?far?as?possible Extract?freq.?transition?times Compute?up?to?date?RoI Extend?patters?w.r.t.?new?RoI Focus?on?patterns?found Sample T-patterns (Data source: trucks in Athens ? 273 trajectories) Performances Linear?scalability?w.r.t.?number?of?trajs Quickly?growing?cost?around?(left&?right)?critical support?thresholds Dynamic?approach?prunes?better Ongoing work ? Application?oriented?tests?on?large,?real datasets ? Study?relations?with ? Simplification?of?output?transition?times ? Geographic?background?knowledge ? Privacy?issues ? Reasoning?on?trajectories?and?patterns ? The?most?complex?info?for?end?users End of the talk ? Thanks?for?your?attention ? Questions?and?remarks?are?welcome Have?a?look?at?our?poster: Contact?me?at: mirco.nanni?@?isti.cnr.it this?evening?(Monday,?13th?August) board?27 software?available download?page?and?user?manuals?under?construction
Introduction to the KDD07 Conference KDD 2007 Welcome slide 3 Technology Capital Pacific Coast Thanks!
Interview with Gregory Piatetsky-Shapiro **Gregory Piatetsky-Shapiro, Ph.D.** is the President of [[http://www.kdnuggets.com/|KDnuggets]], which provides [[http://www.kdnuggets.com/consulting.html|research and consulting]] services in the areas of data mining, knowledge discovery, bioinformatics, and business analytics. Previously, he led data mining and consulting groups at GTE Laboratories, Knowledge Stream Partners, and Xchange. He has extensive experience developing CRM, customer attrition, cross-sell, segmentation and other models for some of the leading banks, insurance companies, and telcos. He also worked on clinical trial, microarray, and proteomic data analysis for several leading biotech and pharmaceutical companies.
A Data Miner?s Story ? Getting to Know the Grand Challenges KDD-07 Invited Innovation Talk Thanks and Gratitude A Data Miner?s Story ? Getting to Know the Grand Challenges Overview The data gap ? What is Data Mining? Beyond Data Analysis Data Mining and Databases Data Mining Grand Vision The myths ? The truths ? Current state of Databases Researcher view Practitioner view Business view A Data Miner?s Story Example: Cataloging Sky Objects Data Mining Based Solution (1) Data Mining Based Solution (2) A Data Miner?s Story Business Results Gap (1) Business Results Gap (2) Sample Customers A Data Miner?s Story DMX Group Mission Data Strategy Modeling Process Map Segments to Actions Pragmatic Grand Challenge 3 Cross-Sell / Up-Sell Example Pragmatic Grand Challenge 4 Technical Challenges (1) Technical Challenges (2) Technical Challenges (3) Summary Challenges Data Strategy A Data Miner?s Story Yahoo! Data ? A league of its own ? To be continued ? Thank You! & Questions?
Speaker Localization: introduction to system evaluation Speaker Localization: introduction to system evaluation Outline Speaker localization: general issues Example of very near-field propagation T-shaped arrays in the CHIL room at IRST Speaker localization: general issues CHIL: Speaker localization in lecture scenarios Evaluation Criteria Evaluation Criteria: fine and gross errors Evaluation Criteria SLOC error computation in a time interval Evaluation Criteria Evaluation software Evaluation software NIST evaluation ?05 of SLOC systems Experimental Results x-coordinate output examples x-coordinate output examples IRST speaker localization and tracking systems System description Global Coherence Field TDOA estimate based on microphone pairs and CSP (GCC-PHAT) analysis IRST T-shaped microphone array Recent results on UKA lectures CSP analysis of a segment with two speakers (from Seminar_2003-11-25_A_4) Conclusions
Extracting Relevant Named Entities for Automated Expense Reimbursement Expense reimbursement is a time-consuming and labor-intensive process across organizations. In this talk, we present an automated expense reimbursement system developed at IBM Almaden Research Center. Our complete solution involves (1) an electronic document management infrastructure that provides multi-channel image capture, transport and storage of paper documents, such as receipts; (2) an unconstrained data mining approach to extracting relevant named entities from un-structured document images; (3) automation of manual auditing procedures using extracted metadata. The main focus of this presentation is our approach to automatically extracting important metadata, once we aggregate documents through such a scalable infrastructure. Extracting relevant named entities robustly from document images with unconstrained layouts and diverse formatting is a fundamental technical challenge to image-based data mining, question answering, and other information retrieval tasks. In many applications that require such capability, applying traditional language modeling techniques to the stream of OCR text does not give satisfactory result due to the absence of linguistic contexts, such as language constructs and punctuation. We present a novel approach for extracting relevant named entities from document images by learning the statistical dependencies between page layout and language features collectively from the sequence of geometrically decomposed regions on a document using a discriminative conditional random fields (CRFs) framework. We integrate this named entity extraction engine into our expense reimbursement solution and evaluate the system performance on large collections of real world receipt images provided by IBM World Wide Reimbursement Center.\\\\
Cleaning Disguised Missing Data: A Heuristic Approach In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely, such as causing significant biases and misleading results in hypothesis tests, correlation analysis and regressions. The very limited previous studies on cleaning disguised missing data use outlier mining and distribution anomaly detection. They highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers. To tackle the problem of cleaning disguised missing data, in this paper, we first model the distribution of disguised missing data, and propose the embedded unbiased sample heuristic. Then, we develop an effective and efficient method to identify the frequently used disguise values which capture the major body of the disguised missing data. Our method does not require any domain background knowledge to find the suspicious disguise values. We report an empirical evaluation using real data sets, which shows that our method is effective ? the frequently used disguise values found by our method match the values identified by the domain experts nicely. Our method is also efficient and scalable for processing large data sets.
Distributed Classification in Peer-to-Peer Networks This work studies the problem of distributed classification in peer-to-peer (P2P) networks. While there has been a significant amount of work in distributed classification, most of existing algorithms are not designed for P2P networks. Indeed, as server-less and router-less systems, P2P networks impose several challenges for distributed classification: (1) it is not practical to have global synchronization in large- scale P2P networks; (2) there are frequent topology changes caused by frequent failure and recovery of peers; and (3) there are frequent on-the-fly data updates on each peer. In this paper, we propose an ensemble paradigm for distributed classification in P2P networks. Under this paradigm, each peer builds its local classifiers on the local data and the results from all local classifiers are then combined by plurality voting. To build local classifiers, we adopt the learning algorithm of pasting bites to generate multiple local classifiers on each peer based on the local data. To combine local results, we propose a general form of Distributed Plurality Voting (DPV ) protocol in dynamic P2P networks. This protocol keeps the single-site validity for dynamic networks, and supports the computing modes of both one-shot query and continuous monitoring. We theoretically prove that the condition C0 for sending messages used in DPV0 is locally communication-optimal to achieve the above properties. Finally, experimental results on real-world P2P networks show that: (1) the proposed ensemble paradigm is effective even if there are thousands of local classifiers; (2) in most cases, the DPV0 algorithm is local in the sense that voting is processed using information gathered from a very small vicinity, whose size is independent of the network size; (3) DPV0 is significantly more communication-efficient than existing algorithms for distributed plurality voting. Distributed Classification in Peer-to-Peer Networks Overview-part01 Research Motivation Research Motivation (2) Problem Formulation Our Contributions Overview-part02 Building Local Classifiers Overview-part03 Problem Formulation Of DPV An Example Of DPV Comparison Between DPV and Distributed Majority Voting (DMV, by Wolff et al. [TSMC?04]) Comparison Between DPV and DMV (2) Challenges for DPV DPV Protocol Overview The Condition for Sending Messages The Condition for Sending Messages (2) The Correctness of DPV Protocol The Optimality of DPV Protocol The Extension of DPV Protocol Overview-part04 Accuracy of P2P Classification The Performance of DPV Protocol The Performance of DPV Protocol (2) The Performance of DPV Protocol (3) The Performance of DPV Protocol (4) Overview-part05 Related Work - Ensemble Classifiers Related Work - P2P Data Mining Overview-part06 Summary Q. & A.
Detecting Motifs Under Uniform Scaling Time series motifs are approximately repeated patterns found within the data. Such motifs have utility for many data mining algorithms, including rule-discovery, novelty-detection, summarization and clustering. Since the formalization of the problem and the introduction of efficient linear time algorithms, motif discovery has been successfully applied to many domains, including medicine, motion capture, robotics and meteorology. In this work we show that most previous applications of time series motifs have been severely limited by the definition?s brittleness to even slight changes of uniform scaling, the speed at which the patterns develop. We introduce a new algorithm that allows discovery of time series motifs with invariance to uniform scaling, and show that it produces objectively superior results in several important domains. Apart from being more general than all other motif discovery algorithms, a further contribution of our work is that it is simpler than previous approaches, in particular we have drastically reduced the number of parameters that need to be specified. Detecting Time Series Motifs Under Uniform Scaling Outline Problem definition Motivation Motivation (cont) Formalization Approach Approach (cont) Experimental evaluation pt 1 Experimental evaluation pt 2 Experimental evaluation (cont) pt 1 Experimental evaluation (cont) pt 2 Conclusion Thank you
iLink: Search and Routing in Social Networks - Part 1 The growth of Web 2.0 and fundamental theoretical breakthroughs have led to an avalanche of interest in social networks. This paper focuses on the problem of modeling how social networks accomplish tasks through peer production style collaboration. We propose a general interaction model for the underlying social networks and then a specific model (iLink) for social search and message routing. A key contribution here is the development of a general learning framework for making such online peer production systems work at scale. The iLink model has been used to develop a system for FAQ generation in a social network (FAQtory), and experience with its application in the context of a full-scale learning-driven workflow application (CALO) is reported. We also discuss methods of adapting iLink technology for use in military knowledge sharing portals and other message routing systems. Finally, the paper shows the connection of iLink to SQM, a theoretical model for social search that is a generalization of Markov Decision Processes and the popular Pagerank model.
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person?s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.
Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Commercial datasets are often large, relational, and dynamic. They contain many records of people, places, things, events and their interactions over time. Such datasets are rarely structured appropriately for knowledge discovery, and they often contain variables whose meanings change across different subsets of the data. We describe how these challenges were addressed in a collaborative analysis project undertaken by the University of Massachusetts Amherst and the National Association of Securities Dealers (NASD). We describe several methods for data preprocessing that we applied to transform a large, dynamic, and relational dataset describing nearly the entirety of the U.S. securities industry, and we show how these methods made the dataset suitable for learning statistical relational models. To better utilize social structure, we first applied known consolidation and link formation techniques to associate individuals with branch office locations. In addition, we developed an innovative technique to infer professional associations by exploiting dynamic employment histories. Finally, we applied normalization techniques to create a suitable class label that adjusts for spatial, temporal, and other heterogeneity within the data. We show how these pre-processing techniques combine to provide the necessary foundation for learning high-performing statistical models of fraudulent activity.
An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs Interaction graphs are ubiquitous in many fields such as bioinformatics, sociology and physical sciences. There have been many studies in the literature targeted at studying and mining these graphs. However, almost all of them have studied these graphs from a static point of view. The study of the evolution of these graphs over time can provide tremendous insight on the behavior of entities, communities and the flow of information among them. In this work, we present an event-based characterization of critical behavioral patterns for temporally varying interaction graphs. We use non-overlapping snapshots of interaction graphs and develop a framework for capturing and identifying interesting events from them. We use these events to characterize complex behavioral patterns of individuals and communities over time. We demonstrate the application of behavioral patterns for the purposes of modeling evolution, link prediction and influence maximization. Finally, we present a diffusion model for evolving networks, based on our framework. An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs Motivation-part01 Motivation-part02 Motivation-part03 Motivation-part04 Workflow Temporal Snapshots Clustering Community-based Event Detection Entity-based Event Detection Event Detection Temporal Analysis Behavioral Analysis Case Study 1 : DBLP Collaboration network Case Study 2 : Clinical Trials Network Stability Index Stability for Clinical Trials data Sociability Index Sociability Index for Community Prediction Experimental Results Popularity Index Application of Popularity Index Influence Index Top Influential authors ? DBLP dataset Diffusion Models Diffusion Models ? Influence Maximization Conclusions Future Directions Thanks!
Domain-Constrained Semi-Supervised Mining of Tracking Models in Sensor Networks Accurate localization of mobile objects is a major research problem in sensor networks and an important data mining application. Specifically, the localization problem is to determine the location of a client device accurately given the radio signal strength values received at the client device from multiple beacon sensors or access points. Conventional data mining and machine learning methods can be applied to solve this problem. However, all of them require large amounts of labeled training data, which can be quite expensive. In this paper, we propose a probabilistic semi-supervised learning approach to reduce the calibration effort and increase the tracking accuracy. Our method is based on semi-supervised conditional random fields which can enhance the learned model from a small set of training data with abundant unlabeled data effectively. To make our method more efficient, we exploit a Generalized EM algorithm coupled with domain constraints. We validate our method through extensive experiments in a real sensor network using Crossbow MICA2 sensors. The results demonstrate the advantages of methods compared to other state-of-the-art objecttracking algorithms. Domain-Constrained Semi-Supervised Mining of Tracking Models in Sensor Networks Signal-Strength-Based Tracking Application Scenario Calibration ? Labeling Data Related Works Conditional Random Fields I Conditional Random Fields II Partially labeled Conditional Random Fields Some Details Test-bed Setup Convergence of Semi-CRF Semi-CRF vs. Baselines Impact of Grid Sizes Conclusion & Future Works
Framework for Classification and Segmentation of Massive Audio Data Streams In recent years, the proliferation of VOIP data has created a number of applications in which it is desirable to perform quick online classification and recognition of massive voice streams. Typically such applications are encountered in real time intelligence and surveillance. In many cases, the data streams can be in compressed format, and the rate of data processing can often run at the rate of Gigabits per second. All known techniques for speaker voice analysis require the use of an offline training phase in which the system is trained with known segments of speech. The state-of-the-art method for text-independent speaker recognition is known as Gaussian Mixture Modeling (GMM), and it requires an iterative Expectation Maximization Procedure for training, which cannot be implemented in real time. In this paper, we discuss the details of such an online voice recognition system. For this purpose, we use our micro-clustering algorithms to design concise signatures of the target speakers. One of the surprising and insightful observations from our experiences with such a system is that while it was originally designed only for efficiency, we later discovered that it was also more accurate than the widely used Gaussian Mixture Model (GMM). This was because of the conciseness of the micro-cluster model, which made it less prone to over training. This is evidence of the fact that it is often possible to get the best of both worlds and do better than complex models both from an efficiency and accuracy perspective.
Searching Speech: A Research Agenda Searching Speech: A Research Agenda Some Grid Use at Maryland Expanding the Search Space Indexable Speech A Web of Speech? The Need for Scalable Solutions Some Spoken Word Collections Indexing Options Supporting ?Intellectual Access? Some Technical Challenges Start Time Error Cost Shoah Foundation Collection Interview Excerpt MALACH Languages Observational Studies Relevance Criteria Topicality Test Collection Design Test Collection Design CLEF-2005 CL-SR Track Additional Resources English ASR >KEYWORD< Warsaw (Poland) Segment duration (s) Keywords vs. Segment duration Nodes descending from parents of leaves Spoken dates in release ASR Current classifier performance: An Example English Topic 5-level Relevance Judgments Comparing Index Terms Searching Manual Transcripts Category Expansion Rethinking the Problem Activation Matrix Training Data: 196,000 Segments Preprocessing Training Data Characteristics of the Problem Modeling Location A Class Model for People Search TITLE Some Open Issues Non-English ASR Systems Planning for the Future The CLEF CL-SR Team More Things to Think About Final Thoughts For More Information TITLE
LungCAD: A Clinically Approved, Machine Learning System for Lung Cancer Detection We present LungCAD, a computer aided diagnosis (CAD) system that employs a classification algorithm for detecting solid pulmonary nodules from CT thorax studies. We briefly describe some of the machine learning techniques developed to overcome the real world challenges in this medical domain. The most significant hurdle in transitioning from a machine learning research prototype that performs well on an in-house dataset into a clinically deployable system, is the requirement that the CAD system be tested in a clinical trial. We describe the clinical trial in which LungCAD was tested: a large scale multi-reader, multi-case (MRMC) retrospective observational study to evaluate the effect of CAD in clinical practice for detecting solid pulmonary nodules from CT thorax studies. The clinical trial demonstrates that every radiologist that participated in the trial had a significantly greater accuracy with LungCAD, both for detecting nodules and identifying potentially actionable nodules; this, along with other findings from the trial, has resulted in FDA approval for LungCAD in late 2006.
Truth Discovery with Multiple Conflicting Information Providers on the Web The world-wide web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the web. Moreover, different web sites often provide conflicting information on a subject, such as different specifications for the same product. In this paper we propose a new problem called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various web sites. We design a general framework for the Veracity problem, and invent an algorithm called TruthFinder, which utilizes the relationships between web sites and their information, i.e., a web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites. Our experiments show that TruthFinder successfully finds true facts among conflicting information, and identifies trustworthy web sites better than the popular search engines. Truth Discovery with Multiple Conflicting Information Providers Trustworthiness of the Web Conflicting Information on the Web Our Problem Setting Basic Heuristics for Problem Solving Overview of Our Method Analogy to Authority-Hub Analysis An Example Computation Model (1): t(w) and s(f) Computation Model (2): Fact Influence Computation Model (3): Influence Function Experiments: Finding Truth of Facts Experiments: Trustable Info Providers Conclusions Thank you!
Detecting Changes in Large Data Sets of Payments Cards Data: A Case Study An important problem in data mining is detecting changes in large data sets. Although there are a variety of change detection algorithms that have been developed, in practice it can be a problem to scale these algorithms to large data sets due to the heterogeneity of the data. In this paper, we describe a case study involving payment card data in which we built and monitored a separate change detection model for each cell in a multi-dimensional data cube. We describe a system that has been in operation for the past two years that builds and monitors over 15,000 separate baseline models and the process that is used for generating and investigating alerts using these baselines. Data Quality Models for High Volume Transaction Streams: A Case Study The Problem: Detect Significant Changes in Visa?s Payments Network Visa Payment Network The Challenge: Payments Data is Highly Heterogeneous Observe: If Data Were Homogeneous, Could Use Change Detection Model Key Idea: Build 104+ Models, One for Each Cell in Data Cube Augustus Some Results to Date Summary
Event Summarization for System Management In system management applications, an overwhelming amount of data are generated and collected in the form of temporal events. While mining temporal event data to discover interesting and frequent patterns has obtained rapidly increasing research efforts, users of the applications are overwhelmed by the mining results. The extracted patterns are generally of large volume and hard to interpret, they may be of no emphasis, intricate and meaningless to non-experts, even to domain experts. While traditional research efforts focus on finding interesting patterns, in this paper, we take a novel approach called event summarization towards the understanding of the seemingly chaotic temporal data. Event summarization aims at providing a concise interpretation of the seemingly chaotic data, so that domain experts may take actions upon the summarized models. Event summarization decomposes the temporal information into many independent subsets and finds well fitted models to describe each subset. Event Summarization for System Management Introduction A Motivating Example Steps for Event Summarization Preprocess Log Data and Generate events Discover Temporal Correlation between Events (Dependency) Rank Dependencies Event Relationship Networks (ERNs) Derive Action Rules from Event Summary A Case Study Decomposition Process in the Case Study ERN in the Case Study Thank You !
Machine Learning for Stock Selection In this paper, we propose a new method called Prototype Ranking (PR) designed for the stock selection problem. PR takes into account the huge size of real-world stock data and applies a modified competitive learning technique to predict the ranks of stocks. The primary target of PR is to select the top performing stocks among many ordinary stocks. PR is designed to perform the learning and testing in a noisy stocks sample set where the top performing stocks are usually the minority. The performance of PR is evaluated by a trading simulation of the real stock data. Each week the stocks with the highest predicted ranks are chosen to construct a portfolio. In the period of 1978-2004, PR?s portfolio earns a much higher average return as well as a higher risk-adjusted return than Cooper?s method, which shows that the PR method leads to a clear profit improvement. Machine Learning for Stock Selection Robert J. Yan Charles X. Ling University of Western Ontario, Canada {jyan, cling}@csd.uwo.ca Outline-part01 Introduction Outline-part02 Stock Selection Task Outline-part03 Prototype Ranking Step 1: Finding Prototypes Finding prototypes using competitive learning Modifications for Stock data Step 2: Predicting Test Data Outline-part04 Data Testing PR Results of Experiment 1 Experiment 2: Comparison to Cooper?s method Results of Experiment 2 Outline-part05 Conclusions
IMDS: Intelligent Malware Detection System The proliferation of malware has presented a serious threat to the security of computer systems. Traditional signature-based antivirus systems fail to detect polymorphic and new, previously unseen malicious executables. In this paper, resting on the analysis of Windows API execution sequences called by PE files, we develop the Intelligent Malware Detection System (IMDS) using Objective Oriented Association (OOA) mining based classification. IMDS is an integrated system consisting of three major modules: PE parser, OOA rule generator, and rule based classifier. An OOA Fast FPGrowth algorithm is adapted to efficiently generate OOA rules for classification. A comprehensive experimental study on a large collection of PE files obtained from the anti-virus laboratory of King Soft Corporation is performed to compare various malware detection approaches. Promising experimental results demonstrate that the accuracy and efficiency of our IMDS system outperform popular anti-virus software such as Norton AntiVirus and McAfee VirusScan, as well as previous data mining based detection systems which employed Naive Bayes, Support Vector Machine (SVM) and Decision Tree techniques. IMDS: Intelligent Malware Detection System Motivation Data Collection and Preprocessing System Architecture Objective Oriented Association Mining Experimental results (1) Experimental results (2) Experimental results (3) Conclusion Selected References
iLink: Search and Routing in Social Networks - Part 2 The growth of Web 2.0 and fundamental theoretical breakthroughs have led to an avalanche of interest in social networks. This paper focuses on the problem of modeling how social networks accomplish tasks through peer production style collaboration. We propose a general interaction model for the underlying social networks and then a specific model (iLink) for social search and message routing. A key contribution here is the development of a general learning framework for making such online peer production systems work at scale. The iLink model has been used to develop a system for FAQ generation in a social network (FAQtory), and experience with its application in the context of a fullscale learning-driven workflow application (CALO) is reported. We also discuss methods of adapting iLink technology for use in military knowledge sharing portals and other message routing systems. Finally, the paper shows the connection of iLink to SQM, a theoretical model for social search that is a generalization of Markov Decision Processes and the popular Pagerank model.
Interview with Jon Kleinberg This interview was made at the KDD 2007 Conference where we cought up with John Kleinber who was one of the **invited speakers**. We discussed **his popularity at Cornell** **University **and and how students affectionately call him \"Rebel King\", how he **published his recent text book on algorithms **with coauthor Eva Tardos and what are his **future plans**...
Interview with Pavel Berkhin What has Pavel Berkhin to say about his beginings as a researcher, his first and last algorithm, the industry and in general the KDD Conference, since he is this years chairman of the event.
Hierarchical Mixture Models: a Probabilistic Analysis Mixture models form one of the most widely used classes of generative models for describing structured and clustered data. In this paper we develop a new approach for the analysis of hierarchical mixture models. More specifically, using a text clustering problem as a motivation, we describe a natural generative process that creates a hierarchical mixture model for the data. In this process, an adversary starts with an arbitrary base distribution and then builds a topic hierarchy via some evolutionary process, where he controls the parameters of the process. We prove that under our assumptions, given a subset of topics that represent generalizations of one another (such as baseball - sports - base), for any document which was produced via some topic in this hierarchy, we can efficiently determine the most specialized topic in this subset, it still belongs to. The quality of the classification is independent of the total number of topics in the hierarchy and our algorithm does not need to know the total number of topics in advance. Our approach also yields an algorithm for clustering and unsupervised topical tree reconstruction. We validate our model by showing that properties predicted by our theoretical results carry over to real data. We then apply our clustering algorithm to two different datasets: (i) ?20 newsgroups? [19] and (ii) a snapshot of abstracts of arXiv [2] (15 categories, 240,000 abstracts). In both cases our algorithm performs extremely well. Hierarchical Mixture Models Mixture models: quick overview Topical Hierarchy Our results Generative model for the topical hierarchy Generative model for the hierarchy Distributions satisfy a few conditions Related work: reconstructing hierarchies Our results Classification along the path in the tree Algorithm when a path in the hierarchy is known Why does it work? Part 1: back to plain mixtures Generalized pseudoinverse Part II ?still, why does it work? Reconstructing hierarchy from unlabeled data Overview of the rest of the talk Experiments: abstracts from ArXiV Experiments: ArXiV Newsgroup 20 Experiments: Newsgroups Conclusions Pseudoinverse, independence coefficient and such Thanks! An example
The ?FAME? Interactive Space This paper describes the ?FAME? multi-modal demonstrator, which integrates multiple communication modes ? vision, speech and object manipulation ? by combining the physical and virtual worlds to provide support for multi-cultural or multi-lingual communication and problem solving. The major challenges are automatic perception of human actions and understanding of dialogs between people from different cultural or linguistic backgrounds. The system acts as an information butler, which demonstrates context awareness using computer vision, speech and dialog modeling. The integrated computerenhanced human-to-human communication has been publicly demonstrated at the FORUM2004 in Barcelona and at IST2004 in The Hague. Specifically, the ?Interactive Space? described features an ?Augmented Table? for multi-cultural interaction, which allows several users at the same time to perform multi-modal, cross-lingual document retrieval of audio-visual documents previously recorded by an ?Intelligent Cameraman? during a week-long seminar. The ?FAME? Interactive Space Project Goals The ?FAME? Multi-modal Room Multi-modal Demonstrator FAME Database Video Acquisition System Fast Lecture Transcription Multi-surface interaction Tracking on Augmented Table Context Management FAME Topic Spotting Results of User Study User Ratings Conclusions from User Study TITLE TITLE TITLE Thank You!
Information distance from a question to an answer We provide three key missing pieces of a general theory of information distance [3, 23, 24]. We take bold steps in formulating a revised theory to avoid some pitfalls in practical applications. The new theory is then used to construct a question answering system. Extensive experiments are conducted to justify the new theory.
Statistical Change Detection for Multi-Dimensional Data This paper deals with detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, we define a statistical test called the density test for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. We define a test statistic that is strictly distribution-free under the null hypothesis. Our experimental results show that the density test has substantially more power than the two existing methods for multi-dimensional change detection. Statistical Change Detection for Multi-Dimensional Data Motivation Example: Antibiotic Resistance Pattern Problem Definition Related Work Hypothesis Test Framework Density Test High-Level Overview Step 1: Kernel Density Estimate (KDE) Choose Bandwidth by MLE/EM Effectiveness of EM Bandwidth Step 2: Define and Calculate Step 3: Derive the Null Distribution Estimating Step 4: Calculate Critical Value and Make a Decision Density Test ? All 4 Steps Run Density Test in 2 Directions False Positive False Negative on Low-D Group False Negative on High-D Group Scalability Conclusion Thanks
Learning the Kernel Matrix in Discriminant Analysis via Quadratically Constrained Quadratic Programming The kernel function plays a central role in kernel methods. In this paper, we consider the automated learning of the kernel matrix over a convex combination of pre-specified kernel matrices in Regularized Kernel Discriminant Analysis (RKDA), which performs linear discriminant analysis in the feature space via the kernel trick. Previous studies have shown that this kernel learning problem can be formulated as a semidefinite program (SDP), which is however computationally expensive, even with the recent advances in interior point methods. Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a Quadratically Constrained Quadratic Programming (QCQP) formulation for the kernel learning problem, which can be solved more efficiently than SDP. While most existing work on kernel learning deal with binary-class problems only, we show that our QCQP formulation can be extended naturally to the multi-class case. Experimental results on both binary-class and multiclass benchmark data sets show the efficacy of the proposed QCQP formulations.
Scalable Look-Ahead Linear Regression Trees The motivation behind Look-ahead Linear Regression Trees (LLRT) is that out of all the methods proposed to date, there has been no scalable approach to exhaustively evaluate all possible models in the leaf nodes in order to obtain an optimal split. Using several optimizations, LLRT is able to generate and evaluate thousands of linear regression models per second. This allows for a near-exhaustive evaluation of all possible splits in a node, based on the quality of fit of linear regression models in the resulting branches. We decompose the calculation of the Residual Sum of Squares in such a way that a large part of it is pre-computed. The resulting method is highly scalable. We observe it to obtain high predictive accuracy for problems with strong mutual dependencies between attributes. We report on experiments with two simulated and seven real data sets.
Estimating Rates of Rare Events at Multiple Resolutions We consider the problem of estimating occurrence rates of rare events for extremely sparse data, using pre-existing hierarchies to perform inference at multiple resolutions. In particular, we focus on the problem of estimating click rates for (webpage, advertisement) pairs (called impressions) where both the pages and the ads are classified into hierarchies that capture broad contextual information at different levels of granularity. Typically the click rates are low and the coverage of the hierarchies is sparse. To overcome these difficulties we devise a sampling method whereby we analyze a specially chosen sample of pages in the training set, and then estimate click rates using a two-stage model. The first stage imputes the number of (webpage, ad) pairs at all resolutions of the hierarchy to adjust for the sampling bias. The second stage estimates click rates at all resolutions after incorporating correlations among sibling nodes through a tree-structured Markov model. Both models are scalable and suited to large scale data mining applications. On a real-world dataset consisting of 1/2 billion impressions, we demonstrate that even with 95% negative (non-clicked)events in the training set, our method can effectively discriminate extremely rare events in terms of  heir click propensity. Estimating Rates of Rare Events at Multiple Resolutions Estimation in the ?Tail? pt 1 Estimation in the ?Tail? pt 2 System Overview Sampling of Webpages Imputation of Impression Volume pt 1 Imputation of Impression Volume pt 2 Imputation of Impression Volume pt 3 Imputing Xij Imputation: Summary System Overview Rare Rate Modeling pt 1 Rare Rate Modeling pt 2 Rare Rate Modeling pt 3 Experiments pt 1 Experiments pt 2 Experiments pt 3 Experiments pt 4 Experiments pt 5 Related Work Conclusions Experiments pt 5 (a)
Predictive Discrete Latent Factor Models for Large Scale Dyadic Data We propose a novel statistical method to predict large scale dyadic response variables in the presence of covariate information. Our approach simultaneously incorporates the effect of covariates and estimates local structure that is induced by interactions among the dyads through a discrete latent factor model. The discovered latent factors provide a predictive model that is both accurate and interpretable. We illustrate our method by working in a framework of generalized linear models, which include commonly used regression techniques like linear regression, logistic regression and Poisson regression as special cases. We also provide scalable generalized EM-based algorithms for model fitting using both \"hard\" and \"soft\" cluster assignments. We demonstrate the generality and efficacy of our approach through large scale simulation studies and analysis of datasets obtained from certain real-world movie recommendation and internet advertising applications. Predictive Discrete Latent Factor Models Internet Advertising: Billion Dollar Industry Recommender Systems Click Fraud Data Problem Definition Agenda Existing Approaches Non-Parametric Function Estimation Random Effects Model Generalized Linear Models Unsupervised Approach Clustering Animation PDLF: High Level Overview Model Fitting Algorithm: Generalized EM EM Algorithm Hard Clustering Simulation Study on Movie Lens Logistic Regression on Movie Lens Experiments: Click Count Data Co-Cluster Interactions: Plain Co-Clustering Co-Cluster Interactions: PDLF Prediction Results Summary Ongoing Work Prediction Results (a)
A Scalable Modular Convex Solver for Regularized Risk Minimization A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a highly scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as `1 and `2 penalties. At present, our solver implements 20 different estimation problems, can be easily extended, scales to millions of observations, and is up to 10 times faster than specialized solvers for many applications. The open source code is freely available as part of the ELEFANT toolbox. A Scalable Modular Convex Solver for Regularized Risk Minimization (BMRM) Statistical Machine Learning Program, NICTA RSISE, Australian National University (Joint work with Quoc V. Le, Alex Smola and S.V.N. Vishwanathan) Regularized Risk Minimization Many machine learning problems can be cast in the form, minimize J(w ) := ??(w ) + R(w ) w 1 where R(w ) := m w : weight vector {(xi , yi )}m : training data i=1 m l(xi , yi , w ) i=1 l(x, y , w ): convex and non-negative loss function ?(w ): convex and non-negative regularizer ?: regularization constant Examples Method (obj. fn.) linear SVMs ??(w ) ? 2 R(w ) 1 m 1 m 1 m m i=1 m i=1 m i=1 max {0, 1 ? yi w , xi } log (1 + exp (?yi w , xi )) max {0, |yi ? w , xi | ? } log. reg. ? w ? 2 -insensitive reg. How to solve these problems? Newton and quasi-Newton Methods When the (convex) function is di?erentiable Cutting Plane based Methods When the (convex) function is continuous Meaningful termination criterion Cutting Plane Methods (CPM) Given: Convex (and non-negative) function R(w ) Idea: First order Taylor approximation lower-bounds R(w ) The convex function... Red curve: convex non-negative function Choon Hui Teo Scalable Modular Solver for Regularised Risk Minimization The lower bound... Black dashed line: 1st-order Taylor approx. at w = 0 Green dot: minimum of the lower bound Blue dashed line: current approximation gap 0 Cutting Plane Methods (CPM) Given: Convex, non-negative convex function R(w ) Idea: First order Taylor approximation lower-bounds R(w ) Fact: More approximations ?? better lower bound Choon Hui Teo Scalable Modular Solver for Regularised Risk Minimization The lower bound is better... The lower bound is better and better... The lower bound is better and better and better... Cutting Plane Methods (CPM) Given: Convex, non-negative convex function R(w ) Idea: First order Taylor approximation lower-bounds R(w ) Fact: More approximations ?? better lower bound Summary: Iteratively improve the piecewise-linear lower bound and minimize it min w ,? ? ?w R(wi ), w ? wi + R(wi ) ? ? ?i s.t. Note: Take any subgradient when R(wi ) is not di?erentiable Bundle Methods (BM) Is basically CPM stabilized with (Moreau-Yosida) regularizer, i.e., ? w ?w 2+? ? 2 2 ?w R(wi ), w ? wi + R(wi ) ? ? ?i, where w is the current minimizer. ? Point: Prevent new minimizer from moving ?too? far away from the current A Variant of BM But, our (machine learning) problem comes with a regularizer ?(w ) ??(w ) + ? ?w R(wi ), w ? wi + R(wi ) ? ? ?i, s.t. Examples of ?(w ): ?(w ) = w ?(w ) = w ?? Linear Program ?? Quadratic Program Rate of Convergence Question How fast does the approximate minimizer w approach actual ? minimizer w ? ? Answer O( 1 ), where := R(w ? ) ? R(w ). ? is the meaningful termination criterion. Architecture of BMRM For serial computation: Data module manages dataset Loss module computes loss and (sub)gradient Solver module solves optimization problem (?(w )-speci?c) Modules are loosely coupled Architecture of BMRM (cont?d) For parallel/distributed computation: For decomposable loss function Split dataset into sub-datasets Each node computes loss w.r.t. its sub-dataset Multiplexer aggregates the loss and (sub)gradients and broadcast new w Choon Hui Teo Scalable Modular Solver for Regularised Risk Minimization Experiment 1: Training time comparison Task: Binary classi?cation Solvers: Our method BMRM (in particular, SVMPERF [Joachims, KDD?06] norm and soft-margin loss) Datasets: kdd99 (m=4898431, dim.=127, den.=12.86%) reuters-c11 (m=23149, dim.=47236, den.=0.16%) Setting: = 1e-5 ? ? {1,0.3,0.1,...,3e-6} Choon Hui Teo Scalable Modular Solver for Regularised Risk Minimization BMRM is comparable to SVMPERF Figure: log-log plot of linear SVM training time vs. regularization constant ? on kdd99. Choon Hui Teo Scalable Modular Solver for Regularised Risk Minimization BMRM is comparable to SVMPERF (cont?d) Figure: log-log plot of linear SVM training time vs. regularization constant ? on reuters-c11. Experiment 2: Convergence rate Task: Binary classi?cation Solvers: BMRM Datasets: kdd99 and reuters-c11 Setting: = 1e-5, ? = 3e-6 Choon Hui Teo Scalable Modular Solver for Regularised Risk Minimization BMRM converged under O(1/ ) steps Figure: semilog-y plot of approximation gap vs. iterations BMRM converged under O(1/ ) steps (cont?d) Experiment 3: Parallelization of BMRM Task: Ranking Methods: Normalized Discounted Cumulative Gain (NDCG) Ordinal regression Dataset: MSN = 1e-5 ? ? {10, 100} Number of computers n ? {1, 2, 4, . . . , 512} BMRM runtime ? 1/n Figure: Plot of NDCG training time vs. the inverse number of computers BMRM runtime ? 1/n (cont?d) Figure: Plot of Ordinal regression training time vs. the inverse number of computers Choon Hui Teo Scalable Modular Solver for Regularised Risk Minimization Conclusion Unconstrained formulation leads to easy, modular and scalable solver design ?Job specialization?: optimization, loss, parallelization scheme Choon Hui Teo Scalable Modular Solver for Regularised Risk Minimization Thank you! (Poster 23, Tuesday 14th August 07)
Introduction to the Panel Since the 1989 workshop on knowledge discovery in databases, the field has seen sustained growth and interest and has attained significant maturity. The main objectives of this panel will be to reflect on the successes and failures in the field of data mining over the last eighteen years and to examine what insights we can take with us as we move forward. Data mining at the crossroads Introduction Success stories Mistakes & failures Future outlook Panelists SIGKDD 2007 Panel Data Mining at the Crossroads: Successes, Failures and Learning From Them Moderator: Srinivasan Parthasarathy (OSU) Panelists Pavel Berkhin (Yahoo!) Christos Faloutsos (CMU) Jiawei Han (UIUC) Haym Hirsh (NSF and Rutgers) John Elder (Elder Research) Introduction ? KDD as a field is concerned with extracting actionable and interpretable knowledge from data as efficiently as possible. ? Panel objectives are to reflect on: ? How successful have we been? ? What have been the mis-steps and failures? ? and ask ourselves what can we learn from them as we turn toward the future? Success Stories ? What have been the major successes and breakthroughs in the field? ? Reflect on the above in the context of: ? Progress in core KDD sub-fields ? Breakthrough algorithms ? Impact on real applications and deployment ? Educational and organizational successes ? Other examples Mistakes & Failures ? What have been our critical mistakes or failures? ? Failure to progress in line with an expectation ? E.g. Illusion of Progress in Classification [Hand?06] ? Failure to adapt to application and end-user requirements ? E.g. Interpretability, Interestingness ? Classic mistakes ? E.g. Use of non-representative training data ? Other Issues ? E.g. Reproducibility of results, benchmarks Future Outlook ? How can we learn from our successes and ensure sustained growth in the field? ? How can we learn from our failures and what can we do to avoid repetitions? ? Where does emerging technology play a role in all this? Panelists ? ? ? ? ? Pavel Berkhin (Yahoo!) Christos Faloutsos (CMU) Jiawei Han (UIUC) Haym Hirsh (NSF and Rutgers) John Elder (Elder Research)
Successes, Failures and Learning From Them At an abstract level, the theme of the field is concerned with extracting actionable and interpretable knowledge from data in as efficient a manner as possible. The primary purpose of this panel, in the context of this underlying theme, is to consider the following questions. What have been the major successes and breakthroughs that we as a field can point to with pride? What have been the critical mistakes or mis-steps that have been taken along the way? And finally, what can we hope to learn from both our successes and mistakes and how can this knowledge be used to determine how to focus our efforts in the future? Mining at the crossroads Applications Enabling Tecnologies I.I.D. Assumption is not realistic Future construction is still an art Off-the-shelf (Robust) clustering Industry strength DM Enviroment Data Mining Operations Thanks Pavel Berkhin Mining at the Crossroads: Crossroads Successes, Failures and Learning From Them Mining at the Crossroads: Crossroads Successes Applications ? Pre Data Mining apps: ? Speech Recognition ? Medical Diagnostics ? Financial Time Series Analysis ? Behavioral Targeting ? Advertising.com, Yahoo! ? Recommendation Systems ? Amazon, Netflix ? Fraud Detection / Risk Modeling ? Fair Isaac ? Search Relevance ? Google, MSN Enabling Technologies ? Data Mining / Machine Learning ? ? ? ? ? Constrained and Stabilized regression Gradient Boosting Fast SVMs Graphical and Probabilistic Modeling Collaborative Filtering ? Information Retrieval ? Web Graph construction ? Information Extraction from unstructured data ? Grid Computing Mining at the Crossroads: Crossroads Challenges and Gaps I.I.D. Assumption is not realistic ? ? ? ? ? Medical Data ? patient relations, family genes Web Graphs ? hyperlinks Social Networks ? friendship / co-authorship graphs News Events ? streams, news updates, multiple sources Commercial products ? manufacturers, distributors, transporters, agents, retailers, etc. Research addressing non-iid data ? Conditional Random Fields (Lafferty, McCallum, Pereira) ? Relational Markov Networks (Taskar, Abbeel, Wong, Koller) Feature Construction is still an Art ? Incorporating domain knowledge ? Integrating time dependency ? Weighted decay of values over time ? Processing different feature types ? Text ? Image ? Audio / Video streams ? Capturing language semantics ? Processing semi-structured / unstructured data Off-the-shelf (Robust) Clustering ? Handling categorical and numeric features ? Practical constraints ? Non-overlapping segments ? Interpretability ? Even k-means requires ? attribute selection and scaling, case scaling, identifying number of clusters ? Exceptions ? Graph clustering and spatial clustering Industry Strength DM Environment ? Robust / Highly Scalable Platform ? Handle wide and sparse data ? Efficient data transformations ? Rapid model building ? Rich library of algorithms ? Quick evaluation ? Key metrics for model selection ? Build thousands of models ? Little or no human intervention Data Mining Operations ? Transition from R&D to Production ? Online evaluation ? A/B Testing Framework ? Model Selection Criteria ? Online scoring ? Cost of deployment ? Complexity of computed features ? Graceful degradation (missing features) ? Model Deployment ? Smooth deployment of thousands of models ? Careful monitoring and tracking of changes ? Effective roll-back of models ? Model Retraining ? When and how to retrain Thanks ? Rajesh Parekh ? Padhraic Smyth ? John Canny
Successes, Failures and Learning From Them Over the last eighteen years, the field of knowledge discovery and data mining has matured considerably. Although the field has evolved as a result of synergistic co-operation among researchers in databases, artificial intelligence, statistics and systems, it has maintained its own identity. From a single workshop in 1989, the field can now lay claim to at least 5 major conferences and numerous symposium devoted to its central theme. Data mining research 18 years of KDD research Major achivements Scalable mining methods Fast expanding of applications Major lessons Look into the near future Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign August 20, 2007 ,4 = 4 ., 2 D % ! ! , ! / 5 B 7 ! ! B ! C B 7 , B ! , B C , 7/ 6 B B , / $ , B A , ! E , ;
EC update: Information on IST Call 5 and FP7 Multi-modal Interfaces Mats Ljungqvist INFSO E.1 ? Interfaces mats.ljungqvist@cec.eu.int ?The focus of IST in FP6 is on the future generation of technologies in which computers and networks will b Our Mission What do we mean? Focus so far? FP6 portfolio The Facts The story so far! Main challenges An R&D agenda Future orientations? Integrated Projects Instruments The Evaluation Criteria We welcome? Remember? Further information
Successes, Failures and Learning From Them Data mining: Successes and failures Failures Successes Next steps IMHO Scalability Example DM for Tera- and Peta-bytes Conclusion Data Mining: Successes and Failures Christos Faloutsos CMU Failures ? non-technical: only ?Public Relations? ? we are too modest to ?brag?: ? SAS, SPSS, +: (SAS on TV) ? [Heckerman, KDD ?04]: better cancer data analysis ? PLUS: companies often keep successes silent, to maintain edge ? colleagues at search engines achieve $Ms in revenue increases KDD 07 Faloutsos Successes ? merging of DB, ML, Stat ? excellent outreach: bio-informatics social networks text / IR game theory / economics etc etc Faloutsos 3 Next steps, IMHO ? keep on the out-reach ? Large scale data mining (Tera and Peta bytes) ? simple algorithms may give stunning results, when applied on massive data ? scalability [in this KDD: Usama; Jon; ++] ? parallelism KDD 07 Faloutsos 4 Scalability ? Google: < 450,000 processors in clusters of ~2000 processors each Barroso, Dean, H?lzle, ?Web Search for a Planet: The Google Cluster Architecture? IEEE Micro 2003 ? target: hundreds of Tb, to several Peta-bytes ? (Netflix sample:2Gb uncompressed) ? Yahoo: ~5Pb [Usama?s keynote] E.g.: self-* system @ CMU ? <200 nodes ? 40 racks of computing equipment ? 774kw of power. ? target: 1 PetaByte ? goal: self-correcting, selfsecuring, self-monitoring, self... KDD 07 Faloutsos DM for Tera- and Peta-bytes Two-way street: >- DM can use such infrastructures to find patterns -< DM can help such infrastructures become self-healing, self-adjusting, ?self-*? Conclusion ? Failures: lack of ?bragging? ? Successes: stunning out-reach + crossdisciplinarity ? Next steps: scalability: emphasis on Systems >?< DM collaboration
Successes, Failures and Learning From Them Another topic of interest here is to highlight some of the classic mistakes made in the field. Topics of interest here could range from the use of non-representative training data to the ignorance of population drift when modeling time-varying data, from not accounting for errors in data or labels in the model to an over reliance on a single technique for the task on hand and from asking the wrong question in the context of the application driver to sampling without care. A related topic here might be to think about the role of benchmark datasets and algorithms, and reflect on the general importance and requirement for repeatable and reproducible results. Data mining at the crossroads What is data mining? Some of my favorite data mining successes-part01 Some of my favorite data mining successes-part02 Some of my favorite data mining successes-part03 Some of my favorite data mining successes-part04 Some of my favorite data mining successes-part05 Some of my favorite data mining successes-part06 Some of my favorite data mining successes-part07 Some of my favorite data mining successes-part08 Some data mining failures-part01 Some data mining failures-part02 Tecnology Msnbc Wired 9/11 Some data mining failures-part02A Some data mining failures-part03 Some data mining failures-part04 Some data mining failures-part05 Some data mining failures-part06 Some data mining failures-part07 Some data mining failures-part08 Some data mining failures-part09 Data Mining At the Crossroads: Successes, Failures, and Learning from Them Haym Hirsh Department of Computer Science Rutgers University Division of Information and Intelligent Systems U.S. National Science Foundation What is Data Mining? For the purposes of my presentation: Data Mining = The extraction of useful information from data (I.e., Data Mining broadly construed) Copyright ? 2007 Haym Hirsh Some of My Favorite Data Mining Successes ? Web search ? ? Web search Spam filtering ? ? ? Web search Spam filtering Recommender systems ? ? ? ? Web search Spam filtering Recommender systems Machine translation ? ? ? ? ? Web search Spam filtering Recommender systems Machine translation Massive data clusters ? ? ? ? ? ? Web search Spam filtering Recommender systems Machine translation Massive data clusters Conferences like this one: Participation by people in diverse, previously disjoint subfields (databases, machine learning, statistics, etc.) ? ? ? ? ? ? ? Web search Spam filtering Recommender systems Machine translation Massive data clusters Conferences like this one: Participation by people in diverse, previously disjoint subfields (databases, machine learning, statistics, etc.) Benchmark datasets Some Data Mining ?Failures? ? Socio-Political Section 13.3: UNITY OF EFFORT IN SHARING INFORMATION The U.S. government has access to a vast amount of information. When databases not usually thought of as \"intelligence,\" such as customs or immigration information, are included, the storehouse is immense. ? In interviews around the government, official after official urged us to call attention to frustrations with the unglamorous \"back office\" side of government operations. ? Recommendation: The president should lead the government-wide effort to bring the major national security institutions into the information revolution. He should coordinate the resolution of the legal, policy, and technical issues across agencies to create a \"trusted information network.\" Some Data Mining ?Failures? ? Socio-Political Copyright ? 2007 Haym Hirsh ? Bad data mining ? ? Bad data mining Misused data mining ? ? ? Bad data mining Misused data mining Ignorant decision-making ? ? ? ? Bad data mining Misused data mining Ignorant decision-making Ramifications of data mining ? ? ? ? ? Bad data mining Misused data mining Ignorant decision-making Ramifications of data mining Presuming fixed technology ? Data Mining is about Real Data: Benchmark data sets are a means to an end ? ? Data sets are supposed to be representative of the sorts of problems our algorithms will see in practice Data sets must stay timely as technological and scientific advances allow our ambitions to grow A data set from some domain is not an application -Who do you personally know that cares about your results? ? How do we ensure reproducible results? ? Many of the applications of data mining are in the commercial sector -- How do we handle research results that reflect proprietary or otherwise restricted data? How do we make sure academic research results address problems that are important in practice? How do we handle inherent resource differentials between industry and academic research? ? ? Access to data Massive data centers What new models of publication are particularly suited to data mining ? ?Executable articles? (Mark Liberman)
Successes, Failures and Learning From Them Over the last eighteen years, while there have clearly been successful deployment of knowledge discovery and data mining solutions, there have also undoubtedly been mistakes and failures. This aspect of the panel discussion will examine the low-lights (important mistakes and failures) with the end goal of trying to learn from them. Worst, Best Can Machines Think? Essentially every Bundling method improves performance Future (1) Future (2) Visualization as Model:>br< Pharmaceutical application Visualization counter-examples ? (1) Visualization counter-examples ? (2) Visualization counter-examples ? (3) Visualization counter-examples ? (4) Worst John Elder elder@datamininglab.com ? Short-cuts on sampling / cross-validation Using all the data more than we realize ? Focus on algorithms to detriment of full ecosystem in which our craft thrives ? Privacy and Security: Failure here endangers our whole ?eld! Best ? Scoring: e.g., credit scoring and fraud detection ? Software tools getting better and better ? Recommendation systems: even early versions are well-accepted ? Ensembles ?Of course machines can think. After all, humans are just machines made of meat.? - MIT CS professor Human and computer strengths are more complementary than alike. Data Mining Products P redictive D ynamix Model 1 Decision Tree Nearest Neighbor Delaunay Triangles (or Hinging Hyperplanes) Kernel Neural Network (or Polynomial Network) Relative Performance Examples: 5 Algorithms on 6 Datasets Error Relative to Peer Techniques (lower is better) (with Stephen Lee, U. Idaho, 1997) Neural Network Logistic Regression Linear Vector Quantization Projection Pursuit Regression Decision Tree Diabetes Gaussian Hypothyroid German Credit Waveform Investment Essentially every Bundling method improves performance Advisor Perceptron AP weighted average Vote Average Future ? Visualization ? Medical treatment becomes data- and results-driven ? Hope: Data Mining process becomes more scienti?c, repeatable - KDD Conference better for industry 7 ? Hildebrandts Visualization as Model: Pharmaceutical application Placebo Drug Density surfaces enclose ascending quartiles of data Visualization counter-examples? John F. Elder IV Chief Scientist, Elder Research, Inc. Dr. John Elder heads a data mining consulting team with offices in Charlottesville, Virginia and Washington DC (www.datamininglab.com). Founded in 1995, Elder Research, Inc. focuses on investment, commercial and security applications of pattern discovery and optimization, including stock selection, image recognition, text mining, process optimization, cross-selling, biometrics, drug efficacy, credit scoring, market timing, and fraud detection. John obtained a BS and MEE in Electrical Engineering from Rice University, and a PhD in Systems Engineering from the University of Virginia, where he?s recently been an adjunct professor, teaching Optimization. Prior to 12 years leading ERI, he spent 5 years in aerospace defense consulting, 4 heading research at an investment management firm, and 2 in Rice's Computational & Applied Mathematics department. Dr. Elder has authored innovative data mining tools, is active on Statistics, Engineering, and Finance conferences and boards, is a frequent keynote conference speaker, and will chair the 2009 Knowledge Discovery and Data Mining conference. John?s courses on data analysis techniques -- taught at dozens of universities, companies, and government labs -- are noted for their clarity and effectiveness. For five years, Dr. Elder was honored to serve on a panel appointed by the President to guide technology for National Security. John is a follower of Christ and the proud father of 5.
Debate The terms \"success\" and \"failure\" often convey fuzzy semantics that are open to interpretation. As part of the discussion, it is expected that panelists will offer their thoughts on the aforementioned questions while defining their interpretation of these terms in the context of particular domains. After an initial round of discussions by the panelists, the floor will then be opened to an interactive session with the audience. Finally, panelists will be asked to conclude their presentations with their outlook on how one can learn from the successes and failures of the past 15+ years and what in their opinion are the critical opportunities for the field in the future.
San Jose jazz festival During the day the KDD 2007 attendees were in the conference rooms, but at night everyone was outside on the streets to listen to jazz on the San Jose Jazz Festival which celebrates: \\\\ - bringing jazz legends to Silicon Valley \\\\ - bringing music to schools \\\\ - supporting local musicians \\\\ - promoting emerging musicians and new jazz forms
Privacy Preserving DataMining The rapid growth of the Internet over the last decade has been startling. However, efforts to track its growth have often fallen afoul of bad data --- for instance, how much traffic does the Internet now carry? The problem is not that the data is technically hard to obtain, or that it does not exist, but rather that the data is not shared. Obtaining an overall picture requires data from multiple sources, few of whom are open to sharing such data, either because it violates privacy legislation, or exposes business secrets. The approaches used so far in the Internet, e.g., trusted third parties, or data anonymization, have been only partially successful, and are not widely adopted. The paper presents a method for performing computations on shared data without any participants revealing their secret data. For example, one can compute the sum of traffic over a set of service providers without any service provider learning the traffic of another. The method is simple, scalable, and flexible enough to perform a wide range of valuable operations on Internet data.
Winning The DARPA Grand Challenge The DARPA grand challenge, technical details enabling Sebastian Thrun's win, and an introduction to the next phase called \"The Urban Grand Challenge\"
Learning and Recognizing Visual Object Categories Over the past few years there has been substantial progress in the development of techniques for recognizing generic categories of objects in images, such as automobiles, bicycles, airplanes, and human faces. Much of this progress can be traced to two underlying technical advances: # detectors for locally invariant features of an image, and # the application of techniques from machine learning. Despite recent successes, however, there are some fundamental concerns about methods that rely heavily on feature detection, because the local image evidence used in detection decisions is often highly ambiguous due to the absence of contextual information. We are taking a different approach to learning and recognizing visual object categories, in which there is no separate feature detection stage. In our approach, objects are modeled as local image patches with spring-like connections that constrain the spatial relations between patches. Such models are intuitively natural, and their use dates back over 30 years. Until recently such models were largely abandoned due to computational challenges that are addressed by our work. Our approach can be used to learn models from weakly labeled training data, without any specification of the location of objects or their parts. The recognition accuracy for such models is better than when using techniques based on feature detection that encode similar forms of spatial constraint.
Everything is Miscellaneous David Weinberger's new book covers the breakdown of the established order of ordering. He explains how methods of categorization designed for physical objects fail when we can instead put things in multiple categoreis at once, and search them in many ways. This is no dry book on taxonomy, but has the insight and wit you'd expect from the author of The Cluetrain Manifesto, Small Pieces Loosely Joined, and a former writer for Woody Allen.
Human Computation Tasks like image recognition are trivial for humans, but continue to challenge even the most sophisticated computer programs. This talk introduces a paradigm for utilizing human processing power to solve problems that computers cannot yet solve. Traditional approaches to solving such problems focus on improving software. I advocate a novel approach: constructively channel human brainpower using computer games. For example, the ESP Game, described in this talk, is an enjoyable online game -- many people play over 40 hours a week -- and when people play, they help label images on the Web with descriptive keywords. These keywords can be used to significantly improve the accuracy of image search. People play the game not because they want to help, but because they enjoy it. I describe other examples of \"games with a purpose\": Peekaboom, which helps determine the location of objects in images, and Verbosity, which collects common-sense knowledge. I also explain a general approach for constructing games with a purpose.
TNO submission TNO submission Reasons for TNO to join RT05s/spkr Speaker Diarization Error rate (SDE) Speech Activity Detection a necessity SAD approaches SAD results ptsamiditw Speaker Segmentation Speaker clustering NIST RT05s speaker diarization results Discussion / conclusions No time / plans for next evaluation David van Leeuwen TNO submission NIST RT05s SAD/SPKR Defence, Security and Safety Reasons for TNO to join RT05s/spkr ? Running Broadcast News speaker segmentation/clustering speech recognition system for Dutch since 2001? ? segmentation necessary for ? on-line processing ? feature stream time reversal in Abbot acoustic NN ? low latency ? poor clustering ? Active in NIST speaker recognition evaluations since 2003 ? Takes part in AMI EU meeting project ? scenarios, data collection, data processing, interpretation, presentation ? speaker segmentation/clustering ? Problem definition ? Evaluation measures 2 TNO Defence, Security and Safety http://speech.tm.tno.nl/radio1/ Speaker Diarization Error rate (SDE) speaker 1 speaker 2 sp. 3 Evaluation sp. A sp. B sp. C Evaluation correct speaker time ? spoken time missed speaker time ? misclassified time ? Security alarm ? TNO Defence,falseand Safety speaker time ? time Machine Reference ?speaker error time Speech Activity Detection a necessity ? Speaker diarization error rate ? Error speaker time / spoken time ? Without speech activity detection: ? All non-speech time is false alarm speaker error time ? Total time T, spoken time Ts ? typical meeting scenario ? SAD important in RT05s speaker diarization ? ICSI offered SAD output to us ? contrastive SPKR condition 4 TNO Defence, Security and Safety 2005 SAD approaches 1)Energy based ? e.g., all frames with energy < 20 dB under meeting maximum ? works fairly well for telephone speech, speaker recognition ? doesn't work with distant microphone ? SAD error ? 50% 2)Two-phone speech recognition system ? speech + non-speech 3-state LtoR phone models ? Sonic decoder, 2-phone grammar ? no output 3)Two-state Viterbi GMM decoder ?ptsamiditw? ? 16 mixtures/model ? calculate maximum likelihood state sequence ? apply some smoothing ? seems to work SAD results ptsamiditw ? GMM training, 12 PLP+energy+delta ? 5 ?train? AMI meetings from dev test ? non/speech labels from SPKR reference files ? thanks Xavier Anguera, ICSI ? decoder parameter tuning ? 5 ?test? AMI meetings from dev test ? parameters 0.01 ? prior odds non/speech 10?5 ? transition probability ratio SAD error rate ? Results ? AMI dev test set 10.3 % 2.8 % ? RT04s ? CMU 5.0 % ? RT05s TNO Defence, Security and Safety Speaker Segmentation ? Uses output from SAD, 12 PLP+energy ? Based on Bayesian Information Criterion, Chen&Gopalakrishnan? ? ? Nx = NA+NB number of frames considered in current ?window? NA tN candidate tedge twindow NB ? store aggregated ?sufficient statistics? for covariances Proc. DARPA broadcast news transcription and understanding, 1998 Speaker clustering ? Uses output from speaker segmentation ? Agglomerative clustering ? Uses ?Gish distance measure? for finding closest segments ? Condition for merging clusters based on BIC ? Nx is total number of frames in entire meeting ? Inefficient for large number of initial segments ? but preferred over ?online? version of BN system ? Tuning parameters ? AMI ?test? split development test data ? ?seg = 1.5 ?clust = 14 8 TNO Defence, Security and Safety 2005 NIST RT05s speaker diarization results ? ?Multiple distant microphones? = single distant mic ? no overlap ? SDE, in % SAD input to SPKR parameters Test set TNO ICSI perfect optimized AMI dev 35.7 45.9 45.3 ? RT04s ? CMU 35.4 31.9 25.6 RT05s 35.1 37.1 32.3 19.0 ? RT05s speaker misses, false alarms ? misses: 13/53 = 24.5% speakers, 0.4% speaker time ? false alarms: 5/53 = 9.4% speakers, 6.6% speaker time TNO Defence, Security and Safety Discussion / conclusions ? SDE Evaluation measure ? harsh on TFA because T ?TFA in denominator ? weights long duration speakers more ? advantageous to ignore short duration speakers ? high ?clust ? BIC segmentation / clustering ? nice idea based on first principles ? still tunable parameters ? ? why full covariance single mixture GMMs? ? cancellation of exponent in likelihood calculation ? how about diagonal covariance, multiple mixtures? 10 TNO Defence, Security and Safety 2005 No time / plans for next evaluation ? Use decoder for clustering process ? use diagonal covariance GMM for speaker model ? include overlap between speakers in network ? Use multiple distant microphone data ? SAD: results from ICSI ? SPKR: RT05s results not hopeful ? Investigate ?absolute speaker ID? ? ?speaker spotting? ? speaker tracking ? speaker priors and evaluation measure ? speaker speaking time entropy?? TNO Defence, Security and Safety Jin et al., Proc NIST RT04s, ICASSP, 2004
Reverse engineering techniques to find security bugs: A case study of the ANI Alex Sotirov is a vulnerability engineer at determina. He will discuss some latest techniques in reverse engineering software to find vulnerabilities. Particularly, he'll discuss his technique that lead him to find the ANI bug (a critical new bug in WinXP and Vista). Alex will describe the tools he uses for reverse engineering and show how he reverse engineered ANI Bug. He will continue to discussed Windows security mechanisms (ASLR, /GS) and describe how ANI exploit bypasses them.
Statistical Aspects of Data Mining (Stats 202) This is the Google campus version of Stats 202 which is being taught at Stanford this?summer. I will follow the material from the Stanford class very closely. That material can be found at [[http://www.stats202.com/|www.stats202.com]]. The main topics are exploring and visualizing **data**, association analysis, classification, and clustering. The textbook is Introduction to **Data** **Mining** by Tan, Steinbach and Kumar. Googlers are welcome to attend any classes which they think might be of interest to them.
Debunking third-world myths with the best stats you've ever seen You've never seen data presented like this. With the drama and urgency of a sportscaster, [[http://www.ted.com/index.php/speakers/view/id/90|Hans Rosling]] debunks myths about the so-called \"developing world\" using extraordinary animation software developed by his Gapminder Foundation. The Trendalyzer software (recently acquired by Google) turns complex global trends into lively animations, making decades of data pop. Asian countries, as colorful bubbles, float across the grid -- toward better national health and wealth. Animated bell curves representing national income distribution squish and flatten. In Rosling's hands, global trends -- life expectancy, child mortality, poverty rates -- become clear, intuitive and even playful. (Recorded February 2006 in Monterey, CA. Duration: 20:35) - More TEDTalks at [[http://www.ted.com/]] ;//\"Rosling believes that making information more accessible has the potential to change the quality of the information itself.\" ://Business Week Online//
The Implications of OpenID OpenID is an emerging standard that provides simple, decentralised authentication for the Web. OpenID follows the Unix philosophy, solving one small problem rather than attempting to tackle the many larger challenges posed by online identity. This talk will explore the implications of OpenID, and explore the best practices required to take advantage of this new technology while avoiding the potential pitfalls. Speaker:\\\\ Simon Willison is a consultant on OpenID and client- and server-side Web development, and a co-creator of the Django Web framework. Before going frelance Simon worked on Yahoo!'s Technology Development team, and prior to that at the Lawrence Journal-World, an award winning local newspaper in Kansas. \\\\ Simon maintains a popular Web development weblog at [[http://simonwillison.net/]]
The Next Fifty Years of Science The scientific method which provides us with so many technological goodies does not resemble the science of 1600. Ever since Bacon, science has undergone a slow evolution. Landmarks in the history of the scientific method are the invention of libraries, indexes, citations, controlled experiments, peer review, placebos, double blind experiments, randomization, and search among others. At the core of the scientific method is the structuring of information. In the next 50 years, as the technologies of information and knowledge accelerate, the nature of the scientific process will change even more than it has in the last 400 years. We can't predict what specific inventions will arise in the next 50 years, but based on long-term trends in epistemic tools, I believe we can speculate on how the scientific method itself -- that is, how we know -- will change in the next five decades.
Faith, Evolution, and Programming Languages Faith and evolution provide complementary--and sometimes conflicting--models of the world, and they also can model the adoption of programming languages. Adherents of competing paradigms, such as functional and object-oriented programming, often appear motivated by faith. Families of related languages, such as C, C++, Java, and C#, may arise from pressures of evolution. As designers of languages, adoption rates provide us with scientific data, but the belief that elegant designs are better is a matter of faith. This talk traces one concept, second-order quantification, from its inception in the symbolic logic of Frege through to the generic features introduced in Java 5, touching on features of faith and evolution. The remarkable correspondence between natural deduction and functional programming informed the design of type classes in Haskell. Generics in Java evolved directly from Haskell type classes, and are designed to support evolution from legacy code to generic code. Links, a successor to Haskell aimed at AJAX-style three-tier web applications, aims to reconcile some of the conflict between dynamic and static approaches to typing.
Machine Learning Summer School 2007 - Tuebingen Machine Learning is a foundational discipline of the Information Sciences. It combines theory from areas as diverse as Statistics, Mathematics, Engineering, and Information Technology with many practical and relevant real life applications. The aim of the summer school is to cover the entire spectrum from theory to practice. It is mainly targeted at research students, academics, and IT professionals from all over the world. The program will feature introductory courses at the beginning to provide basic working knowledge of Machine Learning. Building on this introductory material, advanced topics will be covered progressively over the duration of the school. Subjects will be covered both in lectures (4-6 per topic) and in practical courses (where students will have the chance to implement methods for themselves); and are taught by world experts in their fields.\\\\ **Do you have a question for the lecturers on the MLSS 2007?** **We encourage you to start a debate, comment on each lecturers video or send us an email and we will ask them for you!**
Describing and Discovering Language Resources Describing and Discovering Language Resources Goals LRs on the WWW, 1 LRs on the WWW, 4 Web Services (WS) Service Oriented Architecture (SOA) From WWW to Web Services Web Service: Key Ideas The Appeal of Web Services NLP Services Building NLP Applications Issues in Component Approach Compatible NLP Services: Sequencing WSDL File Processor Input and Output Types Metadata for LRs What?s missing: tool metadata Discovering Resources Service Description & Discovery Some Versions of BNC Corpus Request Scenario Service Description Implementation Summary Thanks to collaborators
Interview with Usama Fayyad [[http://videolectures.net/usama_fayyad/|Dr. Usama Fayyad]] is** [[http://yhoo.client.shareholder.com/press/management.cfm|Yahoo!'s Chief Data Officer and Executive Vice President, Research & Strategic Data Solutions.]]** Dr. Usama Fayyad is responsible for Yahoo!'s overall data strategy, architecting Yahoo!'s data policies and systems, prioritizing data investments, and managing the Company's data analytics and data processing infrastructure. Here he discusses on son=me contemporary topics such as academy and industry cooperation, privacy policy and his beginings as a young researcher and his first algorithms. ; In this interview the Videolectures.Net team spoke to him about his starts as a **young researcher**, the distinction between researchers and engeneers in data mining or whether he felt as a **scientific businessman**. It was very interesting to her what was his opinion on **privacy policy**, if he does remember his **first algorithm **and his **message to the community**. :// \"It is my fundamental belief that the most elegant theoretical problems are tipically embedded in the most mundane real applications, so you really dont have to think too fancy or hard, all you have to do is you have to embedd yourself in a real problem and try to solve the real problem with real constraints.\"//
Welcome from the Program Chairs This proceedings is the published record of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-07) held in San Jose, California on August 12?15, 2007. The KDD-07 conference provides a forum for novel research results and important applications in the area of data mining and knowledge discovery. The vibrancy, excitement and breadth of the field are reflected by the strong lineup of research papers, invited talks, tutorials and workshops at the conference. The conference and the proceedings represent the efforts of a large number of people. We would like to thank the Industrial and Government Applications Track Chairs, the members of the Organizing Committee, the members of the Program Committee (including the Research Track Senior Program Committee), the external reviewers, and the student volunteers who helped out at the conference. These individuals contributed many hours of their time to serve their scientific community and help make the conference as successful as it is. We would also like to thank the ACM staff and the conference sponsors for their support. Chairs of Program Committees Research Track SPC Research Track PC Paper Review and Selection for the Research Track
Temporal Causal Modeling with Graphical Granger Methods The need for mining causality, beyond mere statistical correlations, for real world problems has been recognized widely. Many of these applications naturally involve temporal data, which raises the challenge of how best to leverage the temporal information for causal modeling. Recently graphical modeling with the concept of ?Granger causality?, based on the intuition that a cause helps predict its effects in the future, has gained attention in many domains involving time series data analysis. With the surge of interest in model selection methodologies for regression, such as the Lasso, as practical alternatives to solving structural learning of graphical models, the question arises whether and how to combine these two notions into a practically viable approach for temporal causal modeling. In this paper, we examine a host of related algorithms that, loosely speaking, fall under the category of graphical Granger methods, and characterize their relative performance from multiple viewpoints. Our experiments show, for instance, that the Lasso algorithm exhibits consistent gain over the canonical pairwise graphical Granger method. We also characterize conditions under which these variants of graphical Granger methods perform well in comparison to other benchmark methods. Finally, we apply these methods to a real world data set involving key performance indicators of corporations, and present some concrete results. Temporal Causal Modeling with Graphical Granger Methods Talk Outline A Motivating Example: Key Performance Indicator Data (KPI) in Corporate Index Management [S&P] KPI Case Study: Temporal Causal Modeling for Identifying Levers of Corporate Performance Granger Causality Variable Space Expansion & Feature Space Mapping Graphical Granger Methods Exhaustive Granger vs. Lasso Granger Baseline Methods: SIN and VAR Empirical Evaluation of Competing Methods Experiment 1A: Performance vs. Factors Experiment 1?s Efficiency Experiment 1B: Performance vs. Factors Experiment 1C: Performance vs. Factors Experiment 2: Learned Graphs Experiment 3: Real World Data Output Graphs on the Corporate KPI Data Thank You
Hierarchical Multi-Stream Posterior Based Speech Recognition System In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on \\state gamma posterior\" de?nition (typically used in standard HMMs training) extended to the case of multi-stream HMMs.This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multi-stream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task, this resulted in signi?cant performance improvement, compared to the stateof- the-art Tandem systems. Hierarchical Multi-Stream Posterior Based Speech Recognition System Main idea (1) Main idea (2) Posterior based speech recognition systems Prior knowledge, Contextual information ?Gamma? posterior estimation ?Gamma? posterior estimation Example: introducing prior knowledge Multi-stream ?gamma? posterior Multi-stream ?gamma? posterior estimation Experiments with multi-stream posteriors Feature streams Hierarchical multi-stream posterior based speech recognition system Results Conclusions
The 5th International Workshop on Mining and Learning with Graphs Data Mining and Machine Learning are in the midst of a \"structured revolution\". After many decades of focusing on independent and identically-distributed (iid) examples, many researchers are now studying problems in which examples consist of collections of inter-related entities or are linked together into complex graphs. A major driving force is the explosive growth in the amount of heterogeneous data that is being collected in the business and scientific world. Example domains include bioinformatics, chemoinformatics, transportation systems, communication networks, social network analysis, link analysis, robotics, among others. The structures encountered can be as simple as sequences and trees (such as those arising in protein secondary structure prediction and natural language parsing) or as complex as citation graphs, the World Wide Web, and even relational data bases. In all these cases, structured representations can give a more informative view of the problem at hand, which is often crucial for the development of successful mining and learning algorithms. We believe this is an ideal time for a workshop that allows active researchers in this area to discuss and debate the unique challenges of mining and learning from structured data. The MLG 2007 workshop will thus concentrate on mining and learning with structured data in general and its many appearances and facets such as interpretations, graphs, trees, sequences. Specifically, we seek to invite researchers in Statistical Relational Learning, Kernel Methods for Structured Inputs/Outputs, Graph Mining, (Multi-) Relational Data Mining, Inductive Logic Programming, among others.
Opening - The 5th International Workshop on Mining and Learning with Graphs There have been several workshops on mining and learning from graphs in recent years such as last year's MLG and its forerunner MGTS workshop series on Mining Graphs, Trees and Sequences. These were successful, but were tied to the conference of one research community. Nowadays there seems to be a surge of interest in mining and learning from structured data across several communities. Most researchers, however, only have exposure to one or two communities, and no clear understanding of the relative advantages and limitations of different approaches has yet emerged. We believe this is an ideal time for a workshop that allows active researchers in this area to discuss and debate the unique challenges of mining and learning from structured data. The MLG 2007 workshop will thus concentrate on mining and learning with structured data in general and its many appearances and facets such as interpretations, graphs, trees, sequences. Specifically, we seek to invite researchers in Statistical Relational Learning, Kernel Methods for Structured Inputs/Outputs, Graph Mining, (Multi-) Relational Data Mining, Inductive Logic Programming, among others. the 5th International Workshop on Mining and Learning with Graphs Previous editions MLG?07 as an independent event - part 1 MLG?07 as an independent event - part 2 MLG?07 as an independent event - part 3 Facts: - part 1 Facts: - part 2 Facts: - part 3 Facts: - part 4 Facts: - part 5 Schedule highlights - part 1 Schedule highlights - part 2 Schedule highlights - part 3 Schedule highlights - part 4 Schedule highlights - part 5 Schedule highlights - part 6 mlg07_opening_Page_17 ?Forthcoming special topic of the - Journal of Machine Learning Research ?Stay tuned for details and Call for Papers MLG?07 Welcome to the 5th International Workshop on Mining and Learning with Graphs Universit? degli Studi di Firenze August 1-3, 2007 Previous editions ?MGTS?03: Cavtat-Dubrovnik, Croatia ?MGTS?04: Pisa, Italy ?MLG?05: Porto, Portugal ?MLG?06: Berlin, Germany ?All organized as ECML/PKDD workshops Mining and Learning with Graphs 2007 MLG?07 as an independent event ?Not a publication venue, true workshop, participation driven by interest in the ?eld ?Not a publication venue, true workshop, participation driven by interest in the ?eld ?Strict review process with 3/4 reviews for each extended abstract ?Not a publication venue, true workshop, participation driven by interest in the ?eld ?Strict review process with 3/4 reviews for each extended abstract ?A big thanks to the PC and other referees! Facts: ?49 submissions from 16 countries ?49 submissions from 16 countries ? 17 oral presentations ?49 submissions from 16 countries ? 17 oral presentations ? 26 posters (5 spotlights) ?49 submissions from 16 countries ? 17 oral presentations ? 26 posters (5 spotlights) ?74 participants from 22 countries Schedule highlights ?5 prominent scientists in the ?eld as invited speakers ?A big thanks to our generous sponsors: ?Poster session this afternoon ?Poster session this afternoon ?On-site lunch today and tomorrow ?Poster session this afternoon ?On-site lunch today and tomorrow ?Business meeting today at 18:00 ?Poster session this afternoon ?On-site lunch today and tomorrow ?Business meeting today at 18:00 ?Banquet tomorrow at 20:00 ?Poster session this afternoon ?On-site lunch today and tomorrow ?Business meeting today at 18:00 ?Banquet tomorrow at 20:00 ?Enjoy the workshop! ?Forthcoming special topic of the Journal of Machine Learning Research ?Stay tuned for details and Call for Papers
Learning and Charting Chemical Space with Strings and Graphs: Challenges and Opportunities for AI and Machine Learning Informatics methods and computers have not yet become as pervasive in chemistry as they have in physics and biology. Drawing analogies from bioinformatics, key ingredients for progress in chemoinformatics are the availability of large, annotated databases of compounds and reactions, data structures and algorithms to efficiently search these databases, and computational methods to predict the physical, chemical, and biological properties of new compounds and reactions. We will describe how graph-based methods play a key role in the development of: (1) a large public database of compounds and reactions (ChemDB) and the underlying algorithms and representations; (2) machine learning kernel methods to predict molecular properties; and (3) the applications of these methods to drug screening/design problems and the identification of new drug leads against a major disease. Charting Chemical Space with Computers: Challenges and Opportunities for AI and Machine Learning--Discovering New Drug Leads Mother in Law Theorem: Mother in Law Theorem: Bioinformatics/Chemoinformatics Theorem Chemoinformatics ?A mathematician is a machine that converts coffee into theorems? P. Erdos Cholesterol Aspirine ?A computer scientist ?..?? Chemical Space Chemo/Bio Informatics Data (examples) ChemDB ChemDB ChemDB ChemDB Similarity: Data Representations Fingerprint Representations Fingerprint Compression Power-Law Distributions Power-Law Distribution Models Lossless Compression Algorithms Lossless Compression Algorithms - part 2 Finding a Good Similarity/Kernel - part 1 Finding a Good Similarity/Kernel - part 2 Linear Classifiers Non Linear Classification(Kernel Methods) - part 1 Non Linear Classification(Kernel Methods) - part 2 Similarity: Data Representations 1D SMILES Kernel 2D-Labeled Graph Similarity for Binary Fingerprints Similarity Measures 3D Coordinate Kernel Datasets Example of Results Results Results Results Regression:Aqueous Solubility 30 folds cross-validation Delaney Dataset: 1440 Examples XLogP 40 folds cross-validation Dataset size: 1991 HIV Competition Additional Representations 2.5D Surface Kernel Molecular Representations and Kernels The Conformer Problem 2.5D + Conformers = 3.5D Additional Variations Summary Summary Tuberculosis (TB): An old foe The White Death TB: still a real threat, because?.. The Cell Wall: Key to Pathogen Survival Structure of AccD5 Structure-Based Drug Design 1stDocking datasets by ICM 1stDocking datasets by ICM Structure-Based Drug Design Identified AccD5 Inhibitors Acknowledgements In silicoscreening of AccD5 Structure-Based Drug Design AccD5 + NCI-65828 (Lead1) Docking datasets by ICM & DOCK Pass 1 Pass 1 Summary Similarity for Binary Fingerprints A Bacterial Drug Factory In Action: Biochemical Analyses of Polyketide Biosynthesis Fatty Acid and Polyketide Biosynthesis Stepwise Reactions of ACCase PccB & AccB are HexamersEach Monomer Has Two Domains Important for specificity, stability and catalytic mechanism Residue 422: The Achilles Heel Single-Mutation Interchanges the Specificity of ACC and PCC Crystal Structures of the Mutants The far-reaching effect of 422 mutation Novel Extender Unit For Combinatorial Biosynthesis Perspective #2: Drug Target for Tuberculosis Unique e Subunit AccD5 Accepts Propionyl-CoA Chem DB: UCI?s 5M database From TB to MycobacteriaMycobacterium Reaction Discovery Knowledge Based Reactions Knowledge Based Limitations Knowledge Based Limitations Reaction Favorability Scoring Pseudo-Mechanistic Reactions Pseudo-Mechanistic Reactions Azide + Alkyne Example Diels-Alder Example Summary Summary Summary New Questions Acknowledgements Dockin Some Targets P53 Drug Rescue of P53 Mutants Docking ?ChemDB Chemical Toxicity Prediction Data Flow Results Example of Results Chemical Informatics Datasets Small Molecules as Undirected Labeled Graphs of Bonds Chemical Informatics Chemical Informatics FOUR LEVELS OF MODELING AND ANALYSIS Computational Drug Design/Screening Computational Drug Design/Screening Computational Drug Design/Screening Kernels for Small Molecules Outline Small Organic Molecules mlg07_baldi_pierre_Page_119 Why It Is Not Hopeless Kernel Methods Data Finding a Good Kernel The Plan Molecular Representations 1D-Sequence 2D-Labeled Graph 2D-Venn Similarity 2D-Fingerprints 2D-MinMax 3D-Geometric Object Outline Datasets Mutag and PTC NCI Results Summary The Outline Alpha Shape Acknowledgements Finding a Good Kernel mlg07_baldi_pierre_Page_141 mlg07_baldi_pierre_Page_142 Chemical Space Chemical Space ChemDB Architecture ChemDB Architecture ChemDB Architecture ChemDB Architecture ChemDB Architecture Classification 2D-Fingerprints Future Work 2.5D Surface Kernel Alpha Shape Alpha Shape Delaney Aqueous Solubility 30 folds cross-validation Dataset size: 1440 XLogP40 folds cross-validationDataset size: 1991 Regression Results Boiling point of Alkanes (Leave-one-out)Dataset size: 150 Melting point of Benzodiazepines(Leave-one-out)Dataset size: 72 Bergstr?m ?Melting points(Leave-one-out) Dataset size: 277 Mutagenicity (classification dataset)(Leave-one-out) Dataset size: 188 Similarity Nicotine Aspirine Aspirine Derived Chemical Properties Classification Additional Representations Molecular Representations and Kernels 2.5D Surface Kernel Additional Representations Chemical Informatics Chemoinformatics Methods The Fingerprint Approximation
Graph Identification Within the machine learning community, there has been a growing interest in learning structured models from input data that is itself structured. Graph identification refers to methods that transform an observed input graph into an inferred output graph. Examples include inferring organizational hierarchies from social network data and identifying gene regulatory networks from protein-protein interactions. The key processes in graph identification are entity resolution, link prediction, and collective classification. I will overview algorithms for these tasks and discuss the need for integrating the results to solve the overall problem collectively.
Department of Knowledge Technologies Core business of JSI KT Dep is research, technology development and deployment of state-of-the-art analytic solutions on large real-life scenarios. Jo?ef Stefan Institute Department of Knowledge Technologies Core Technology Jozef Stefan Institute Department of Knowledge Technologies Department of Knowledge Technologies (1) Department of Knowledge Technologies (2) ?other members of the group European Projects from FP5 & FP6 Key competencies Some images from our systems? Visualization of FP6 IST projects Structure analysis and summarization of news documents Semantic Question-Answering with Cyc-Analytic Environment
Dynamic Bayesian Networks for Multimodal Interaction Dynamic Bayesian networks (DBNs) offer a natural upgrade path beyond classical hidden Markov models and become especially relevant when temporal data contains higher order structure, multiple modalities or multi-person interaction. We describe several instantiations of dynamic Bayesian networks that are useful for modeling temporal phenomena spanning audio, video and haptic channels in single, two-person and multi-person activity. These models include input-output hidden Markov models, switched Kalman filters and, most generally, dynamical systems trees (DSTs). These models are used to learn audio-video interaction in social activities, video interaction in multi-person game playing and haptic-video interaction in robotic laparoscopy. Model parameters are estimated from data in an unsupervised setting using generalized expectation maximization methods. Subsequently, these models can predict, synthesize and classify various types of rich multimodal human activity. Experiments in gesture interaction, audio-video conversation, football game playing and surgical drill evaluation are shown. Dynamic Bayesian Networks for Multimodal Interaction Outline Introduction Bayesian Networks Bayes Nets to Junction Trees Junction Tree Algorithm Junction Tree Algorithm Maximum Likelihood with EM Dynamic Bayes Nets Two-Person Interaction DBN: Hidden ARMA Model DBN: Hidden ARMA Model Hidden ARMA Features: Conditional EM for hidden ARMA Conditional EM Conditional EM Hidden ARMA on Gesture DBN: Input-Output HMM DBN: Input-Output HMM Input-Output HMM Data Video Representation Video Representation Input-Output HMM Input-Output HMM with CEM Input-Output HMM with CEM Input-Output HMM Results Intractable Dynamic Bayes Nets Intractable DBNs: Generalized EM Intractable DBNs Variational EM Dynamical System Trees Dynamical System Trees DSTs and Generalized EM DSTs for American Football DSTs for American Football DSTs for Gene Networks Robotic Surgery, Haptics & Video Robotic Surgery, Haptics & Video Robotic Surgery, Haptics & Video Robotic Surgical Drills Results Conclusion
Department of Communication Systems The Department of Communication Systems is concerned mainly with the research, development and design of next generation networks and wireless access systems, and the development of new algorithms for parallel and distributed computing and computer simulations. Other research activities include the development of software tools for testing, modeling and simulation of communication systems, provision of security services in communication networks, digital signal processing in medicine, etc. With its research and development activities the department has been actively involved in national and international projects including Framework Programme and Structural Funds projects. Jo?ef Stefan Institute Department of Communication Systems Digital telecommunication systems Computer networks and distributed systems Parallel computing Selected international RTD projects Selected international projects -CAPANINA Selected international projects -SatNEx Selected national projects ?TETRA Examples of past cooperation with the industry Industrial cooperation with Telsima Wireless Regular interconnection topologies Modelling and simulations in medicine Biomedical measuring devices
Computer Systems Department Computer Systems Department at Jozef Stefan Institute is concerned primarily with the design automation of computing structures and systems. Within this broad area, we are concentrating particularly on metaheuristic approach to engineering design and logistics problems as well as system design and test. Jozef Stefan Institute ? Computer Systems Department Security extension for IEEE Std 1149.1 Secure data storage unit ict07_novak_franc_Page_4
Department of Intelligent Systems Development of methods and techniques of intelligent computer systems, with applications in the areas of event-driven security and supervisory systems for near-real-time and mission critical operation and network communication systems. ?Jozef Stefan? Institute - Department of Intelligent Systems Core Technologies (Research Areas) Applications / projects CiVaBiS Partnership
Welcome and Presentation of Center for Knowledge Transfer in Information Technologies * Centre for knowledge transfer in information technologies performs educational, promotional and infrastructural activities and provides direct exchange of information and experience between researchers and the users of their research results. * We develop and prepare carefully designed educational events, such as: seminars, workshops, conferences and summer schools. * Because of our experiences in European projects we have decided to offer the service to the industries and organizations for consulting, pre-evaluating and helping prepare EU projects proposal as well as support by the project implementation. * We have prepared a number of training web portals with more than 2700 hours of recorded tutorials from different domains of knowledge available at **[[http://videolectures.net/]]**
IRC IRENE, AREA Science Park * Assistance to industry, in particular SMEs, in the definition of technology needs * Promotion, through the IRC Network, of innovative ideas at European level * Identification of potential technology partners * Enhance of Transnational Technology Transfer Agreements * Support to R&D projects under the European Research Framework Programme AREA Science Park - Transnational ICT and Security Technology Opportunities AREA Science Park connecting and adding value to the Friuli Venezia Giulia regional ICT Companies AREA Science Park Focusing on ICT Cluster Focusing on ICT Technologies Thank you
active-media-group.com srl We are specialized in EDI (electronic data interchange) online solutions: with our product DERWID? we are present on the Italian, Austrian, German and Lichtenstein?s markets. Extensive know-how in cross-platform software development, system integratin, EDI and growing Linux-market are the basis of our competence. Solutions provided in dedicated projects as well as standard tools for different requirements can be offered. Active-media-group.com srl Core Technology
EIDON S.p.A. EIDON is an engineering and contract research company, working in partnership with enterprises to provide cutting edge technological support for product and process innovation. Established in 1979, EIDON was one of the first centers in Italy to provide technological support services of this nature.The company designs and develops ICT solutions for the global, automated management and control of production processes, including those distributed throughout the country. EIDON is recognised by MUR (Italian Ministry of University and Research) as an excellence laboratory in the fields of computer science and electronics. EIDON S.p.A. Core Technology
ELIMOS srl System integrator, real-time Software development, Digital TVCC systems and centralization, ANPR system, domotic ELIMOS srl Core Technology
Emaze Networks S.p.A. Emaze Networks is an innovative provider of services and products in the Information **Security field**. Emaze customers are Financial and Insurance institutes, Industry and Services companies, R&D departments, Telecommunication and Utility companies. Emaze Networks S.p.A. Core Technology
Sicom test S.r.l. The mission of the Company is to perform measurements and tests concerning telecommunication mobile terminal products. Sicom services are including Functional tests, Quality of service (QoS), SAR measurements, Network operator acceptance, Global Certification Forum tests and railways R-GSM tests. Sicom test S.r.l. Core Technology
Gender issues in user interfaces Most areas of the computing sciences consider themselves to be either gender neutral or they aim at acknowledging physiologically-based differences between women and men. While the first standpoint draws on the traditional ideal of science as a rational, objective and value-free project, the second one does not refer to gender, but to sex differences ? a general tendency that is supported by recent interpretations of brain research in popular media. Gender Issues in User Interfaces Overview Gender Issues: Common Assumptions The Gender-Technology Relation-ship: Theoretical Approaches Liberal Feminism Radical feminism Radical Feminism Equality vs. Difference? Constructionist Approaches Constructionist Approaches The Co-Production of Gender and Technology 2. How is gender inscribed into software and user interfaces? I. Inscription of the gendered division of labor I. Inscription of the gendered division of labor II. Inscription of the alleged absence of gender Gendered inclusion/ exclusion of knowledge III. Inscription of the developers? (male) dreams Smart House Smart Houses as ?toys for the boys? IV. Inscription of assumptions about users The inscription of gender in word processing systems V. The Representation of Gender The Inscription of Alleged Human Differences The Inscription of Alleged Human Differences Summary: Inscriptions of Gender
Monodispersed particles in technologies and medicine In the last decades we witnessed the development of different techniques for the manufacturing of monodispersed systems with particle size from few micrometers to few nanometers. These particles are of different sizes and forms, their chemical structure can be complex or simple. At the beginning the focus of research of these systems was in the search of different physical and chemical properties, which can be strongly dependent from the size of the particles and their morphology. Later on the interest was pointed towards their use in the making of materials with specific and repeated properties. This use is specially accelerated by today?s trends in miniaturizing in technology and medicine. They enable a large progress in the transmission of laboratory techniques of preparative arrangements in monodispersed colloid systems in the economy. The lecturer will focus on the role of monodispersive colloid systems in the manufacturing of cheramics and pigments, photography, chemical polishing, and especially in the making of monodispersive medicines. He will show how colloid particles are useful in the transmission of medicines with delayed effect on the in advance chosen part of the human body. They are also used in diagnostics, especially in x-ray examinations. Their use is different because of the differences in the properties of nano and micro dispersions.
T-CONNECT srl T-Connect is engaged in R&D of wireless applications on 3G devices integrated with Location Based Services for the Mobile Business. The company works with a 3G mobile carrier for consulting services on functional and inter-operability (IOT) testing on wireless systems (3G mobile networks, Wi-Fi and DVB-H). T-CONNECT srl Core Technology
Testability Snc Development, Deployement, Installation and Startup of: * Production Testing Lines for Electronic devices and equipment * Automated Test Benches * Mechanical Toolings for Electronics, i.e. fixtures * Replicas of existing Production/Testing lines * Technology transfer Testability Snc Core Technology
Wego s.r.l. Wego works in the e-government sector. Its activities consist of consulting and software development for public administration entities, and especially of: * business process and administrative proceedings engineering; * front end solutions development; * electronic document management (EDM). Wego s.r.l. Core Technology
IRC Slovenia, Jo?ef Stefan Institute * Helping local industry specify its new technological needs (technological audits) and with the help of IRC network trying to identify partners to provide these new technologies. * Helping local industry identify which of its technologies are suitable for transfer to other regions or industries and promoting these innovative ideas across Europe through the Innovation Relay Centres network. * Providing assistance in the negotiation process between the provider and the receiver of the technology. * Advising on related aspects of research exploitation, such as patenting and licensing. * Informing about relevant Community and national financial support schemes for innovation. Participants from Slovenia IRC Slovenia, Jo?ef stefan Institute
andEuros d.o.o. The andEuros company is active in electronics and software research and development to provide complete solutions on system architecture, sensors, sensor networks, mixed signal hardware designs, distributed software and hardware platforms, embedded systems, FPGA/VHDL system-on-chip design, powerline communications and sensor protocols. Service Event Architecture Sensor Standard General Language (1) Sensor Standard General Language (2) Sensor General Instrumentation Bus
IKS d.o.o. * Implementing computer vision methods in various fields (video surveillance, traffic control, industrial quality inspection, medical diagnostics) * Development of database applications * Algorithm development and testing IKS d.o.o. Core Technology
INDATA d.o.o. Research and development of applied electronics: * Microprocessor applicaiton for building energy management * Building management systems components based on EN14908/ANSI 709.3 standard * Hardware implementation of SSGL protocol INDATA d.o.o. Core Technology
ISKRA ZA??ITE d.o.o. Research, production and implementation the products on: * Surge protection devices in Low Power Distribution Systems for Power suply Systems; Telephone Exchanges and Terminals; Base Stations, Oile, Gass and wather pipe lines * Data transmission Systems * External Lighting protection Base Stations, Oile, Gass and wather pipe line stations, Medium Voltage Surge Arestors * Integration and Enngenering for complete solutions in Telecommunication Networks, Surge and Over voltage protection solutions * Reserch, production and implementation on Telecommunication Access and Sensor Systems Iskra Za??ite Director's Message, Quality, Key Facts Our Mission and Goals Business Experiences Lightning Strikes in Julian Alps (Triglav, 2864 m) DIRECTOR' s MESS AGE ?From its inception in 1989, ISKRA Zascite has sought to strive towards excellence. Excellence in the customer service it offers and excellence in the product designs and technological innovations it makes. It is pleased to stand today as one of Europe's leading suppliers of surge protection products for power, data and telecommunications.The company realizes that in the increasingly globalized market place in which it now operates, such dedication alone can not guarantee long term success. It has to remain strategically focused and prepared to be nimble and willing to adjust its business plans as situations and competitive competencies determine. Perhaps one of the more evident signs that it is indeed succeeding in this endeavour is its growing list of customers and OEM partners spread through the world.This is indeed a rewarding testimony to its willingness to engineer customized solutions for local market requirements and remain flexible to its partner's needs and industry trends.? Q UA L I T Y The company is ISO 9001:2000 certified. It is also certified to EN 13980 (94/9/EC ATEX) directives for intrinsic safety. These two international standards ensure that quality is part of each step from conceptual design to fitting. As a ISO 9001 accredited company we are committed to the work of international standardization both in efforts to make the development, manufacturing and supply of our products more efficient, safer and cleaner, and in their ability to make trade between countries easier and fairer. Attention to quality at Iskra Za{~ite is in grained in all employees. We recognize that in the competitive environment we now find ourselves in, quality must be fundamental to our corporate culture if we are to succeed.We realize that the synergies that come from a quality product and a strong partnership with our customers are the core to our continued growth. A number of our employees are technical experts to various committees developing such standards, including IEC SC37A on surge protection, UL 1449 Standards Technical Panel on surge protection and IEC TC81 on Lightning Protection. We were proud to host the IEC SC37A working group meetings responsible for the development of the IEC 61643 series of standards in Slovenia last year. Such involvement at the standards development level ensures that our products are always at the cutting edge of design and are in compliance with relevant certifications such as VDE, ?VE, IEC and UL. Vladimir Murko, MSc. Managing Director K E Y FA C T S The company was established in Ljubljana, Slovenia, in 1989 as a limited liability company. Its expertise is in the research and development of surge protective components, efficient and cost effective production and the marketing and sales of its products to meet the demands of industry and customers alike. In 2005, the company achieved a turnover of six million EUR of which 90% was for export sales, and employed 90 skilled and semi-skilled personnel. OUR MISSION and GOALS To become a leading producer of surge protective devices and equipment for the telecommunications, low voltage power distribution systems and information technology market sectors. To research and develop new technologies in specific strategic fields with the aim of transferring this technology to foreign markets where manufacturing competitive advantages can best be achieved. To provide users with complete solutions. To strive for the mutual satisfaction of all parties buyers, employees and business partners alike. To constantly invest in our employee's know-how and acquisition of new skills. To maintain investment in Research and Development. To strive towards establishing long-term customer relationships. To maintain our company competitiveness at the level of our business projections and planning. PRODUCTS and SERVICE Surge Protective Devices In LowVoltage Power Distribution Systems: Category IEC/VDE I/B+C Category IEC/VDE II/C For Photovoltaic Systems Category IEC/VDE III/D Category IEC/VDE II/A For NH Fuse Holders For Equipotential Bonding ForTelephone Exchanges andTerminals : Main Distribution Frame (MDF) Connection Strips Equipment for High-Band Communication Protection Modules Independent Line Protection For DataTransmission Systems: Single-Pair DataTransmission Standard BUS DataTransmission Computer Networks DC Power Supplies CoaxialTransmission ADSL andVDSL Lines Combined Protection for End Users Protection on PCBs Applications for Explosive Areas (Ex). Elements: Gas DischargeTubes External Lightning Protection MediumVoltage Metal-Oxide Surge Arresters Lightning and Overvoltage Planning: Preparing complete technical solutions for lightning and overvoltage protection according to national and international standards and regulations Mounting of entire systems and supervision Instruction seminars on integrated approach to effective lightning, earthing and surge protection BUSINESS EXPERIENCES The high percentage of export sales requires that the company be globally focused. Most export sales are to the European market but the company also enjoys a strong presence in South-East Asia and South Africa. It is currently working to penetrate Central and South America. In 2003 the company established its first off-shore manufacturing facility, located in Suzhou, China. Iskra enjoys a number of strong OEM partnership with various large US companies under private labeling agreements. To better serve such customers, the company established its North America office in 2004. The office is based in Cleveland, Ohio which being part of the US Mid West is well positioned to provide logistical sales and technical service support throughout the country. This office also coordinates testing to UL and other mandatory US standards. ISKRA ZA[^ITE d.o.o. Surge Voltage Protection Systems Engineering and Cooperation Stegne 35, 1521 LJUBLJANA, SLOVENIA EU Telephone: +386 1/ 5003 100 Fax: +386 1/ 5003 236 E-mail: info@ iskrazascite.si Published by: Iskra Za{~ite 08. 06 www.iskrazascite.si
NOMEN d.o.o. * CORE BUSINESS: BIOMETRIC SECURITY * Other Security Fields: video&audio surveliance, close protection, education * Trainings for security personnel, border security guards, anti terror seminars? NOMEN d.o.o. Core Technology
Physical Substrates All life forms rely on information processing to maintain their highly organised state. Macromolecules and supramolecular structures are key to the special properties that set living systems apart from dead matter. The course will adopt an engineering perspective to introduce the molecular biology (proteins, RNA, DNA) and the physics (thermodynamics, kinetics, dynamics) required for understanding the operation of the molecularmachinery at work in living cells. On this basis the role andthe processing of information at the molecular level will be discussed - covering topics such as noise, molecular motors, conformational switching and intracellular networks - leading to decision making in cells (chemotaxis, development). Throughout the course the potential transfer of concepts from nature to artificial systems will be explored (robustness, self-repair, nano-engineering, molecular computing). The Organisation of Biota Outline Concepts Concepts Thermodynamic Entropy Macroscopic Func. Distinct Diversity Macroscopic Func. Equivalent Diversity Behavioral Uncertainties Entropy of organisation Structural Diversity Context Sensitiviy versus Modularity Outline Outline E. coli Lactose Metabolism lac Operon lac Operon lac Operon lac Operon Summary Outline Protein Domains Protein Domains Outline Coupling of Function and Interaction Coupling of Function and Interaction Coupling of Function and Interaction Protein Networks Protein Networks Protein Networks Robustness under random errors Testability Testability
SETCCE ? Security Technology Competence Centre E-commerce products and services (globally used and W3C-compliant digital signature components; products for electronic invoicing process management, and outsourced services; products and services for trusted electronic archiving; CA?s) IT consulting services (company-wide business process de-materialization and optimization; PKI-related security policies; application of digital signatures; ambiental intelligence; pervasive systems etc.) Jo?ef Stefan Institute ? Laboratory for Open Systems and Networks Projects from the 4., 5. and 6. Framework Programme of EU Projects from the 4., 5. and 6. Framework Programme of EU Diadem firewall FAIN Active Node Security Technology Competence Centre (SETCCE)
Sequential Monte Carlo methods Parts 4 and 5 of this lecture are presented in [[mlss07_davy_smcmc|//Manuel Davy's// \"%title\"]]
Stochastic Information Processing in Sensor Networks: Challenges, Some Solutions, and Open Problems ISAS Prof. Dr.-Ing. Uwe D. Hanebeck Stochastic Information Processing in Sensor Networks: Challenges, Some Solutions, and Open Problems Motivation What is a Sensor-Actuator-Network? The Big Picture Research Training Group 1194: Self-Organizing Sensor-Actuator-Networks Stohastic Information Processing Simple Example: Heat Conductor Heat Conductor Individual Sensor Nodes Model-based Reconstruction of Temperature Distribution Reconstruction Model: Partial Differential Equation Spatial and Time Discretization Resulting Network Simulation -< Estimation Flow of Information Challenges & Some Solutions Flow of Information Decentralized Reconstruction Problem I: Unknown Correlation (1) Problem I: Unknown Correlation (2) Problem I: Unknown Correlation (3) Problem I: Unknown Correlation (4) Problem I: Unknown Correlation (5) Decentralized Reconstruction - Simulations (1) Decentralized Reconstruction - Simulations (2) Decentralized Reconstruction - Simulations (3) Decentralized Reconstruction - Simulations (4) Problem II: Increasing Complexity (1) Problem II: Increasing Complexity (2) Problem II: Increasing Complexity (3) Problem II: Increasing Complexity (4) So Far ... Framework for Information Processing in Distributed Systems Open Problems Open Problems and Approaches Real-World Examples Calibration of Machine Tools (1) Calibration of Machine Tools (2) Calibration of Machine Tools (3) Estimation of Organ Movement (1) Estimation of Organ Movement (2) Collaborative Control of Robot Teams (1) Collaborative Control of Robot Teams (2) Collaborative Control of Robot Teams (3) Snow Monitoring mlss07_hanebeck_sips_Page_49
Interview about past, present, future of MLSS In this interview the Videolectures.Net team spoke to Bernhard Sch?lkopf at the MLSS 2007 in Tuebingen. We were interested how he sees the **social part of the school**, if he **still attends school with the same enthusiasm**, if the talks were too narrow and specialised or widely comprehensable and if **they should invite speakers from different fields such as psychology**...
Efficient Closed Pattern Mining in Strongly Accessible Set Systems Many problems in data mining can be viewed as a special case of the problem of enumerating the closed elements of an independence system w.r.t. some specific closure operator. We consider a generalization of this problem to strongly accessible set systems and arbitrary closure operators. For this more general problem setting, the closed sets can be enumerated with polynomial delay if deciding membership in the set system and computing the closure operator can be solved in polynomial time. We discuss potential applications in graph mining. Efficient Closed Pattern Mining in Strongly Accessible Set Systems Closed Frequent Patterns The Closed Set Mining (CSM) Problem Results on Mining Closed Sets Example: Track Mining Example Generators and Inductive Generators The Closed Set Mining (CSM) Problem Main Result for Strongly Accessible Set Systems Appl. 1: Closed Frequent Itemset Mining Appl. 2: Closed Frequent Connected Subgraph Mining Closed Frequent Connected Subgraph Mining Appl. 3: Closed Frequent Subpath Mining - part 1 Appl. 3: Closed Frequent Subpath Mining - part 2 An Open Problem
Support Vector Machines for Collective Inference Interdependent training instances violate the common assumption of independently drawn examples and render classical learning algorithms an inappropriate choice. Collective inference approaches explicitly incorporate these dependencies by translating the examples into a graph where two training instances are connected if their values depend on each other. We present a support vector approach for collective inference allowing for arbitrary dependencies in the data and report on empirical results. Since exact inference for large graphs is infeasible, we integrate an approximate decoding technique based on loopy belief propagation into the optimization problem. We empirically compare versions of the procedure that are based on exact (using the Hugin algorithm) and approximate decoding (loopy belief propagation and others) in terms of accuracy and execution time. Support Vector Machines for Collective Inference Motivation - part 1 Motivation - part 2 Motivation - part 3 Overview Problem Setting - part 1 Problem Setting - part 2 Markov Random Fields Markov Property - part 1 Markov Property - part 2 Factorization - part 1 Factorization - part 2 Factorization - part 3 Node Features - part 1 Node Features - part 2 Node Features - part 3 Transition Features Inference in MRFs - part 1 Inference in MRFs - part 2 Inference in MRFs - part 3 Inference in MRFs - part 4 Exact Inference The Junction Tree Algorithm - part 1 The Junction Tree Algorithm - part 2 The Junction Tree Algorithm - part 3 Approximate Inference Loopy Belief Propagation - part 1 Loopy Belief Propagation - part 2 LBP in Parallel Parameter Learning in MRFs with SVMs SVM Optimization Criterion - part 1 SVM Optimization Criterion - part 2 SVM Optimization Criterion - part 3 Experiments Empirical Results - part 1 Empirical Results - part 2 Empirical Results - part 3 Empirical Results - part 4 Efficiency Conclusions
Probabilistic Modelling of Networks and Pathways The main aim of this workshop will be to bring together researchers working on the many faces of these problem, providing a forum for discussion and giving focus to the future directions of research. We will aim to involve some experimental biologists in order to foster collaborations between computational and experimental researchers.
The Cost of Learning Directed Cuts Classifying vertices in digraphs is an important machine learning setting with many applications. We consider learning problems on digraphs with three characteristic properties: (i) The target concept corresponds to a directed cut; (ii) the total cost of finding the cut has to be bounded a priori; and (iii) the target concept may change due to a hidden context. For one motivating example consider classifying intermediate products in some process, e.g., for manufacturing cars or the control flow in software, as faulty or correct. The process can be represented by a digraph and the concept is monotone: Typical faults that appear in an intermediate product will also be present in later stages of the product. The concept may depend on a hidden variable as some pre-assembled parts may vary and the fault may occur only for some charges and not for others. In order to be able to trade off between the cost of having a faulty product and the costs needed to find the cause of the fault, tight performance guarantees for finding the bug are needed.
Speeding up Graph Edit Distance Computation with a Bipartite Heuristic In the present paper we aim at speeding up the computation of exact graph edit distance. We propose to combine the standard tree search approach to graph edit distance computation with the suboptimal procedure. The idea is to use a fast but suboptimal bipartite graph matching algorithm as a heuristic function that estimates the future costs. The overhead for computing this heuristic function is small, and easily compensated by the speed-up achieved in tree traversal. Since the heuristic function provides us with a lower bound of the future costs, it is guaranteed to return the exact graph edit distance of two given graphs. SPEEDING UP GRAPH EDIT DISTANCE COMPUTATION WITH A BIPARTITE HEURISTIC Outline - part 1 Outline - part 2 Graph Based Representation Graph Edit Distance 1/2 - part 1 Graph Edit Distance 1/2 - part 2 Graph Edit Distance 1/2 - part 3 Graph Edit Distance 1/2 - part 4 Graph Edit Distance 1/2 - part 5 Graph Edit Distance 1/2 - part 6 Graph Edit Distance 2/2 Applications of Graph Edit Distance Complexity of Graph Edit Distance Tree Search Tree Search Heuristics - part 1 Tree Search Heuristics - part 2 The Assignment Problem 1/2 The Assignment Problem 2/2 Munkres? Algorithm Munkres? Algorithm as a Heuristic Node Cost Matrix Bipartite Heuristic Experimental Setup Letter Dataset Image Dataset Molecule Dataset Fingerprint Dataset Summary - part 1 Summary - part 2 Fast Suboptimal Edit Distance 1/2 Fast Suboptimal Edit Distance 2/2 - part 1 Fast Suboptimal Edit Distance 2/2 - part 2 Conclusions 5TH INT. WORKSHOP ON MLG, FIRENZE, 2007 SPEEDING UP GRAPH EDIT DISTANCE COMPUTATION WITH A BIPARTITE HEURISTIC Kaspar Riesen and Horst Bunke riesen@iam.unibe.ch Institute of Computer Science and Applied Mathematics University of Bern, Switzerland Outline ? Graph edit distance ? Munkres? algorithm ? Tree search for graph edit distance ? Munkres? algorithm as a heuristic for graph edit distance ? Experimental results ? Conclusions Speeding up Graph Edit Distance Computation with a Bipartite Heuristic August 1, 2007 ? Main contribution: We provide a new heuristic for speeding up graph edit distance computation. Graph Based Representation A graph g is de?ned by the 4-tuple g = (V, E, ?, ?), where ? E ? V ? V is the set of edges ? V is the ?nite set of nodes ? ? : V ? L is the node labeling function ? ? : E ? L is the edge labeling function L = {1, 2, 3, . . .}, L = Rn , or L = {?, ?, ?, . . .}. Speeding up Graph Edit Distance Computation with a Bipartite Heuristic August 1, 2007 3 Graph Edit Distance 1/2 ? De?ne the dissimilarity of graphs by the minimum amount of distortion that is needed to transform one graph into another. ? The edit operations ei consist of deletions, insertions, and substitutions of nodes and edges. Speeding up Graph Edit Distance Computation with a Bipartite Heuristic August 1, 2007 ? Let g1 = (V1 , E1 , ?1 , ?1 ) be the source graph and g2 = (V2 , E2 , ?2 , ?2 ) be the target graph. ? The graph edit distance between g1 and g2 is de?ned by k d(g1 , g2 ) = (e1 ,...,ek )??(g1 ,g2 ) min c(ei ), i=1 where ?(g1 , g2 ) denotes the set of edit paths transforming g1 into g2 , and c denotes the edit cost function measuring the strength c(ei ) of edit operation ei . where ?(g1 , g2 ) denotes the set of edit paths transforming g1 into g2 , and c denotes the edit cost function measuring the strength c(ei ) of edit operation ei . ? Graph edit distance provides us with a general dissimilarity model for graphs. Applications of Graph Edit Distance ? Classi?ers Applicable in the Graph Domain ? k-NN classi?er ? Edit Distance Based Graph Kernels ? Trivial graph kernels in conjunction with SVM, e.g. ?(g, g ) = exp(?d(g, g )) ? Graph kernels based on graph edit distance, e.g. Random Walk Edit Kernel [Neuhaus, 2006] ? Graph embedding in real vector spaces by means of prototype selection [Riesen and Bunke, 2007] ? Graph Clustering Complexity of Graph Edit Distance ? In contrast with exact graph matching algorithms, the nodes of the source graph can potentially be mapped to any node of the target graph. ? The computational complexity for edit distance is exponential in the number of nodes of the involved graphs. (For graphs with unique node labels the complexity is linear.) ? Graph edit distance is usually computed by a tree search algorithm which explores the space of all possible mappings of the nodes and edges of g1 to the nodes and edges of g2 . ? Note that edit operations on edges are implied by edit operations on their adjacent nodes. Tree Search ? Underlying search space is a tree. ? Search tree is constructed dynamically at runtime by creating successor nodes linked by edges to the currently considered node. ? A heuristic function is usually used to determine the node p used for further expansion. Tree Search Heuristics ? For each node p in the search tree g(p) + h(p) is computed. ? g(p): Cost of the partial edit path accumulated so far. ? h(p): Estimated lower bound for the costs from p to a leaf node. ? h(p) = 0: ef?cient but inaccurate estimation. ? h(p) = exact GED to leaf node: accurate estimation but inef?cient. ? How do we estimate a lower bound of the future cost ef?ciently and accurately? The Assignment Problem 1/2 ? Find an optimal assignment of n elements of a set S1 = {u1 , . . . , un } to n elements of a set S2 = {v1 , . . . , vn }. ? The optimal assignment is a permutation p = (p1 , . . . , pn ) of the integers n 1, . . . , n that minimizes i=1 cipi . ? Let cij be the costs of the assignment (ui ? vj ). ? Given the n ? n matrix (cij ) of real numbers corresponding to the assignment ratings. ? The assignment problem can be stated as ?nding a set of n independent elements of (cij ) such that the sum of these elements is minimum. p 123 132 213 231 312 321 n i=1 cipi Munkres? Algorithm ? Munkres? algorithm ?nds the best, i.e. the minimum cost, assignment in O(n3 ) time. ? It ?nds an n ? n matrix (bij ) equivalent to the initial one (aij ) having n independent zero elements. Munkres? Algorithm as a Heuristic ? The problem of estimating a lower bound h(p) for the costs from the current node p to a leaf node can be seen as an assignment problem: ? How can one assign the unprocessed nodes of graph g1 to the unprocessed nodes of g2 such that the resulting edit costs are minimal? Node Cost Matrix ? V1 = {u1 , . . . , un } and V2 = {v1 , . . . , vm } are the unprocessed nodes of g1 and g2 . De?ne an (n + m) ? (n + m) node cost matrix Cn . ? The left upper corner represents the costs of all possible node substitutions. ? The diagonal of the right/left upper/bottom corner represents the costs of all possible node deletions/insertions. Bipartite Heuristic ? We construct an edge cost matrix Ce analogously. ? For each open node p in the search tree we run Munkres algorithm twice: Once with Cn and once with Ce . ? The accumulated minimum cost of both assignments serves us as a lower bound for the future costs to reach a leaf node. ? h(p) = Munkres(Cn ) + Munkres(Ce ). Experimental Setup ? We use four different graph datasets: Letter, Image, Fingerprint, and Molecule. ? We compute the edit distance between graphs with and without bipartite heuristic. ? We measure the mean computation time and the mean number of open paths in the search tree during the graph matching process. Letter Dataset ? Graphs representing capital letter line drawings, 15 classes, 562 500 matchings, ?|V | = 4.6, ?|E| = 4.5 Method Plain-A* BP-A* Time [ms] 465 14 OPEN 478 72 Image Dataset ? Graphs representing images, 5 classes (city, countryside, people, snowy, streets), 26 244 matchings, ?|V | = 2.7, ?|E| = 2.4 Molecule Dataset ? Graphs representing molecules, 2 classes (active and inactive), 21 300 matchings, ?|V | = 5.5, ?|E| = 4.7 Fingerprint Dataset ? Graphs representing ?ngerprint images, 4 classes (arch, left loop, right loop, whorl), 65 025 matchings, ?|V | = 5.4, ?|E| = 4.4 Summary ? Thanks to the bipartite heuristic we can achieve signi?cant speed-ups for exact graph edit distance. ? Further speed-ups can be achieved if we resort to suboptimal algorithms. ? Thanks to the bipartite heuristic we can achieve signi?cant speed-ups for exact graph edit distance. ? Further speed-ups can be achieved if we resort to suboptimal algorithms. ? Transform the bipartite heuristic h(p) into a suboptimal graph matching procedure. Fast Suboptimal Edit Distance 1/2 ? De?ne node cost matrix for whole graphs g1 and g2 . ? Munkres? algorithm ?nds the optimal node assignment by considering node operations or the local structure only. ? The implied edge operations are added at the end of the computation. ? Consequently, the edit distance found by Munkres? algorithm need not necessarily correspond to the exact edit distance. ? However, a signi?cant speed-up can be expected. ? Future Work: Find out whether or not the suboptimal distance remains suf?ciently accurate for pattern recognition and machine learning applications. Conclusions ? We propose a new heuristic based on Munkres? algorithm for speeding up graph edit distance. ? Our heuristic ?nds an optimal node and an optimal edge assignment for the unprocessed nodes and edges of both graphs in polynomial time. ? Our heuristic helps in speeding up exact graph edit distance substantially. ? The proposed heuristic can also be used for fast suboptimal graph matching.
Random Walks All life forms rely on information processing to maintain their highly organised state. Macromolecules and supramolecular structures are key to the special properties that set living systems apart from dead matter. The course will adopt an engineering perspective to introduce the molecular biology (proteins, RNA, DNA) and the physics (thermodynamics, kinetics, dynamics) required for understanding the operation of the molecularmachinery at work in living cells. On this basis the role andthe processing of information at the molecular level will be discussed - covering topics such as noise, molecular motors, conformational switching and intracellular networks - leading to decision making in cells (chemotaxis, development). Throughout the course the potential transfer of concepts from nature to artificial systems will be explored (robustness, self-repair, nano-engineering, molecular computing). Barkai-Leibler analysis - Steady State Perfect Adaptation Modules, Networks, Robustness and all that ... Robustness and Adaptation in Biochemical Networks of Bacterial Chemotaxis Outline Outline Motion of E.Coli E. Coli motion and the structure of flagellar apparatus How fast do bacteria have to swim? How fast do bacteria have to swim? How fast do bacteria have to swim? Motion of E.Coli Diffusion times in water Sensory apparatus that triggers motion E. coli chemtactic pathway Event sequence for triggering of flagellar motion Outline Phosphorylation and dephosphorylation events Methylation events at the receptor State Transitions for Signal Transduction Outline Adaptation in Bacterial Chemotaxis Role of Methylation in Adaptation Response characteristics to be captured in model Response characteristics to be captured in model Simplified model of Barkai-Leibler Simplified model of Barkai-Leibler Phosphorylation and dephosphorylation events Simplified model of Barkai-Leibler Simplified model of Barkai-Leibler Barkai-Leibler model Simplified Barkai-Leibler analysis - 2 Methylation events at the receptor Simplified Barkai-Leibler analysis - 2 Methylation events at the receptor Simplified Barkai-Leibler analysis - 2 Barkai-Leibler analysis - Steady State
Least squares estimation of a transcription regulation model The way transcription factors regulate the activity of their target genes is of much interest. Several authors have recently used model based computational approaches to infer concentrations of transcription factor proteins from high throughput gene expression data ([1-3,6-7]. Here, I present an approach explicitly formulated to model periodic biological phenomena, and a least squares framework for parameter estimation of such a model. Such a computational strategy can be used to infer levels of transcription factor activities at the protein level using genes that are regulated by single transcription factors, and then to decipher ?transcriptional logic? of genes under regulation by the comboined actions of multiple transcription factors. Least squares estimation of a transcription regulation model Mahesan Niranjan The University of Sheffield Motivation To join the community-wide goal: ?There exists no machine learning / statistical inference algorithm that has not been applied to microarray gene expression data? Subject to the constraint: Comparison of computational methods for the identification of cell cycle-regulated genes, U. De Lichtenberg, L. Jensen, A. Fausboll, T. Jensen, P. Bork & S. Brunak Bioinformatics, 21(7), 1164-1171 (2005) PMNP, Sheffield, Sept 2007 Overview ? Modelling transcription factor target gene interactions ? short review of some models ? Luscombe, Sanguinetti, Barenco, Rogers, Lawrence, Khanin? (in no particular order) Some lessons we may learn from these ? My simple model for cell cycle data ? How to estimate parameters ? How to do useful things with it Inferring Subnetworks from Perturbed Expression profiles D. Pe?er, A. Regev, G. Elidan & N. Friedman Bioinformatics, Vol. 17:S, p. 215-224, 2001. Microarray expression profiles Bayesian Belief Network Genomic analysis of regulatory network dynamics reveals large topological changes N. Luscombe, M. Babu, H. Yu, M.Snyder, S. Teichmann & M. Gerstein NATURE | VOL 431 | 16 SEPTEMBER 2004 Known regulations + microarray co-expression BUT? mRNA and protein levels do not correlate Ideker et al, Science 292 (2001) Washburn et al, PNAS 100 (2003) Gene mRNA Griffin et al, Mol. Cel. Proteomics 1.4, (2002) Protein Ranked prediction of p53 targets using hidden variable dynamic modeling M. Barenco, D.Tomescu, D.Brewer, R.Callard, J.Stark & M.Hubank Genome Biology 2006, 7:R25 dx j (t ) dt = B j + S j f (t ) ? D j x j (t ) mRNA decay Baseline Transcription factor protein X Sensitivity ? Learn the p53 protein level from a model (& known targets) ? Use it to predict novel targets of p53 Non-linear interactions f (t ) = Bj + S j ? D j x j (t ) f (t ) + ? Statistical reconstruction of transcription factor activity using Michaelis-Menten kinetics R.Khanin, V.Vinciotti, M.Mersinias, C.Smith, E.Wit Biometrics, Volume 63 Issue 3 Page 816-823, September 2007 Maximum likelihood Bayesian model-based inference of transcription factor activity S.Rogers, R.Khanin & M.Girolami BMC Bioinformatics. 2007; 8(Suppl 2): S2. ? fully Bayesian Gaussian Process Approximation: ? Assume Gaussian process prior on protein dynamics ? Target gene response is also Gaussian process Modelling transcriptional regulation using Gaussian processes N. Lawrence, G. Sanguinetti & G.Rattray, NIPS 2006 ? leads to efficient computation? ? continuous process in protein space? Two problems with Barenko et al. (2006) A. Targets of p53 used in building the model have other regulators DDB2 C/ EBP-beta, E2F1, E2F3,? SREBP-1a, Sp1, Sp3,? p53 p21WAF1/CIP1 SESN1/hPA26 BIK TNFRSF10b B. Not much overlap with (independent) experimentally determined targets 50 predictions A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome C. Wei et al. Cell 124, 207?219, January 13, 2006 10 120 experimental BUT, synonyms not checked for! Network component analysis: Reconstruction of regulatory signals in biological systems J. Liao, R. Boscolo, Y.Yang, L. Tran, C. Sabatti & V. Roychowdhury PNAS December 23, 2003 vol. 100 no. 26, 15522?15527 expressions connectivity + sensitivity X = A P Transcription factor activity PCA: columns of P orthogonal ICA: columns of P statistically independent (joint distribution is product of marginal distributions) Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. A. Boulesteix, K. Strimmer, Theor Biol Med Model 2005, 2(23) A probabilistic dynamical model for quantitative inference of the regulatory mechanism of transcription G.Sanguinetti, M.Rattray & N.Lawrence Bioinformatics 2006 22(14):1753-1759 Protein dynamics is first order auto-regression f m (t ) = ? m f m (t ? 1) + vm (t ) xn (t ) = ? wnm bnm f m (t ) + Bn + ? n (t ) m =1 q Maximum likelihood estimation of parameters Gene expression is linear function of all transcription factors that bind A model for regulation in cell cycle Basic ideas: ? regulating transcription factor proteins to be periodic ? simple observation model (linear + time delay) ? stochastic excitation vector quantized protein oscillator mRNA noise in system f P (t ) = a1 f P (t ? 1) + a2 f P (t ? 2) + ? f M (t ) + ? c j (t ) target gene (one regulator) x j (t ) = ? j f P (t ? TM j ) + v j (t ) j = 1,2,...J Code Excited Linear Prediction mRNA Expression of target gene noise term a1 z ? 1 + a 2 z ? 2 Protein dynamics Codebook of stochastic excitation sequences Linear filter Stochastic process being modelled, but we have seen just one realization of it approximating noise sequence Estimation ? Target genes regulated by one and only one transcription factor ? Filter fixed (we know it is cell cycle) ? Stochastic codebook (fill with random numbers) ? For each sequence in codebook (128 X Time) ? Time align target gene expression profiles ? Solve 2 X 2 linear system for mRNA and noise amplitudes ? Search for excitation giving minimum error save TF protein profile ? kind of works as expected 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 0 5 10 ACE2 mRNA ACE2 Protein SCW11 mRNA 15 20 25 1.5 1 0.5 0 -0.5 -1 -1.5 0 ACE2 mRNA ACE2 Protein OXA1 mRNA 2 1.5 1 0.5 0 -0.5 -1 -1.5 0 5 10 15 20 25 SWI4 mRNA SWI4 Protein YGR151C mRNA Validation Statistical methods for identifying yeast cell cycle transcription factors H. Tsai, H. Lu & W. Li PNAS 2005 September 20; 102(38): 13532?13537 19 transcription factors known to be cell-cycle regulated (clear literature evidence); 31 others (=50) plausible 77 transcription factors that regulate genes regulated by one and only one factor Rank according to how periodic: mRNA and protein True Positive ACE2 DIG1 FKH1 FKH2 HIR1 HIR2 MBP1 MCM1 MET31 MET4 NDD1 STB1 STE12 SWI4 SWI5 SWI6 TEC1 YAP5 YOX1 False Positive mRNA Protein Use to infer cooperative regulation M regulating transcription factors ? x j (t ) = ? wm f m (t ? T j ) m =1 M ? Of interest are the signs of the weights (activators / repressors) ? mRNA co-expression cannot tell the whole story Multiple regulators Genes regulated by two (and exactly two) transcription factors ? x j (t ) = w1 f1 (t ? T ) + w2 f 2 (t ? T ) mRNA correlation Gene G1 G2 G3 TF1 TF2 Gene G1 G2 G3 TF1 TF2 G60 G60 Similarly 3, 4, 5? Summary ? Simple model for cell cycle regulation ? Estimation algorithm ? Validation: enhanced periodicity at regulating protein level ? Potential application to cooperative regulation
Reverse engineering gene and protein regulatory networks using graphical models: A comparative evaluation study One of the major goals in systems biology is to infer the architecture of biochemical pathways and regulatory networks from postgenomic data, such as microarray gene expression and cytometric protein expression data. Various reverse engineering Machine Learning methods have been proposed in the literature, and it is important to understand their relative merits and shortcomings. In the talk the learning performances of three different graphical models machine learning methods, namely Relevance networks, Gaussian Graphical Models, and Bayesian networks, are cross-compared on real cytometric protein data and simulated data from the RAF signalling pathway. Relevance networks are based on pairwise association scores and straightforward to implement. But the inference is not done in the context of the whole system and there is no possibility to distinguished between direct and indirect associations. Both shortcomings are addressed by Gaussian graphical models, where the partial correlation between two variables, conditional on all the other domain variables, is employed as association score. Bayesian networks are more flexible probabilistic graphical models for conditional dependence and independence relations. Bayesian networks are based on directed acyclic graphs and can be exploited to analyse interventional data for identifying putative causal interactions. The empirical results were obtained by applying the shrinkage estimator of Schaefer and Strimmer (2005) to compute the inverse covariance matrix for Gaussian Graphical Models, and Bayesian network inference was done by sampling BNs from the posterior distribution with order Markov chain Monte Carlo (MCMC), as proposed by Friedman and Koller (2003). The experimental results were obtained by analysing data from the RAF protein signalling network reported in Sachs et al. (2005); which describes the interaction of eleven phosphorylated proteins and phospholipids in human immune system cells. Thereby it was distinguished between real cytometric protein activity measurements reported in Sachs et al. (2005) and synthetically generated data as well as between pure observational and interventional data. Observational data are obtained by passively monitoring the system without any interference while interventional data are obtained by actively manipulating variables, e.g. using gene knock-out experiments. Detailed results of this empirical study have been published in Werhli et al. (2006) and Grzegorczyk (2007). The three main findings can be summarized as follows. First, exclusively on Gaussian observational data, Bayesian networks and Gaussian graphical models were found to outperform Relevance networks. Second, for observational data no significant difference between Bayesian networks and Gaussian Graphical models was observed. Third, only for interventional data Bayesian networks clearly performed superior to the other two approaches. Reverse engineering gene and protein regulatory networks using Graphical Models. A comparative evaluation study. Original paper Systems biology - Learning signalling pathways and regulatory networks from postgenomic data - part 1 Systems biology - Learning signalling pathways and regulatory networks from postgenomic data - part 2 Systems biology - Learning signalling pathways and regulatory networks from postgenomic data - part 3 Systems biology - Learning signalling pathways and regulatory networks from postgenomic data - part 4 Systems biology - Learning signalling pathways and regulatory networks from postgenomic data - part 5 Systems biology - Learning signalling pathways and regulatory networks from postgenomic data - part 6 Systems biology - Learning signalling pathways and regulatory networks from postgenomic data - part 7 Reverse Engineering of Regulatory Networks - part 1 Reverse Engineering of Regulatory Networks - part 2 Reverse Engineering of Regulatory Networks - part 3 Three widely applied methodologies: ?Relevance networks Relevance networks (Butte and Kohane, 2000) - part 1 Relevance networks (Butte and Kohane, 2000) - part 2 Relevance networks (Butte and Kohane, 2000) - part 3 Relevance networks (Butte and Kohane, 2000) - part 4 Pairwise associations without taking the context of the system into consideration - part 1 Pairwise associations without taking the context of the system into consideration - part 2 ?Graphical Gaussian models Graphical Gaussian models Shrinkage estimation of the covariance matrix (Sch?fer and Strimmer, 2005) - part 1 Shrinkage estimation of the covariance matrix (Sch?fer and Strimmer, 2005) - part 2 Graphical Gaussian Models Further drawbacks ?Bayesian networks Bayesian networks Bayesian networks versus causal networks - part 1 Bayesian networks versus causal networks - part 2 Bayesian networks - part 1 Bayesian networks - part 2 Bayesian networks - part 3 Learning the network structure MCMC sampling of Bayesian networks Order MCMC (Friedman and Koller, 2003) Equivalence classes of BNs CPDAG representations - part 1 CPDAG representations - part 2 Interventional data Evaluation of Performance Probabilistic inference -DGE Probabilistic inference -UGE - part 1 Probabilistic inference -UGE - part 2 Probabilistic inference - part 1 Probabilistic inference - part 2 Evaluation 1: AUC scores Area under Receiver Operator Characteristic (ROC) curve Evaluation 2: TP scores - part 1 Evaluation 2: TP scores - part 2 Evaluation 2: TP scores - part 3 Evaluation - part 1 Evaluation - part 2 Evaluation - part 3 Evaluation: Raf signalling pathway ?gold standard RAF pathway? according to Sachs et al. (2004) Raf pathway Data Expression Data Two types of experiments - part 1 Two types of experiments - part 2 Evaluation Raf pathway Gaussian simulated data Netbuilder simulated data - part 1 Netbuilder simulated data - part 2 Netbuilder simulated data - part 3 Experimental Results Synthetic data, observations Synthetic data, interventions Cytometry data, observations Cytometry data, interventions Area under the ROC curve Number of TPs for FP=5 fixed How can we explain the difference between synthetic and real data ? Raf pathway Pathway Disputed structure of the gold-standard network Complications with real data Stabilisation through negative feedback loops Conclusions 1 Conclusions 2 Additional analysis I:Raf pathway - part 1 Additional analysis I:Raf pathway - part 2 Additional analysis I:Raf pathway - part 3 CPDAGs of networks Graphs Some additional analysis II Thank you References
Improving frequent subgraph mining in the presence of symmetry The difficulty of the frequent subgraph mining problem arises from the tasks of enumerating the subgraphs and calculating their support in the dataset. If the dataset graphs have additional information in the form of labels, these problems can be solved quite easily. However, if the dataset graphs are unlabeled or only have a few labels, then the complexity of these problems greatly reduces the number and sizes of the dataset graphs that can be managed. Thus far, researchers working on the frequent subgraph mining problem have given little attention to such datasets, and current algorithms tend to do poorly on them. Yet, there are many applications which deal with this type of data, mainly in the fields of compute vision where the data is structured as 2D or 3D meshes [8], or communication/transportation networks where the information is mostly topological.
DIGDAG, a first algorithm to mine closed frequent embedded sub-DAGs Although tree and graph mining have attracted a lot of attention, there are nearly no algorithms devoted to DAGmining, whereas many applications are in dire need of such algorithms. We present in this paper DIGDAG, the first algorithm capable of mining closed frequent embedded sub- DAGs. This algorithm combines efficient closed frequent itemset algorithms with novel techniques in order to scale up to complex input data.
Abductive Stochastic Logic Programs for Metabolic Network Inhibition Learning We revisit an application developed originally using Induc- tive Logic Programming (ILP) by replacing the underlying Logic Pro- gram (LP) description with Stochastic Logic Programs (SLPs), one of the underlying Probabilistic ILP (PILP) frameworks. In both the ILP and PILP cases a mixture of abduction and induction are used. The abductive ILP approach used a variant of ILP for modelling inhibition in metabolic networks. The example data was derived from studies of the e?ects of toxins on rats using Nuclear Magnetic Resonance (NMR) time-trace analysis of their bio?uids together with background knowledge representing a subset of the Kyoto Encyclopedia of Genes and Genomes (KEGG). The ILP approach learned logic models from non-probabilistic examples. The PILP approach applied in this paper is based on a gen- eral approach to introducing probability labels within a standard sci- enti?c experimental setting involving control and treatment data. Our results demonstrate that the PILP approach not only leads to a signi?- cant decrease in error accompanied by improved insight from the learned result but also provides a way of learning probabilistic logic models from probabilistic examples. Learning Metabolic Network Inhibition using Abductive Stochastic Logic Programming Summary Metabolic Network Excerpt of the rat metabolic network Introducing ILP Induction and Abduction Problem Experiment setting Important point Background knowledge Background knowledge-partial network Observations after 8 hours of injection Discovered abducibles Introducing SLPs Reformulating the problem Significance of this approach Novelty of our work Extracting Probabilistic Examples from Scientific Data Prediction accuracy of CSLP vs PSLP Abductive SLP model Learned metabolic network with Probabilistic SLP Conclusions and Discussion
An Efficient Sampling Scheme For Comparison of Large Graphs As new graph structured data is being generated, graph comparison has become an important and challenging problem in application areas such as molecular biology, telecommunications, chemoinformatics, and social networks. Graph kernels have recently been proposed as a theoretically sound approach to this problem, and have been shown to achieve high accuracies on benchmark datasets. Different graph kernels compare different types of subgraphs in the input graphs. So far, the choice of subgraphs to compare is rather ad-hoc and is often motivated by runtime considerations. There is no clear indication that certain types of subgraphs are better than the others. On the other hand, comparing all possible subgraphs has been shown to be NP-hard, thus making it practically infeasible. These difficulties seriously limit the practical applicability of graph kernels. In this article, we attempt to rectify the situation, and make graph kernels applicable for data mining on large graphs and large datasets. Our starting point is the matrix reconstruction theorem, which states that any matrix of size 5 or above can be reconstructed given all its principal minors. By applying this to the adjacency matrix of a graph, we recursively define a graph kernel and show that it can be efficiently computed by using the distribution of all size 4 subgraphs of a graph. This distribution, we argue, is similar to a sufficient statistic of the graph, especially when the graph is large. Exhaustive enumeration of these subgraphs is prohibitively expensive, scaling as O(n4). But, by bounding the deviation of the empirical estimates of the distribution from the true distribution, it suffices to sample a fixed number of subgraphs. Incidentally, our bounds are stronger than those found in the bio-informatics literature for similar techniques. In our experimental evaluation, our graph kernel outperforms state-of-the-art graph kernels both in times of time and classification accuracy.
Fast Inference in Infinite Hidden Relational Models Relational learning is an area of growing interest in machine learning (Dzeroski & Lavrac, 2001; Friedman et al., 1999; Raedt & Kersting, 2003). Xu et al. (2006) introduced the infinite hidden relational model (IHRM) which views relational learning in context of the entity-relationship database model with entities, attributes and relations (compare also (Kemp et al., 2006)). In the IHRM, for each entity a latent variable is introduced. The latent variable is the only parent of the other entity attributes and is a parent of relationship attributes. The number of states in each latent variable is entity class specific. Therefore it is sensible to work with Dirichlet process (DP) mixture models in which each entity class can optimize its own representational complexity in a self-organized way. For our discussion it is sufficient to say that we integrate a DP mixture model into the IHRM by simply letting the number of hidden states for each entity class approach infinity. Thus, a natural outcome of the IHRM is a clustering of the entities providing interesting insight into the structure of the domain.
Inferring vertex properties from topology in large networks Network topology not only tells about tightly-connected ?communities,? but also gives cues on more subtle properties of the vertices. We introduce a simple probabilistic latent-variable model which finds either latent blocks or more graded structures, depending on hyperparameters. With collapsed Gibbs sampling it can be estimated for networks of 106 vertices or more, and the number of latent components adapts to data through a Dirichlet process prior. Applied to the social network of a music recommendation site (Last.fm), reasonable combinations of musical genres appear from the network topology, as revealed by subsequent matching of the latent structure with listening habits of the participants. The advantages of the generative nature of the model are explicit handling of uncertainty in the sparse data, and easy interpretability, extensibility, and adaptation to applications with incomplete data. Inferring vertex properties from topology in large networks Contents Interactions as networks Problem setting Example of structure Generative modeling Latent component model Illustration of the model - part 1 Illustration of the model - part 2 Illustration of the model - part 3 Illustration of the model - part 4 Illustration of the model - part 5 Parameter inference Infinite mixture Generative process Inferring components Joint distribution Conditional probability Example 1: Football network Football result Example 2: Last.fm Last.fm result Conclusion Future work Inferring vertex properties from topology in large networks Janne Aukia (Xtract Ltd) with: Janne Sinkkonen (Xtract Ltd) (Helsinki University of Technology) Samuel Kaski Contents Overview of the problem Background Generative modeling In?nite components Sampling Latent component model Results Interactions as networks Many types of interactions can be represented as large networks Friendships between people, protein interactions, web pages... Missing data and imprecise relationships Nodes and edges are often unlabeled Dense groups of nodes Number of links between nodes (degree) varies Networks have often some type of structure Problem setting How to ?nd the underlying factors which can explain network structure for a single, unlabeled, large graph? Some previous approaches Community detection (Newman & Girvan 2004) Machine learning (Airoldi et al. 2006, Handcock et al. 2007) Our approach A latent component model Generative model for constructing edges in graphs Optimized with collapsed Gibbs sampling Usable on networks with millions of nodes [1]?Airoldi E. M., Blei D. M., Fienberg S. E., Xing E. P. (2006). Mixed-membership stochastic block models for relational data with application to protein-protein interaction. [2]?Handcock M. S. and Raftery A. E. (2007). Model-based clustering for social networks. J. R. Statist. Soc. A 170, 1?22. [3] Newman, M. E.J. and Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69:026113. Example of structure A collaboration network of jazz musicians? has community structure Components found with the latent component algorithm [1] P. Gleiser and L. Danon, Adv. Complex Syst. 6, 565 (2003). Data at: http://deim.urv.cat/~aarenas/data/welcome.htm Generative modeling A generative model can generate samples of the data it represents from a set of parameters ?Cooking recipe? Models are often hierarchical Bayesian methods can be used to infer model parameters from a sample Latent component model Each node belongs to a number of latent components Mixture of components A component is selected based on the component probabilities Edge endpoints are selected based on the probability of the endpoint in the component Generative model, for each edge: Probabilities for components and nodes in components are drawn from Dirichlet distributions Illustration of the model West company North company East company Parameter inference In?nite mixture A crucial feature in latent component models is to learn the number of components required Can be achieved by using a Dirichlet process (DP) DP corresponds to Dirichlet distribution with in?nite components In practice, leads to a ?nite number of components Estimates the amount of components from data ? However, hyperparameter (?) remains Generative process Full generative process for the in?nite component model: mz 1. Draw ? from DP (?) 2. For each component z in C components: (a) Draw mz from Dir(?) 3. For each of L edges: (a) Draw a latent component z from ? (b) Draw ?rst end point ni from mz (c) Draw second end point nj form mz C ni z nj L Inferring components From the full model and its joint distribution, latent components can be found using Bayesian inference A form of unsupervised learning Because of the Dirichlet priors, the inference is tractable and can be easy to compute Components can be found with EM optimization or full MCMC inference EM seems to converge to bad local minima Gibbs sampling, a form of MCMC, gives better results An effective implementation with collapsed Gibbs sampling Latent variables marginalized away, only counts remain! Joint distribution The joint probability distribution for the in?nite mixture model: pDP (L, Z, m|?, ?) = p(L|Z, m) ? p(m|?) ? p(Z|?) = iz mkzi zi m??1 2E!?C ? iz zi C ? D(E, ?) C!?2N z nz ?2N = ?(? + 1) . . . (? + 2N ? 1). Conditional probability Sampling implemented with Gibbs sampler Conditional probability for each edge conditioned on all the other edges Unknown parameters marginalized away Component probabilities for the left out edge: kzi + ? kzj + ? C(nz , ?) p(z|i, j) = ? ? 2nz + 1 + M ? 2nz + M ? N + K? C(nz , ?) = nz if nz = 0 and C(0, ?) = ? New component In every iteration, a component is sampled for each edge based on the conditional probabilities Example 1: Football network The football network? depicts American college football games during fall season 2000 115 nodes (teams) and 613 edges (games) A standard test data for clustering networks Known community structure (clustering), teams belong to different conferences [1] M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002). Data at: http://www-personal.umich.edu/~mejn/netdata/ Football result Colors represent clusters Blue background represents the correct clustering into conferences Example 2: Last.fm A large friendship network of 675,681 Last.fm users Crawled via Last.fm web services during March and April 2007 Mutual links between all users Subset: 147,610 users claiming to be from the US For each user: demographics (age, country, sex) and music taste (artists) In addition, tags for over 188,565 artists were crawled A B C D E F G H folk singer!songwriter indie pop experimental classic rock post!rock jazz Alt!country electronica electronic post!punk ambient 80s new wave britpop female vocalists pop country Canadian comedy Soundtrack Grunge industrial Progressive metal metal j!pop japanese hard rock Progressive rock ska punk rock rap Hip!Hop hip hop piano rock emo punk screamo pop punk post!hardcore hardcore metalcore [NA] acoustic christian Last.fm result Eight components found (columns A-H) The music tags occur often in some speci?c components (rows) Inference took slightly less than 4 hours likely unlikely Conclusion Algorithm performs well at clustering networks Can ?nd both local structure (clusters) and diffuse global traits (latent dimensions) Method is computationally ef?cient However, suboptimal hierarchical clustering methods are even faster Provides information on the con?dence of the clustering results Choice of constant parameters for the model (hyperparameters) may be hard Future work 1. Further validation of algorithm Perform comparisons with machine learning methods and community extraction algorithms More detailed analysis of the algorithm as a predictor for node traits Include more information about network structure into the model, such as weights, user traits, directed links Model architecture Distributional assumptions Parallel implementation of sampling 2. Method development 3. Improvement of performance
Genomic Repeat Visualisation Using Suffix Arrays Repeat analysis is an important technique for understanding the structure of genomic sequences. Here we present a visualisation for describing the repeat character of a sequence, the repeat-score plot. This visualisation allows the identification of all repeats within a sequence. Genomic Repeat Visualisation Using Suffix Arrays Repeat Visualisation Using Suffix Arrays The repeatscore plot1 The repeatscore plot2 The repeatscore plot3 The repeatscore plot4 The repeatscore plot5 The repeatscore plot6 The repeatscore plot7 The repeatscore plot8 The repeatscore plot9 The repeatscore plot10 The repeatscore plot11 The repeatscore plot12 The repeatscore plot13 The repeat-score plot14 The repeat-score plot15 Repeatscore plots of Artificial Sequences Random Sequences DNA Sequences Small Genomic Sequences1 Small Genomic Sequences2 E.Coli1 E.Coli2 E.Coli3 E.Coli4 Repeats in Genomic Sequences A Linear time algorithm The suffix array1 The suffix array2 Generating the repeatscore plot1 Generating the repeatscore plot2 Whole human genome1 Whole human genome2 Whole human genome3 Human Chromosome 18 Arabidopsis thaliana chromosome 1, coding region Fibonacci derived sequences Gallus gallus chromosome 20 Application to other sequences Shakespeare Text document containing the text ?The quick brown fox jumped over the lazy dog? 16times. ?On the Economy of Machinery and Manufacturers? by Charles Babbage with artificial repeat inserted 16times.1 ?On the Economy of Machinery and Manufacturers? by Charles Babbage with artificial repeat inserted 16times.2 Conclusion
A Polynomial-time Metric for Outerplanar Graphs (Extended Abstract) In the chemoinformatics context, graphs have become very popular for the representation of molecules. However, a lot of algorithms handling graphs are computationally very expensive. In this paper we focus on outerplanar graphs, a class of graphs that is able to represent the majority of molecules. We define a metric on outerplanar graphs that is based on finding a maximum common subgraph and we present an algorithm that runs in polynomial time. Having an efficiently computable metric on molecules can improve the virtual screening of molecular databases significantly. A Polynomial-time Metric for Outerplanar Graphs Introduction Examples of molecules Graphs Related work The problem Maximum Common Subgraph (MCS) However... Planar and outerplanar graphs A molecule The subgraph isomorphism revisited The maximum common subgraph revisited Sketch of the algorithm Non-block-splitting subgraphs Half-graphs Finding the size of the MCS of two outerplanar graphs Datasets kNN-classification Preliminary results Time complexity - part 1 Time complexity - part 2 Time complexity - part 3 Conclusions Further work Questions? Mining and Learning with Graphs August 1-3 2007, Florence Motivation Example Introduction Drug discovery ?nd new drug molecules that are active against some disease need for automatic techniques that select interesting molecules How to ?nd interesting molecules similarity measure: which molecules are close to known drug molecules? observation: molecules with the same structure tend to have the same activity Problem how to represent molecules? Examples of molecules Graphs Related work Problem description Complexity Special classes of graphs Graphs Very suitable to represent (binary) relational data vertices are entities, edges are relationships between entities molecules: vertices are atoms, edges are bonds graphs can be labeled atoms: C, O, Cu, Cl, H, . . . bonds: single, double, aromatic, . . . Problem operations on graphs are computationally expensive! hence: algorithms that handle graphs directly are avoided Related work Feature-based distances (?ngerprints) de?ning of some features molecule is represented by a vector advantages: e?ciently computable, use of existing machine learning techniques disadvantages: loss of information, feature selection Cost-based distances aka. graph edit distances approximation algorithms exact algorithms advantage: original graph structure preserved disadvantage: e?ciency The problem Goal of this work: to develop an e?ciently computable metric on graphs representing molecules Bunke & Shearer (1998) proposed a distance function on graphs based on the maximum common subgraph (MCS): dbs (G , H) = 1 ? |MCS(G , H)| , max(|G |, |H|) with |G | equal to the number of vertices in G . dbs is a metric Other size functions can be used too Maximum Common Subgraph (MCS) Given two graphs G and H The MCS is the graph I which is subgraph isomorphic to G and H there exists no other graph J which is also subgraph isomorphic to G and H and |J| < |I | dbs (G , H) = 1 ? |MCS(G ,H)| max(|G |,|H|) 12 max(26,18) However... Problem: the computation of the MCS is not easy the subgraph isomorphism problem is NP-hard for general graphs (unless P = NP) Previous work on graphs has shown that the complexity of some problems can be reduced by imposing some constraints on the graph structure sequences trees planar graphs graphs of bounded degree graph of with treewidth at most k k-connected graphs Task: ?nd an ?easier? class of graphs to represent molecules? Leander Schietgat Jan Ramon Maurice Bruynooghe A Polynomial-time Metric for Outerplanar Graphs Planar and outerplanar graphs Planar graph can be drawn in the plane in such a way that no two edges intersect except at a vertex in common Outerplanar graph planar graph with all the vertices adjacent to the outer face Leander Schietgat Jan Ramon Maurice Bruynooghe A Polynomial-time Metric for Outerplanar Graphs A molecule 95% of the molecules in the NCI database can be represented by outerplanar graphs [Horv?th et al. 2006] a Problem: the subgraph isomorphism problem for outerplanar graphs is still NP-hard [Syslo 1982] Leander Schietgat Jan Ramon Maurice Bruynooghe A Polynomial-time Metric for Outerplanar Graphs The subgraph isomorphism revisited New terminology block: maximal subgraph for which every pair of vertices is involved in a cycle bridge: edge that does not belong to a block Block-and-bridge preserving (BBP) subgraph isomorphism variant of the general subgraph isomorphism blocks are mapped onto blocks bridges are mapped onto bridges Motivation the BBP subgraph isomorphism for outerplanar graphs is computable in polynomial time [Horv?th et al. 2006] a chemist viewpoint: ring structures and linear fragments usually behave di?erently The maximum common subgraph revisited dbs (G , H) = 1 ? |MCS(G ,H)| max(|G |,|H|) 10 max(26,18) dbs (G , H) = 0.54 Preliminaries The algorithm Sketch of the algorithm Dynamic programming approach Generate subgraphs non-block-splitting subgraphs half-graphs Order the subgraphs by ascending ?size? Solve them (bottom-up) simple subgraphs (1 node): trivial solution di?cult subgraphs (multiple nodes): combine the earlier computed solutions of parts of the subgraphs Results in polynomial time complexity Leander Schietgat Jan Ramon Maurice Bruynooghe A Polynomial-time Metric for Outerplanar Graphs Non-block-splitting subgraphs Based on block-bridge trees Example of a non-block-splitting subgraph Half-graphs Half-graph G |o[u,v ] : maximal connected subgraph of G containing all vertices of o[u, v ] but none of the vertices V (B) \\ o[u, v ] and none of the edges adjacent to v , which do not belong to the block B (o = ) Example of a half-graph Finding the size of the MCS of two outerplanar graphs Datasets Method Results Datasets NCI cancer dataset publicly available (National Cancer Institute) screening results for the ability of more than 70,000 compounds to suppress or inhibit the growth of a panel of 60 human tumour cell lines 60 datasets from Swamidass et al. (2006) for each cell line: two-class classi?cation problem more or less balanced datasets ?3500 examples, ?90% outerplanar kNN-classi?cation ?nd the nearest neighbour(s) according to the de?ned distance measure parameters k=5 distance measure: dbs (G , H) = 1 ? |G |: number of nodes in G |MCS(G , H)| max(|G |, |H|) prediction for molecule m: majority voting weighted voting: e.g., |MCS(G , H)| ? class(H) Leander Schietgat Jan Ramon Maurice Bruynooghe A Polynomial-time Metric for Outerplanar Graphs Preliminary results Evaluation method: leave-one-out crossvalidation Dataset 1 2 3 4 5 6 7 8 9 10 #examples 3085 3047 3278 3105 2426 3136 3049 3191 1053 1072 #positives 1572 1520 1624 1545 1190 1607 1903 1648 701 768 #negatives 1513 1527 1654 1560 1236 1529 1146 1543 352 304 Acc 69 70 70 70 70 70 69 68 70 74 AUROC 0.75 0.76 0.76 0.76 0.76 0.76 0.73 0.75 0.72 0.72 Leander Schietgat Jan Ramon Maurice Bruynooghe A Polynomial-time Metric for Outerplanar Graphs Time complexity molecule NCI 76026, #nodes = 30 time (s) number of nodes molecule NCI 76026, #nodes = 30, #halfgraphs = 1104 number of halfgraphs y = 1.3379 ? x ? 17.1147 O(HG ) = (HGG ? HGH )1.34 2 2 O(V ) ? (VG ? VH )1.34 2.68 2.68 O(V ) ? VG ? VH ln(time) O(V ) ? V 5.36 ln(HG1,HG2) Conclusions Further work Conclusions We introduced a polynomial algorithm to ?nd the size of the maximum connected common subgraph between two outerplanar graphs under the block and bridge preserving subgraph isomorphism which can be used to construct a metric on outerplanar graphs and have a similarity measure between molecules Preliminary results predictive performance running time Further work Full-scale experiments investigating other distance measures, size functions, . . . comparison with similar algorithms and metrics Swamidass et al. (2006) Ceroni et al. (2007) ... Investigation of other subclasses of graphs 10% of molecules in this dataset are not outerplanar look for other graph properties for which we can develop polynomial algorithms e.g., graphs with bounded treewidth Questions?
Weighted Substructure Mining for Image Analysis In web-related applications of image categorization, it is desirable to derive an interpretable classification rule with high accuracy. Using the bag-of-words representation and the linear support vector machine, one can partly fulfill the goal, but the accuracy of linear classifiers is not high and the obtained features are not informative for users. We propose to combine item set mining and large margin classifiers to select features from the power set of all visual words. Our resulting classification rule is easier to browse and simpler to understand, because each feature has richer information. As a next step, each image is represented as a graph where nodes correspond to local image features and edges encode geometric relations between features. Combining graph mining and boosting, we can obtain a classification rule based on subgraph features that contain more information than the set features. We evaluate our algorithm in a web-retrieval ranking task where the goal is to reject outliers from a set of images returned for a keyword query. Furthermore, it is evaluated on the supervised classification tasks with the challenging VOC2005 data set. Our approach yields excellent accuracy in the unsupervised ranking task and competitive results in the supervised classification task.
A Universal Kernel for Learning Regular Languages We give a universal kernel that renders all the regular languages linearly separable. We are not able to compute this kernel efficiently and conjecture that it is intractable, but we do have an efficient ?-approximation
Mining, Indexing, and Searching Graphs in Large Data Sets Recent research on pattern discovery has progressed from mining frequent itemsets and sequences to mining structured patterns including trees, lattices, and graphs. As a general data structure, graph can model complicated relations among data with wide applications in Web, social network analysis, and bioinformatics. However, mining and searching large graphs in graph databases is challenging due to the presence of an exponential number of frequent subgraphs. In this talk, we present our recent progress on developing efficient and scalable methods for mining and searching of graphs in large databases. We introduce gSpan and CloseGraph, two efficient methods for mining frequent graph patterns in graph databases. Then we introduce constraint-based graph mining methods. Further, we introduce a graph indexing method, gIndex, and a graph approximate searching method, grafil, both taking advantages of frequent graph mining to construct a compact but highly effective graph index and perform similarity search with such indexing structures. These methods not only facilitate mining and querying graph patterns in massive datasets but also claim broad applications in other fields, including DB/OS systems and software engineering. mlg07_han_jiawei_Page_01 Mining, Indexing & Searching Graphs in Large Data Sets Research Papers Covered in this Talk Graph, Graph, Everywhere Why Graph Mining and Searching? Outline Graph Pattern Mining Example: Frequent Subgraphs Frequent Subgraph Mining Approaches Properties of Graph Mining Algorithms Apriori-Based Approach Pattern Growth-Based Span and Pruning gSpan(Yan and Han ICDM?02) DFS Code Graph Pattern Explosion Problem Closed Frequent Graphs CLOSEGRAPH (Yan & Han, KDD?03) Experimental Result Discovered Patterns Number of Patterns: Frequent vs. Closed Runtime: Frequent vs. Closed Outline Constraint-Based Graph Pattern Mining Pattern Pruning vs. Data Pruning Pruning Properties Overview Pruning Pattern Search Space Pruning Data Space (I): Pattern-Separable D-Antimonotonicity Pruning Data Space (II): Pattern-Inseparable D-Antimonotonicity Graph Constraints: A General Picture Outline Graph Search: Querying Graph Databases Scalability Issue Indexing Strategy Framework Cost Analysis Path-Based Approach Problems of Path-Based Approach gIndex: Indexing Graphs by Data Mining IDEAS: Indexing with Two Constraints Why Discriminative Subgraphs? Discriminative Structures Why Frequent Structures? Experimental Setting Experiments: Index Size Experiments: Answer Set Size Experiments: Incremental Maintenance Outline Structure Similarity Search Some ?Straightforward? Methods Index: Precise vs. Approximate Search Substructure Similarity Measure Substructure Similarity Measure Intuition: Feature-Based Similarity Search Feature-Graph Matrix Edge Relaxation ? Feature Misses Query Processing Framework Performance Study Comparison of the Three Algorithms Outline Graph Search vs. Graph Containment Search Example: Graph Search vs. Graph Containment Search Different Philosophies in Two Searches Contrast Features for C-Search Pruning The Basic Framework Cost Analysis Feature Selection Feature-Graph Matrix Contrast Graph Matrix Training by the Query Log Maximum Coverage with Cost The Basic Containment Search Index The Bottom-Up Hierarchical Index The Top-Down Hierarchical Index Experiment Setting Chemical Descriptor Search Hierarchical Indices Object Recognition Search Conclusions Pictures for data mining lab
Pascal Workshop on Graph Theory and Machine Learning The focus of the workshop is on the fundamentals of graph theory relevant to learning, with emphasis on the applications of spectral clustering, visualisation and transductive learning. Methods from graph theory have made an impact in Machine Learning recently through two avenues. The first arises when we view the data samples as the vertices of the graph with the similarity between the examples encoded by the weights on the edges. This view of the data can be used to motivate a number of techniques, including spectral clustering, nonlinear dimensionality reduction, visualisation, transductive and semi-supervised classification. The second reason for involving graph theory is through the representation of complex objects by graphs. This could be for objects that have a natural graph structure such as molecules or gene networks, or for cases where a feature extraction phase constructs a graph, as for example in natural language processing or computer vision. A key development in this area has been the realisation that feature spaces involving exponentially many features can be used implicitly via kernels that compute in polynomial time inner products between projections into the feature space. This use of graph representations is becoming common in many applications of machine learning making a focus on this topic relevant to a number of application areas, particularly bioinformatics and natural language processing. [[http://conferences.imfm.si/conferenceDisplay.py?confId=2|more on conference page]]
Honeycomb tori and Cayley graphs on generalized dihedral groups We investigate a family of graphs known to some people as honeycomb tori. We establish that they all are Cayley graphs on generalized dihedral groups. We then look at hamiltonicity properties for this family of graphs.
Graphs with extremal energy tend to have a small number of distinct eigenvalues The sum of the absolute values of the eigenvalues of a graph is called the energy of the graph. We study the problem of finding graphs with extremal energy within specified sets of graphs. We develop some tools for treating such problems and obtain some partial results. In particular, we show that in many cases the expected extremal graphs with a small number of distinct eigenvalues do not exist and that actual extremal graphs could have a large number of distinct eigenvalues. Zigzag and central circuit structure Graphs with extremal energy tend to have a small number of distinct eigenvalues Cvetkovic Gwee Page_02 Cvetkovic Gwee Page_03 Cvetkovic Gwee Page_04 Cvetkovic Gwee Page_05 Cvetkovic Gwee Page_06 Cvetkovic Gwee Page_07 Cvetkovic Gwee Page_08 Cvetkovic Gwee Page_09 Cvetkovic Gwee Page_10 Cvetkovic Gwee Page_11 Cvetkovic Gwee Page_12 Cvetkovic Gwee Page_13 Cvetkovic Gwee Page_14 Cvetkovic Gwee Page_15 Cvetkovic Gwee Page_16 Cvetkovic Gwee Page_17 Conclusion Cvetkovic Gwee Page_19
Modelling Intra-Speaker Variability for Improved Speaker Recognition In this paper we present a speaker recognition algorithm that models explicitly intra-speaker inter-session variability. Such variability, which is caused by channel, noise and temporary speaker characteristics (mood, fatigue, etc.), is not modeled explicitly by the state-of-the-art speaker recognition algorithms. We define a session-space in which each session (either train or test spoken utterance) is a vector. We then calculate a rotation of the session-space for which the estimated intra-speaker subspace is trivially isolated and can be modeled explicitly. Due to the high dimensionality of the session-space, it is impossible to use standard orthogonalization methods. We therefore used QR factorization based on Givens rotations to calculate the projection. On the NIST-2004 evaluation corpus, recognition error rate was reduced by 23% compared to the classic GMM state-of-the-art algorithm.
Erd?s-Ko-Rado theorems I will show that this theorem has a natural proof using linear algebra, and that this approach also applies to situations where sets are replaced by objects such as subspaces, permutations or partitions.
Small polyhedral models of the torus, the projective plane and the Klein bottle Models of these manifolds have been studied at least since the work of Moebius, with increasing depth and many results in more recent times. The models range from purely combinatorial to various types of geometric representations, such as by topological complexes, by planar-faced polyhedra (convex or nor necessarily convex), or by smooth manifolds. The talk will give a survey of available results, and then concentrate on what seems to be a new direction ? models that admit as faces selfintersecting polygons. One of the unexpected results is that in some cases such models are simpler and more readily visualized than the more traditional ones, and that in other cases they are the only possible ones. The understanding of the role of selfintersecting polygons as faces sheds light, among other things, on the relations between the Platonic solids and the Kepler-Poinsot regular polyhedra. Many open problems remain, both in the traditional framework and in the new one. Small polyhedral models of the torus, the projective plane and the Klein bottle Small polyhedral models of the torus - Page 1 Small polyhedral models of the torus - Page 2 Small polyhedral models of the torus - Page 3 Small polyhedral models of the torus - Page 4 Triangulations - Page 1 Triangulations - Page 2 Triangulations - Page 4 Overarching faces - Page 1 Overarching faces - Page 2 Quadrangulations - Page 1 Quadrangulations - Page 2 Quadrangulations - Page 3 Quadrangulations - Page 4 Quadrangulations - Page 5 Quadrangulations - Page 6 Quadrangulations - Page 7 Quadrangulations - Page 8 Quintangulations - Page 1 Quintangulations - Page 2 Quintangulations - Page 3 Hexangulations - Page 1 Hexangulations - Page 2 Hexangulations - Page 3 Hexangulations - Page 4 Hexangulations - Page 5 Hexangulations - Page 6 Hexangulations - Page 7 - Example Generalization of constructions Polyhedral realizations of non-orientable manifolds Projective plane, the Klein bottle, and the M?bius band Polyhedra - Page 1 Polyhedra - Page 2 Usual ways of representing the real projective plane - Page 1 Usual ways of representing the real projective plane - Page 2 Polyhedral models of the projective plane - Page 1 Polyhedral models of the projective plane - Page 2 Can a polyhedral map of the projective plane be realized by an actual polyhedron - Page 1 Can a polyhedral map of the projective plane be realized by an actual polyhedron - Page 2 Can a polyhedral map of the projective plane be realized by an actual polyhedron - Page 3 Can a polyhedral map of the projective plane be realized by an actual polyhedron - Page 4 Can a polyhedral map of the projective plane be realized by an actual polyhedron - Page 5 Can a polyhedral map of the projective plane be realized by an actual polyhedron - Page 6 There are many possibilities for realizations of hemi-polyhedra Hemi-dodecahedron Problem - Page 1 Problem - Page 2 Problem - Page 3 Problem - Page 4 Problem - Page 5 Establishing a homeomorphism - Page 1 Establishing a homeomorphism - Page 2 Establishing a homeomorphism - Page 3 M?bius band Euler characteristic Infolding - Page 1 Infolding - Page 2 Infolding - Page 3 Dodecahedron is homeomorphic to the Platonic dodecahedron From isomorphic to homeomorphic - Page 1 From isomorphic to homeomorphic - Examples - Page 1 From isomorphic to homeomorphic - Examples - Page 2 From isomorphic to homeomorphic- Examples - Page 3 From isomorphic to homeomorphic - Examples - Page 4 From isomorphic to homeomorphic- Examples - Page 5 From isomorphic to homeomorphic - Examples - Page 6 From isomorphic to homeomorphic - Examples - Page 7 From isomorphic to homeomorphic - Examples - Page 8 From isomorphic to homeomorphic - Examples - Page 9 From isomorphic to homeomorphic- Examples - Page 10 From isomorphic to homeomorphic - Examples - Page 11 From isomorphic to homeomorphic - Examples - Page 12 From isomorphic to homeomorphic- Examples - Page 13 From isomorphic to homeomorphic - Examples - Page 14 From isomorphic to homeomorphic - Examples - Page 15 Various interesting isogonal polyhedra Various interesting isogonal polyhedra - Examples - Page 1 Various interesting isogonal polyhedra - Examples - Page 2 Various interesting isogonal polyhedra - Examples - Page 3 Various interesting isogonal polyhedra - Examples - Page 4 Various interesting isogonal polyhedra - Examples - Page 5 Various interesting isogonal polyhedra - Examples - Page 6 Various interesting isogonal polyhedra - Examples - Page 7 Various interesting isogonal polyhedra - Examples - Page 8 The boundary of a M?bius band - Page 1 The boundary of a M?bius band - Page 2 The boundary of a M?bius band - Page 3 The boundary of a M?bius band - Page 4 The classification of 2-manifolds - Page 1 The classification of 2-manifolds - Page 2 Polyhedral M?bius From the traditional M?bius band to the polyhedral three-bow-ties cross-cap A variety of cross-caps Another kind of cross-caps Conclusion
Geometric intersection graphs Geometric intersection graphs are intensively studied both for their practical motivations and interesting theoretical properties. Many classes allow elegant characterizations, for many of them optimization problems NP-hard in general can be solved in polynomial time. We will present a survey of recent results and old problems in this area, including questions related to colorability, maximum clique or representations of planar and co-planar graphs. Computational complexity of recognition of many of intersection defined classes of graphs will be one of the main topics.
On Hurwitz theory: enumerating branched surface coverings With a chronological review of Hurwitz theory, we survey some known results on the enumeration of the equivalence classes of several types of branched coverings of a surface. In particular, relations with the enumeration of the equivalence classes of several types of graph coverings and enumerating the isomorphism classes of branched orientable surface coverings of a nonorientable surface will be mentioned. Also we discuss a similar problem for branched coverings having prescribed branched types.
Famous and lesser known problems in ?elementary? combinatorial geometry and number theory Which problems attain great notoriety and which are delegated to collect dust on a shelf? ?Elementary? problems tend to attract attention because they are very easy to understand and look ?solvable?. It is a mystery to me why some attract a lot of attention while others lie hibernating waiting for some new fresh ideas. In their recent interesting book Research Problems in Discrete Geometry (Springer, New York 2005) P. Brass, W. Moser, J. Pach wrote: ?Although Discrete Geometry has a rich history extending more than 150 years, it abounds in open problems that even a high-school student can understand and appreciate. Some of these problems are notoriously difficult and are intimately related to deep questions in other fields of mathematics. But many problems, even old ones, can be solved by a clever undergraduate or a high- school student equipped with an ingenious idea and the kinds of skills used in a mathematical olympiad.?
Distance-regular graphs and the quantum affine algebra Uq(bsl2) Combinatorial objects, such as graphs, can often be used to construct representations of abstract algebras. In this talk we will consider a graph possessing a high degree of regularity, known as a distance-regularity. For this graph we define an algebra generated by the adjacency matrix and a certain diagonal matrix. There exists a set of elements in this algebra that, under a minor assumption, satisfy some attractive relations. Using these relations we obtain a representation of the quantum affine sl2 algebra.
Infinite planar tessellations We survey problems involving planar embeddings of locally finite, infinite graphs, especially graphs that have only one infinite component when any finite subgraphs is deleted. Problems considered concern separating double rays, geodetic double rays, facial walks, rate of growth, and vertex-, edge- and face-homogeneity.
On Numerical Characterization of DNA, Proteins, Proteomics Maps and Proteome from their Graphical Representations We will outline calculation of a selection of mathematical invariants that can be extracted from matrices associated with various graphical representations of DNA, Proteins, Proteomics Maps and Proteome. In the case with proteome maps one can construct a zigzag line connecting ordered spots in a map, one may construct the partial order graph, one may construct the cluster graph, one may construct the nearest neighbor graph or the sequential neighbor graph, and one may construct the minimal spanning tree.
Graph methods and geometry of data In recent years graph-based methods have seen success in different machine learning applications, including clustering, dimensionality reduction and semi-supervised learning. In these methods a graph is associated to a data set, after which certain aspects of the graph are used for various machine learning tasks. It is, however, important to observe that such graphs are empirical objects corresponding to a randomly chosen set of data points. In my talk I will discuss some of our work on using spectral graph methods for dimensionality reduction and semi-supervised learning and certain theoretical aspects of these methods, in particular, when data is sampled from a low-dimensional manifold.
Learning gene regulatory networks in Arabidopsis Thaliana Gene regulatory networks govern the functional development and biological processes of cells in all organisms. Genes regulate each other as part of a complex system, of which it is vitally important to gain an understanding. For example, discovery of the complete gene regulatory networks in humans would allow the identification of genes which cause disease, and could be used for drug discovery to identify genes interacting with compounds of interest. Similarly in plants knowledge of the gene regulatory networks would allow the development of stress (drought/salt/temperature) resistant crops. Learning large gene regulatory networks with thousands of genes with any certainty from microarray data is extremely challenging. This research aims to build around known networks from the literature on gene regulation, and assesses which other genes are likely to play a regulatory role or be in the same regulatory pathways. The gene regulatory networks are modelled with a Bayesian network. The gene expThe use of large scale public microarray data appears to be a very useful starting point for informing future experiments in order to determine gene regulatory networks.ression levels are quantised and a greedy hill climbing search method is used within a network structure learning algorithm. The inclusion of extra genes with the best explanatory power into the model has been demonstrated to be robust. Large sets of microarray experiments are used in this analysis, specifically 2466 NASC Arabidopsis thaliana microarrays containing gene expression levels of over twenty thousand genes in a number of experimental conditions. Initial investigation of this data is very promising. We have learned gene transcription sub-networks (see Figure 1) regulated by the plant?s circadian clock. The network shown was generated from microarray data without the use of any prior information, and yet the method managed to identify the strong causal relationships between clock components (TOC1, LHY, ELF3, ELF4, CCA1) and to link these to further key regulators of important processes (e.g. ZAT, myb and GATA transcription factors). Learning gene regulatory networks in Arabidopsis thaliana Gene Regulatory Networks Gene Expression Microarrays Arabidopsis thaliana Arabidopsis GATA Factor genes Biological approach Informatics approaches - part 1 Informatics approaches - part 2 Data:Arabidopsis thaliana Bayesian networks Conditional Probability Distributions Structure Learning Conditional Independence Method Results Predictive models Future Computation Future Biology
A theory of similarity functions for learning and clustering Kernel methods have proven to be very powerful tools in machine learning. In addition, there is a well-developed theory of sufficient conditions for a kernel to be useful for a given learning problem. However, while a kernel function can be thought of as just a pairwise similarity function that satisfies additional mathematical properties, this theory requires viewing kernels as implicit (and often difficult to characterize) maps into high-dimensional spaces. In this talk I will describe a more general theory that applies to more general similarity functions (not just legal kernels) and furthermore describes the usefulness of a given similarity function in terms of more intuitive, direct properties of the induced weighted graph. An interesting feature of the proposed framework is that it can also be applied to learning from purely unlabeled data, i.e., clustering. In particular, one can ask how much stronger the properties of a similarity function should be (in terms of its relation to the unknown desired clustering) so that it can be used to *cluster* well. Investigating this question leads to a number of interesting graph-theoretic properties, and their analysis in the inductive setting uses regularity-lemma type results of [FK99,AFKK03]. This work is joint with Maria-Florina Balcan and Santosh Vempala.
Convergence of the graph Laplacian application to dimensionality estimation and image segmentation Given a sample from a probability measure with support on a submanifold in Euclidean space one can construct a neighborhood graph which can be seen as an approximation of the submanifold. The graph Laplacian of such a graph is used in several machine learning methods like semi-supervised learning, dimensionality reduction and clustering. We will present the pointwise limit of three different graph Laplacians used in the literature as the sample size increases and the neighborhood size approaches zero. We show that for a uniform measure on the submanifold all graph Laplacians have the same limit up to constants. However in the case of a nonuniform measure on the submanifold only the so called random walk graph Laplacian converges to the weighted Laplace-Beltrami operator. We will give two applications of these theoretical results.
ProBic: identification of overlapping biclusters usinf Probabilistic Relational Models, applied to simulated gene expression data Biclustering is an increasingly popular technique to identify regulatory modules that are linked to biological processes. A bicluster is defined as a subset of genes which have a similar expression profile for a subset of conditions in the context of gene expression data. We describe a novel method, called ProBic, to simultaneously identify a series of overlapping biclusters in gene expression data within the framework of Probabilistic Relational Models (PRMs) [1;2]. PRMs are a relational extension to Bayesian Networks and allow for the integration of relational data within a unified probabilistic framework. A PRM model describes a joint probability as in Bayesian networks but with additional constraints on the conditional probability functions. We propose a novel PRM based biclustering model, in which gene expression data can be considered as relational data. The classes are Gene, Condition and Expression. Both the classes Gene and Condition have a vector attribute Bicluster containing a series of bicluster-id?s. These vectors represent which biclusters exist for a gene or condition and are initially unknown. Condition has an extra attribute ID, which is a unique number for each condition. Expression has an attribute Level containing the expression value and two reference slots which point to the gene and condition for which the level was measured. Expression.Level is conditionally dependent on Gene.Bicluster, Condition.Bicluster and Condition.ID. The conditional dependency is modeled as a set of Gaussian distributions with conjugate priors. The ProBic model naturally deals with missing values (in fact, there are no ?missing? values in this model) and robust sets of biclusters are obtained due to explicit modeling of noise. The maximum likelihood solution is approximated using an Expectation-Maximization strategy. ProBic was applied to simulated gene expression data sets and all the biclusters were successfully identified. Various noise settings and different overlap models (average, sum, product) have been explored. Our results show that PRM models can be used to identify overlapping biclusters in an efficient and robust manner, naturally dealing with missing values and noise. Identification of overlapping biclusters using Probabilistic Relational Models Overview Overview - Biclustering and biology Biclustering and biology - part 1 Biclustering and biology - part 2 Overview - Probabilistic Relational Models Probabilistic Relational Models (PRMs) - part 1 Probabilistic Relational Models (PRMs) - part 2 Overview - ProBic biclustering model ProBic biclustering model: notation ProBic biclustering model - part 1 ProBic biclustering model - part 2 ProBic biclustering model - part 3 Overview - Algorithm Algorithm: choices ProBic biclustering model Algorithm: example - part 1 Algorithm: example - part 2 Overview - Results Results: noise sensitivity - part 1 Results: noise sensitivity - part 2 Results: bicluster shape independence - part 1 Results: bicluster shape independence - part 2 Results: 9 bicluster dataset (15 genes x 80 conditions) Overlap examples Missing values - part 1 Missing values - part 2 Conclusion Acknowledgements
Probabilistic graph partitioning We consider the problem of Graph Partitioning for applications in Web Mining and Collaborative Filtering. Our approach is based on predicting the presence/absence of a directed link based on a form of probabilistic mixture model. Being based on a generative model of directed graphs, we are able to apply an approximate Bayesian treatment to automatically select an appropriate number of partitions. We will discuss an application in Collaborative Filtering and comment on relations to mixed membership models, Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis.
Estimating parameters and hidden states in biological networks with particle filters Abstract. Identifying biological networks requires to develop models able to capture their dynamics and statistical learning methods to estimate their parameters from time-series measurements. In particular, Ordinary Differential Equations (ODE's) are a rich family of quantitative models, but their estimation remains a bottleneck for reverse engineering, especially when the biological processes are nonlinear and partially observed. In a recent work [5], we have proposed a state-space model, derived from ODE's used in Systems Biology, that can encompass regulatory networks, metabolic networks or signaling pathways. For this model, we have derived a Bayesian estimation procedure for both parameters and hidden variables based on nonlinear filtering and a particular approximation scheme: the Unscented Kalman Filter (UKF). Despite satisfactory results, the UKF approximation possesses some limitations such as few theoretical results and a limited range of applications. We propose then to use Sequential Monte Carlo methods (SMC), also known as particle filters [2], which are now standard methods for filtering nonlinear and non Gaussian processes. SMC methods provide a nonparametric approximation of the filtering probability discretely supported by the so-called particles whose convergence properties have been intensively studied [1, 2]. In this work, we develop a SMC approach for the Bayesian estimation of the (kinetic) parameters and hidden states, by considering the parameters as additional hidden states with no evolution. Despite the generality of SMC methods, the deterministic evolution of the hidden variables implies a fast degenerescence of standard algorithms, e.g. bootstrap filter. To overcome this problem, we use a solution proposed by Liu and West [4], which relies on an adapted kernel smoothing of the particle approximation. The method is illustrated on the Repressilator, an ODE proposed for a gene regulatory network[3], and on an ODE for the JAK-STAT signaling pathway [6]. Experimental results show that particle filters provide similar results to UKF for the parameter estimation and lower Mean Square Error for the state estimation, while offering a greater versatility. Estimating Parameters and Hidden States in Biological Networks with Particle Filters Outline Outline - Problem Problem: Reverse Engineering of Biological Networks Outline - Filtering in State Space Models Nonlinear State-Space Model - part 1 Nonlinear State-Space Model - part 2 Bayesian estimation: Filtering Recursive Filtering Algorithm - part 1 Recursive Filtering Algorithm - part 2 Recursive Filtering Algorithm - part 3 Outline - Particle Filter Sequential Monte Carlo Methods or Particle filters Importance sampling Sequential Importance Sampling with Resampling Sampling-importance Resampling: Bootstrap Filter [Gordon 1993] Theoretical Convergence Problem with Bootstrap Filter - part 1 Problem with Bootstrap Filter - part 2 Outline - Results Repressilator Result for the Repressilator Repressilator - part 1 Repressilator - part 2 Repressilator - part 3 Repressilator - part 4 Conclusion and Future work Estimating Parameters and Hidden States in Biological Networks with Particle Filters Nicolas Brunel, Minh Quach and Florence d?Alch?-Buc IBISC FRE CNRS 2873 University of Evry and Genopole, France 4th September 07 / PMNP workshop Outline Problem: Reverse Engineering of Biological Networks ODE Model dx(t) dt y(t) f(x(t); ?) H(x(t); ?) + (t) x(t) : state variables at time t (protein, mRNA, metabolite concentrations) f : nonlinear function, derived from biochemical reactions. ?: parameter set (kinetic parameters, rate constants,...) H is a nonlinear observation function (t) is a i.i.d measurement noise Problem Given: A sequence of observed data : y1:K = {y1 , ..., yK } at time t1 , t2 , ..., tK Goal: Estimation of parameters ? and states x(t) Outline Nonlinear State-Space Model Continuous time ODE model dx(t) dt y(t) f(x(t); ?) H(x(t)); ?) + (t) The corresponding discrete-time augmented state-space model Augmented states: ? k +1 x(tk +1 ) with F(x(tk ); ?) = x(tk ) + tk ?k F(x(tk ); ? k ) tk +1 f(x(? ); ?)d? observation model: y(tk ) = H(x(tk ); ? k ) + (tk ) Bayesian estimation: Filtering Given: Prior distribution over the initial state and parameters: p(x1 , ? 1 ) A state transition model: p(xk |xk ?1 , ? k ?1 ) An observation model: p(yk |xk , ? k ) A sequence of observations: y1:K = {y1 , ..., yK } Estimating the posterior distributions p(xk , ? k |y1:k ) k = 1, 2, . . . , K Recursive Filtering Algorithm Two steps Prediction: p(xk +1 |y1:k ) = p(xk +1 |xk )p(xk |y1:k )dxk Prediction: p(xk +1 |y1:k ) = Update: p(xk +1 |xk )p(xk |y1:k )dxk p(yk +1 |xk +1 )p(xk +1 |y1:k ) p(yk +1 |y1:k ) p(xk +1 |y1:k +1 ) = where: p(yk +1 |y1:k ) = p(yk +1 |xk +1 )p(xk +1 |y1:k )dxk +1 Analytical solution obtained only when F, H are linear and p(x1 ) and are Gaussian ? Kalman Filter When F, H are nonlinear, the integrals are usually intractable ? Approximate solutions Gaussian Approximations: Extended Kalman Filter (EKF), Unscented Kalman Filter (UKF). Sequential Monte Carlo Methods or Particle ?lters [Gordon 1993, Doucet 1998] Outline Sequential Monte Carlo Methods or Particle ?lters Map intractable integrals of optimal Bayesian solution to tractable discrete sums of samples drawn from the posterior distribution. xk (i) i.i.d ? p(xk |y1:k ) ? p(xk |y1:k ) = 1 N N ?(xk ? xk ) i=1 (i) So, any expectation of the form E[g(xk )] = may be approximated by: E[g(xk )] ? g(xk )p(xk |y1:k )dxk 1 N N g(xk ) i=1 (i) Importance sampling Often impossible to sample directly from true posterior density p(xk |y1:k ). Use importance sampling: g(xk )p(xk |y1:K )dxk = (i) xk p(xk |y1:k ) q(xk |y1:k )dxk ? q(xk |y1:k ) wk g(xk ) i where is sampled from a proposal distribution q(xk |y1:k ): xk ? q(xk |y1:k ) and the importance weights given by: wk = p(xk |y1:k ) q(xk |y1:k ) Sequential Importance Sampling with Resampling Recursive implementation of sequential importance sampling. wk = wk ?1 (i) (i) p(yk |xk )p(xk |xk ?1 ) q(xk |xk ?1 , y1:k ) (i) (i) Most popular choice of proposal distribution is the transition prior: q(xk |xk ?1 , y1:k ) = p(xk |xk ?1 ) So, wk = wk ?1 p(yk |xk ) Particles degenerate over time (only few particles have signi?cant weights) Sampling-importance Resampling (SIR): keep / multiply particles with high importance weights and map N unequally weighted particles into a new set of N equally weighted samples. {xk , wk } ?? {xk , (i) (i) (i) (i) (i) (i) 1 } N Sampling-importance Resampling: Bootstrap Filter [Gordon 1993] [van der Merwe 01] Theoretical Convergence Theorem If the importance weight wk ? p(yk |xk )p(xk |xk ?1 ) q(xk |xk ?1 , y1:k ) is upper bounded for any (xk ?1 , y1:k ) then, for all k ? 0, there exists ck independent of N such that for any fk ? ? 2 N fk 2 (i) ? 1 fk (x1:k ) ? fk (x1:k )p(x1:k |y1:k ) ? ? ck E N N i=1 Problem with Bootstrap Filter Deterministic parameter and state evolution: ? k +1 = ? k x(tk +1 ) = F(x(tk ), u; ? k ) ? p(xk +1 , ? k +1 |xk , ? k ) is a Dirac-Delta function. Parameter and state particles degenerate quickly after a few time steps Deterministic parameter and state evolution: ? k +1 = ? k x(tk +1 ) = F(x(tk ), u; ? k ) ? p(xk +1 , ? k +1 |xk , ? k ) is a Dirac-Delta function. Parameter and state particles degenerate quickly after a few time steps ? At each time step, a small amount of Gaussian noise is added to each resampled particle, which is equivalent to using a Gaussian kernel to smooth the ?ltering distribution [Liu and West 2001] p(? k |y1:k ) ? 1 N N i=1 ? ?(? k ? ? k ) with mean ? k and variance Vk is replaced by: p(? k |y1:k ) ? 1 N N (i) N (? k |mk , h2 Vk ) i=1 (i) ? Shrinkage of kernel locations to retain the mean ? k and variance Vk (to prevent the lost of information caused by adding arti?cial Gaussian noise): ? mk = a? k + (1 ? a)? k where a = 1 ? h2 (i) (i) Outline Repressilator [Elowitz, Nature 2000] dr1 dt dr2 dt dr3 dt dp1 dt dp2 dt dp3 dt mRNAs are observed, proteins are hidden Estimate 9 parameters = = = = = = max v1 max v2 max v3 n k12 n k12 n k23 n + p2 n + p3 n k23 n k31 n n k31 + p1 protein mRNA r1 ? k1 mRNA r2 ? k2 mRNA ? k3 r3 k1 r1 ? k1 k2 r2 ? k2 k3 r3 ? k3 p1 p2 p3 protein protein mRNA and protein degradation rate constants are supposed to be known Result for the Repressilator 20 % Gaussian noise, Gausian priors on parameters and initial states, 5000 particles. Repressilator Conclusion and Future work Choices of better proposal distributions (e.g. EKF, UKF proposals) Adaptive resampling schemes Larger networks
Evolution of protein complexes and protein interaction networks There is an abundance of data on protein interactions and protein complexes, both from conventional smallscale experiments collected over the decades, including threedimensional structures, and more recently by largescale functional genomics experiments. We can now draw on the information available about protein interactions in order to study the evolution of interactions. We have shown that interactions, just like individual proteins, frequently emerge by duplication and divergence. The duplication of a protein that engages in proteinprotein interactions raises issues about the stoichiometry and equilibrium of protein complexes when the quantity of one component increases. Nevertheless, our results indicate that most interactions and complexes have evolved by stepwise duplications of individual proteins engaged in interactions. We show that duplicated complexes retain the same overall function, but have different binding specificities and regulation, revealing that duplication is associated with functional specialization[1,2]. From analysis of crystal structures of proteins as well as the domain architectures of multidomain proteins, it is clear that physical interactions between identical or homologous domainsand protein chains are extremely common [3,4]. How have this particular class of interactions evolved, and Evolution of Protein Interactions in Networks and Complexes The question: How do protein interactions evolve? - part 1 The question: How do protein interactions evolve? - part 2 The question: How do protein interactions evolve? - part 3 Frequency in protein interaction networks? Frequency in protein interaction networks? Frequency in protein interaction networks? Outline - part 1 Outline - part 2 Echo and Narcissus How frequent are homo-oligomers? Selective advantages of homo-oligomers 3Dcomplex.org - a hierarchical classification of complexes Abrief reminder of symmetry Evolutionary pathways of complex assembly Symmetries found in a non-redundant set of ~2500 complexes - part 1 Symmetries found in a non-redundant set of ~2500 complexes - part 2 Evolutionary pathways of complex assembly Outline Conservation of Quaternary Structure -Examples Conservation of Quaternary Structure Model of homomer evolution Homology to the largest interface is common An evolutionary relic Predictions based on largest interface A hypothesis: evolution predicts assembly pathway? Outline Macromolecular mass spectrometry (E.B. Erba & C.V. Robinson) Complexes with characterised assembly pathways -agreement with predictions Outline - Interface size and geometry as unifying principle Summary Acknowledgements pmnp07_teichmann_epca_Page_33 pmnp07_teichmann_epca_Page_34 pmnp07_teichmann_epca_Page_35 pmnp07_teichmann_epca_Page_36 pmnp07_teichmann_epca_Page_37 pmnp07_teichmann_epca_Page_38 pmnp07_teichmann_epca_Page_39 pmnp07_teichmann_epca_Page_40 pmnp07_teichmann_epca_Page_41 pmnp07_teichmann_epca_Page_42 pmnp07_teichmann_epca_Page_43 pmnp07_teichmann_epca_Page_44 pmnp07_teichmann_epca_Page_45 pmnp07_teichmann_epca_Page_46 pmnp07_teichmann_epca_Page_47 pmnp07_teichmann_epca_Page_48 pmnp07_teichmann_epca_Page_49 pmnp07_teichmann_epca_Page_50 pmnp07_teichmann_epca_Page_51 pmnp07_teichmann_epca_Page_52 pmnp07_teichmann_epca_Page_53
Prediction on a graph We will discuss the problem of robust online learning over a graph. Consider the following game for predicting the labeling of a graph. ?Nature? presents a vertex v1; the ?learner? predicts the label of the vertex ?y1; nature presents a label y1; nature presents a vertex v2; the learner predicts ?y2; and so forth. The learner?s goal is minimize the total number of mistakes. If nature is adversarial, the learner will always mispredict; but if nature is regular or simple, there is hope that a learner may make only a few mispredictions. Thus, a methodological goal is to give learners whose total mispredictions can be bounded relative to the ?complexity? of nature?s labeling. In this talk, we consider the ?label cut size? as a measure of the complexity of a graph?s labeling, where the size of the cut is the number of edges between disagreeing labels. We will give bounds which depend on the cut size and the (resistance) diameter of the graph.
Inferring ancestral states of the bZIP transcription factor interaction network As whole-genome protein interaction network datasets become available for a wide range of species, evolutionary biologists have the opportunity to address some of the unanswered questions surrounding the evolution of these complex systems. Protein interaction networks from divergent organisms may be compared to investigate how gene duplication, deletion and ?re-wiring? processes may have shaped the evolution of their contemporary structures [1,2]. However, current approaches to aligning observed networks from multiple species are generally lacking the phylogenetic context necessary for meaningful conclusions to be drawn regarding network evolution. Here we show how probabilistic modeling can provide a platform for the quantitative analysis of multiple protein interaction networks. We apply this technique to the reconstruction of ancestral networks for the bZIP family of transcription factors [3] and find that excellent agreement is obtained with an alternative, sequence-based method for the prediction of leucine zipper interactions [4]. Further analysis shows our probabilistic method to be significantly more robust to the presence of noise in the observed network data than a simple parsimony-based approach [5]. In addition, the integration of evidence over multiple species means that the same method may be used to improve the quality of noisy interaction data for extant species. This is the first time that ancestral states of a protein interaction network have been reconstructed using an explicit probabilistic model of network evolution. We anticipate that it will form the basis of more general methods for probing the evolutionary history of biochemical networks. Inferring ancestral states of the bZIP transcription factor interaction network Networks in computational biology Network evolution Network inference Network inference by probabilistic methods bZIP transcription factors bZIP transcription factors The Victoria University of ManchesterbZIP interactions Genomic data Reconciling gene and species trees From gene trees to interaction trees From an interaction tree to a probabilistic model Probabilistic model parameters Estimating rates of network re-wiring Results: Vertebrate Adding noise to the input data ROC curves: Vertebrata (noise added to inputs) Using probabilistic inference to clean noisy interaction data Conclusions Acknowledgements Network inference by maximum parsimony Network inference using maximum parsimony Validation of inferred networks Predicting interactions using sequence inference Summary of methods for ancestral network inference bZIP interactions Example: genomic data for human bZIP transcription factors Estimating error rates for predicted networks
Mixture models on graphs One of the most fundamental challenges in the analysis of 'omics data sets is clustering the relevant quantities (gene transcripts, protein levels, etc.) into distinct groups. One of the simplest instances occurs when comparing data obtained from two different conditions, where the basic task is to assess whether a quantity is upregulated, downregulated or unregulated. This task has traditionally been addressed using t-statistics or, from a probabilistic point of view, mixture models, with one mixture representing one of the three states of regulation. This approach tacitly assumes the various measurements to be independently drawn from the same mixture distribution. However, it is well known that biological quantities (genes, enzymes, etc.) are not independent, but they are linked in an often very complex network of interactions at various levels. It is therefore reasonable to use available network structure (and weighting) information in order to obtain a more accurate inference of the expression state. This can also be found useful in finding suitable subnetworks that exhibit coherent behaviours, giving rise to testable biological predictions. In this contribution, we introduce a probabilistic model that implements mixture models on a graph. The graph structure is encoded in a set of conditional prior distributions over the latent class memberships. This formulation leads naturally to a Gibbs sampling approach. We present preliminary results on synthetic and real data where gene expression is modelled as a mixture of a Gaussian and two exponential distributions. Mixture Models on Graphs Basic question Traditional approach Network based approach Prior model Class conditional model Parameters and hyper-parameters Conditional posteriors Gibbs sampling Monitoring convergence Synthetic results - part 1 Synthetic results - part 2 Real data (prelim) Future directions
Bayesian Inference of transcription factor activity - an application to the fission yeast cell cycle When modeling genetic regulatory interactions, it is often assumed that the mRNA expression of a transcription factor is a reliable proxy for the regulatory activity of that transcription factor. There are many examples where this assumption does not hold due to post-transcriptional and translational modifications of the transcription factor protein. As true transcription factor activity is very difficult to measure, methods to infer it are becoming increasingly common and it is likely that will become increasingly important when building models of regulatory interactions. Previously, we have shown how Bayesian techniques, particularly Markov- Chain Monte-Carlo based sampling, can enable us to make inferences regarding the activity of transcription factors based on the transcript levels of their targets. However, in that work, we only looked at simple regulatory interactions where one transcription factor acted individually on a set of target genes. In this work, we investigate extending this model to the more general (and common) case of multiple transcription factors working together. As an example application, we use data from a small regulatory network from the fission yeast cell cycle in which several transcription factors are known to work together to produce the desired response, and for which plentiful experimental data are available. Towards Bayesian inference in multiple-input motifs Overview Previously.....TFA inference in SIMs G/M2 Transition in fission yeast - part 1 G/M2 Transition in fission yeast - part 2 Competitive transcription factors mRNA production Stochastic Quasi-Steady-State assumption - part 1 Stochastic Quasi-Steady-State assumption - part 2 Competitive transcription factors Examples - how good is the approximation? - part 1 Examples - how good is the approximation? - part 2 Example - inference - part 1 Example - inference - part 2 Conclusions and Future work Towards Bayesian inference in multiple-input motifs Simon Rogers and Mark Girolami Bioinformatics Research Centre Department of Computing Science University of Glasgow Overview ? TFA Inference in SIMs ? The full network ? Competitive transcription factors ? An MM-type expression for mRNA production ? Examples ? Conclusions Previously.....TFA inference in SIMs T F ? For modeling purposes, mRNA expression of TF is not reliable proxy of its true activity ? Use expression of target genes and nonlinear model of transcription to infer activity over time ?g ?(t) ? ?g xg (t) xg (t) = ?g + ? Kg + ?(t) September 5, 2007 Page 3 G/M2 Transition in ?ssion yeast At least 2 TFs, both of which regulate themselves Starting point - remove feedback, and assume 2 TFs Competitive transcription factors T F 1 T F 2 T ? It is thought that the two TFs operate in a competitive manner - both binding to same site on promoters. ? We have continuous data (target mRNA), so would like model of the form xg (t) = ?g + f (?1 (t), ?2 (t), ?g ) ? ?g xg (t) ? ? What is a good f (?1 (t), ?2 (t), ? g )? ? and when is it good? F T F mRNA production ? Return to SIM, one TF ? We have the following reaction set ? ? ? ? D + P ? DP, DP ? ? D + P, DP ?m DP + M. k1 k?1 k = D = 1 = D = 0 ? Can be thought of as a two state Continuous time Markov Chain where the time between state transitions is Exp(P k1 ) and Exp(k?1 ). D D P P Stochastic Quasi-Steady-State assumption ? For large enough T , we can calculate quantity of mRNA produced using expectations with respect to the stationary ditribution. P k1 p(DP = 1) = P k1 + k?1 p(MT | . . . ) = p(DP = 1) ? Poiss(MT |T km ) = Poiss(MT |p(DP = 1)T km ) ? Which is equivalent to removing the protein binding-unbinding reactions and modifying the mRNA production rate. k? = km p(DP = 1) = km m P k1 P k1 + k?1 September 5, 2007 Page 8 ? k? gives the expected quantity of mRNA produced per unit m time. ? Using this as our production term in a continuous representation gives us exactly the Michaelis-Menten expression shown previously. xg (t) = ?g + kmg ? P P + k?1g /k1g ? ?g xg (t) ? If we can compute the stationary distribution for a particular motif, we can derive an MM-type expression. ? Approximation is reasonable as long as the probability of being in a particular state can be considered to be stationary. ? i.e., binding and disassociation of protein is fast relative to changes in P. September 5, 2007 Page 9 Competitive transcription factors k2 P2 k?1 p(DP 2 = 1) = k2 P2 k?1 + k?2 (P1 k1 + k?1 ) k? = km m x?g (t) = ?g + P2 P2 + KP1 + ? kmg P2 ? ?g xg (t) P2 + Kg P1 + ?g Page 10 September 5, 2007 Examples - how good is the approximation? M t Protein degradation slow with respect to state transition. September 5, 2007 Page 11 Protein degradation and state transition rates of the same order. Example - inference P1 P2 ? Fixed P2 , can we infer P1 ? ? Metropolis Hastings algorithm, 20000 Burn-in samples ? Log-normal likelihood September 5, 2007 Page 13 September 5, 2007 Page 14 Conclusions ? Derived a deterministic MM-type function for the speci?c Biological model ? Veri?ed the inference of activator protein in simple example Future work ? More realistic examples ? More sophisticated sampling schemes - multi-modal posterior ? Incorporate ChIP binding data September 5, 2007 Page 15
Stochastic estimation of fluxes in metabolic networks The qualitative and quantitative information conveyed by metabolic networks are important for regulating the metabolism of an organism to achieve desired targets. One approach to quantification is the 13C tracer experiment which aims to provide information on metabolic fluxes. The flux estimation problem is addressed in steady state and dynamic conditions in this presentation. The problem formulation in the steady state leads to a latent variable model structure which is utilised in applying the stochastic estimation framework to solve the flux quantification problem. A natural algorithm to solve this problem is the expectationmaximisation algorithm which is applied first. This is extended to the Markov Chain Monte Carlo algorithm to account for nonGaussian measurement noise. Finally, a sequential Monte Carlo filter is used to determine the fluxes under dynamic conditions. Results are presented for the central metabolism of Cornybacterium Glutamicum in the steady state and using a simulated metabolic network for the dynamic case. Stochastic Estimation of Fluxes in Metabolic Networks Overview Systems Biology Metabolic Systems Metabolic Network Map of E.coli[2] Metabolic Network Analysis Metabolic Flux Analysis Stoichiometric Equation Flux Balance Analysis 13C Tracer based Flux Analysis MFA based on 13C Tracer Experiment Metabolic Flux Estimation Total Least Squares Estimation Cyclic Pentose Phosphate Pathwayand its 13C Enrichment Balance Least Squares Flux Estimates Linear least squares performance Nonlinear Least Squares Estimation Incomplete Data and Noise Maximum Likelihood Estimation Expectation Maximisation Expectation Conditional Maximisation Metabolic Flux Estimation by ECM Central metabolism of Corynebacterium glutamicum ECM Estimation Results ECM Algorithm Efficiency Incorporating Noise Bayesian Approach Markov Chain Monte Carlo Method Central Metabolism of Corynebacterium glutamicum Flux Distribution Results Dynamic Metabolic System Analysis Intracellular Metabolite Estimation Dynamic System State-Space Model Sequential Monte Carlo Filter [12] Sequential Importance Sampling Particle Filter Simulated Measurements The Simulated Measurements SMC Estimation Results Summary Acknowledgements
Frequent graph mining - what is the question? The objective of data mining is to find regularities, or interesting patterns in large data sets, such as business transactions. More recently, there has been great interest in extending this work to structured data, such as graphs. The domain could be a database of molecular graphs, or the web graph, and the question could be to find subgraphs which occur frequently in the data. Algorithms usually list frequent subgraphs or other patterns. There are many different formulations of this problem. At this stage of the development of the field, it appears to be of some interest to put together a general picture of the different variants. In this talk we present an attempt towards this direction.
Transductive Rademacher complexities for learning over a graph Recent investigations indicate the use of a probabilistic ?learning? perspective of tasks defined on a single graph, as opposed to the traditional algorithmical ?computational? point of view. This note discusses the use of Rademacher complexities in this setting, and illustrates the use of Kruskal?s algorithm for transductive inference based on a nearest neighbor rule.
Strings, graphs, invariants Strings play an important role in various sciences, from computer science, linguistics, social sciences, to various natural sciences, including bioinformatics. Strings, words, or finite sequences are mainly studied in formal language theory and form the basis of logic, mathematics and theoretical computer science. Although strings have a simple linear structure, we may associate a number of invariants to them in particular, via various graphs. This non-technical talk will explain some of these features and survey some of the recent work of the present authors. 123
On graphical representation of proteins We will review a selection of graphical representations of proteins and explore their mathematical properties. In particular, we will consider highly condensed representations of proteins by ?magic circle?, representation by star-like graphs and spectral-like representations and will consider calculations of some of acompanying invariants. Finally we will outline graphical approaches to protein alignment.
Graph complexity for structure and learning The talk will consider ways of bounding the complexity of a graph as measured by the number of partitions satisfying certain properties. The approach adopted uses Vapnik Chervonenkis dimension techniques. An example of such a bound was given by Kleinberg et al (2004) with an application to network failure detection. We describe a new bound in the same vein that depends on the eigenvalues of the graph Laplacian. We show an application of the result to transductive learning of a graph labelling from examples.
Semidefinite ranking on graphs We consider the problem of ranking the vertices of an undirected graph given some preference relation. This ranking on graphs problem has been tackled before using spectral relaxations in [1]. Their approach is strongly related to the spectral relaxation made in spectral clustering algorithms. One problem with spectral relaxations that has been found in clustering is that even on simple toy graphs the spectral solution can be arbitrarily far from the optimal one [2]. It has recently been shown that semidefinite relaxations offer in many cases better solutions than spectral ones for clustering [3] and transductive classification [4]. We therefore investigate semidefinite relaxations of ranking on graphs. Semidefinite ranking on graphs Outline Ranking on graphs - Problem setting Ranking on graphs - Optimisation Motivation ? Vertex ordering algorithms - 1 Motivation ? Vertex ordering algorithms - 2 SDP formulations in machine learning Graph-based clustering: A brief detour Spectral relaxation - 1 Spectral relaxation - 2 Spectral relaxation - 3 SDP relaxation - 1 SDP relaxation - 2 SDP relaxation - 3 QP relaxation (Agarwal, 06) Semidefinite ranking on graphs Incorporating preference constraints The optimisation problem - 1 The optimisation problem - 2 Ranking on graphs algorithm Experiments Results Future work Questions Graph-based clustering: A brief detour - 1 Graph-based clustering: A brief detour - 2 SDP relaxation for ranking on graphs Incorporating preference constraints Motivation - SDP formulations in machine learning SDP formulations in machine learning
Random walk graph kernels and rational kernels Random walk graph kernels (Gartner et al., 2003 [5]; Borgwardt et al., 2005 [1]) count matching random walks, and are defined using the tensor product graph. Loosely speaking, rational kernels (Cortes et al., 2004, 2003, 2002 [4, 3, 2]) use the weight assigned by a transducer to define a kernel. The kernel is shown to be positive semi-definite when the transducer can be written as a composition of two identical transducers. In our talk we will establish explicit connections between random walk graph kernels and rational kernels. More concretely, we show that composition of transducers is analogous to computing product graphs, and that rational kernels on weighted transducers may be viewed as generalizations of random walk kernels to weighted automata. In order to make these connections explicit we adapt slightly non-standard notation for weighted transducers, extensively using matrices and tensors wherever possible. We prove that under certain conditions rational kernels are positive semi-definite. Our proof only uses basic linear algebra and is simpler than the one presented in Cortes et al., 2004[4].
Extracting Semantic Relations from Query Logs In this paper we study a large query log of more than twenty million queries with the goal of extracting the semantic relations that are implicitly captured in the actions of users submitting queries and clicking answers. Previous query log analyses were mostly done with just the queries and not the actions that followed after them. We first propose a novel way to represent queries in a vector space based on a graph derived from the query-click bipartite graph. We then analyze the graph produced by our query log, showing that it is less sparse than previous results suggested, and that almost all the measures of these graphs follow power laws, shedding some light on the searching user behavior as well as on the distribution of topics that people want in the Web. The representation we introduce allows to infer interesting semantic relationships between queries. Second, we provide an experimental analysis on the quality of these relations, showing that most of them are relevant. Finally we sketch an application that detects multitopical URLs.
Multiscale Topic Tomography Modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. In this work, we propose a new probabilistic graphical model to address this issue. The new model, which we call the Multiscale Topic Tomography Model (MTTM), employs non-homogeneous Poisson processes to model generation of word-counts. The evolution of topics is modeled through a multi-scale analysis using Haar wavelets. One of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. Our experiments on Science data using the new model uncovers some interesting patterns in topics. The new model is also comparable to LDA in predicting unseen data as demonstrated by our perplexity experiments.
A Concept-based Model for Enhancing Text Categorization Most of text categorization techniques are based on word and/or phrase analysis of the text. Statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. Thus, the underlying model should indicate terms that capture the semantics of text. In this case, the model can capture terms that present the concepts of the sentence, which leads to discover the topic of the document. A new concept-based model that analyzes terms on the sentence and document levels rather than the traditional analysis of document only is introduced. The concept-based model can effectively discriminate between non-important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed model consists of concept-based statistical analyzer, conceptual ontological graph representation, and concept extractor. The term which contributes to the sentence semantics is assigned two different weights by the concept-based statistical analyzer and the conceptual ontological graph representation. These two weights are combined into a new weight. The concepts that have maximum combined weights are selected by the concept extractor. A set of experiments using the proposed concept-based model on different datasets in text categorization is conducted. The experiments demonstrate the comparison between traditional weighting and the concept-based weighting obtained by the combined approach of the concept-based statistical analyzer and the conceptual ontological graph. The evaluation of results is relied on two quality measures, the Macro-averaged F1 and the Error rate. These quality measures are improved when the newly developed concept-based model is used to enhance the quality of the text categorization
Expertise modeling for matching papers with reviewers An essential part of an expert-finding task, such as matching reviewers to submitted papers, is the ability to model the expertise of a person based on documents. We evaluate several measures of the association between a document to be reviewed and an author, represented by their previous papers. We compare language-model-based approaches with a novel topic model, Author-Persona-Topic (APT). In this model, each author can write under one or more \"personas,\" which are represented as independent distributions over hidden topics. Examples of previous papers written by prospective reviewers are gathered from the Rexa database, which extracts and disambiguates author mentions from documents gathered from the web. We evaluate the models using a reviewer matching task based on human relevance judgments determining how well the expertise of proposed reviewers matches a submission. We find that the APT topic model outperforms the other models.
SCAN: A Structural Clustering Algorithm for Networks Network clustering (or graph partitioning) is an important task for the discovery of underlying structures in networks. Many algorithms find clusters by maximizing the number of intra-cluster edges. While such algorithms find useful and interesting structures, they tend to fail to identify and isolate two kinds of vertices that play special roles - vertices that bridge clusters (hubs) and vertices that are marginally connected to clusters (outliers). Identifying hubs is useful for applications such as viral marketing and epidemiology since hubs are responsible for spreading ideas or disease. In contrast, outliers have little or no influence, and may be isolated as noise in the data. In this paper, we proposed a novel algorithm called SCAN (Structural Clustering Algorithm for Networks), which detects clusters, hubs and outliers in networks. It clusters vertices based on a structural similarity measure. The algorithm is fast and efficient, visiting each vertex only once. An empirical evaluation of the method using both synthetic and real datasets demonstrates superior performance over other methods such as the modularity-based algorithms. Algorithm pt 10 Algorithm pt 11 Algorithm pt 12 Algorithm pt 13 Running Time Are You Ready for Some Football? 789 Contests Consider Arkansas? Schedule The Network The 1A Conference Result of Our Algorithm Result of FastModularity Alg. [2] Conclusion SCAN: A Structural Clustering Algorithm for Networks Network Clustering Problem An Example of Networks A Social Network Model The Neighborhood of a Vertex Structure Similarity Structural Connectivity [1] Structure-Connected Clusters Algorithm pt 1 Algorithm pt 2 Algorithm pt 3 Algorithm pt 4 Algorithm pt 5 Algorithm pt 6 Algorithm pt 7 Algorithm pt 8 Algorithm pt 9
Development of NeuroElectroMagnetic Ontologies (NEMO): A Framework for Mining Brain Wave Ontologies Event-related potentials (ERP) are brain electrophysiological patterns created by averaging electroencephalographic (EEG) data, time-locking to events of interest (e.g., stimulus or response onset). In this paper, we propose a generic framework for mining and developing domain ontologies and apply it to mine brainwave (ERP) ontologies. The concepts and relationships in ERP ontologies can be mined according to the following steps: pattern decomposition, extraction of summary metrics for concept candidates, hierarchical clustering of patterns for classes and class taxonomies, and clustering-based classification and association rules mining for relationships (axioms) of concepts. We have applied this process to several dense-array (128-channel) ERP datasets. Results suggest good correspondence between mined concepts and rules, on the one hand, and patterns and rules that were independently formulated by domain experts, on the other. Data mining results also suggest ways in which expert-defined rules might be refined to improve ontology representation and classification results. The next goal of our ERP ontology mining framework is to address some long-standing challenges in conducting large-scale comparison and integration of results across ERP paradigms and laboratories. In a more general context, this work illustrates the promise of an interdisciplinary research program, which combines data mining, neuroinformatics and ontology engineering to address real-world problems. Development of NeuroElectroMagnetic Ontologies (NEMO) Outline EEG Data ERP data and pattern analysis NEMO Neuroelectromagnetic ontologies NEMO Arhictecture Domain ontologies Ontology mining Our framework Four general procedures Experiments on ERP data Input raw ERP data Data processing(1) Data processing(2) ERP factors after PCA decomposition Mining ERP classes with clustering(1) Mining ERP classes with clustering(2) Mining ERP class taxonomy with hierarhical clustering(1) Mining ERP class taxonomy with hierarhical clustering(2) Mining ERP class taxonomy with hierarhical clustering(1)A Mining properties and axioms with clustering based classification(1) Mining properties and axioms with clustering based classification(2) Mining properties and axioms with clustering based classification(3) Discovering axioms among properties with association rule mining Rule optimization A partialn wiev of the mined ERP ontology Future/ongoing work Thank you
Exploiting Duality in Summarization with Deterministic Guarantees Summarization is an important task in data mining. A major challenge over the past years has been the efficient construction of fixed-space synopses that provide a deterministic quality guarantee, often expressed in terms of a maximum-error metric. Histograms and several hierarchical techniques have been proposed for this problem. However, their time and/or space complexities remain impractically high and depend not only on the data set size n, but also on the space budget B. These handicaps stem from a requirement to tabulate all allocations of synopsis space to different regions of the data. In this paper we develop an alternative methodology that dispels these deficiencies, thanks to a fruitful application of the solution to the dual problem: given a maximum allowed error, determine the minimum-space synopsis that achieves it. These complexity advantages offer both a spaceefficiency and a scalability that previous approaches lacked. We verify the benefits of our approach in practice by experimentation. The Impact of Duality on Data Synopsis Problems Introduction Outline Histograms-part01 Histograms-part02 Histograms-part03 Restricted Haar Wavelet Synopses-part01 Restricted Haar Wavelet Synopses-part02 Unrestricted Haar and Haar+ Synopses-part01 Unrestricted Haar and Haar+ Synopses-part02 Experiments: Histograms, Time vs. n Experiments: Histograms, Time vs. B Experiments: Haar Wavelets, Time vs. n Experiments: Haar Wavelets, Time vs. B Conclusions Related Work Thank you! Questions?
Webpage Understanding: an Integrated Approach Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels of the text phrases can be used in page structure understanding tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research homepage extraction show the feasibility and promise of our approach. Webpage Understanding: an Integrated Approach Outline Motivating Examples Characteristics of Webpage Tasks of Web Data Extraction slide 6 Existing Attempts ? De-coupled Approaches Disadvantages Why no integrated approach? Outline Statistical Web Structure Mining Model (KDD 2006) Integrated Webpage Understanding Model Factorized Distribution Separate Learning Outline Experiments Extraction Accuracy NP-Chunking Features Conclusions & Future Work
Knowledge Discovery of Multiple-topic Document using Parametric Mixture Model with Dirichlet Prior Documents, such as those seen onWikipedia and Folksonomy, have tended to be assigned with multiple topics as a meta-data. Therefore, it is more and more important to analyze a relationship between a document and topics assigned to the document. In this paper, we proposed a novel probabilistic generative model of documents with multiple topics as a meta-data. By focusing on modeling the generation process of a document with multiple topics, we can extract specific properties of documents with multiple topics. Proposed model is an expansion of an existing probabilistic generative model: Parametric Mixture Model (PMM). PMM models documents with multiple topics by mixing model parameters of each single topic. Since, however, PMM assigns the same mixture ratio to each single topic, PMM cannot take into account the bias of each topic within a document. To deal with this problem, we propose a model that considers Dirichlet distribution as a prior distribution of the mixture ratio. We adopt Variational Bayes Method to infer the bias of each topic within a document. We evaluate the proposed model and PMM using MEDLINE corpus. The results of F-measure, Precision and Recall show that the proposed model is more effective than PMM on multiple-topic classification. Moreover, we indicate the potential of the proposed model that extracts topics and document-specific keywords using information about the assigned topics. Knowledge Discovery of Multiple-topic Document using Parametric Mixture Model with Dirichlet Prior Contents 1 What kind of relationship between a document and topics exists? Contents 2 Probabilistic Generative Model PMM-part01 PMM-part02 Contents 3 Dirichlet distribution on mixture ratio (?1,?2,?3) Estimate of Mixture Ratio Graphical Model Contents 4 Multiple-topic classification Evaluation by F-measure F-measure:PDMM vs PMM 1/2 Precision:PDMM vs PMM Recall:PDMM vs PMM F-measure:PDMM vs PMM 2/2 Contents 5 Word Ranking [Female(0.499)], [Male(0.460)] [Biological Markers(0.041)] [Rats(0.411)], [Child(0.352)] [Incidence(0.237)] [Female(0.442)], [Animals(0.437)] [Pregnancy(0.066)],[Glucose(0.055)] [Pregnancy(0.502)],[Glucose(0.498)] Summary [Thank] [you] [for] [listening] [!]
Tracking Multiple Topics for Finding Interesting Articles We introduce multiple topic tracking (MTT) for iScore to better recommend news articles for users with multiple interests and to address changes in user interests over time. As an extension of the basic Rocchio algorithm, traditional topic detection and tracking, and single-pass clustering, MTT maintains multiple interest profiles to identify interesting articles for a specific user given user-feedback. Focusing on only interesting topics enables iScore to discard useless profiles to address changes in user interests and to achieve a balance between resource consumption and classification accuracy. Also by relating a topic?s interestingness to an article?s interestingness, iScore is able to achieve higher quality results than traditional methods such as the Rocchio algorithm. We identify several operating parameters that work well for MTT. Using the same parameters, we show that MTT alone yields high quality results for recommending interesting articles from several corpora. The inclusion of MTT improves iScore?s performance by 9% in recommending news articles from the Yahoo! News RSS feeds and the TREC11 adaptive filter article collection. And through a small user study, we show that iScore can still perform well when only provided with little user feedback.
Content-based Document Routing and Index Partitioning for Scalable Similarity-based Searches in a Large Corpus We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in common with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to partition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and parallel search execution. When a document is ingested into the repository, a small number of partitions are chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document. Our approach scales very well. We show that executing similarity-based searches over such a partitioned search space has minimal impact on the precision and recall of search results, even though every search consults less than 3% of the total number of partitions.
Mining Favorable Facets The importance of dominance and skyline analysis has been well recognized in multi-criteria decision making applications. Most previous studies assume a fixed order on the attributes. In practice, different customers may have different preferences on nominal attributes. In this paper, we identify an interesting data mining problem, finding favorable facets, which has not been studied before. Given a set of points in a multidimensional space, for a specific target point p we want to discover with respect to which combinations of orders (e.g., customer preferences) on the nominal attributes p is not dominated by any other points. Such combinations are called the favorable facets of p. We consider both the effectiveness and the efficiency of the mining. A given point may have many favorable facets. We propose the notion of minimal disqualifying condition (MDC) which is effective in summarizing favorable facets. We develop efficient algorithms for favorable facet mining for different application scenarios. The first method computes favorable facets on the y. The second method pre-computes all minimal disqualifying conditions so that the favorable facets can be looked up in constant time. An extensive performance study using both synthetic and real data sets is reported to verify their effectiveness and efficiency. Mining Favorable Facets Outline Introduction pt 1 Introduction pt 2 Introduction pt 3 Introduction pt 4 Introduction pt 5 Introduction pt 6 Introduction pt 7 Introduction pt 8 Introduction pt 9 Introduction pt 10 Introduction pt 11 Introduction pt 12 Introduction pt 13 Algorithm pt 1 Algorithm pt 2 Algorithm pt 3 MDC-O: Computing MDC On-the-Fly MDC-M: A Materialization Method Empirical Study pt 1 Empirical Study pt 2 Empirical Study pt 3 Conclusion Questions & Answers
Weighting versus Pruning in Rule Validation for Detecting Network and Host Anomalies For intrusion detection, the LERAD algorithm learns a succinct set of comprehensible rules for detecting anomalies, which could be novel attacks. LERAD validates the learned rules on a separate held-out validation set and removes rules that cause false alarms. However, removing rules with possible high coverage can lead to missed detections. We propose to retain these rules and associate weights to them. We present three weighting schemes and our empirical results indicate that, for LERAD, rule weighting can detect more attacks than pruning with minimal computational overhead. Intrusion Detection Systems Learning Rules for Anomaly Detection (LERAD) Aspects of Rule Quality Predictiveness vs. Belief>br<for LERAD rule Motivation and Problem Statement Overview of LERAD Anomaly score Revisit Validation Step Rule Pruning (1) Rule Pruning (2) Case 1 - Rule Conformed>br< (Rule Pruning) Case 2 - Rule Violated>br< (Rule Pruning) LERAD Rule Generation Coverage and Rule Pruning LERAD Rule Generation Rule Weighting Case 1 - Rule Conformed>br< (Rule Weighting) Case 2 - Rule Violated>br< (Rule Weighting) Anomaly Score Weighting Method 1:>br< Winnow-specialist Weighting Method 2:>br< Equal Reward Apportioning Weighting Method 3:>br< Weight of Evidence Empirical Evaluation AUC% (0.1% FA) >br<[Random detector AUC= 0.005%] AUC% (1% FA) >br<[Random detector AUC= 0.5%] Analysis of new attack(s)>br< detected by rule weighting Overhead Summary
Cost-effective Outbreak Detection in Networks Given a water distribution network, where should we place sensors to quickly detect contaminants? Or, which blogs should we read to avoid missing important stories? These seemingly different problems share common structure: Outbreak detection can be modeled as selecting nodes (sensor locations, blogs) in a network, in order to detect the spreading of a virus or information as quickly as possible. We present a general methodology for near optimal sensor placement in these and related problems. We demonstrate that many realistic outbreak detection objectives (e.g., detection likelihood, population affected) exhibit the property of ?submodularity?. We exploit submodularity to develop an efficient algorithm that scales to large problems, achieving near optimal placements, while being 700 times faster than a simple greedy algorithm. We also derive online bounds on the quality of the placements obtained by any algorithm. Our algorithms and bounds also handle cases where nodes (sensor locations, blogs) have different costs. We evaluate our approach on several large real-world problems, including a model of a water distribution network from the EPA, and real blog data. The obtained sensor placements are provably near optimal, providing a constant fraction of the optimal solution. We show that the approach scales, achieving speedups and savings in storage of several orders of magnitude. We also show how the approach leads to deeper insights in both applications, answering multicriteria trade-off, cost-sensitivity and generalization questions.
Joint Optimization of Wrapper Generation and Template Detection Many websites have large collections of pages generated dynamically from an underlying structured source like a database. The data of a category are typically encoded into similar pages by a common script or template. In recent years, some value-added services, such as comparison shopping and vertical search in a specific domain, have motivated the research of extraction technologies with high accuracy. Almost all previous works assume that input pages of a wrapper induction system conform to a common template and they can be easily identified in terms of a common schema of URL. However, we observed that it is hard to distinguish different templates using dynamic URLs today. Moreover, since extraction accuracy heavily depends on how consistent input pages are, we argue that it is risky to determine whether pages share a common template solely based on URLs. Instead, we propose a new approach that utilizes similarity between pages to detect templates. Our approach separates pages with notable inner differences and then generates wrappers, respectively. Experimental results show that our proposed approach is feasible and effective for improving extraction accuracy. Joint Optimization of Wrapper Generation and Template Detection Outline Motivations Related?Work Problems?(cont.) (1) Problems?(cont.) (2) Problems (1) Problems (2) Our Proposed Approach Problem?Definition System?Overview Wrapper?Generation?[6,?4,?18] Wrapper-DOM.Distance Wrapper-Oriented Page Clustering (WPC) Outline Experiments Effectiveness?Test WPC?with?Different?Thresholds Stability?Test Demo! Conclusion Thanks!
Detecting Anomalous Records in Categorical Datasets We consider the problem of detecting anomalies in high arity categorical datasets. In most applications, anomalies are defined as data points that are ?abnormal?. Quite often we have access to data which consists mostly of normal records, along with a small percentage of unlabelled anomalous records. We are interested in the problem of unsupervised anomaly detection, where we use the unlabelled data for training, and detect records that do not follow the definition of normality. A standard approach is to create a model of normal data, and compare test records against it. A probabilistic approach builds a likelihood model from the training data. Records are tested for anomalousness based on the complete record likelihood given the probability model. For categorical attributes, bayes nets give a standard representation of the likelihood. While this approach is good at finding outliers in the dataset, it often tends to detect records with attribute values that are rare. Sometimes, just detecting rare values of an attribute is not desired and such outliers are not considered as anomalies in that context. We present an alternative definition of anomalies, and propose an approach of comparing against marginal distributions of attribute subsets. We show that this is a more meaningful way of detecting anomalies, and has a better performance over semi-synthetic as well as real world datasets. Anomaly detection in Categorical Datasets Outline Problem Motivation Problem Overview (1) Problem Overview (2) Related Work (1) Related Work (2) Outline Conditional Anomaly (1) Conditional Anomaly (2) Conditional Anomaly (3) Conditional Anomaly (4) Conditional Anomaly (5) Conditional Anomaly ? Algorithm Estimating Probability Values (1) Estimating Probability Values (2) Estimating Probability Values (3) Outline Marginal Anomaly (1) Marginal Anomaly (2) Datasets (1) Datasets (2) Datasets (3) Results: PIERS Dataset (1) Results: PIERS Dataset (2) Datasets Results: KDD Cup 99 Summary Thank You!
Constraint-Driven Clustering Clustering methods can be either data-driven or need-driven. Data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. Thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. However, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. In this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. For this purpose, we introduce a novel cluster model, Constraint-Driven Clustering (CDC), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. Two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. We prove the NP-hardness of the CDC problem with different constraints. We propose a novel dynamic data structure, the CD-Tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the CDC constraints and minimizes the objective function. Based on CD-Trees, we develop an efficient algorithm to solve the new clustering problem. Our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. Constraint-Driven Clustering Introduction Capturing Application Needs Constraint-Driven Clustering Motivation - Energy Aware Sensor Networks Motivation - Privacy Preservation Related Work Constraint-Driven Clustering (CDC) Theoretical Results Heuristic Algorithm CD-Tree CD-Tree vs. CF-Tree and R*-Tree Algorithm Experimental Results Results on Synthetic data set Results on Letter data set Conclusion & Future work Reference Reference Thanks
A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network We address the issue of clustering numerical vectors with a network. The problem setting is basically equivalent to constrained clustering by Wagstaff and Cardie [20] and semisupervised clustering by Basu et al. [2], but our focus is more on the optimal combination of two heterogeneous data sources. An application of this setting is web pages which can be numerically vectorized by their contents, e.g. term frequencies, and which are hyperlinked to each other, showing a network. Another typical application is genes whose behavior can be numerically measured and a gene network can be given from another data source. We first define a new graph clustering measure which we call normalized network modularity, by balancing the cluster size of the original modularity. We then propose a new clustering method which integrates the cost of clustering numerical vectors with the cost of maximizing the normalized network modularity into a spectral relaxation problem. Our learning algorithm is based on spectral clustering which makes our issue an eigenvalue problem and uses k-means for final cluster assignments. A significant advantage of our method is that we can optimize the weight parameter for balancing the two costs from the given data by choosing the minimum total cost. We evaluated the performance of our proposed method using a variety of datasets including synthetic data as well as real-world data from molecular biology. Experimental results showed that our method is effective enough to have good results for clustering by numerical vectors and a network. A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network Table?of?Contents Heterogeneous?Data?Clustering Related?work Table?of?Contents Spectral?Clustering Cost?combining?numerical?vectors?with?a?network Complex?Networks Network?Modularity Cost?Combining?Numerical?Vectors?with?a?Network Our?Proposed?Spectral?Clustering Table?of?Contents Synthetic?Data Results?for?Synthetic?Data (1) Results?for?Synthetic?Data (2) Synthetic?Data?(Numerical?Vector) +?Real?Data?(Gene?Network) Summary Thank?you?for?your?attention! A?Spectral?Clustering?Approach?to Optimally?Combining?Numerical Vectors?with?a?Modular?Network Motoki?Shiga,?Ichigaku Takigawa, Hiroshi?Mamitsuka Bioinformatics?Center,?ICR,?Kyoto?University,?Japan KDD?2007, San?Jose, California,?USA, August?12?15?2007 Table?of?Contents 1. Motivation Clustering?for?heterogeneous?data (numerical?+?network) 2. Proposed?method Spectral?clustering?(numerical?vectors?+?a?network) 3. Experiments Synthetic?data?and?real?data 4. Summary Heterogeneous?Data?Clustering Heterogeneous?data?:?various?information?related?to?an?interest Ex. Gene?analysis :?gene?expression,?metabolic?pathway,??,?etc. Web?page?analysis?: word?frequency,?hyperlink,??,?etc. S?th value Gene?expression?#experiments?= S Gene 1 3 Numerical Vectors k?means SOM,?etc. 2 To improve clustering?accuracy, 1 expression?value combine?numerical?vectors?+?network st metabolic 4 pathway 6 Network Minimum?edge?cut Ratio?cut,?etc. M.?Shiga,?I.?Takigawa and?H.?Mamitsuka,?ISMB/ECCB?2007. Related?work?:?semi?supervised?clustering ?Local?property Neighborhood?relation ?must?link?edge,?cannot?link?edge ?Hard?constraint?(K.?Wagstaff and?C.?Cardie,?2000.) ?Soft?constraint (S.?Basu etc.,?2004.) ? Probabilistic?model?(Hidden?Markov?random?field) Proposed?method ?Global?property (network?modularity) ?Soft?constraint ?Spectral?clustering Table?of?Contents 1. Motivation Clustering?for?heterogeneous?data (numerical?+?network) Spectral?clustering?(numerical?vectors?+?a?network) 3. Experiments Synthetic?data?and?real?data 4. Summary Spectral?Clustering L.?Hagen,?etc.,?IEEE?TCAD,?1992., J.?Shi?and?J.?Malik,?IEEE?PAMI,?2000. 1. Compute?affinity(dissimilarity)?matrix?M from?data 2. To?optimize?cost J(Z)?=?tr{ZT M Z}?subject?to?ZTZ=I Trace?optimization where?Z(i,k)?is?1?when?node?i belong?to?cluster?k,?otherwise?0, e2 compute?eigen?values?and??vectors?of?matrix?M by?relaxing?Z(i,k)?to?a?real?value Each?node?is?by?one?or?more computed?eigenvectors Eigen?vector?e1 3. Assign?a?cluster?label?to?each?node?(?by?k?means?) Cost?combining?numerical?vectors with?a?network Cost?of?numerical?vector cosine?dissimilarity network What?cost? N?:?#nodes, Y?:?inner?product?of?normalized?numerical?vectors To?define?a?cost?of a?network, use?a?property?of?complex?networks Complex?Networks Ex.?Gene?networks, WWW, Social?networks,??,?etc. Property ?Small?world?phenomena ?Power?law ?Hierarchical?structure ?Network?modularity Ravasz,?et?al.,?Science,?2002. Guimera,?et?al.,?Nature,?2005. 8 Normalized?Network?Modularity =?density?of?intra?cluster?edges High Low #?intra?edges #?total edges normalize?by?cluster?size Z :?set?of?whole?nodes Guimera,?et?al.,?Nature,?2005.,?Newman,?et?al.,?Phy.?Rev.?E,?2004. Zk :?set?of?nodes?in?cluster?k L(A,B)?:?#edges?between?A?and?B Cost?Combining?Numerical?Vectors with?a?Network network Cost?of?numerical?vector Normalized?modularity cosine?dissimilarity (Negative) M? for?? =?0?1 Our?Proposed?Spectral?Clustering 1. Compute?matrix?M?= 2. To?optimize?cost J(Z)?=?tr{ZT M? Z}?subjet to?ZTZ=I , compute?eigen?values?and??vectors?of?matrix?M? by relaxing?elements?of?Z to?a?real?value e2 Each?node?is?represented?by?K?1?eigen?vectors 3. Assign?a?cluster?label?to?each?node?by?k?means. (k?means?outputs in?spectral?space.) e1 end ?Optimize weight?? e2 x is?sum?of?dissimilarity (cluster?center?>?<?data) x Eigen?vector?e1 Table?of?Contents 1. Motivation Clustering?for?heterogeneous?data (numerical?+?network) 2. Proposed?method Spectral?clustering?(numerical?vectors?+?a?network) 3. Experiments Synthetic?data?and?real?data 4. Summary Numerical?vectors (von?Mises?Fisher?distribution) ? =?1 5 50 x3 x2 x1 x3 x2 x1 0.450 x3 x2 0.525 x1 Synthetic?Data Network (Random?graph)?#nodes?=?400,?#edges?=?1600 Modularity?=?0.375 Results?for?Synthetic?Data ? =?1 ? =?1 x3 x2 Costspectral Numerical?vectors x3 x2 x3 x2 ? =?5 ? =?50 x1 x1 x1 NMI Network #nodes?=?400,?#edges?=?1600 Modularity?=?0.375 ? Numerical?vectors?only (k?means) Network?only (maximum?modularity) ?Best?NMI?(Normalized?Mutual?Information)?is?in?0?>?? >?1 ?Can?be?optimized?using?Costspectral ? =?1 Costspectral NMI Numerical?vectors?only (k?means) Network?only (maximum?modularity) Synthetic?Data?(Numerical?Vector) +?Real?Data?(Gene?Network) True?cluster (#clusters?=?10) Resultant?cluster (?=0.5,??=10) ? =?10 ? =?102 ? =?103 Gene?network by?KEGG?metabolic?pathway ?Best?NMI?is?in?0?>?? >?1 ?Can?be?optimized?using Cost Summary ? New?spectral?clustering?method?proposed combining?numerical?vectors?with?a?network ?Global?network?property (normalized?network modularity) ?Clustering?can?be?optimized?by?the?weight ? Performance?confirmed?experimentally ?Better?than?numerical?vectors?only?and?a?network only ?Optimizing?the?weight with?synthetic?dataset?and semi?real?dataset Thank?you?for?your?attention! our?poster?#16 Spectral?Representation?of?M? (concentration?? =?5?,?Modularity?=?0.375) ? =?0 e3 e3 ? =?0.3 e3 ? =?1 e2 e1 e2 0.0538 e2 e1 e1 0.0809 Cost?J? =?0.0932 Select?? by?minimizing?Costspectral (clusters?are?divided?most?separately) Result?for?Real?Genomic?Data ? Numerical?vectors :?Hughes? expression?data?(Hughes,?et?al.,?cell,?2000) ? Gene?network : Constructed?using?KEGG?metabolic?pathways (M.?Kanehisa,?etc.?NAR,?2006) Our?method NMI Ratio?cut Costspectral Normalized?cut Our?method ? ? Evaluation?Measure Normalized?Mutual?Information?(NMI) between?estimated?cluster?and?the?standard?cluster H(C)?:?Entropy?of?probability?variable?C, C :?Estimated?clusters, G :?Standard?clusters The?more?similar clusters C and G are,?the?larger?the?NMI. Web?Page?Clustering Word Frequency Word Frequency A n(A,1) B ? A n(B,1) B ? ? n(A,2) n(B,2) ? Numerical Vector Frequency?of?word?Z To?improve?accuracy, combine?heterogynous?data Network Frequency?of?word?A Spectral?Clustering?for?Graph?Partitioning Ratio?cut Subject?to Normalized?cut Subject?to L.?Hagen,?etc.,?IEEE?TCAD,?1992., J.?Shi?and?J.?Malik,?IEEE?PAMI,?2000.
Unifying Divergence Minimization and Statistical Inference via Convex Duality We unify divergence minimization and statistical inference by means of convex duality. In the process of doing so, we prove that the dual of approximate maximum entropy estimation is maximum a posteriori estimation. Moreover, our treatment leads to stability and convergence bounds for many statistical learning problems. Kernel Methods - >br<Lecture 3: Inference and Convex Duality Course Overview Inverse Problems Example Maximum Entropy Principle Proof Proof (Part II) Approximate Moment Matching Previous Work Questions Fenchel Duality Key Theorem Application: Csiszar Divergence Application: KL-Divergence Application: Conditional Models Concentration of Empirical Means Risk Bounds Risk Bounds 01 Optimization Shameless Plugs Kernel Methods Lecture 3: Inference and Convex Duality Thanks to Yasemin Altun, Markus Hegland Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Machine Learning Summer School, Taiwan 2006 Course Overview Estimation in exponential families Maximum Likelihood and Priors Clifford Hammersley decomposition Applications Conditional distributions and kernels Classi?cation, Regression, Conditional random ?elds Inference and convex duality Maximum entropy inference Approximate moment matching Maximum mean discrepancy Means in feature space, Covariate shift correction Hilbert-Schmidt independence criterion Covariance in feature space ICA, Feature selection Inverse Problems Observations Data x1 , . . . , xm , drawn from some distribution p(x). Measurements y1 , . . . , ym observed at x1 , . . . , xm . Indirect measurements y1 , . . . , ym generated by some measurement process A. Formal De?nition Solve the problem Ax = b for unknown x. Ill Posed Problem We do not have enough data to ?nd x exactly ? A does not have full rank. Solution Solve a regularized risk minimization problem. minimize f (x) subject to Ax ? b = 0 Alexander J. Smola: Kernel Methods 4 / 25 Example Density Estimation We want to have a density matching empirical means E [x] = Ill Posed Problem 1 m m xi and E x 2 = i=1 1 m m xi2 i=1 m 1 Many distributions possible, e.g. p(x) = ?xi (x). m i=1 We need a regularity condition Regularizers Small squared norm of the density p2 (x)dx. Density should be smooth, i.e. small |?x p(x)|2 dx. Non-informative density, i.e. small entropy ?p(x) log p(x)dx. Maximum Entropy Principle Motivation Find least informative consistent distribution. Moment Matching 1 Given ?(x) ?nd p such that E [?(x)] = ? := ? m ?(xi ). Theorem (MaxEnt is dual to Maximum Likelihood) minimize ?p(x) log p(x)dx subject to E [?(x)] = ? ? has as its dual the maximum likelihood problem minimize g(?) ? ?, ? where g(?) = ? exp ( ?(x), ? ) dx. Proof Lagrangian We need to ensure that p is nonnegative and is normalized: L(p, ?, ?, ?) = ?p(x) log p(x)dx + ? +? 1? p(x)dx + ?? ? ?(x)p(x)dx ?(x)p(x)dx Variational Derivative Informally, we can pull the derivative into the integral: ?p L(p, ?, g, ?) = ? log p(x) + 1 + ? ?(x) + ? + ?(x) = 0 Solution p(x) = exp ? ?(x) + ? + 1 + ?(x) :=?g(?) Proof (Part II) Wolfe?s Dual Plugging the expansion p(x) = ( ?(x), ? ? g(?)) into the Lagrangian yields: maximize ? ? ? g(?) where g(x) = log ? exp ? ?(x) dx. Maximum Likelihood Expanding the dual objective by m and using 1 ? = m m ?(xi ) proves the claim. ? i=1 Caveat We ignored feasibility and constraint quali?cation of the problem. Approximate Moment Matching Exact Moment Matching This requires that the distribution has exactly the same moments as the empirical mean. Example: Estimating a Normal Distribution Normal distribution with 0 mean and variance 1. Empirical average of x is 0.03, that of x 2 is 1.07. Clearly exact moment matching is unrealistic! Solution Maximum entropy under approximate moment matching: minimize ?p(x) log p(x)dx subject to E [?(x)] ? ? ? ? Alexander J. Smola: Kernel Methods Previous Work General Problem minimize f (x) subject to Ax ? b ? AdaBoost (Lafferty 1999, Kivinen etc 1999, Colins etc 2000) f (x) is the Bregmann divergence corresponding to the unnormalized entropy. = 0, b = 0 and A takes care of deviations from the empirical averages (more later). Regularized MaxEnt (Dudik et al., 2004, 2006) f (x) is the (normalized) entropy b is empirical average of moments and A is expectation operator. ? are 1 and 2 norms. Regularization Theory (Arsenin and Tikhonov, 1977) f (x) is x 2 squared 2 norm. General ill-posed problem Ax = b. Alexander J. Smola: Kernel Methods 11 / 25 Questions General Case Uni?ed treatment of approximate solutions of ill-posed problems. Algorithm Ef?cient algorithm to solve all the problems. Feasibility When is the problem feasible, bounded, etc.? When can we compute the dual? Interpretation What is the meaning of the dual problem? Alexander J. Smola: Kernel Methods Fenchel Duality De?nition (Convex Conjugate) Denote by f : X ? R a convex function on some convex domain X of a Banach space B. Then the dual f ? : B? ? R is de?ned as f ? (x ? ) := sup x, x ? ? f (x). x?X Properties Self Duality f ?? = f Linear Offset {f (x) + a, x }? = f ? (x ? ? a) Linear Functions f (x) = a, x and X = UB (1) implies f ? (x ? ) = x ? ? a . Alexander J. Smola: Kernel Methods 13 / 25 Key Theorem Theorem (Fenchel?s Duality with Constraints) t := inf {f (x) subject to Ax ? b x?X B} d := sup {?f ? (A? x ? ) + b, x ? ? x ? ?B? x? B? } If core(A dom f ) ? (b + int(B)) = ? then t = d. Note that s ? core(S) if ?<0 ?(S ? s) ? X and S ? X. This is the price we pay for in?nite dimensionality. This allows us to rewrite optimization problems in the dual domain. Application: Csiszar Divergence Divergence Denote by q a reference density and let h be convex. f (p) := q(t)h p(t) q(t) dt Special cases are Tsallis, Burg, Amari, and KL divergence. Primal Problem minimize f (p) subject to E [?(x)] ? ? ? p B ? and dp = 1 Dual Problem maximize ? ? q(t)h? ( ?, ?(t) ? ?) dt + ?, ? ? ? ? ? ? Density is given by p(t) = q(t)(h? ) ( ?, ?(t) ? ?). Application: KL-Divergence Divergence h(?) = ? log ? yields Kullback-Leibler divergence. Dual Problem maximize ? log q(t) exp ( ?, ?(t) ) dt + ?, ? ? ? + e?1 Density is given by p(t) = q(t) exp ( ?, ?(t) ? g(?)). Examples For B = ? we get 1 penalization (Dudik et al. 2004). For B = H we get a kernel method (Nemermann and Bialek, 1998) Alexander J. Smola: Kernel Methods Application: Conditional Models AdaBoost (Collins et al., 2000) f is sum over unnormalized entropies for p(y |xi ). A is sum over evaluations of features at locations (xi , yi ) =0 Gaussian Process Classi?cation f is sum over normalized entropies for p(yi |xi ). A is sum over evaluations of features at locations (xi , yi ) B is a Hilbert Space Gaussian Process Regression Same as classi?cation, only different suf?cient statistics. Conditional Random Fields Suf?cient statistics ?(x, y ) decomposes into cliques. Alexander J. Smola: Kernel Methods 17 / 25 Concentration of Empirical Means Problem We need to determine for the constraint 1 E [?(x)] ? m m ?(xi ) ? i=1 Theorem (Uniform Convergence to Empirical Means) With probability ? ? 1 ? exp ? Rm 2 E [?(x)] ? ? ? 2Rm (F, p) + ? Rm is Rademacher average. F is the class of linear functions of bounded norm. Advantage Principled regularization scheme = O(m?1/2 ). Risk Bounds Loss L(?, ?) := f ? ( ?, ?(?) ) ? ?, ? + True Statistics Let ?? be the true mean. Theorem With probability ? ? 1 ? exp ? Rm 2 L(?? , ?? ) ? L(?, ?) ? ?? [2Rm (F, p) + ] Proof. Use the relation L(?, ?) ? L(?? , ?? ) ? ?, ?? ? ? and the concentration of empirical means. ? k B? We want to bound the deviation between the loss at the actual solution ? and the optimal solution ?? . Theorem With probability ? ? 1 ? exp ? Rm 2 L(?, ?? ) ? L(?? , ?? ) ? 2 C k 1/k ?1 [2Rm (F, p) + ] In the case of an RKHS we can get slightly tighter bounds instead of using Rademacher averages. Alexander J. Smola: Kernel Methods Optimization We can use Zhang?s algorithm (2003) for minimization. 1: input: sample of size m, statistics ?, base function class B? , approximation , number of iterations K , and radius of base the space of solutions R 2: Set ? = 0. 3: for k = 1, . . . , K do 4: Find (?, ?) such that for ei ? B? and ? ? [0, 1] the ? ? base following is approximately minimized: L((1 ? ?)? + R?ei , b) ? ? ? Update ? ? (1 ? ?)? + R ?e? end for This gives us O(1/K ) rate of convergence. Alexander J. Smola: Kernel Methods 23 / 25 Shameless Plugs Looking for a job . . . talk to me! Alex.Smola@nicta.com.au (http://www.nicta.com.au) Positions PhD scholarships Postdoctoral positions, Senior researchers Long-term visitors (sabbaticals etc.) More details on kernels http://sml.nicta.com.au http://www.kernel-machines.org http://www.learning-with-kernels.org Sch?lkopf and Smola: Learning with Kernels Alexander J. Smola: Kernel Methods
Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis To unravel the concept structure and dynamics of the bioinformatics field, we analyze a set of 7401 publications from the Web of Science and MEDLINE databases, publication years 1981?2004. For delineating this complex, interdisciplinary field, a novel bibliometric retrieval strategy is used. Given that the performance of unsupervised clustering and classification of scientific publications is significantly improved by deeply merging textual contents with the structure of the citation graph, we proceed with a hybrid clustering method based on Fisher?s inverse chi-square. The optimal number of clusters is determined by a compound semiautomatic strategy comprising a combination of  istancebased and stability-based methods. We also investigate the relationship between number of Latent Semantic Indexing factors, number of clusters, and clustering performance. The HITS and PageRank algorithms are used to determine representative publications in each cluster. Next, we develop a methodology for dynamic hybrid clustering of evolving bibliographic data sets. The same clustering methodology is applied to consecutive periods defined by time windows on the set, and in a subsequent phase chains are formed by matching and tracking clusters through time. Term networks for the eleven resulting cluster chains present the cognitive structure of the field. Finally, we provide a view on how much attention the bioinformatics community has devoted to the different subfields through time. Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Overview of the presentation General context Agglomerative hierarchical clustering Indexing in Vector Space Model Bibliometrics and network analysis Hybrid (integrated) clustering Hybrid clustering: intermediate integration Weighted linear combination (linco) Fisher?s inverse chi-square method (1) Fisher?s inverse chi-square method (2) Fisher?s inverse chi-square method (3) Conclusions from previous research Dynamic hybrid mapping of bioinformatics Number of clusters and LSI factors Number of clusters: stability diagram Number of clusters: link-based Silhouette values Dendrogram slide 19 slide 20 Dynamics Dynamic term networks Conclusions (1) Conclusions (2)
Enhancing Semi-Supervised Clustering: A Feature Projection Perspective Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional sparse data. This paper thus fills this crucial void by developing a Semi-supervised Clustering method based on spheRical KmEans via fEature projectioN (SCREEN). Specifically, we formulate the problem of constraint-guided feature projection, which can be nicely integrated with semi-supervised clustering algorithms and has the ability to effectively reduce data dimension. Indeed, our experimental results on several real-world data sets show that the SCREEN method can effectively deal with high-dimensional data and provides an appealing clustering performance.
45th Annual Meeting of the Association for Computational Linguistics The conference was organized by the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague (\"Univerzita Karlova v Praze\"), Czech Republic, the oldest University in Europe to the north of the Alps (founded in 1348).
BoostCluster: Boosting Clustering by Pairwise Constraints Data clustering is an important task in many disciplines. A large number of studies have attempted to improve clustering by using the side information that is often encoded as pairwise constraints. However, these studies focus on designing special clustering algorithms that can effectively exploit the pairwise constraints. We present a boosting framework for data clustering, termed as BoostCluster, that is able to iteratively improve the accuracy of any given clustering algorithm by exploiting the pairwise constraints. The key challenge in designing a boosting framework for data clustering is how to influence an arbitrary clustering algorithm with the side information since clustering algorithms by definition are unsupervised. The proposed framework addresses this problem by dynamically generating new data representations at each iteration that are, on the one hand, adapted to the clustering results at previous iterations by the given algorithm, and on the other hand consistent with the given side information. Our empirical study shows that the proposed boosting framework is effective in improving the performance of a number of popular clustering  algorithms (Kmeans, partitional SingleLink, spectral clustering), and its performance is comparable to the state-of-the-art algorithms for data clustering with side information.
Assisting Translators in Indirect Lexical Transfer We present the design and evaluation of a translator?s amenuensis that uses comparable corpora to propose and rank non-literal solutions to the translation of expressions from the general lexicon. Using distributional similarity and bilingual dictionaries, the method outperforms established techniques for extracting translation equivalents from parallel corpora.
Nonlinear Adaptive Distance Metric Learning for Clustering A good distance metric is crucial for many data mining tasks. To learn a metric in the unsupervised setting, most metric learning algorithms project observed data to a lowdimensional manifold, where geometric relationships such as pairwise distances are preserved. It can be extended to the nonlinear case by applying the kernel trick, which embeds the data into a feature space by specifying the kernel function that computes the dot products between data points in the feature space. In this paper, we propose a novel unsupervised Nonlinear AdaptiveMetric Learning algorithm, called NAML, which performs clustering and distance metric learning simultaneously. NAML first maps the data to a high-dimensional space through a kernel function; then applies a linear projection to find a low-dimensional manifold where the separability of the data is maximized; and finally performs clustering in the low-dimensional space. The performance of NAML depends on the selection of the kernel function and the projection. We show that the joint kernel learning, dimensionality reduction, and clustering can be formulated as a trace maximization problem, which can be solved via an iterative procedure in the EM framework. Experimental results demonstrated the efficacy of the proposed algorithm.
A Framework for Simultaneous Co-clustering and Learning from Complex Data For difficult classification or regression problems, practitioners often segment the data into relatively homogenous groups and then build a model for each group. This two-step procedure usually results in simpler, more interpretable and actionable models without any loss in accuracy. We consider problems such as predicting customer behavior across products, where the independent variables can be naturally partitioned into two groups. A pivoting operation can now result in the dependent variable showing up as entries in a ?customer by product? data matrix. We present a modelbased co-clustering (meta)-algorithm that interleaves clustering and construction of prediction models to iteratively improve both cluster assignment and fit of the models. This algorithm provably converges to a local minimum of a suitable cost function. The framework not only generalizes co-clustering and collaborative filtering to model-based coclustering, but can also be viewed as simultaneous co-segmentation and classification or regression, which is better than independently clustering the data first and then building models. Moreover, it applies to a wide range of bi-modal or multimodal data, and can be easily specialized to address classification and regression problems. We demonstrate the effectiveness of our approach on both these problems through experimentation on real and synthetic data.
Joint Cluster Analysis of Attribute and Relationship Data Without Priori Specification of the Number of Clusters In many applications, attribute and relationship data are available, carrying complementary information about real world entities. In such cases, a joint analysis of both types of data can yield more accurate results than classical clustering algorithms that either use only attribute data or only relationship (graph) data. The Connected k-Center (CkC) has been proposed as the first joint cluster analysis model to discover k clusters which are cohesive on both attribute and relationship data. However, it is well-known that prior knowledge on the number of clusters is often unavailable in applications such as community identification and hotspot analysis. In this paper, we introduce and formalize the problem of discovering an a-priori unspecified number of clusters in the context of joint cluster analysis of attribute and relationship data, called Connected X Clusters (CXC) problem. True clusters are assumed to be compact and distinctive from their neighboring clusters in terms of attribute data and internally connected in terms of relationship data. Different from classical attribute-based clustering methods, the neighborhood of clusters is not defined in terms of attribute data but in terms of relationship data. To efficiently solve the CXC problem, we present JointClust, an algorithm which adopts a dynamic two-phase approach. In the first phase, we find so called cluster atoms. We provide a probability analysis for this phase, which gives us a probabilistic guarantee, that each true cluster is represented by at least one of the initial cluster atoms. In the second phase, these cluster atoms are merged in a bottom-up manner resulting in a dendrogram. The final clustering is determined by our objective function. Our experimental evaluation on several real datasets demonstrates that JointClust indeed discovers meaningful and accurate clusterings without requiring the user to specify the number of clusters.
NATO Advanced Study Institute on Mining Massive Data Sets for Security This Workshop brings together scientists and engineers interested in recent developments in exploiting Massive Data Sets. Emphasis is placed on available techniques and their application to security-critical applications. \\\\ Today our world is awash in data and we live in an Information Society where every action leaves a trace, generating massive amounts of data. Recent scientific developments provide technologies to exploit these huge amounts of data and extract from it critical information. Used today in many commercial applications (marketing campaigns, user profiling and recommendations on e-commerce sites, web search,users communitiee...), these technologies can also be used for security-critical applications (fraud detection and money laundering, intrusions detection, intelligence gathering, terrorist networks detection, Web surveillance...). It is the purpose of this workshop to review the various technologies available (data mining algorithms, social networks, crawling and indexing, text-mining, search engines, data streams) in the context of very large data sets. The workshop will provide survey presentations & posters to help building a scientific community aware of security issues and techniques to solve them.
Identifying Temporal Patterns and Key Players in Document Collections We consider the problem of analyzing the development of a document collection over time without requiring meaningful citation data. Given a collection of timestamped documents, we formulate and explore the following two questions. First, what are the main topics and how do these topics develop over time? Second, to gain insight into the dynamics driving this development, what are the documents and who are the authors that are most influential in this process? Unlike prior work in citation analysis, we propose methods addressing these questions without requiring the availability of citation data. The methods use only the text of the documents as input. Consequentially, they are applicable to a much wider range of document collections (email, blogs, etc.), most of which lack meaningful citation data. We evaluate our methods on the proceedings of the Neural Information Processing Systems (NIPS) conference. Even with the preliminary methods that we implemented, the results show that the methods are effective and that addressing the questions based on the text alone is feasible. In fact, the text-based methods sometimes even identify influential papers that are missed by citation analysis. Identifying Temporal Patterns and Key Players in Document Collections Introduction and Goals Key Contribution Temporal Cluster Histograms: Goals Temporal Cluster Histograms: Method Temporal Cluster Histograms: Results Related Work (Topics) Influential Documents: Goals Related Work (Influence) Influential Documents: Method Influential Documents: Results Influential Authors: Goals Influential Authors: Method Influential Authors: Results Conclusions Future Directions MLSS Taipei 2006 Workshop Identifying Temporal Patterns and Key Players in Document Collections Rich Caruana, Thorsten Joachims, Johannes Gehrke, Benyah Shaparenko Cornell University {caruana,tj,johannes,benyah}@cs.cornell.edu Introduction and Goals Identify Development of Topics What are the key topics in a collection of documents and how did their popularity change over time? Identify Influential Documents Which documents introduced new ideas that had a large impact? Identify Influential Authors Which authors have the largest influence on the development of topics? Key Contribution Find influential documents and authors without using citation analysis Methods based on document text Only metadata is the author name for who wrote which document Wider applicability: news, blogs, email, etc. Temporal Cluster Histograms: Goals What are the main topics in a collection? Identify key topics. What proportion of documents is in each topic? How do topics develop? What are new emerging topics? Which topics are fading? When did particular topics peak in popularity? Temporal Cluster Histograms: Method Document Assumptions Text, Time-stamped, Document dependencies Testbed 1955 NIPS documents from 1987 ? 2000 Vector space TFIDF representation K-means Clustering Cosine distance metric on TFIDF text vectors K = 7, 13, 30 10 runs, select run with least squared error Temporal Cluster Histograms: Results NIPS k-means clusters Percentage Distribution (k=13) ?86 ?87 ?88 ?89 ?90 ?91 ?92 ?93 ?94 ?95 ?96 ?97 ?98 ?99 ?00 Year 12: chip, circuit, analog, voltage, vlsi 11: kernel, margin, svm, vc, xi 10: bayesian, mixture, posterior, likelihood, em 9: spike, spikes, firing, neuron, neurons 8: neurons, neuron, synaptic, memory, firing 7: david, michael, john, richard, chair 6: policy, reinforcement, action, state, agent 5: visual, eye, cells, motion, orientation 4: units, node, training, nodes, tree 3: code, codes, decoding, message, hints 2: image, images, object, face, video 1: recurrent, hidden, training, units, error 0: speech, word, hmm, recognition, mlp P e r c e n ta g e o f P a p e r s Related Work (Topics) Temporal Topic/Trend Detection Topic Detection and Tracking (TDT) Studies (Allan et al. ?98) ICA (Kolenda et al. ?01) Burst detection (Kleinberg ?02) Timelines (Swan ?00) Efficient, formal models (Guha et al. ?05) Thread life cycle (Mei et al. ?05) Influential Documents: Goals Identify ?leading? papers Which documents introduce new ideas? Which documents most influence future work? Constraints Analysis not limited to scientific papers Must work without citation data Only document text and timestamp may be used Related Work (Influence) Influential Documents and Authors Bibliometrics (McGovern et al. ?03, Osareh ?96, White ?04) Hubs/Authorities (Kleinberg ?99) PageRank (Page et al. ?99) Impact factors (Garfield ?03) Influential Documents: Method Document Lead/Lag Index Basic assumption: Document dependencies are based on the terminology of documents. Find the k nearest neighbors (cosine distance) Raw lead/lag score: LLraw(d) = # later ? # earlier Scaled lead/lag score (avoid edge effects): LLnorm(d) = LLraw(d) - AVGd??year(d)(LLraw(d?)) d1 1997 1998 1999 d 2000 d2 2001 d3 d4 2002 2003 Influential Documents: Results Score 1.167 1.128 0.986 0.953 0.945 0.945 0.934 0.934 Year 1996 1999 1999 1996 1998 1997 1998 1995 Cites 128 17 (466) 18 41 (3711) 27 3 17 584 Paper Title and Authors ?improving the accuracy and speed of support vector machines? by chris j.c. burges, b. schoelkopf ?using analytic qp and sparseness to speed training of support vector machines? by john c. platt ?regularizing adaboost? by gunnar raetsch, takashi onoda, klausrobert mueller ?support vector method for function approximation, regression, and signal processing? by v. vapnik, s. golowich, a. smola ?training methods for adaptive boosting of neural networks? by holger schwenk, yoshua bengio ?modeling complex cells in an awake macaque during natural image viewing? by william e. vinje, jack l. gallant ?em optimization of latent-variable density models? by chris bishop, markus svensen, chris william ?a new learning algorithm for blind signal separation? by s. amari, a. cichocki, h. h. yang Influential Authors: Goals Identify ?leading? authors Who are the major players on the scene? Which authors most influence future work? Constraints Analysis not limited to scientific papers Must work without citation data Only document text and timestamp may be used Influential Authors: Method Author Lead/Lag Index Assume author a has documents d1,?,dn Compute scaled lead/lag score for each document and average these scores LLnorm(a) = 1/n (LLnorm(d1)+?+ LLnorm(dn)) Compute variance v of LLnorm(a) and rank by LLnorm(a) ? 2 * sqrt(v / n) Use smoothing to avoid small sample artifacts Influential Authors: Results Conclusions Propose problem: Analyze temporal trends without using citation data Temporal cluster histograms concisely depict how popularity changes over time. Document (author) lead/lag index identifies important, influential documents (authors) without using citation analysis. Future Directions Better methods for identifying influential documents and authors Simple methods work More principled methods probably do better Associate influential documents with cluster formation Clustering algorithms that directly capture the splitting and merging of clusters
Surface plasmon resonance imagery and single molecule approaches of proteo-nucleic complex self-assembly The presentation will illustrate the use of use surface plasmon resonance imagery and of single molecule approaches based on quantum dot labelling for the analysis of the dynamic and structure of proteo-nucleic assemblies. Semi-synthetic complexes (PDAs), encompassing a protein part embedding electro active redox cofactors and a nucleic acid part allowing highly controlled self-organisation of the protein domains, were designed and constructed. This approach allows the building of large and fully defined chain of electron transport centres with tuneable spatial organization. Association with quantum dot offering single molecule monitoring power opens the route to the use of self-assembled bio molecules as molecular wire for bioelectronics.
Information Theo-retic and Alge-braic Methods for Network Anomaly Detection The tutorial will discuss two central issues: (i) Information Theoretic principles and algorithms for extracting predictive statistics in distributed networks and (ii) algebraic and spectral methods for network anomaly detection. The first part will deal with the concept of predictive information - the mutual information between the past and future of a process, its sub-extensive properties, and algorithms for estimating it from data.We will argue that the information theoretic predictability quantifies the complexity of a process and provides effective ways for detecting anomalies and surprises in the process. Using the Information Bottleneck algorithms one can extract approximate sufficient statistics from the past to the future of the process and use them as anomaly detectors on multiple time scales. In the second part we will discuss ways for analyzing network activity using spectral methods (distributed PCA and network Laplacian analysis) for identifying regular temporal patterns of connected network components. By combining the two approaches, we will suggest new techniques for network anomaly detectors for security. Algebraic and Information Theoretic Methods forNetwork Anomaly Detection Outline Statement of the problem \"...drowning in data but starving for knowledge\" Biological neural networks Biochemical interactions Gene expression analysis Example: Wireless Sensor Networks An Object Moving Through the Network Graph Thoretical Formulation Undirected graph - Symmetric matrix Security Issues Algebraic Methods - Static Networks (1) Algebraic Methods - Static Networks (2) Algebraic Methods - Static Networks (3) Laplacian eigenvector decomposition Application: Using Spectral Embedding for Novelty Detection in communication networks Reordering the nodes based on Spectral decomposition Simple illustration Distances between graphs (1) Distances between graphs (2) Distances between graphs (3) Distances between graphs (4) Example of the anomaly detection Diffusion on Graphs Computational comment Diffusion on time dependent graphs Predictive Information Why Predictability? (1) Why Predictability? (2) Predictive Information (with Bialek and Nemenman, 2001) Predictive Information Logarithmic growth for finite dimensional processes Power law growth Entropy of words in a Spin Chain Entropy of 3 Generated Chains Predictive Information ?Subextensive Component of the Entropy But WHAT -in the past -is predictive ? But WHAT -in the past -is predictive ? The Information Bottleneck Method The IB algorithm ?generalized Arimoto-Blahut for RDT Why is the predictive information interesting? Variable Memory Markov Modelsand Prediction Suffix Tree Learning Complexity?Accuracy Tradeoff Complexity?Accuracy Tradeoff Can we understand it? Many Thanks to?
Data stream management and mining The course provides an introduction to the data stream management and mining field. The following points are treated: (1) applications which motivated these new developments (telecommunications, computer networks, stock market, security, ...), (2) new concepts related to data streams (structure of a stream, timestamps, time windows, ...), (3) main features of data stream management systems, (4) adaptations of data mining algorithms to the case of streams, (5) solutions to summarize data streams. Data stream management and mining Outline What is a data stream ? (1) What is a data stream ? (2) Outline - Applications of data stream processing Applications of data stream processing - Data stream processing Applications of data stream processing - Applications Applications of data stream processing - Standard data processing versus data stream processing Applications of data stream processing - Let?s go deeper into some examples Applications of data stream processing - Network management (1) Applications of data stream processing - Network management (2) Applications of data stream processing - Stock monitoring Applications of data stream processing - Linear Road Benchmark (1) Applications of data stream processing - Linear Road Benchmark (2) Applications of data stream processing - Linear Road Benchmark (3) Applications of data stream processing - Where is the problem ? Outline - Models for data streams Models for data streams - Structure of a stream (1) Models for data streams - Structure of a stream (2) Models for data streams - Model of a stream Models for data streams - Contents of a stream Models for data streams - Modeling the stream Models for data streams - Some canonical models of streams Models for data streams - Examples: Models for data streams - Windowing (1) Models for data streams - Windowing (2) Models for data streams - Sliding window Outline - Data stream management systems DSMS outline - Definition of a DSMS DSMS: definition DSMS outline - DSMS data model DSMS: data model (1) DSMS: data model (2) DSMS outline - Queries in a DSMS DSMS: queries (1) DSMS: queries (2) DSMS: queries (3) DSMS: queries (4) DSMS: queries (5) DSMS: queries (6) DSMS outline - STREAM example with ?Linear Road? DSMS: STREAM (1) DSMS: STREAM (2) DSMS: STREAM (3) DSMS: STREAM (4) DSMS outline - Main architecture of DSMS Main architecture of DSMS (1) Main architecture of DSMS (2) Main architecture of DSMS (3) DSMS outline - Approximate answers to queries Approximate answers to queries (1) Approximate answers to queries (2) Approximate answers to queries (3) Approximate answers to queries (4) DSMS outline - Main existing DSMS Main existing DSMS (1) Main existing DSMS (2) Main existing DSMS (3) Outline - Data stream mining Data stream mining: outline - Definition Data stream mining: definition (1) Data stream mining: definition (2) Data stream mining: definition (3) Data stream mining: definition (4) Data stream mining: outline - Decision tree Data stream mining: decision tree (1) Data stream mining: decision tree (2) Data stream mining: decision tree (3) Data stream mining: outline - PCA Data stream mining: additive methods (1) Data stream mining: additive methods (2) Data stream mining: outline - Clustering Data stream mining: clustering (1) Data stream mining: clustering (2) Data stream mining: clustering (3) Data stream mining: clustering (4) Data stream mining: clustering (5) Data stream mining: clustering (6) Data stream mining: clustering (7) Data stream mining: clustering (8) Outline - Synopses structures Synopses structures Synopses structures: random samples Synopses structures Synopses structures: sketches (1) Synopses structures: sketches (2) Synopses structures: sketches (3) Synopses structures: sketches (4) Synopses structures: sketches (5) Synopses structures: sketches (6) Synopses structures: sketches (7) Synopses structures: sketches (8) Outline - Conclusion Conclusion References: general References: DSMS References: data stream mining QUESTIONS ?
User logs processing using machine learning techniques User modeling is progressively becoming an important and generic component of many applications and services. The mains reasons that explain this phenomenon are the tasks increasing complexity and the wide variety of users. \\\\ nformation systems, hypermedia, websites, and application software are becoming more and more complex, hence difficult to use efficiently. Also, the amount of on-line information available to a user through Internet is huge and is still increasing everyday so that recovering information is becoming harder and harder. Finally, together with the huge development of Internet, more and more on-line commercial websites and services are proposed to Internet users. The aim and interest of user modeling consists in these situations in helping the user to efficiently use the systems he is offered and to retrieve the information he is looking for by filtering the information according to his will and needs. Furthermore, while many software, hypermedia, websites and services are potentially used by a variety of users, these systems have been traditionally developed in a ?one size fits all? manner. Consequently, they are often not adapted to most of the users, with various knowledge, preferences, and needs. In this context user modeling allows personalizing such systems, their content or presentation, in order to fit the individual. User Modeling and Machine Learning: A Survey User Modeling and Machine Learning: A Survey Outline Main Applications Intelligent User Interfaces Intelligent User Interfaces Adapative Hypermedia (1) Adapative Hypermedia (2) Adapative Hypermedia (3) Adapative Hypermedia (4) Educational Systems Recommender Systems (1) Recommender Systems (2) WebSite log analysis ?Web analytics solutions Anaweb project Personalized Information Retrieval (1) Personalized Information Retrieval (2) Example (1) Example (2) Prediction (of Next Action) Navigation Help Systems (1) Navigation Help Systems (2) Navigation Help Systems (3) Desktop User Help System Office Activity Help Systems User models (1) User models (2) User models (3) User models (4) User models (5) Some more details about? Web log preprocessing (1) Web log preprocessing (2) Web log preprocessing (3) Web log preprocessing (4) Standard Web Usage Mining technique: Association rules Dealing with sequences Dealing with sequences () Dealing with sequences () Dealing with sequences () User navigation behaviour detection and tracking (1) User navigation behaviour detection and tracking (2) User navigation behaviour detection and tracking (3) User navigation behaviour detection and tracking (4) User navigation behaviour detection and tracking (5) User navigation behaviour detection and tracking (6) mmdss07_artieres_ulpum_Page_47 References
Summarizing Data Stream's History This article presents data mining algorithms whose goal is to build summaries designed to summarize the whole history of one or several data streams so that selected parts of that history may be studied later. Summarizing Data Stream's History Plan Definition of a data stream Definition of a Data Stream Summary Motivation Existing Summary Methods StreamSamp : Single Stream Summary Algorithm (1/3) (Online component) Algorithm (2/3) (Online component) Algorithm (3/3) (Online component) Method Evaluation Artificial Dataset (1) Artificial Dataset (2) KDD 98 Charitable Donation Dataset (1) KDD 98 Charitable Donation Dataset (2) KDD 99 Network Intrusion Dataset Result Scores for Various Chunks of the Stream Result Curve for an Increasing Quantity of Stream Data Starting with the First Element Result Curve for an Increasing Quantity of Stream Data Starting with the Last Element Result Curve for Stream Processing Speed CrossStream : Relational Stream Summary Motivation Problematic Goal Useful Tools Cluster Feature Vector (CFV) (BIRCH, Zhang 1996) (Aggarwal 2003) CluStream (on-line part) (1) CluStream (on-line part) (2) CluStream (on-line part) (3) SnapShot System Snapshot System : Distribution example : 2o CluStream (off-line part) Bloom Filters (Bloom 1970) (1/3) Bloom Filters (Bloom 1970) (2/3) Bloom Filters (Bloom 1970) (3/3) Method Presentation System Overview Entity Summary Relation Summary (1) Relation Summary (2) Storage Management Method Evaluation Performance Evaluation Simple Static Dataset Simple Relational Data Structure Numerical Values (1) Numerical Values (2) Numerical Values (3) Complex Static Dataset Complex Relational Data Structure (1) Complex Relational Data Structure (2) Conclusion and Perspective
Evolving Networks Most real networks often evolve through time: changes of topology can occur if some nodes and/or edges appear and/or disappear, and even if the topology stays static, the types or weights of node and edges can also change. Mobile devices with wireless capabilities (mobile phones, laptops, etc.) are a typical example of evolving networks where nodes or users are spread around in the environment and connections between users can only occur if they are near each others. This who-is-near-who network is going to evolve every time users move and communication services such as the spread of any information will deeply rely on the mobility and on the characteristics of the underlying network. We will present here some results focusing on three key problems - measuring, describing and modeling evolving networks - using a typical evolving network where 41 sensors had been distributed to participants of a conference, which where asked to keep the sensor at all time. Each sensor was able to detect and record the presence of others sensors within their radio range which gives some information on the proximity of participants
Facial expression recognition and emotion recognition from speech The presentation tackles the problem of recognizing the emotions based on video and audio data analysis. A fully automatic facial expression recognition system is based on three components: face detection, facial characteristic point extraction and classification. Face detection is employed by boosting simple rectangle Haar-like features that give a decent representation of the face. These features also allow the differentiation between a face and a non-face. The boosting algorithm is combined with an Evolutionary Search to speed up the overall search time. Facial characteristic points (FCP) are extracted from the detected faces. The same technique applied on faces is utilized for this purpose. Additionally, FCP extraction using corner detection methods and brightness distribution has also been considered. Finally, after retrieving the required FCPs the emotion of the facial expression can be determined. Using a sparse learning Relevance Vector Machine in Facial Expression Recognition Introduction Problem definition Facial expression recognition system I. Face detection Viola&Jones features The RVM-based weak classifier Face detection III. FCP model III.1. FCP detection using corner detectors III.2.a. The FCPs to be extracted with RVM based E.A.>br< classifier Using a sparse learning Relevance Vector Machine in Facial Expression Recognition Drago? Datcu s April 14, 2006 Introduction The utility of facial expression recognition technology: ? Human computer interfaces ? Safety, surveillance - terorist identi?cation - safe driving, somnolence detection - access control ? Psychology Problem de?nition How to realize a fully automatic facial expression recognition system using a sparse learning Relevance Vector Machine? The automatic facial expression recognition system includes: face detector facial feature extractor for mouth, left and right eye Facial Characteristic Point - FCP extractor facial expression recognizer Facial expression recognition system I. Face detection The method makes use of: - Viola&Jones features (24x24 size samples, 162336 features/sample) - Evolutionary AdaBoost (150 size population) - Relevance Vector Machine (RVM) - weak classi?er - training data set includes: 4916 faces and 10000 non-faces. The detector consists in a 32 layer cascade of classi?ers using 4297 features Viola&Jones features The basic types: Applied on an image: The value is: Haar value = P P P ixelintensitydarkareas? P ixelintensitylightareas For a 24x24 image, there exist more than 160.000 such features. Euromedia 2006, Drago? Datcu s 5 AdaBoost algorithm Discrete AdaBoost [Freund and Schapire (1996b)] 1. Start with weights wi = 1/N , i = 1, ..., N . 2. Repeat for m = 1, 2, ..., M : (a) Fit the classi?er fm(c) ? {?1, 1} using weights wi on the training data. (b) Compute errm = Ew [1(y=fm(x))], cm = log((1 ? errm)/errm). (c) Set wi ? wiexp[cm1(yi=fm(xi))], i = 1, 2, ..., N , and renormalize so that P i wi = 1 . PM 3. Output the classi?er sign[ m=1 cmfm(x)]. Euromedia 2006, Drago? Datcu s The RVM-based weak classi?er Evolutionary AdaBoost E.A. performs an e?cient search for the representative Viola&Jones features for classi?cation Face detection Cascaded classi?er with T layers Example: choosing the propper weak classi?er for three di?erent V&J features. 2-fold cross validation results on three week classi?ers for face detection based on Haar-like features ROC curves of three kernels, obtained by adjusting each classi?er?s threshold Face detection results RVM test results, both training and testing are performed on MIT CBCL database RVM test results, the training is done using MIT CBCL, the testing is done on CMU database II. Facial feature extraction The facial features to be extracted are: left/right eye and mouth areas. III. FCP model Kobayashi and Hara model the face through 30 FCPs There are three steps involved in the FCP detection: 1. FCP detection using corner detectors 2.a. FCP detection using RVM classi?er 2.b. FCP detection using integral projection method Euromedia 2006, Drago? Datcu s 17 III.1. FCP detection using corner detectors There are two corner detectors that are used as a ?rst stage for FCP detection. The hybrid corner detector stands for a combination of two corner detectors: - Harris - Sojka Euromedia 2006, Drago? Datcu s III.2.a. FCP detection using RVM classi?er The method makes use of: - Viola&Jones features (13x13 size samples, 14140 features/sample) - Evolutionary AdaBoost - Relevance Vector Machine (RVM) - weak classi?er Note: DCT - Discrete Cosine Transform method has been found to be too sensitive to illumination III.2.a. The FCPs to be extracted with RVM based E.A. classi?er III.2.a. The stages of training the FCP detector III.2.a. FCP data set BioID dataset and the Carnegie Melon dataset III.2.a. EA characteristics III.2.a. FCP detection results III.2.b. FCP detection, integral projection method It is used to extract the rest of the points. - projects the image into the vertical and horizontal axes - obtain the boundaries - the boundaries of the features have relatively high contrast - the image is presented by two 1D orthogonal projection functions: IP Fv , IP Fh, M IP Fv , M IP Fh, V IP Fv , V IP Fh, GP Fv , GP Fh
Mining Networks through Visual Analytics: Analysts are faced with massive collections gathering documents, events and actors from which they try to make sense, searching data to locate patterns and discover evidence. Visual and interactive exploration of data has now established as a fruitful strategy to tackle the problem posed by this abundance of information. The Visual Analytics initiative promotes the use of Information Visualization to support analytical reasoning through a sense-making loop based on which the analysis incrementally builds hypotheses. Mining Networks through Visual Analytics - Incremental Hypothesis Building and Validation peacokmaps.com InfoVis CyberInfraStructure ? Pajek Tulip ? BubbleTree Graph Viz Framework Tulip Internet traffic Vorono? Treemaps Cushion Treemaps Munzner?s Hyperbolic Browser Tulip ? Sugiyama Layout Visualize? (1) Visualize? (2) Visual graph mining related to security issues Example from NCTC data (1) Example from NCTC data (2) Example from NCTC data (3) Massive data (1) Massive data (2) Visualization and Moore?s law (1) Visualization and Moore?s law (2) Added value of visual and interactive mining ? Sense making loop ? ? Visualization mantras ? Visualization ?pipeline? Visualize? Organize data prior to visualization Case study: ITA 2000 passenger air traffic Case study: ITA 2000 passenger air traffic TopoLayout ? (Topological) Feature-based Hierarchization (1) TopoLayout ? (Topological) Feature-based Hierarchization (2) TopoLayout ? (Topological) Feature-based Hierarchization (3) TopoLayout ? (Topological) Feature-based Hierarchization (4) TopoLayout ? (Topological) Feature-based Hierarchization (5) TopoLayout ? (Topological) Feature-based Hierarchization (6) TopoLayout TopoLayout + interaction: Grouse (1) TopoLayout + interaction: Grouse (2) TopoLayout + interaction: Grouse (3) TopoLayout + interaction: Grouse (4) Multilevel navigation of small world networks Small world networks (1) Small world networks (2) Small world networks (3) Small world networks (4) Small world networks (5) Community structure of small world networks (1) Community structure of small world networks (2) Community structure of small world networks (2) ?Quality? criteria MQ MQ / Nice properties (1) MQ / Nice properties (2) Challenge: find the best possible clustering (according to MQ) Filter / Threshold (1) Filter / Threshold (2) Filter / Threshold (3) Hierarchical organization of the network MQ / Extension (1) MQ / Extension (2) Conclusion ? Future work MQ / Extension to graph hierarchies Conclusion ? Future work Conclusion (1) Conclusion (2) Conclusion (3) Credits
Inference and Learning with Networked Data In many applications we would like to draw inferences about entities that are interconnected in complex networks. For example, calls, emails, IM, and web pointers link people into huge social networks. However, traditional statistical and machine learning classification methods assume that entities are independent of each other. I start by discussing various applications of \"classification\" (scoring) in networked data, from fraud detection to counterterrorism to network-based marketing. I then discuss four characteristics of networked data that allow improvements-- sometimes substantial--over traditional classification: (i) models can take into account \"guilt by association,\" (ii) inference can be performed \"collectively,\" whereby inferences on linked entities mutually reinforce each other, (iii) characteristics of linked entities can be incorporated in models, and (iv) models can incorporate specific identifiers, such as the identities of particular individuals, to improve inference. I present results demonstrating the effectiveness of these techniques. Inference and Learning with Networked Data Modeling for prediction using networked data Prediction in networked data Prediction tasks in networked data (cf. Getoor Tutorial 2005) Modeling for prediction The problem: Prediction in Networked Data (1) The problem: Prediction in Networked Data (2) The problem: Prediction in Networked Data (3) The problem: Prediction in Networked Data (4) The problem: Prediction in Networked Data (5) The problem: Prediction in Networked Data (6) Example social network application: Target consumers for new product Sales rates are substantially higher for ?network neighbors? More-sophisticated network-based attributes? Cumulative % of Consumers Targeted (Ranked by Predicted Sales) Example social network application: Ecommerce firms increasingly are collecting data on explicit social networks of consumers (1) Example social network application: Ecommerce firms increasingly are collecting data on explicit social networks of consumers (2) So, what?s different about networked data? Unique Characteristics of Networked Data (for predictive inference) (1) Unique Characteristics of Networked Data (for predictive inference) (2) Guilt by association: autocorrelation relationship between labels* of neighboring nodes How can predictive models incorporate network autocorrelation? (Part 0) How can predictive models incorporate network autocorrelation? (Part 1) Some univariate network classification techniques (see Macskassy & P. JMLR 2007) How can predictive models incorporate network autocorrelation? (Part 2) How can predictive models incorporate network autocorrelation? (Part 2, cont.) How can predictive models incorporate network autocorrelation? (Part 2, cont.) Is guilt-by-association justified theoretically? (1) Is guilt-by-association justified theoretically? (2) Is guilt-by-association justified theoretically? (3) Is guilt-by-association justified theoretically? (4) Is guilt-by-association justified theoretically? (5) Unique Characteristics of Networked Data (for predictive inference) (1) Unique Characteristics of Networked Data (for predictive inference) (2) Various techniques for collective inference (see also Jensen et al. KDD 2004) Collective inference cartoon: (1) Collective inference cartoon: (2) Collective inference cartoon: (3) Collective inference cartoon: (4) Collective inference cartoon: (5) Collective inference cartoon: (6) Collective inference cartoon: (7) recall network-based marketing example? Collective inference gives additional improvement, especially for non-network neighbors So, how much ?information? is in the network structure alone? Network Classification Case Study How much information is in the network structure? (1) How much information is in the network structure? (2) Univariate network classification techniques (see Macskassy & Provost 2007) (1) Univariate network classification techniques (see Macskassy & Provost 2007) (2) RBN vs wvRN Classifying linked documents (CoRA data) Machine Learning Research Papers (from CoRA data) (1) Machine Learning Research Papers (from CoRA data) (2) Unique Characteristics of Networked Data (for predictive inference) Networks ? Graphs? (1) Networks ? Graphs? (2) Detecting ?bad brokers? (NASD) ;(Neville et al. KDD 2005) Data on brokers, branches, disclosures Neville et al. KDD 2005) Relational Learning Traditional Learning and Classification Network Learning and Classification Logic modeling Network data in first-order logic Probabilistic graphical models Example: A Bayesian network modeling consumer reaction to new service Probabilistic relational models Relational prob. model of broker variables Neville & Jensen, JMLR to appear) Important concept! Recall: broker dependency network Broker data network Neville et al. 2005) Putting it all together: Relational dependency networks (Neville & Jensen, JMLR 2007) Model unrolled on (tiny) data network Combining first-order logic and probabilistic graphical models (1) Combining first-order logic and probabilistic graphical models (2) A snippet from an actual network including ?bad guys? Side note: not just for ?networked data? ? id?s important for any data in a multi-table RDB How to incorporate identifiers of related objects (in a nutshell) Density Estimation for Aggregation Classify buyers of most-common title from a Korean E-Book retailer Machine Learning Research Papers (from CoRA data) (recall CoRA from discussion of univariate network models) Using identifiers on CoRA Summary: Unique Characteristics of Networked Data (for predictive inference) http://pages.stern.nyu.edu/~fprovost/
Ontologies and Machine Learning We address the problem of constructing light-weight ontology from social network data. As an example we use social network of a mid size research institution obtained based on e-mail communication. The main contribution is an architecture consisting from five major steps that enable transformation of the data from a given e-mail transactions recordings to an ontology estimating the structure of the organization. Once having a set of sparse vectors, we apply an approach to semi-automated ontology construction as implemented in the OntoGen tool. The experiments and illustrative evaluation show that our approach is useful and applicable in real life situations where the goal is to model social structures based on communication records. Ontologies & Machine Learning Aim of the talk What areas of research are we trying to target? Ontologies What is an Ontology? (1) What is an Ontology? (2) Which elements represent an ontology? Levels Semantic-Web formalisms Top-down modeling of knowledge Cyc system Cyc ?a little bit of historical context The Cyc Ontology ?part of Cyc Ontology on Human Beings Structure of Cyc Ontology (1) Structure of Cyc Ontology (2) Structure of Cyc Ontology (3) Structure of Cyc Ontology (4) Structure of Cyc Ontology (5) Cyc KB Extended w/Domain Knowledge (1) Cyc KB Extended w/Domain Knowledge (2) An example of Psychoanalyst?s Cyc taxonomic context Example Vocabulary: Senses of ?In? relation (1/3) Example Vocabulary: Senses of ?In? relation (2/3) Example Vocabulary: Senses of ?In? relation (3/3) Cyc?s front-end: ?Cyc Analytic Environment? ? querying (1/2) Cyc?s front-end: ?Cyc Analytic Environment? ? querying (2/2) Document Tagging (1) Document Tagging (2) Annotating the document with CycKB Probabilistic Concept Tagging Knowledge Template Induction (1) Knowledge Template Induction (2) Learning Facts by Search (1) Learning Facts by Search (2) Parsing Results KB Consistency Check Initial Results Microtheory (context) Suggestion Automatic Ontology Placement MT Suggestor Approach Results Induction of new rules with ILP Learning Higher-Order Knowledge Performing Induction in Cyc Sample Rules Produced (1) Sample Rules Produced (2) Bottom-up modeling of knowledge OntoGen system Underlying concepts Main Features Ontology management Concept management Active Learning for concept learning Multiple views of the same data Concept?s instances visualization Ontology population
Ontogen Software Demo We address the problem of constructing light-weight ontology from social network data. As an example we use social network of a mid size research institution obtained based on e-mail communication. The main contribution is an architecture consisting from five major steps that enable transformation of the data from a given e-mail transactions recordings to an ontology estimating the structure of the organization. Once having a set of sparse vectors, we apply an approach to semi-automated ontology construction as implemented in the OntoGen tool. The experiments and illustrative evaluation show that our approach is useful and applicable in real life situations where the goal is to model social structures based on communication records. Semi-Automatic Data-Driven Ontology Construction System Underlying concepts Main Features
PASCAL Workshop on Methods of Data Analysis in Computational Neuroscience and Brain Computer Interfaces This workshop shall cover three main topics: First, a general outline of problems occuring in computational neuroscience shall be given. Here, the connection between microscopic measurement and modeling to macroscopic observation shall be outlined. Second, we discuss issues of decomposition techniques applied on fMRI and EEG/MEG data. A present trend in this area is to increase the tensorial order of the data representation which, at least in principle, allows for unique decomposition under fairly mild conditions unless the data are 'pathological'. A specific question here is whether real data are so close to being 'pathological' that the decomposition lacks robustness. More generally, decomposition methods like PCA, ICA, Parafac or the construction of general \"dictionaries\" make different kinds of assumptions. The question is which of these assumptions are met in real data and whether or not some assumptions are useful to make even if they are not met. Third, a specific application of data analysis methods is the brain computer interface. In practice it appears that the most simple methods are surprisingly successful. One reason could be that uninteresting background noise is so complicated and diverse that ignoring the background as much as possible should have priority over interpreting details of the signal of interest. The respective priorities set the range of promising methods. We shall discuss in this workshop the present experience with various methods and the most promising directions of research to improve the results.
Multimodal Imaging: EEG-fMRI integration EEG-fMRI Integration fMRI Group Black Box and Surrogates for Cognitive Neuroscience Surprise vs Expectancy Pattern Learning Oddball pt 1 Pattern Learning Oddball - AEPs Pattern Learning Oddball - OEPs Pattern Learning Oddball pt 2 Single Trial EEG-fMRI Auditory Oddball fMRI How to Answer a Where-and-When Question pt 1 How to Answer a Where-and-When Question pt 2 How to Answer a Where-and-When Question pt 3 How to Answer a Where-and-When Question pt 4 Why Didn?t the Auditory Onset Response Show Up What?s the Problem pt 1 What?s the Problem pt 2 How Would the Brain Frame Family Pictures pt 1 How Would the Brain Frame Family Pictures pt 2 What?s the Problem pt 3 What?s the Situation Aims Single Trial EEG-fMRI Revisited Joint ICA Simulation - Sources Joint ICA Simulation - Mixed Joint ICA Simulation - Unmixed What Group ICA Works Previously Lost in Translation If This Is What Happens in EEG-fMRI When People Make Errors Group ICA Model Map Conclusion Thanks
Symbolic Dynamics of Neurophysiological Data Symbolic Dynamics of Neurophysiological Data Content Ion Channels I Ion Channels II Action Potentials I Action Potentials II Event-Related Potentials I Event-Related Potentials II Event-Related Potentials III The Dynamical Approach Phase Space Portrait Coarse-Graining Symbolic Dynamics of Noisy Data Cylinder Sets of ERP Measures of Complexity Entropy Signal-to-Noise Ratios One-Threshold Encodings Signal-to-Noise Ratios (a) One-Threshold Encodings (a) Two-Threshold Encoding Symbolic Resonance Analysis pt 1 Symbolic Resonance Analysis pt 2 Mean-Field Transformation pt 1 Mean-Field Transformation pt 2 Mean-Field Transformation pt 3 Mean-Field Transformation pt 4 Mean-Field Transformation pt 5 Mean-Field Transformation pt 6 Mean-Field Transformation pt 7 Mean-Field Transformation pt 8 Mean-Field Transformation pt 9 Mean-Field Transformation pt 10 Mean-Field Transformation pt 11 Signal Dissociation I Signal Dissociation II Applications Oddball Experiment Baseline Encoding I Baseline Encoding II Baseline Encoding III Median Encoding I Median Encoding II Median Encoding III Symbolic Resonance Analysis pt 3 Symbolic Resonance Analysis pt 4 Time-Threshold-Analysis Negative Polarity Processing Acknowledgements Time-Threshold-Analysis (a) Symbolic Resonance Analysis pt 4 (a)
Visualization of text document corpus From the automated text processing point of view, natural language is very redundant in the sense that many different words share a common or similar meaning. For computer this can be hard to understand without some background knowledge. Latent Semantic Indexing (LSI) is a technique that helps in extracting some of this background knowledge from corpus of text documents. Visualization of Document Corpus Motivation Document representation Problem The Big Picture Latent Semantics Indexing The Big Picture Multidimensional scaling The Big Picture Landscape generation Keywords Keywords Demo on two document collections Trip into the third dimension Thank you for listening! The Big Picture
Amplitude and phase patterns in encephalographic signals - multivariate approaches for movement-related brain activity Amplitude and phase patterns in encephalographic signals Multivariate approaches for movementrelated brain activity Andreas Daffertshofer Research Institute MOVE, Faculty of Human Movement Sciences VU University Amsterdam, The Netherlands Thanks to: T.W. Boonstra , S. Houweling, A.N. Vardy (VU-FBW) B.W. van Dijk, C.J. Stam (VUmc), G. Nolte (FIRST, Berlin) Neural activity ? KNAW-MEG Centre KNAW-MEG Centre Amsterdam Amsterdam MEG / electromyography ? in studying (rhythmic) movements Established behavioral characteristics Reproducible Quantifiable ? Movement-evoked fields Cortico-cortical and cortico-spinal entrainment Electric potentials versus magnetic fields Dendritic currents Dendrites of excitatory neurons are aligned Dendrites of inhibitory neurons are not aligned Magneto-encephalography Event-related potentials / event-related fields How to proceed? ? what to look for: brain functioning The brain shows local specialization Complicated tasks require cooperation between multiple brain areas Synchronization is a key mechanism for functional integration Synchronization results in the formation of functional networks with temporal and spatial structure How can information be transferred? Amplitude modulation (~ am-radio) Event-related activity Co-varying (spectral) power Frequency modulation (~ fm-radio) Coherence Frequency locking Phase locking Synchronization Spectral analysis Fourier transform t ? : f ( t ) ? F ? f ( t ) ? = F (? ) = ? ? p ( ? ) = F (? ) f ( t ) e ? i? t dt Power spectrum Power spectral density (~ periodogram) 1 Nw 1 Nw P (? ) = ? p j (? ) = N ? N w j =1 w j =1 t j + Tw 2 tj x ( t ) e ? i? t dt Time domain e at i ar v ni ? x (? ) = u e at i ar v bi ? xy (? ) = Frequency domain x ( t ) x? ( t ? ? ) dt Fourier p x (? ) = x ( t ) e ? i? t dt ? x ( t ) y ? ( t ? ? ) dt Fourier ?? ??? ? ? i? t ? i? t pxy (? ) = ? ? x ( t ) e dt ? ? ? y ( t ) e dt ? ? ?? ? ? ?? ? = ? xy (? ) e i? xy (? ) ? xy (? ) = pxy (? ) ? Cross-correlation ? xy (? ) ? xy (? ) = ? xy ( 0 ) Coherence pxy (? ) ? xy (? ) = ? xy (? ) = p x (? ) p y ( ? ) p x (? ) p y ( ? ) Coherence Analytical example: f ( t ) = ( A0 + ? t ) sin ( ? t + ? t ) and g ( t ) = A0 sin ? t ?de gre eo f ra ndo mn e ss ? coherence ency requ f ?de gre eo f ra ndo mn e ss ? coherence ency requ f Coherence ? xy (? ) = p x ( ? ) p y (? ) pxy (? ) ? ?? ? x ( t ) e ? i? t dt ? ? ? y ( t ) e ? i? t dt ? ?? ? ?? ? ? ?? ? x ( t ) e ? i? t dt y ( t ) e ? i? t dt increasing window-size/resolution increasing window-size/resolution Coherence Function Estimate 25 Frequency Transforms Fourier t ? : f ( t ) ? F ? f ( t ) ? = F (? ) = ? ? 1 f (t ) = 2? f ( t ) e ? i? t dt F (? ) ei? t d? Gabor t ? ,? : f ( t ) ? Ga ? f ( t ) ? = Ga (? ,? ) ? ? ? f (t ) ? * f ( t ) ? a ( t ? ? ) e ? i? t dt Ga (? ,? ) ? a ( t ? ? ) ei? t d? d? ? Wavelet t ? ,? : f ( t ) ? W? ? f ( t ) ? = W? (? ,? ) ? ? ? ? t ?? f ( t ) ? ? W? (? ,? )? ? ? ? ?? ? t ?? f ( t )? ? ? ? ? ? ? dt ? ? 1 ? 2 d? d? ?? Transforms (continued) Hilbert ? f (? ) ? i ? ? ? d? ? t : f ( t ) ? H ? f ( t ) ? = H ( t ) = f ( t ) + ? PV ? ? ? ? ? ?? t ? ? ? ? ? Hilbert amplitude H (t ) raw EMG Hilbert amp 18 time [s] Hilbert phase ? ( t ) = arctan Im { H ( t )} Re { H ( t )} phase ? Analytical example: ? A(t ) = a ? f ( t ) = a cos ( ? t + ? 0 ) ? ? ?? ( t ) = ? t + ? 0 ? Paced tapping (unimanual) Coherence: an example ? MEG-EMG comparison coherence LC 23 frequency [Hz] Kilner et al.(2000) Journal of Neuroscience 20 (p 8841) Cross-phase spectrum ? ?? ? ? i? t ? i? t pxy (? ) = ? ? x ( t ) e dt ? ? ? y ( t ) e dt ? ? ?? ? ? ?? ? = ? xy (? ) e i? xy (? ) Volume conduction / phase differences C A D B -? 0 +? -? +? Phase distributions ? xm ( t ) = ? amk sk ( t ) k sk ( t + ? ) sk ? ( t ) = ? kk ?? k ( ? ) i ? 1? analytic signals zm ( t ) = xm ( t ) + ? xm ( t ) ? ? = ? amk yk ( t ) ?? t? k ? ? sk (? ) sk (? ? ) 1 ( ?) yk ( t + ? ) yk ( t ) = sk ( t + ? ) sk ( t ) ? 2 ? ? d? ?d? ? ?? ?? ( t + ? ? ? )( t ? ? ? ) ? sk (? ) sk ( t ) sk ( t + ? ) sk (? ) + ?? ? t ?? ? ?? ? t + ? ? ? ? i ? ? d? ? ? ( ? zm ( t ) zn ) ( t ) = amp anq ? pq + amq anp ? qp + ? amk ank ? k( ) ? ? k phases? check for symmetry ? zm zn = Am An e ( i ?m ??n ) sign ?sin (?m ? ?n ) ? ? ? for phase statistics see textbooks by, e.g., Mardia (1974, 1991), Fisher (1993), Batschelet (1993) Phase locking index (PLI) Phase coherence Phase lag index -? +? -? +? 1 R= N ?e t =0 N ?1 i?t -? +? Stam, Nolte and Daffertshofer Human Brain Mapping, Epub PLI = sign {sin ?t } Stability of coordination acoustically paced ?tapping? on/off the beat 5 15x 10 tone and force signals time [sec] 51 time [sec] Daffertshofer et al., Phys. Lett. A, 2000 Relative Hilbert phase ?? ( t ) = ?1 ( t ) ? ?2 ( t ) Force-Tone EMG-Tone MEG LC23-Tone Quantification via circular variance and/or uniformity Encephalographic & electromyographic recordings 151 channel MEG (CTF) Force transducer Multi-channel EMG M. adductor pollicis Movement-related fields LC15 LP11 time [sec] motor event Coherence fails as robust estimator for corticocortical and cortico-spinal synchronization in rapid (rhythmic) movement coherence LC 23 frequency [Hz] ev en ev t1 e ev nt2 en ev t3 en ev t4 en ev t5 e ev nt6 en ev t7 en ev t8 en ev t9 en t1 Analysis scheme record X channels for M different trials/conditions/subjects record X channels for M different trials/conditions/subjects filter signals with N different bands filter signals with N different bands Hilbert transform ? amplitude & phase Hilbert transform ? amplitude & phase event-related fields (ARF & MRF) event-related fields (ARF & MRF) not necessarily averaging!! principal component analysis ? mode reduction principal component analysis ? mode reduction reconstruction of X ? M ? N signals reconstruction of X ? M ? N signals condition frequency etc. statistical testing for effects of statistical testing for effects of Comparison listening & (un-)paced ?tapping? Hilbert amplitude (?-power and ERF amplitude) M RF un pa ce d AR F pa ce d left performance pa ce d lis te n Boonstra et al., Brain Res., 2006 right performance Comparison low force / high force (fatigue) Hilbert amplitude? tempo dependence in ?-band low force high force event-related fields ? overall power is ?constant? MRF ARF sy nc sy hr. n hi sy chr gh nc .lo sy op. w nc hig op h .l ow Boonstra et al., 2006 Comparison movement tempo frequency [Hz] Hilbert phase? auditory-related time [sec] ARF phase uniformity depends on movement tempo and is ?inverted? when compared to MRF? uniformity time [sec] Evoked vs. induced responses cross-covariance Hilbert phase vs. amplitude? lag-zero x-cov. depicts simultaneous increase of phase locking and amplitude: evoked lagging x-cov. depicts delayed increase of phase locking and amplitude: induced ?-response Boonstra et al., Brain Res., 2006 Polyrhythmic bimanual tapping Relative Hilbert phase in polyrhythmic performance Daffertshofer et al. (2000) Slow changes? learning movement coordination Adopted from: Mechsner et al., Nature 2001 Pre (control): Left/right/bimanual Learning the polyrhythm as intervention Post (control): Left/right/bimanual Timing in adaptive hand Frequency locking Tone ? Left Phase uniformity Left EMG Left Pre Left Post Event-related synthetic aperture magnetometry Cheyne et al. (2006) Human Brain Mapping 27:213 Motor areas left M1 right M1 Left pre ? Left post: contralateral decrease in ?-band [20-30Hz] activity left pre left M1 right M1 left post right M1 Cerebellum left CB right CB Left pre ? Left post: ipsilateral increase in ?-band [40-70Hz] activity left CB right CB left CB Changes in event-related amplitudes Unimanual Left ?power? Right M1 Left M1 Right CB Left CB SMA Bimanual Unimanual Right ???????? ???????? ???????? ???????? ?? ?? ?????? ?? ?? ???????? ?timing? Mathematics ? stoch. phase-locked loop ? ( t ) = ?? sin ?? ( t ? ? ) ? + ?t ? ? Pstat ??? ? [?] Models ? ~ L ~ L ~ ~ ~ ~ ?L?L ?L?L ?R?R ?R?R ?R?L L ?L?R R R L ~ R ~ R dR + (? R + i?? R ) R = ? R ? R L* R dt dR 2 ? (? R ? i? R ) R + R R dt = ? L ? R L ?? R ? R R* L ?R ? R * dr adiabatic * ? lr + ? R r = ? R ? Rl r ????? r ? elimination ?R dt ? phases in premotor areas quickly adapt to ? phases in premotor areas quickly adapt to changes in the phases of primary motor areas changes in the phases of primary motor areas Phase clustering map e.g., mean: phase ?j (? pulse rate vs. dendritic current) ? j = A j (? j ; {? } ) ? ? k B j ,k sin (? j ? ?k ) + ? t , j ? ? ? P (? , t ) = ? ? A (? ; {? } ) ? B sin ( ?? ) ?? ? ?t V (? ) = ? cos ? ? ? cos 2? P (?) mean-field approximation ? ? ? ? P (? , t ) P ?? ? Frank et al., Physica D (2000) cy cy en en qu qu re ffre ? ? ? ?nonlinear Fokker-Planck equations Thanks for your attention Berlin, June 29, 2007 Thanks to: P.J. Beek, T.W. Boonstra, S. Houweling, C.E. Peper, A.N. Vardy (FBW) T.D. Frank (Storrs, UConnecticut) B.W. van Dijk, C.J. Stam, J. Verbunt (VUmc Amsterdam) G. Nolte (FIRST, Berlin), A. Hutt (Humboldt, Berlin) A. Longtin (UOttawa) M. Breakspear (USydney)
Multimodal Imaging: MEG-NIRS integration Multimodal Imaging - MEG-NIRS Integration Neuro-Vascular Coupling pt 1 Neuro-Vascular Coupling pt 2 Neuro-Vascular Coupling pt 3 Near-Infrared Spectroscopy (NIRS) pt 1 Near-Infrared Spectroscopy (NIRS) pt 2 Near-Infrared Spectroscopy (NIRS) pt 3 Time Resolved NIRS pt 1 Time Resolved NIRS pt 2 Time Resolved NIRS - Setup pt 1 Time Resolved NIRS - Setup pt 2 Magnetoencephalography and DC-MEG Magnetoencephalography Typical MEG Result - N20m Neuro-Vascular Coupling - How To Quantify DC-Magnetoencephalography Modulation DC-Magnetoencephalography Direct DC-MEG of Motor Activity CombinedDC-MEG and NIRS Signal Processing - Independent Component Analysis Independent Component Analysis pt 1 Independent Component Analysis pt 2 Independent Component Analysis - TDSEP Variables for Statistical Analysis - NIRS Classification of ICA Result by Demixing Results Unaveraged DC-MEG and NIRS Data Averaged mDC-MEG and NIRS Data Neuro-Vascular Loop Neuro-Vascular Loop - DirectDC-MEG Feasibility Study - Patient Data Feasibility Study - Comparison Summary
Exploiting temporal delays in interpreting EEG/MEG data in terms of brain connectivity Problem of volume conduction Cross-spectrum EEG-simulation of ERD (two sources) Rest coherence EEG-simulation of ERD (one source) Change in coherence pt 1 Change in coherence pt 2 Observation Explicit derivation Coherence Selfpaced movement - C3-C4 relationships Significance - False Discovery Rate (FDR) Simulated non-interacting sources Results Difference between cross-spectrum pt 1 Difference between cross-spectrum pt 2 Imaginary part - 5 dipoles ?Philosophy? pt 1 ?Philosophy? pt 2 ?Philosophy? pt 3 Pairwise Interacting Source Analysis (PISA) EEG - imagined foot movement Music pt 1 Music pt 2 Example 1 Example 2 Result ISA-pattern Conclusion
2nd ECOLEAD Summer School The Summer School will briefly review the current state of the art in the theoretical foundations of Collaborative Networked Organizations (CNO), Professional Virtual Communities (PVC), Virtual Enterprises (VE) and their performance measurement. The speakers will highlight those problems arising in this context, which can be handled using novel IT tools equipped by interesting AI features. Such tools and approaches will be introduced in several separate talks and the participants will be offered a chance to get acquainted with the selected tools during the tutorials. An important part of the program will be devoted to the talks reviewing experience gained in some real-life case studies or in related EU projects. Among the speakers, there is a number of renouned experts. They will pay attention to different aspects of this recently established and very active scientific domain with high promises for business world.
The Long Road from Text to Meaning Computers have given us a new way of thinking about language. Given a large sample of language, or corpus, and computational tools to process it, we can approach language as physicists approach forces and chemists approach chemicals. This approach is noteworthy for missing out what, from a language-user's point of view, is important about a piece of language: its meaning. I shall present this empiricist approach to the study of language and show how, as we develop accurate tools for lemmatisation, part-of-speech tagging and parsing, we move from the raw input -- a character stream -- to an analysis of that stream in increasingly rich terms: words, lemmas, grammatical structures, Fillmore-style frames. Each step on the journey builds on a large corpus accurately analysed at the previous levels. A distributional thesaurus provides generalisations about lexical behaviour which can then feed into an analysis at the ?frames' level. The talk will be illustrated with work done within the ?Sketch Engine' tool. For much NLP and linguistic theory, meaning is a given. Thus formal semantics assumes meanings for words, in order to address questions of how they combine, and WSD (word sense disambiguation) typically takes a set of meanings (as found in a dictionary) as a starting point and sets itself the challenge of identifying which meaning applies. But, since the birth of philosophy, meaning has been problematic. In our approach meaning is an eventual output of the research programme, not an input.
Hidden Topic Markov Models Algorithms such as Latent Dirichlet Allocation (LDA) have achieved significant progress in modeling word document relationships. These algorithms assume each word in the document was generated by a hidden topic and explicitly model the word distribution of each topic as well as the prior distribution over topics in the document. Given these parameters, the topics of all words in the same document are assumed to be independent. In this work, we propose modeling the topics of words in the document as a Markov chain. Specifically, we assume that all words in the same sentence have the same topic, and successive sentences are more likely to have the same topics. Since the topics are hidden, this leads to using the well-known tools of Hidden Markov Models for learning and inference. We show that incorporating this dependency allows us to learn better topics and to disambiguate words that can belong to different topics. Quantitatively, we show that we obtain better perplexity in modeling documents with only a modest increase in learning and inference complexity. //Joint work with Michal Rosen-Zvi and Yair Weiss.//
Text Visualisation Tutorial Text Visualization Tutorial Contents Why visualizing text? Some basic text preliminaries Quick Example: Visualization of PASCAL Project PASCAL project Visualization of PASCAL research topics Similarity between document vectors ?typical way of doing visualization Graph based visualization Example of visualizing Eu IST projects corpora Graph based visualization of 1700 IST project descriptions into 2 groups Graph based visualization of 1700 IST project descriptions into 3 groups Graph based visualization of 1700 IST project descriptions into 10 groups Graph based visualization of 1700 IST project descriptions into 20 groups Tiling based visualization Tiling based visualization of 1700 IST project descriptions into 2 groups Tiling based visualization of 1700 IST project descriptions into 3 groups Tiling based visualization of 1700 IST project descriptions into 4 groups Tiling based visualization of 1700 IST project descriptions into 5 groups Tiling visualization (up to 50 documents per group) of 1700 IST project descriptions (60 groups) WebSOM Visualizing text using a lot of structure What is structure in the text? Deep Linguistic Parsing (Microsoft?s NLPWin Parser) Extraction of semantic graphs from text Example article represented as text Example article as semantic graph Example Article on Earthquake Example Article on Clinton?s speech Conclusions PASCAL organizes Text Visualization Challenge
Gears and the Mashup Problem Mashups are the most interesting innovation in software development in decades. Unfortunately, the browser's security model did not anticipate this development, so mashups are not safe if there is any confidential information in the page. Since virtually every page has at least some confidential information in it, this is a big problem. Google Gears may lead to the solution. Speaker: Douglas Crockford Douglas Crockford is the world's foremost living authority on JavaScript. He is an architect with Yahoo's Ajax Strike Force. He is the founder of two startups, and was Director of Technology at Lucasfilm Ltd., Director of New Media at Paramount, and a researcher at Atari and SRI.
The ?Last Lecture? of Randy Pausch Almost all of us have childhood dreams: for example, being an astronaut, or making movies or video games for a living. \\\\ Sadly, most people don't achieve theirs, and I think that's a shame. I had several specific childhood dreams, and I've actually achieved most of them. More importantly, I have found ways, in particular the creation (with Don Marinelli), of CMU's ([[http://etc.cmu.edu/|Entertainment Technology Center]]), of helping many young people actually **achieve** their childhood dreams. This talk will discuss how I achieved my childhood dreams (being in zero gravity, designing theme park rides for Disney, and a few others), and will contain realistic advice on how **you** can live your life so that you can make your childhood dreams come true, too.
Three Beautiful Quicksorts This talk describes three of the most beautiful pieces of code that I have ever written: three different implementations of Hoare's classic Quicksort algorithm. \\\\ # The first implementation is a bare-bones function in about a dozen lines of C. \\\\ # The second implementation starts by instrumenting the first program to measure its run time; a dozen systematic code transformations proceed to make it more and more powerful yet more and more simple, until it finally disappears in a puff of mathematical smoke. It therefore becomes the most beautiful program I never wrote. \\\\ # The third program is an industrial-strength C library Qsort function that I built with Doug McIlroy. A theme running through all three implementations is the power of elegance and simplicity. \\\\ (This talk expands my Chapter 3 in Beautiful Code, edited by Oram and Wilson and published by O'Reilly in July, 2007.)
The 7th International Symposium on Intelligent Data Analysis Our aim for the 7th IDA conference is to bring together a wide variety of researchers - academic, industrial, and otherwise who are concerned with extracting knowledge from data, including researchers from statistics, machine learning, neural networks, computer science, pattern recognition, database management, and other areas. The strategies adopted by people working in these areas are often different, and a synergy results if this is recognised. IDA-2007 is intended to stimulate interaction between these different areas, so that more powerful techniques and tools emerge for extracting knowledge from data and a better understanding is developed for the process of intelligent data analysis. ...more on IDA 2007 at [[http://www.ida2007.org/]]
Tutorials At the IDA-2007 conference we propose an interesting agenda of events that include several tutorial tracks, open panel discussions, and keynote talks, all based on the following topics of interest: # Algorithms and Techniques (Machine Learning, Data Mining, Statistics) # Theoretical Contributions (Data Analysis Principles, KDD, Data Modeling) # Application Fields (Practical, Applied and Industrial Data Analysis)
Evolving Systems One of the important research challenges today is to develop new theoretical methods, algorithms, and implementations of systems with a higher level of flexibility and autonomy, we can say with higher level of intelligence. These systems have to be able to evolve their structure and knowledge on the environment and ultimately ? evolve their intelligence. To address the problems of modelling, control, prediction, classification and data processing in a dynamically changing and evolving environment, a system must be able to fully adapt its structure and adjust its parameters, rather than use a pre-trained and a fixed structure. That is, the system must be able to evolve, to self-develop, to self-organize, to self-evaluate and to self-improve. The talk will concentrate on the problems and results the author encountered during last several years of research in this emerging area as well as on the approach to on-line identification of a particular type of fuzzy models ? so called Takagi-Sugeno fuzzy models including some applications, in particular to mobile robots, mobile communications, process modelling and control, on-line evolving classification intelligent (inferential) sensors. Evolving Systems from Streaming Data Lancaster University InfoLab21 Tutorial Objective Outline Methodology Algorithms EFS Applications The Challenge pt 1 Streaming Data vs Batch Data The Challenge pt 2 The Challenge pt 3 The Challenge pt 4 The Challenge pt 5 Example 1: Current UAVs Example 2: Mobile Robots Example 3: Intruder Detection Data The Challenge pt 6 The Proposed Approach pt 1 The Proposed Approach pt 2 System Modeling Fermentation Process Black-Box Models Fuzzy Rule-Based Models Black-Box Models (a) Fuzzy Rule-Based Models (a) Black-Box Models (b) Fuzzy Model Types TSK Models TSK Fuzzy Model (Concept) TSK in 2D Feature Space Clusters in the Feature Space On-Line Identification TSK in 2D Feature Space (a) Outlier or a New Info Granule (Cluster/Rule) Adaptive vs Evolving Data-Driven Learning Evolving Systems pt 1 Evolving Systems pt 2 Evolving Systems pt 3 Evolving Systems pt 4 Evolving Fuzzy Systems Evolving Systems Basic Principle Rule-Base Evolution Data Space Partitioning Outlier or a New Info Granule (Cluster/Rule) (a) Data Space Partitioning (a) Equal Partitioning Data Space Partitioning (b)
Compact and Understandable Descriptions of Mixtures of Bernoulli Distributions Finite mixture models can be used in estimating complex, unknown probability distributions and also in clustering data. The parameters of the models form a complex representation and are not suitable for interpretation purposes as such. In this paper, we present a methodology to describe the finite mixture of multivariate Bernoulli distributions with a compact and understandable description. First, we cluster the data with the mixture model and subsequently extract the maximal frequent itemsets from the cluster-specific data sets. The mixture model is used to model the data set globally and the frequent itemsets model the marginal distributions of the partitioned data locally. We present the results in understandable terms that reflect the domain properties of the data. In our application of analyzing DNA copy number amplifications, the descriptions of amplification patterns are represented in nomenclature used in literature to report amplification patterns and generally used by domain experts in biology and medicine. Compact and Understandable Descriptions of Mixtures of Bernoulli Distributions Background on the Problem Example on the Data Collection Chromosomal Regions: Names DNA Copy Number Amplification Data as 0-1 Data Mixture Models for 0-1 Data Model Selection: How Many Components in a Mixture? Mixture Model: Chromosome 1 Mixture Model in Clustering Solution Creates a Problem Compact and Understandable Descriptions Describe the Model Parameters Describe the Clustered Data Descriptions - Chromosome 1 Amplification Models and Patterns Summary and Conclusions
Multiplicative Updates for L1-Regularized Linear and Logistic Regression Multiplicative update rules have proven useful in many areas of machine learning. Simple to implement, guaranteed to converge, they account in part for the widespread popularity of algorithms such as nonnegative matrix factorization and Expectation-Maximization. In this paper, we show how to derive multiplicative updates for problems in L1-regularized linear and logistic regression. For L1?regularized linear regression, the updates are derived by reformulating the required optimization as a problem in nonnegative quadratic programming (NQP). The dual of this problem, itself an instance of NQP, can also be solved using multiplicative updates; moreover, the observed duality gap can be used to bound the error of intermediate solutions. For L1?regularized logistic regression, we derive similar updates using an iteratively reweighted least squares approach. We present illustrative experimental results and describe efficient implementations for large-scale problems of interest (e.g., with tens of thousands of examples and over one million features). Multiplicative updates for L1-regularized regression Trends in data analysis How do we scale? Searching for sparse models An unexpected connection This talk Part I. Multiplicative updates Nonnegative quadratic programming (NQP) Matrix decomposition Multiplicative update Matrix decomposition (a) Multiplicative update (a) Fixed points Attractive properties for NQP Part II. Sparse regression Linear regression Regularization L2 versus L1 Reformulation as NQP L2 versus L1 (a) Reformulation as NQP (a) L1 norm as NQP Why reformulate? Logistic regression Part III. Experimental results Convergence to sparse solution Primal-dual convergence Large-scale implementation Discussion Large-scale implementation (a)
Discrete PCA Methods for analysis of principal components in discrete data have existed for some time under various names such as grade of membership modelling, probabilistic latent semantic indexing, genotype inference with admixture, non-negative matrix factorization, latent Dirichlet allocation, multinomial PCA, and Gamma-Poisson models. Statistical methodologies for developing algorithms are equally as varied, although this talk will focus on the Bayesian framework. The most well published application is genetype inference, but text analysis is now increasingly seeing use because the algorithms cope with very large sparse matrices. This talk will present the general model, a discrete version of both PCA and ICA, present alternative representations, and several algorithms (mean field and Gibbs).
Learning to align: a statistical approach We present a new machine learning approach to the inverse parametric sequence alignment problem: given as training examples a set of correct pairwise global alignments, find the parameter values that make these alignments optimal.We consider the distribution of the scores of all incorrect alignments, then we search for those parameters for which the score of the given alignments is as far as possible from this mean, measured in number of standard deviations. This normalized distance is called the ?Z-score? in statistics. We show that the Z-score is a function of the parameters and can be computed with efficient dynamic programs similar to the Needleman-Wunsch algorithm.We also show that maximizing the Z-score boils down to a simple quadratic program. Experimental results demonstrate the effectiveness of the proposed approach. Learning to Align: a Statistical Approach Outline Sequence Alignment pt 1 Sequence Alignment pt 2 Sequence Alignment pt 3 Sequence Alignment pt 4 Moments of the Scores The Z-Score Computing the Z-Score pt 1 Computing the Z-Score pt 2 Computing the Z-Score pt 3 IPSAP Z-Score Maximization pt 1 Z-Score Maximization pt 2 Iterative Algorithm pt 1 Z-Score Maximization pt 2 (a) Iterative Algorithm pt 1 (a) Z-Score Maximization pt 2 (b) Iterative Algorithm pt 2 Experimental Results pt 1 Experimental Results pt 2 Experimental Results pt 3 Summary
Transductive Reliability Estimation for Kernel Based Classifiers Estimating the reliability of individual classifications is very important in several applications such as medical diagnosis. Recently, the transductive approach to reliability estimation has been proved to be very efficient when used with several machine learning classifiers, such as Naive Bayes and decision trees. However, the efficiency of the transductive approach for state-of-the art kernel-based classifiers was not considered. In this work we deal with this problem and apply the transductive reliability methodology with sparse kernel classifiers, specifically the Support Vector Machine and Relevance Vector Machine. Experiments with medical and bioinformatics datasets demonstrate better performance of the transductive approach for reliability estimation compared to reliability measures obtained directly from the output of the classifiers. Furthermore, we apply the methodology in the problem of reliable diagnostics of the coronary artery disease, outperforming the expert physicians? standard approach. Transductive Reliability Estimation for Kernel Based Classifiers Introduction Kernel Classifiers Support Vector Machine (SVM) Reliability Measure for SVM Relevance Vector Machine pt 1 Relevance Vector Machine pt 2 RVM Reliability Measure Transductive Reliability Estimation pt 1 Transductive Reliability Estimation pt 2 Transductive Reliablility Estimation pt 3 Selecting the Threshold Evaluation of Reliability Measures pt 1 Evaluation of Reliability Measures pt 2 Evaluation on UCI Datasets Application on CAD pt 1 Application on CAD pt 2 Conclusions Future Work
Fast Clustering based on Kernel Density Estimation The Denclue algorithm employs a cluster model based on kernel density estimation. A cluster is defined by a local maximum of the estimated density function. Data points are assigned to clusters by hill climbing, i.e. points going to the same local maximum are put into the same cluster. A disadvantage of Denclue 1.0 is, that the used hill climbing may make unnecessary small steps in the beginning and never converges exactly to the maximum, it just comes close. We introduce a new hill climbing procedure for Gaussian kernels, which adjusts the step size automatically at no extra costs. We prove that the procedure converges exactly towards a local maximum by reducing it to a special case of the expectation maximization algorithm. We show experimentally that the new procedure needs much less iterations and can be accelerated by sampling based methods with sacrificing only a small amount of accuracy. Fast Clustering Based on Kernel Density Estimation Overview Density-Based Clustering Kernel Density Estimation Denclue 1.0 Framework Problem of Constant Step Size New Hill Climbing Approach New Denclue 2.0 Hill Climbing New Hill Climbing Approach (a) New Denclue 2.0 Hill Climbing (a) Proof of Convergence pt 1 Proof of Convergence pt 2 Identification of Local Maxima Acceleration Experiments pt 1 Experiments pt 2 Experiments pt 3 Experiments pt 4 Conclusion Thank You for Your Attention! Experiments pt 4 (a)
Visualising the Cluster Structure of Data Streams The increasing availability of streaming data is a consequence of the continuing advancement of data acquisition technology. Such data provides new challenges to the various data analysis communities. Clustering has long been a fundamental procedure for acquiring knowledge from data, and new tools are emerging that allow the clustering of data streams. However the dynamic, temporal components of streaming data provide extra challenges to the development of stream clustering and associated visualisation techniques. In this work we combine a streaming clustering framework with an extension of a static cluster visualisation method, in order to construct a surface that graphically represents the clustering structure of the data stream. The proposed method, OpticsStream, provides intuitive representations of the clustering structure as well as the manner in which this structure changes through time. Visualising the Cluster Structure of Data Streams Evolving Data Streams pt 1 Evolving Data Streams pt 2 Evolving Data Streams pt 3 Clustering and Density Estimation pt 1 Density Estimation Clustering and Density Estimation pt 2 Density Based Clustering Examining Neighbours pt 1 The Micro-Clustering Framework The DenStream Algorithm Density Based Clustering Examining Neighbours pt 2 Visualizing Clusters in Static Datasets Stream Cluster Visualization Stream Cluster Visualization: StreamOptics StreamOptics: Spawning Clusters StreamOptics: Disappearing Clusters StreamOptics: The Forest CoverType Data Set pt 1 StreamOptics: The Forest CoverType Data Set pt 2 Concluding Remarks Visualising the Cluster Structure of Data Streams Dimitris K. Tasoulis1 1 Institute Gordon Ross2 Niall M. Adams2 for Mathematical Sciences, 2 Department of Mathematics Imperial College London, South Kensington Campus London SW7 2PG, United Kingdom {d.tasoulis,gordon.ross,n.adams}@imperial.ac.uk Evolving Data Streams Data Nature The data of interest comprises multiple sequences that evolve over time. Algorithms must have the capacity to adapt rapidly to changing dynamics of the sequences. The results of analysis are useful if they are available immediately. Scalability in the number of sequences is becoming increasingly desirable. Time series Analysis Inference using the complete information Data Streams Inference using only partial information It is not possible to store the complete information. We are only interested in information relevant to the current time. In this work we assume that: Data are arriving sequentially in time from some mixture distribution. The mixture components gradually change over time. Components may vanish, and new ones can appear. A forgetting procedure is usually employed that attaches decreasing weight to historical data, so as to gradually diminish their effect. Clustering and Density Estimation De?nition Clustering refers to the partitioning of a set of objects into groups (clusters) such that objects within the same group are more similar to each other than objects in different groups. The data space can be regarded as the empirical probability density function (pdf) of the data. In this sense local maxima of the pdf can be thought to correspond to centres of clusters. References to clustering date back to the antiquity but one of the ?rst comprehensive foundations of clustering methods was published by Tryon in 1939 (R.C. Tryon, Cluster Analysis, Ann Arbor, MI, Edward Brothers, 1939). Density Estimation The rationale behind density estimation is that the data space can be regarded as the empirical probability density function (pdf) of the data. Let a data set X = {x1 , . . . , xn } The multivariate kernel density estimate ?H (x), is computed at point x as: f ?H (x) = Cf ,H 1 f n n KH H ?1 (x ? xi ) , i=1 H is the bandwidth matrix. KH () : Rd ? R Cf ,H is a normalization constant dependent on the kernel function, the dimension of the data. ?H (x) ? f (x). f Clustering and Density Estimation Mean Shift is one of the most successful density clustering methods: Each point is mean shifted towards the local gradient estimate of the density function: Density Based Clustering Examining Neighbours core points In a neighbourhood of a given radius (Eps) for each point in a cluster at least a minimum number of objects (MinPts) should be contained. Each point in their neighbourhood is considered as ?Directly Density-Reachable?. Chain of?Directly Density-Reachable? points form clusters. The Micro-Clustering Framework A micro-cluster is de?ned by the quantities w, c, r , which try to summarize the information about the data density on a particular area. De?nition (core-micro-cluster) A micro-cluster MCt (w, c, r ) is de?ned as core-micro-cluster CMCt (w, c, r ) at time t for a group of streaming points {xi , ti } i = 1, . . . , n, and parameters , ? it when w ? and r . Where w = n T? (ti ) is the i=1 P micro-cluster?s weight, c = radius r = n T? (ti )wc?xi . i=1 n i=1 xi T? (ti ) w , is its center and r its Two types of micro-clusters Potential core-micro-clusters, when wc ?? and r . Outlier-micro-clusters, when wc < ?? and r < . The DenStream Algorithm Procedure ListMaintain 1 Initialize two lists PL, OL; one for the core-micro-clusters, and the other for the outlier-micro-clusters. 2 Each time a new point p = {x, t} arrives do one of the following: 1 Attempt to merge p into its nearest core-micro-cluster cp : If the resultant micro-cluster has a radius r < , then the merge is omitted. 2 Attempt is made to merge p into its nearest outlier-micro-cluster op : if r < , the merge is omitted. Otherwise, if the subsequent weight w of op exceeds ?, then move op to PL. 3 A new outlier-micro-cluster is created, centered at p. 3 Periodically prune from PL and OL, the micro-clusters for which wc ??, and wc ? respectively. Density Based Clustering Examining Neighbours core points In a neighbourhood of a given radius (Eps) for each point in a cluster at least a minimum number of objects (MinPts) should be contained. Each point in their neighbourhood is considered as ?Directly Density-Reachable?. Chain of?Directly Density-Reachable? points form clusters. Visualizing Clusters in Static Datasets ?core-level? The distance from p to its MinPts-nearest neighbour. ?reachability-distance? For two objects p, q in the database the reachability of p wrt. q is de?ned as RDist(p, q) = max{ core level of p, dist(p, q)}. Each object is positioned so that all objects before it have the minimum reachability distance to it. The cluster-ordering of a data set can be represented and understood graphically. Stream Cluster Visualization A time changing mapping of the clustering structure to a user understandable format, operating in a real-time environment. Micro-cluster neighbourhood Let ? R, be a user de?ned parameter and PL a potential core-micro-cluster list. Then for a potential core-micro-cluster cp , we de?ne the micro-cluster neighbourhood of cp , as N(cp ) = {cq ? OL|dist(cp , cq ) 3.0 }. The function dist(cp , cq ), returns the euclidean distance between the centers of cp and cq . Micro-cluster core-level Let ? R, ? ? R, ? ? N. The core-level of cp , CLev (cp ) is de?ned as: CLev (cp ) = radius of cp Stream Cluster Visualization: StreamOptics Procedure StreamOptics 1 While there is still a micro-cluster cp in PL that has a neighbourhood size |N(cp )| larger than one initialize a list S of all the micro-clusters in |N(cp )|. 2 Remove cp from PL and added to OL. 3 Remove all micro-clusters in |N(cp )| from PL. 4 For each cl in S, compute RDist(cl , cp ). 5 For each cl in S, insert to S all the micro-clusters in N(cl ). 6 Remove form PL all the micro-clusters in S. 7 Insert to OL the object with the smallest RDist(cl , cp ) from S, until S is empty. StreamOptics: Spawning Clusters StreamOptics: Disappearing Clusters StreamOptics: The Forest CoverType data set Concluding Remarks Methods that can visualise the change of the clustering structure through time have only been investigated in lower dimensional situations or via projection. We hybridise a stream clustering framework with an extension of OPTICS. The method aims to provide insight into both the clustering structure and its evolution in time. We can identify the change in cluster structure, (spawning clusters, fading clusters). The abilities of the method are demonstrated in a real world dataset. Incorporating other ideas, such as projected clustering, to deal with very high dimensional spaces. ................
Relational Topographic Maps We introduce relational variants of neural topographic maps including the self-organizing map and neural gas, which allow clustering and visualization of data given as pairwise similarities or dissimilarities with continuous prototype updates. It is assumed that the (dis-)similarity matrix originates from Euclidean distances, however, the underlying embedding of points is unknown.Batch optimization schemes for topographic map formations are formulated in terms of the given (dis-)similarities and convergence is guaranteed, thus providing a way to transfer batch optimization to relational data. Relational Topographic Maps Outline Prototype-Based Methods - A Brief Introduction Prototype-Based Methods Vector Quantization Neural Gas Median Batch Neural Gas/SOM - General Proximity Data Median Variants Relational Methods - Continuous Prototype Updates Relational Methods pt 1 Relational Methods pt 2 Experimental Results - Does It Really Work?! Experiments pt 1 Experiments pt 2 Experiments pt 3 Experiments pt 4 Experiments pt 5 Summary - Collecting the Pieces Summary
A Support Vector Machine Approach to Dutch Part-of-Speech Tagging Part-of-Speech tagging, the assignment of Parts-of-Speech to the words in a given context of use, is a basic technique in many systems that handle natural languages. This paper describes a method for supervised training of a Part-of-Speech tagger using a committee of Support Vector Machines on a large corpus of annotated transcriptions of spoken Dutch. Special attention is paid to the decomposition of the large data set into parts for common, uncommon and unknown words. This does not only solve the space problems caused by the amount of data, it also improves the tagging time. The performance of the resulting tagger in terms of accuracy is 97.54 %, which is quite good, where the speed of the tagger is reasonably good. A Support Vector Machine Approach to Dutch Part of Speech Tagging Outline CGN: Corpus Spoken Dutch Part of Speech Tags Tag Set Part-of-Speech Tagging Goal & Challenge Design of the SVM Tagger Decomposing the SVM Initial Committee of SVM?s Training and Test Data Initial Evaluation on Validation Set Compound Analysis pt 1 Compound Analysis pt 2 The Final Committee of SVM?s Overall Performance More Detailed Performance Analysis Conclusions pt 1 Conclusions pt 2 Design of the SVM Tagger (a) A Support Vector Machine Approach to Dutch Part of Speech Tagging Mannes Poel, Luite Stegeman & Rieks op den Akker Dept. Computer Science Human Media Interaction Group University of Twente September 2007 Outline CGN & Part of Speech Tagging Design of the SVM tagger Final Evaluation Conclusions CGN: Corpus Spoken Dutch Large corpus: ? 9 million transcribed words. 15 different categories: Type Face to face conversations Interview with teacher Dutch Phone dialogue (recorded with mini disc) Business conversations Political debates, discussions and meetings Sport comments Masses and ceremonies Lectures and discourses Size in words 2626172 565433 853371 136461 360328 208399 18075 140901 Morpho-syntactically annotated. Part of Speech Tags Full part of speech tags for the CGN consists of 316 different tags. Very ?ne tuned many tags occur only a few times. Main classes: 12 tags such as NOUN, VERB, LET (punctuation mark), SPEC (special). Much smaller then commonly used. We considered a tag set consisting of 72 tags. Tag set Tag numbers 1...8 9 . . . 21 22 23 . . . 49 50, 51 52 53 54 . . . 65 66 . . . 68 69, 70 71 72 Part-of-Speech Tag Noun Verb Article Pronoun Conjunction Adverb Interjections Adjective Preposition Numeral Punctuation Special Tags in the CGN corpus N1, N2, . . ., N8 WW1, WW2, . . ., WW13 LID VNW1, VNW2, . . ., VNW27 VG1, VG2 BW TSW ADJ1, ADJ2, . . ., ADJ12 VZ1, VZ2, VZ3 TW1, TW2 LET SPEC Part-of-Speech Tagging Part-of-Speech Tagging (PoST) is the process of determining the right tag for (ambiguous) words 80 Ambiguous (%) 0 1 10 100 1000 10000 100000 1e+006 Frequency Goal & Challenge The goal is to design a SVM for tagging of the ambiguous words. The tag (class) depends on the context. Different approaches: ? ? Gimenez and Marquez constructed accurate SVM PoS taggers for English and Spanish. In their approach a linear kernel was used. Nakagawa, Kudo and Matusmoto constructed a polynomial kernel SVM PoS tagger for English. However both of the above approaches are applied to written text only and are applied to a corpus of a much smaller size than the CGN corpus. The main challenge is to construct a SVM PoS tagger based on the large CGN corpus. Design of the SVM tagger PoST using SVM?s Mannes Poel, Luite Stegeman & Rieks op den Akker CGN & Part of Speech Tagging Design of the SVM tagger Final Evaluation Conclusions number suf?x PoST bigrams PoST trigrams Reduced PoST bigrams Reduced PoST trigrams word bigrams w1, . . ., w7 w4 w1, w2, w3 w1, w2, w3 w1, w2, w3 w1, w2, w3 w1, w2, w3, w5, w6, w7 Single pass left to right tagger, using a sliding window of 7 words w1 w2 w3 w4 w5 w6 w7 where w4 is the word to be tagged. Input coding Type PoST relative tag frequencies word capitalization length Used for w1, w2, w3 w4, w5, w6, w7 w1, . . ., w7 w1, . . ., w7 w1, . . ., w7 Coding ?1-out-of-N? vector of relative tag frequencies ?1-out-of-N? 0 for no capitals, 1 for the ?rst letter, 2 for more then one letter single number 0 if word does not contain number, 1 if it contains at least one number, 2 if the ?rst character is a number, 3 if all characters are numbers ?1-out-of-N? ?1-out-of-N? ?1-out-of-N? ?1-out-of-N? ?1-out-of-N? ?1-out-of-N? Decomposing the SVM PoST using SVM?s Mannes Poel, Luite Stegeman & Rieks op den Akker CGN & Part of Speech Tagging Design of the SVM tagger Final Evaluation Conclusions Reason for decomposition: huge amount of data. w4 common word (frequency < 50 in the training set): SVM for each word, resulting in 795 SVM?s. Relative tag frequency w4 is constant and can be discarded from the input coding. w4 uncommon word: SVM for each reduced w3 tag. 12 reduced tags, hence 12 different SVM?s w4 unknown word: SVM for each reduced w3 tag. 12 reduced tags, hence 12 different SVM?s. Relative frequency of w4 is set to zero, hence can be discarded. Initial Committee of SVM?s if w4 = common word then select SVM-common(w4) else-if w4 = uncommon word then select SVM-uncommon(reduced tag(w3)) else w4 is unknown word select SVM-unknown(reduced tag(w3)) Training and test data From the CGN 11 sets were constructed: First sentence of the CGN corpus was put in set0, second sentence in set1, . . ., eleventh sentence in set10, twelfth sentence in set0 etc. set1 up to and including set10 are used for training and validation. set0 is used for the ultimate performance test. Initial evaluation on validation set Kernel type rbf 2nd order 3rd order linear common 97.65 97.67 97.66 97.65 uncommon 87.42 87.14 87.64 87.25 unknown 54.14 53.47 53.47 52.90 overall 97.81 97.82 97.82 97.81 Average unknown word performance: ? 53%. Compound analysis of the unknown word can be used to improve performance rate on unknown words. Compound analysis In order to improve the unknown word performance we use so-called compound analysis. Many words in Dutch are compounds, i.e. consisting of two or more words glued together. For instance: schoenveter (shoestring) ?etsband (tire of a bicycle) ?etsventieldopje Method used: decompose unknown words into compounds and use the second compound as an indication for the PoST. Compound analysis [2] Strict: Both compounds must be in the lexicon. Coverage 38.15%, performance on compounds 83.04% and overall unknown word performance 65.72% Relaxed: Second part must be in the lexicon. Coverage 64.99%, performance on compounds 72.33% and overall unknown word performance 68.32% Coverage and performance on validation set. The Final Committee of SVM?s if w4 = common word then select SVM-common(w4) else-if w4 = uncommom word then select SVM-uncommon(reduced tag(w3)) else-if compound analysis w4 succeeds %w4 is unknown word select SVM-uncommon(reduced tag(w3)) else % w4 is unknown word select SVM-unknown(reduced tag(w3)) Overall performance common 97.28 uncommon 88.40 unknown 70.00 overall (all words) 97.52 Tagging performance (in %) on the test set of the ?nal committee of taggers. The overall performance also includes the non-ambiguous words. Memory based PoST of Canisius and van den Bosch: 95.96% Neural Network based approach: 97.35% (97.88% on known words and 41.67% on unknown words) More detailed performance analysis Best scoring category: Phone dialogue common 97.94 uncommon 88.06 unknown 65.97 overall (all words) 98.15 Worst scoring category: Masses & Ceremonies common 96.28 uncommon 75.00 unknown 62.07 overall (all words) 96.03 Conclusions Design of a committee of SVM?s to tackle PoST for large corpora: if w4 = common word then select SVM-common(w4) else-if w4 = uncommom word then select SVM-uncommon(reduced tag(w3)) else-if compound analysis w4 succeeds %w4 is unknown word select SVM-uncommon(reduced tag(w3)) else % w4 is unknown word select SVM-unknown(reduced tag(w3)) Performance: common uncommon 97.28 88.40 unknown 70.00 overall (all words) 97.52 Compound analysis improves unknown word performance: from 53% to 70% Reasonable tagging speed: 1000 words/sec Future work: Combine SVM and NN based approach
Towards Adaptive Web Mining: Histograms and Contexts in Text Data Clustering We present a novel approach to the growing neural gas (GNG) based clustering of the high-dimensional text data. We enhance our Contextual GNG models (proposed previously to shift the majority of calculations to context-sensitive, local sub-graphs and local sub-spaces and so to reduce computational complexity) by developing a new, histogram-based method for incremental model adaptation and evaluation of its stability. Towards Adaptive Web Mining: Histograms and Contexts in Text Data Clustering Outline - BEATCA Overview BEATCA Overview Outline - Contextual Approach Contextual Approach: Vector Space Model Contextual Approach: Contextual Maps Contextual Approach: Contextual Term Weights Contextual Approach: Advantages Outline - Histograms Histograms: Concept Overview Histograms: Contextual Term/Document Importance Histograms: Some Applications Outline - Experimental Results Experimental Results: Experimental Setting pt 1 Experimental Results: Experimental Setting pt 2 Experimental Results: Reclassification Measure Experimental Results: Reclassification Results pt 1 Experimental Results: Reclassification Results pt 2 Experimental Results: Reclassification Results pt 3 Experimental Results: Reclassification Results pt 4 Outline - Conclusions Conclusions: Summary Conclusions: Future Research Towards Adaptive Web Mining Institute of Computer Science, Polish Academy of Sciences ul. Ordona 21, 01-237 Warszawa, Poland kciesiel,klopotek@ipipan.waw.pl The 7th International Symposium on Intelligent Data Analysis Ljubljana, Slovenia, 6-8 September 2007 Outline Figure: BEATCA map-based search engine GUI Vector space model Contextual maps Contextual term weights Advantages Outline BEATCA overview Contextual approach Vector space model Contextual maps Contextual term weights Advantages Histograms Experimental results Conclusions Ciesielski, Mieczyslaw Klopotek Krzysztof In the so-called vector model a document is considered as a vector in space spanned by the words it contains The a?nity of a document to another document or to a query is measured as cosine of angle between the two vectors t?df weights: wt,d = ft,d ? log N (D) ft Krzysztof Ciesielski, Mieczyslaw Klopotek Figure: Phases of contextual maps creation IDA?2007 Krzysztof Ciesielski, Mieczyslaw Klopotek Histograms and Contexts in Text Data Clustering Clusters are represented in di?erent subspaces (contexts) Attribute (term, phrase) speci?city for context C : st,C = |C | ? d?C ft,D (ft,d ? md,C ) ? d?C md,C where md,C is the document importance in context C (e.g. its fuzzy cluster membership level) Contextual term weighting function (replaces t?df ): wt,d,C = wC (t, d) = st,C ? ft,d ? log |C | ft (C ) IDA?2007 Krzysztof Ciesielski, Mieczyslaw Klopotek Histograms and Contexts in Text Data Clustering Scalability: signi?cant reduction of model creation time (faster convergence to a stable set of nodes/antibodies) reduced space complexity possibility of parallel and distributed processing Model quality improvements: clustering model evaluation measures supervised measures (cluster purity, cluster entropy, mutual information between clusters and labels) graph/network structure evaluation (meta-clustering) Concept overview Contextual term/document importance Some applications Outline BEATCA overview Contextual approach Histograms Concept overview Contextual term/document importance Some applications Experimental results Conclusions IDA?2007 Krzysztof Ciesielski, Mieczyslaw Klopotek Histograms and Contexts in Text Data Clustering IDA?2007 Krzysztof Ciesielski, Mieczyslaw Klopotek Histograms and Contexts in Text Data Clustering On the basis of histograms, one can measure contextual term importance (its ?typicality? for documents within a group) and document similarity to a group Term contextual importance is de?ned as: mt,C = q?? [q ? log (ht,C (q))] Qt,C Document similarity to a context is de?ned as: md,C = mw (d, C ) = where ht,C (q) = t?d mt,C ? ht,C (q) t?d mt,C ht,C (q) ht,C (q) = (C ) ht,C (q) ft q?? is the normalized histogram (empirical density function). IDA?2007 Krzysztof Ciesielski, Mieczyslaw Klopotek Histograms and Contexts in Text Data Clustering Many possible applications of histogram-based approach: robust dimensionality reduction of vector space (outperforms entropy-based term selection) cluster space labeling with descriptive terms and phrases new histogram-based measures: dissimilarity measure (based on Hellinger divergences of term weights distribution) evaluation of cluster?s distributional homogeneity clustering structure stability (reclassi?cation) new fuzzy subspace clustering algorithm (Fuzzy C-Histograms) with histogram-based cluster representatives and dynamic adaptation of local (contextual) similarity measure during clustering process possibilities of personalization of clustering process and clustering structure visualization Experimental setting Reclassi?cation measure Reclassi?cation results Outline BEATCA overview Contextual approach Histograms Experimental results Experimental setting Reclassi?cation measure Reclassi?cation results Conclusions Categorized document collections: 20 News: 20000 documents in 20 groups of equal size 12 News: 8094 documents in 12 groups of varied size Reuters: 21578 documents in 90 groups of varied size WebKb: 8282 web pages in 7 categories 100K www: 96908 pages from 266 Polish hosts Reclassi?cation procedure was applied to evaluate: 1. stability of category-based groups description: via group centroids via sets of histograms 2. in?uence of vector space representations: contextual weights vs. t?df weights representation extension with contextual phrases Reclassi?cation measure evaluates consistency of the model-derived clustering with the histogram-based clustering space description: for each cluster (or document group) C , build its histogram representation: {ht,C : t ? C } for each document, calculate its similarity (membership) md,C to every cluster C ? S, described by a set of histograms choose the most similar cluster: C = argmaxC ?S md,C document is said to be correctly reclassi?ed, if its original cluster C ? is the same as its most similar cluster C IDA?2007 Krzysztof Ciesielski, Mieczyslaw Klopotek Histograms and Contexts in Text Data Clustering Centr. t?df wtdC ?phrases wtdC +phrases Hist. t?df wtdC ?phrases wtdC +phrases 12News 0.665 0.743 0.871 12News 0.878 0.926 0.965 20News 0.666 0.938 0.946 20News 0.898 0.988 0.997 Reuters 0.247 0.608 0.612 Reuters 0.54 0.849 0.861 WebKb 0.704 0.752 0.768 WebKb 0.697 0.98 0.982 100K www 0.255 0.503 0.557 100K www 0.67 0.829 0.969 Table: Reclassi?cation of contextual groups wrt cluster representations and vector space weights Many groups of varied size ? unstable centroid-based representation IDA?2007 Krzysztof Ciesielski, Mieczyslaw Klopotek Histograms and Contexts in Text Data Clustering Centroids / t?df ?null? comp.windows.x rec.antiques.radio+photo rec.models.rockets rec.sport.baseball rec.sport.hockey sci.math sci.med sci.physics soc.culture.israel talk.politics.mideast talk.politics.misc talk.religion.misc Precision ? 0.629 0.584 0.762 0.806 0.358 0.692 0.226 0.774 0.597 0.849 0.692 0.131 Recall ? 0.883 0.888 0.598 0.609 0.85 0.707 0.961 0.545 0.929 0.548 0.945 0.723 Size 0 408 612 999 999 365 624 326 861 666 1000 872 357 Reclassi?ed 55 291 403 1274 1322 154 611 77 1222 428 1548 639 65 Table: Precision and recall for t?df weights and centroid cluster representatives of 12 Newsgroups data Histograms / wtdC ?null? comp.windows.x rec.antiques.radio+photo rec.models.rockets rec.sport.baseball rec.sport.hockey sci.math sci.med sci.physics soc.culture.israel talk.politics.mideast talk.politics.misc talk.religion.misc Table: Precision and recall for contextual weights and histogram cluster representatives of 12 Newsgroups data Summary Future research Outline BEATCA overview Contextual approach Histograms Experimental results Conclusions Summary Future research A new concept of document cluster characterization via term weights distribution histograms, augmented with contextual term weighting approach: individual clusters are represented in di?erent subspaces of term-vector space, with locally induced term importance and dimensionality reduction histogram-based cluster description richer and more adequate than centroid representation contextually localized document representations and similarity measures, dynamically adapted during clustering process possibility of exploitation of external (e.g. semi-supervised or user-personalized) information in clustering and visualization IDA?2007 Krzysztof Ciesielski, Mieczyslaw Klopotek Histograms and Contexts in Text Data Clustering Adaptive, histogram-based text mining: fuzzy subspace clustering model with dynamically adapted feature sets extension of histogram-based approach with: data quality assessment measures (including spam documents and spam web sites identi?cation) exploitation of semi-supervised information by clustering analysis of within-context term similarities (contextual theasuri) user pro?le sensitive document/context recommendation and clustering model visualization overlapping contexts and extension of histogram approach to content+links data clustering Thank you!
Does SVM Really Scale Up to Large Bag of Words Feature Spaces? We are concerned with the problem of learning classification rules in text categorization where many authors presented Support Vector Machines (SVM) as leading classification method. Number of studies, however, repeatedly pointed out that in some situations SVM is outperformed by simpler methods such as naive Bayes or nearest-neighbor rule. In this paper, we aim at developing better understanding of SVM behaviour in typical text categorization problems represented by sparse bag of words feature spaces. We study in details the performance and the number of support vectors when varying the training set size, the number of features and, unlike existing studies, also SVM free parameter C, which is the Lagrange multipliers upper bound in SVM dual. We show that SVM solutions with small C are high performers. However, most training documents are then bounded support vectors sharing a same weight C. Thus, SVM reduce to a nearest mean classifier; this raises an interesting question on SVM merits in sparse bag of words feature spaces. Additionally, SVM suffer from performance deterioration for particular training set size/number of features combinations. Does SVM really scale up to large bag of words feature spaces? Motivation Comparison of classification algorithm pt 1 Comparison of classification algorithm pt 2 Text classification and SVM A performance dip for SVM Characterization of the performance dip The nature of SVM solution vary largely Three different areas are identified Area (1) - limit condition for the use of SVM Area (2) - uncommon experimental settings Area (3) - display the best performing SVM solutions Partial explanation of the performance dip Concluding remarks Comparison of classification algorithm pt 2 (a) September 6th, 2007 Motivation SVM is often presented as leading classi?cation method In some situations, naive Bayes or nearest-neighbor rule outperforms SVM Why? In which settings? Unavoidable situations? Our aim is to develop better understanding of SVM behaviour in text classi?cation Comparison of classi?cation algorithm Usually, many learning curves are drawn. An interesting phenomena was observed. Text classi?cation and SVM A performance dip for SVM Characterization of the performance dip Two steps to characterize the dip, vary the size of the feature vector, and the C parameter of SVM Three types of weight for a feature vector inactive active active ?i = 0 0 > ?i > C ?i = C unbounded support vector bounded support vector The nature of SVM solution vary largely Three di?erent areas are identi?ed Area (1), limit condition for the use of SVM One should not apply SVM when C is set too small Why so? 100% of the feature vectors are bounded with ?i = C, thus every feature vector has equal weight and SVM behave as nearest mean classi?er. Why a limit condition? N w= i=1 yi?i xi and 0 ? ?i ? C, lim ||w||2 = 0, C?0 as ? = 1 then lim ? = ?. C?0 ||w||2 Area (2), uncommon experimental settings Large proportion of bounded SV because the feature vectors overlap Uncommon conditions for text classi?cation Large feature vectors are expected in bag of word representation Area (3) display the best performing SVM solutions Normal application domain for SVM Which C setting best performs? Depends of the input feature space for the xi Ongoing research Partial explanation of the performance dip A performance dip Between area (2) and (3), the proportion of bounded SV reduces A partial transfer from bounded to unbounded SV occurs The performance dip matches the total number of SV dip! Concluding remarks Advice for people who use SVM for text classi?cation One should not use SVM in area (1) with too small C An SVM solution with 100% bounded SV is a nearest mean classi?er, which is unwanted. So monitor carefully the number of bounded SV, maybe some erroneous feature vectors overlap and have distinct classes, e.g. area (2). In that case, they are unseparable!
Incremental Learning with Multiple Classifier Systems Using Correction Filters for Classification Classification is a quite relevant task within data mining area. This task is not trivial and some difficulties can arise depending on the nature of the problem. Multiple classifier systems have been used to construct ensembles of base classifiers in order to solve or alleviate some of those problems. One of the most current problems that is being studied in recent years is how to learn when the datasets are too large or when new information can arrive at any time. In that case, incremental learning is an approach that can be used. Some works have used multiple classifier system to learn in an incremental way and the results are very promising. The aim of this paper is to propose a method for improving the classification (or prediction) accuracy reached by multiple classifier systems in this context. Incremental Learning with Multiple Classifier Systems Using Correction Filters for Classification Outline Introduction Incremental Learning with MCS pt 1 Incremental Learning with MCS pt 2 Incremental Learning with MCS pt 3 Incremental MCS with Correction Filters pt 1 Incremental MCS with Correction Filters pt 2 Incremental MCS with Correction Filters pt 3 Incremental MCS with Correction Filters pt 4 Experiments and Results pt 1 Experiments and Results pt 2 Experiments and Results pt 3 Conclusions pt 1 Conclusions pt 2 Future Work Thank You Experiments and Results pt 3 (a)
Combining Bagging and Random Subspaces to Create Better Ensembles Random forests are one of the best performing methods for constructing ensembles. They derive their strength from two aspects: using random subsamples of the training data (as in bagging) and randomizing the algorithm for learning base-level classifiers (decision trees). The base-level algorithm randomly selects a subset of the features at each step of tree construction and chooses the best among these. We propose to use a combination of concepts used in bagging and random subspaces to achieve a similar effect. The latter randomly select a subset of the features at the start and use a deterministic version of the base-level algorithm (and is thus somewhat similar to the randomized version of the algorithm). The results of our experiments show that the proposed approach has a comparable performance to that of random forests, with the added advantage of being applicable to any base-level algorithm without the need to randomize the latter. Combining Bagging and Random Subspaces to Create Better Ensembles Outline Motivation Randomization Methods for Constructing Ensembles Bagging Random Subspace Method Random Forest Combining Bagging and Random Subspaces Training Set S pt 1 Training Set S pt 2 Training Set S pt 3 Training Set S pt 4 Training Set S pt 5 Training Set S pt 6 Experiments Results pt 1 Results pt 2 Results pt 3 Results ? Wilcoxon test Summary Further work
Machine Learning Reductions There are several different classification problems commonly encountered in real world applications such as 'importance weighted classification', 'cost sensitive classification', 'reinforcement learning', 'regression' and others. Many of these problems can be related to each other by simple machines (reductions) that transform problems of one type into problems of another type. Finding a reduction from your problem to a more common problem allows the reuse of simple learning algorithms to solve relatively complex problems. It also induces an organization on learning problems ? problems that can be easily reduced to each other are 'nearby' and problems which can not be so reduced are not close. Machine Learning Reductions Tutorial Scenario 1 Scenario 2 Where did the Hollywood ending go? Characteristics of Learning Reductions It's reductionist (= good research direction) Elemental It's easy (= you can use it too) Given a Binary classifier, how can we solve Classification Definition Importance Weighted Classification The core theorem: folklore How do we change distributions? Distribution Transform: Rejection Sampling Costing (Sw, A) Costing+classifier applied to the KDD-98 dataset Costing+classifier applied to the DMEF2 dataset Given a Binary classifier, how can we solve Square Error Regression Reasons for the Regression Problem The Probing Method: Observations The Probing Algorithm The Probing Method: Details Comparison for Probing with Squared Error The one classifier trick Probing Theory The proof, pictorially The proof, mathematically Proof, continued Proof II: Properties of most efficient error inducing method An Modification: Quantile Regression normalized performance Some Caveats Given a Binary classifier, how can we solve Given a Binary classifier, how can we solve Multiclass Reductions Multiclass Reductions 01 Multiclass Classification ECOC Transformation ECOC Transformation 01 ECOC Prediction ECOC Analysis Two little problems PECOC: Probabilistic ECOC Probabilistic Error Correcting Output Code Probabilistic Error Correcting Output Code 01 PECOC Analysis Proof Proof 01 Proof: Analyzing errors An Extension Given a Binary classifier, how can we solve Cost Sensitive Classification Sensitive Error Correcting Output Code SECOC, the training algorithm SECOC, the prediction algorithm SECOC Analysis Proof (sketch only, much like PECOC) Another Bonus Some Things to Think About Final Thoughts Regret Transform Reductions ' A B A R B RA X c (c(x) = y) X ? {0, 1} (x,y)?D Pr c : X ? {0, 1} e(D, c) = S = (X ? {0, 1})? D D S D ew (D, c) = E(x,y,i)?D [iI(c(x) = y)] D (x, y, i) = EiD(x,y,i) [i] (x,y,i)?D ew (D, c) = e(D , c)E(x,y,i)?D [i] D Di ew (D, c) = = E(x,y,i)?D [i] = E(x,y,i)?D [i] Pr(x,y)?D [I(c(x) = y)] = E(x,y,i)?D [i]e(D , c) (x,y,i)[iD(x, y, i)I(c(x) = y)] (x,y,i) [D (x, y, i)I(c(x) = y)] i3 i4 i2 i5 i 10 i 1 i6 i9 i7 i8 ?i c < i i c (x, y, i) c S Sw Sw A St ct = A(St) t=1 c(x) = ({c1 (x), ..., c10(x)}) Costing+classifier applied to the KDD-98 dataset 20000 championship winner 15000 \"profit\" 10000 5000 0 Naive B. B.N.B. C4.5 SVM classification algorithm Costing+classifier applied to the DMEF2 dataset 40000 35000 30000 25000 20000 15000 10000 5000 0 Naive B. B.N.B. C4.5 SVM classification algorithm \"profit\" ' A X ? [0, 1] h : X ? [0, 1] S = (X ? [0, 1])? D h er (D, h) = E(x,y)?D [(h(x) ? y)2] X t ? [0, 1] (x, y) ? (x, I(y < t), |y ? t|) c c(x) = 1 ? Ey?D|x[y] < t c(x) = 1 ? Ey?D|xI(y ? t)(t ? y) > Ey?D|xI(y < t)(y ? t) ? 0 > Ey?D|xI(y < t)(y ? t) + Ey?D|xI(y ? t)(y ? t) ? 0 > Ey?D|x[y ? t] The Probing Algorithm S p= 0.01 0.1 0.5 0.7 0.9 0.99 weight S0.99 A C 0.99 Sp = C = p S0.01 S0.1 A C 0.01 S0.5 A S0.7 A S0.9 A C 0.9 importance weighted sample A C 0.1 C C 0.7 importance weighted classifier x 1 x 1 x 0 x 0 x 0 x 0 importance weighted prediction mean prediction t Comparison with Probing for Squared Error 1 squared error 0.8 0.6 0.4 0.2 0 adult Austra. biology breast COIL diabetes echo Hepatitis ion KDD98 Krvskp liver shroom physics sick dataset NB, NB+Sig, NB+Prob C45, C45+Bag, C45+Prob SVM+Sig, SVM+Prob S = ?t{(> x, t <, y, i) : (x, y, i) ? St} (x, y) ? D (S, A) ct(x) = c(x, t) c c= t ? U (0, 1) D c : X ? [0, 1] ? {0, 1} c (x) = Prt?U (0,1)(c(x, t) = 1) (D), c ) c (x)) (D), c) ? min e( E[y|x] Ex,y?D (y ? c X ? [0, 1] ? e( D E[y|x] e( (D), c) Loss of p for different D(y=1|x) Weighted loss 1 D(y=1|x)=1 D(y=1|x)=0.5 D(y=1|x)=0 Most efficient errors 0 0.5 p 1 e( = 2Et,(x,y)?D |y ? t|I(c(x, t) = y) ? min{Et,(x,y)?D I(y ? t)(t ? y), Et,(x,y)?D I(y < t)(y ? t)} = 2Ex,tEy?D|x|y ? t|I(c(x, t) = y)? ? min{Ey?D|xI(y ? t)(t ? y), Ey?D|xI(y < t)(y ? t)} (D), c) ? minc e( (D), c ) x, t c |Ey?D|xI(y ? t)(t ? y) ? Ey?D|xI(y < t)(y ? t)| = |Ey?D|xI(y ? t)(t ? y) + Ey?D|xI(y < t)(t ? y)| = |t ? E[y|x]| ? = 2|t ? E[y|x]| E[y|x] E[y|x] c(x, t) E[y|x]+? E[y|x] E[y|x] 2|t ? E[y|x]|dt = ?2 (x, y) ? (x, I(y < t), qI(y < t) + (1 ? Pr(y ? y ) = q ? c (x)| ? min Ex,y?D |y ? q(x)| q(x) y ? (D), c) ? min e( h (D), c ) q = 0.5 q ? [0, 1] Ex,y?D |y ? ? e( q)I(y > t)) normalized performance 0.2 0.4 0.6 0.8 0 adult 0.1 adult 0.5 adult 0.9 linear kernel quanting/log. reg. quanting/J48 KDD98 0.1 KDD98 0.5 KDD98 0.9 Calif. 0.1 Calif. 0.5 Calif. 0.9 Boston 0.1 Boston 0.5 Boston 0.9 1 Quantile Prediction Performance ' A X X ? {1, ..., k} cm : X ? {1, ..., k} D e(D, cm) = S = (X ? {1, ..., k})? Pr (x,y)?D (cm (x) = y) cm (Dm ) (Dm ), c) k c) ? 4e( c(x, s) = cs(x) (x, ym) Dm e(Dm , c l1 Binary Problem Label 2 3 predictions sum 1.08 1.10 1.82 2.00 0.8 PECOC test error rate 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ECOC test error rate SMO J48 LR (Dm ) (Dm ), c ) c) ? min e(Dm , c ) c c(x, s, p) = csp(x) (Dm ), c) ? min e( k (x, ym) c e(Dm , ? 4 e( c = 0.25 Dm A Dm (l |x)? l|x) l b? b? b =2 b = b(Dm (l|x) + 1 Dm (l|x) + 1 Dm(l |x)) = b 2 l =l 2 ? ? l? b( (D), c ) c (x)| Ex,y?D |D(1|x) ? c (D), c) ? min e( x |D(1|x) ? e( c : X ? [0, 1] ? {0, 1} Dm (l|x) + 1 2 pb = b ? ? Dm (l|x) = X ? {1, ..., k} Ex,y?D (Dm (y|x) ? 2 Dm (Dm ), c) ? min e( b (Dm ), c )) pb ? ' A X cm : X ? {1, ..., k} X ? [0, ?)k S = (X ? [0, ?)k )? Dcs ecs (Dcs, cm ) = E(x, )?D [ cm (x)] cs Subset Label 2 3 M S A s (x, ) ? S { st} t y y s= cst = A({(x, I( s ? t| |), | s ? | |t|) : (x, ) ? S}) y?s y miny EsEt [I(y ? s)cst (x) + I(y ? s)(1 ? cst(x))] {cst} x?X Sst = {((x, s, t), y, i) : (x, y, i) ? Sst} c = A(?stSst) cst(x) = c(x, s, t) k c Dcs (x, ) ? 4 (e( ecs (Dcs , c ) ? min ecs (Dcs , c ) c (Dcs), c) ? min e( c (Dcs), c ))E(x, )?D | | cs s E ?D|x [ s ] E ?D|x | | Et [I(y ? s)cst(x) + I(y ? s)(1 ? cst(x))] = t | | ?D|x [ ] [ y] ?D|x E E E ?D|x [ s] E ?D|x | | Es t? ' A ? 4(e( ? Ex?Dcs (c, x, y)E ?D|x | | ? E ?D|x y (Dcs ), c) ? min e( c (Dcs ), c ))E(x, )?D | | cs O(log k) Regret Transform Reductions Outlier Detection Probing Quantle Regression r quantile loss r E(importance) Classification PECOC Costing, etc... Importance Weighted Classification SECOC 4 (r E (sum of costs)) Mean Regression r squared error 4r Multiclass Classification 4 (r E (max cost)) Ranking Cost Sensitive Classification All Bounded loss Classification Problems Regression Reinforcement Learning with Generative Model ; ? ? A ' ? ; @ $ ? ; ; ?? ? & \" ; 18; ? 18; ; ? 18; ? ? ?? 23; ? ; ? ?? ?? ? A
Two Bagging Algorithms with Coupled Learners to Encourage Diversity In this paper, we present two ensemble learning algorithms which make use of boostrapping and out-of-bag estimation in an attempt to inherit the robustness of bagging to overfitting. As against bagging, with these algorithms learners have visibility on the other learners and cooperate to get diversity, a characteristic that has proved to be an issue of major concern to ensemble models. Experiments are provided using two regression problems obtained from UCI. Two bagging algorithms with coupled learners to encourage diversity Ensemble approach Diversity in ensembles NC algorithm Algorithm (I) Resampling in ensembles Algorithm (II) Training with residuals to compute diversity pt 1 Training with residuals to compute diversity pt 2 Algorithm (III) pt 1 Algorithm (III) pt 2 Experiments pt 1 Experiments pt 2 Concluding Remarks Two bagging algorithms with coupled learners to encourage diversity ? Carlos Valle1 , Ricardo Nanculef1 , H?ctor Allende1 e 2 Claudio Moraga (1) Universidad T?cnica Federico Santa Mar? Chile e ?a, (2) Dortmund University, Germany Ljubljana, September 2007 Ensemble Approach Take a space of functions H. Ensemble strategy chooses from H a set of functions {f1 , f2 , . . . , fn } which are then combined in some way to produce the ?nal hypothesis. F = {f1 , f2 , . . . , fn } (1) In regression it is typically a convex combination n F (x) = i=1 wi fi (x) How to choose fi and wi ? Diversity in Ensembles A central issue in the generation of the ensemble is diversity of their members. Several algorithms encourage diversity. What exactly is diversity?. In regression we can de?ne it as the second term in the so called ?ambiguity decomposition?. with i wi = 1, it can be proved that n n e = (F ? y )2 = ? wi (y ? fi )2 ? wi (fi ? F )2 So, local biases can be ?compensated? with diversity. NC algorithm Negative Correlation Learning algorithm (NC) trains the i-th learner using ei = (y ? fi )2 ? ? (fi ? F )2 where ? weights the importance of the ambiguity component versus the individual performance. Theoretical argument is presented to choose ? according to 1 ? = 2 ? ? ? (1 ? ) n where n is the size of the ensemble and the value of ? ? [0, 1] is problem dependent. (5) (4) Algorithm (I) After some steps of algebra we can note that equation (3) can be alternatively stated as n n (F ? y )2 = i=1 wi2 (y ? fi )2 + i=1 j=i wi wj (fi ? y )(fj ? y ) So, we proposed algorithm (I) which train the learners of the ensemble synchronously with the loss function (6) Resampling in Ensembles Bagging stabilizes prediction by equalizing the in?uence of training examples. bootstrapping selectively reduce the in?uence of leverage points, in particular badly in?uential points. bootstrapping the base learner has an e?ect similar to robust M-estimators. So, we can combined mutual cooperation of algorithm (I) with robustness of bagging, the result produces algorithm (II). Algorithm (II) Algorithm 1 Algorithm (II) 1: Let S = {(xk , yk ); k = 1, . . . , m} be a training set. 2: Let {fi ; i = 1, . . . , n} be a set of n learners and fit the function implemented 3: Generate n bootstrap samples S i , i = 1, . . . , n from S. 4: Make one epoch on the learner fi with the training set Si to obtain the initial 5: for t = 1 to T do 6: for i = 1 to M do 7: Make one epoch on the learner fi with Si and the loss function NegCorrErrt (fi (x), y ) = (y ? fi (x))2 +? i j=i by the learner fi at time t = 0, . . . , T . functions fi 0 . (fi (x) ? y ) fjt?1 (x) ? y 8: end for 9: Set the ensemble predictor at time t to be F t (x) = 1/n 10: end for n?1 t i=0 fi (x) Training with residuals to compute diversity For a single ?xed example (xk , yk ) optimality implies 2(yk ?fi t (xk ))+? (fi t (xk )?yk )(fjt?1 (xk )?yk ) = 0 (6) that is fi t (xk ) = yk , where ? yk ? = yk + ? rit?1 (k) (7) where in turns ? = ?/2 and rit?1 (k) = fjt?1 (xk ) ? yk ? In general, R = 1/m k l(f (xk , yk )) tends to be a down biased estimator of the true generalization error R = E [l(f (x), y )] ? ? R/?f is not a good estimator of f . So, a more realistic estimator of the residual of a single example is to compute the residual error after training with a sample without the example. In Bagging, each bootstrap sample about 37% of the examples do not appear in the sample. These left-out examples can be used to form accurate estimates of quantities of interest. Algorithm (III) We propose to compute the residuals rit?1 (k) of equation (7) ?it?1 (k) = r j?Cik fjt?1 (xk ) ? yk Is equivalent to optimize at each iteration the following objective function oobErrt (fi (x), y ) = (y ? fi (x))2 + ? i j?Ci (x,y ) (fi (x) ? y ) fjt?1 (x) ? y (10) Algorithm (III) can be considered a self-iterated bagging with coupled learners, that is learners that cooperate to get diversity. Algorithm 2 Algorithm (III) 1: Let S = {(xi , yi ); i = 1, . . . , m} be training set. 2: Let fi i = 0, . . . , n ? 1 be a set of n learners and fi t the function implemented by the learner fi at time t = 0, . . . , T . 3: Generate n bootstrap samples S i , i = 1, . . . , n from S. 4: for t = 1 to T do 5: Make one epoch on the learner fi with the learning function oobErrt (fi (x), y ) as de?ned in equation (10), and i the set of examples S i . 6: Set the ensemble predictor at time t to be F t (x) = 1/n n?1 fi t (x) i=0 7: end for Experiments N NC ? ? ? Bagging 10.66 10.79 10.58 10.72 10.53 10.65 13.97 14.93 13.89 14.86 13.73 14.19 Alg.I Training Set ? 10.98 ? 11.23 ? 11.11 ? 11.46 ? 11.95 ? 11.19 Testing Set ? 14.89 ? 15.38 ? 15.17 ? 15.76 ? 14.86 ? 15.36 Alg.II 9.22 9.33 9.27 9.36 8.89 8.97 12.50 12.90 13.02 13.46 12.81 13.25 ? ? ? Alg.III 8.76 8.88 9.12 9.36 8.65 8.75 12.75 13.22 12.69 13.13 12.89 13.36 ? ? ? Concluding Remarks In this paper we have shown two algorithms that are constructed using ideas from the negative correlation algorithm and bagging. The hypothesis that bootstrapping can help to obtain more robust estimators to over?tting. The use of out-of-bag estimates of the residuals however, does not improve the performance of algorithm (II). Future work has then to include a more exhaustive experimental analysis.
Relational Algebra for Ranked Tables with Similarities: Properties and Implementation The paper presents new developments in an extension of Codd?s relational model of data. The extension consists in equipping domains of attribute values with a similarity relation and adding ranks to rows of a database table. This way, the concept of a table over domains (i.e., relation over a relation scheme) of the classical Codd?s model extends to the concept of a ranked table over domains with similarities. When all similarities are ordinary identity relations and all ranks are set to 1, our extension becomes the ordinary Codd?s model. The main contribution of our paper is twofold. First, we present an outline of a relational algebra for our extension. Second, we deal with implementation issues of our extension. In addition to that, we also comment on related approaches presented in the literature. Relational model of data over domains with similarities Outline Problem setting pt 1 Problem setting pt 2 Problem setting pt 3 Problem setting pt 4 Preliminaries from fuzzy logic Preliminaries: structures of truth degrees Problem setting pt 2 (a) Preliminaries: structures of truth degrees (a) Our extension of Codd?s model Functional dependencies Recalling functional dependencies (FDs) Fuzzy functional dependencies: syntax Semantics of FFDs Semantics of FFD: models, entailment Relational algebra and calculus Example I: select power production of countries with large population Implementation of ranked table in ORDBMS pt 1 Implementation of ranked table in ORDBMS pt 2 Implementation of ranked table in ORDBMS pt 3 Implementation of ranked table in ORDBMS pt 4 Future research