All posts by “moa

New release of MOA 11.10

We’ve made a new release of MOA 11.10.

The new features of this release are:

  • new active classification methods : ActiveClassifier
  • Cluster Mapping Measure CMM
  • cleanup of Clustering Setup Panel
  • export fix for FileStream based clusterings
  • screenshot button: filename option
  • wrapper for Weka Clustering algorithms

You find the download link for this release on the MOA homepage:

https://moa.cs.waikato.ac.nz

Cheers,

The MOA Team 

Pocket Data Mining Project using MOA

Pocket Data Mining PDM is a new term describing collaborative mining of streaming data in mobile and distributed computing environments by researchers Frederic Stahl, Mohamed Medhat Gaber, Max Bramer, and Philip S. Yu. With sheer amounts of data streams are now available for subscription on our smart mobile phones, the potential of using this data for decision making using data stream mining techniques has now been achievable owing to the increasing power of these handheld devices. Wireless communication among these devices using Bluetooth and WiFi technologies has opened the door wide for collaborative mining among the mobile devices within the same range that are running data mining techniques targeting the same application.

 


Related publications:

Stahl F., Gaber M. M., Bramer M., and Yu P. S, Distributed Hoeffding Trees for Pocket Data Mining, Proceedings of the 2011 International Conference on High Performance Computing & Simulation (HPCS 2011), Special Session on High Performance Parallel and Distributed Data Mining (HPPD-DM 2011), July 4 — 8, 2011, Istanbul, Turkey, IEEE press.
https://eprints.port.ac.uk/id/eprint/3523

Stahl F., Gaber M. M., Bramer M., Liu H., and Yu P. S., Distributed Classification for Pocket Data Mining, Proceedings of the 19th International Symposium on Methodologies for Intelligent Systems (ISMIS 2011), Warsaw, Poland, 28-30 June, 2011, Lecture Notes in Artificial Intelligence LNAI, Springer Verlag.
https://eprints.port.ac.uk/3524/

Stahl F., Gaber M. M., Bramer M., and Yu P. S., Pocket Data Mining: Towards Collaborative Data Mining in Mobile Computing Environments, Proceedings of the IEEE 22nd International Conference on Tools with Artificial Intelligence (ICTAI 2010), Arras, France, 27-29 October, 2010.
https://eprints.port.ac.uk/3248/

CFP – Data Streams Track – ACM SAC 2012

DATA STREAMS TRACK
https://www.cs.waikato.ac.nz/~abifet/SAC2012/

========================================================================
ACM SAC 2012
The 27th Annual ACM Symposium on Applied Computing
in Trento University, Italy, March 20-23, 2012.
https://www.acm.org/conferences/sac/sac2012/
========================================================================

CALL FOR PAPERS

The rapid development in information science and technology in general
and in growth complexity and volume of data in particular has
introduced new challenges for the research community. Many sources
produce data continuously. Examples include sensor networks, wireless
networks, radio frequency identification (RFID), health-care devices
and information systems, customer click streams, telephone records,
multimedia data, scientific data, sets of retail chain transactions,
etc. These sources are called data streams.  A data stream is an
ordered sequence of instances that can be read only once or a small
number of times using limited computing and storage capabilities.
These sources of data are characterized by being open-ended, flowing
at high-speed, and generated by non stationary distributions.

TOPICS OF INTEREST
We are looking for original, unpublished work related to algorithms,
methods and applications on data streams. Topics include (but are not
restricted) to:

– Data Stream Models
– Data Stream Management Systems
– Data Stream Query Languages
– Continuous queries and Summarization from Data Streams
– Sampling Data Streams
– Single-Pass Algorithms
– Scalable Algorithms
– Change Detection Algorithms
– Clustering on Data Streams
– Classification and Regression on Data Streams
– Association Rules on Data Streams
– Feature Selection on Data Streams
– Visualization Techniques for Data Streams
– Evaluation of Data Streams Models
– Data Stream applications
– Sensor Networks
– Real-Time Applications

IMPORTANT DATES (strict)
– Paper Submission: 31 August, 2011
– Author Notification: 12 October, 2011
– Camera-ready Copy: 2 November, 2011

PAPER SUBMISSION GUIDELINES
Papers should be submitted in PDF using the SAC 2012 conference
management system: https://www.softconf.com/c/sac2012/. Authors are
invited to submit original papers in all topics related to data
streams. All papers should be submitted in ACM 2-column camera ready
format for publication in the symposium proceedings. ACM SAC follows a
double blind review process. Consequently, the author(s) name(s) and
address(s) must NOT appear in the body of the submitted paper, and
self-references should be in the third person. This is to facilitate
double blind review required by ACM. All submitted papers must include
the paper identification number provided by the eCMS system when the
paper is first registered. The number must appear on the front page,
above the title of the paper. Each submitted paper will be fully
refereed and undergo a blind review process by at least three
referees. The conference proceedings will be published by ACM. The
maximum number of pages allowed for the final papers is 6 pages. There
is a set of templates to support the required paper format for a
number of document preparation systems at:
https://www.acm.org/sigs/pubs/proceed/template.html

HaCDAIS @ IEEE ICDM 2011

HaCDAIS 2011: The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems

https://wwwis.win.tue.nl/hacdais2011/

CALL FOR PAPERS 

In the real world data is often non stationary. In predictive analytics, machine learning and data mining the phenomenon of unexpected change in underlying data over time is known as concept drift. Changes in underlying data might occur due to changing personal interests, changes in population, adversary activities or they can be attributed to a complex nature of the environment.

When there is a shift in data, the predictions might become less accurate as the time passes or opportunities to improve the accuracy might be missed. Thus the learning models need to be adaptive to the changes.

The problem of concept drift is of increasing importance to machine learning and data mining as more and more data is organized in the form of data streams rather than static databases, and it is rather unusual that concepts and data distributions stay stable over a long period of time. It is not surprising that the problem of concept drift has been studied in several research communities including but not limited to machine learning and data mining, data streams, information retrieval, and recommender systems. Different approaches for detecting and handling concept drift have been proposed in the literature, and many of them have already proved their potential in a wide range of application domains, e.g. fraud detection, adaptive system control, user modeling, information retrieval, text mining, biomedicine.

TOPICS OF INTEREST

In this workshop, we aim to attract researchers with an interest in handling concept drift and recurring contexts in adaptive information systems. Although we have emphasized the application aspects of handling concept drift we are open to any original work in this area.

A non-exhaustive list of topics includes:

  • Classification and clustering on data streams and evolving data
  • Change and novelty detection in online, semi-online and offline settings
  • Adaptive ensembles
  • Adaptive sampling and instance selection
  • Incremental learning and model adaptivity
  • Delayed labeling in data streams
  • Dynamic feature selection
  • Handling local and complex concept drift
  • Qualitative and quantitative evaluation of concept drift handling performance
  • Reoccurring contexts and context-aware approaches
  • Application-specific and domain driven approaches within the areas of information retrieval, recommender systems, pattern recognition, user modeling, decision support and adaptive (information) systems

We invite submissions in the following categories:

  • New approaches advancing the current state of the art
  • Generic frameworks for handing concept drift and reoccurring contexts
  • Taxonomies and categorizations of the approaches for handing concept drift and reoccurring contexts
  • Case studies and application examples dealing with drifting data

Please notice that we encourage prospective contributors to submit full papers (10 pages) as short papers (5 pages).

IMPORTANT DATES

July 23, 2011 Submission due (for both full and short papers)

September 20, 2011 Notification of acceptance

October 11, 2011 Final papers due

December 10, 2011 Workshop day

SUBMISSION PROCEDURE

Paper submissions are limited to a maximum of 10 pages in the IEEE 2-column format, which is the same as the camera-ready format (see the IEEE Computer Society Press Proceedings Author Guidelines). All papers will be reviewed by the Program Committee based on technical quality, relevance to data mining, originality, significance, and clarity. A double blind reviewing process will be adopted. Authors should therefore avoid using identifying information in the text of the paper. All papers should be submitted through the ICDM Workshop Submission Site. At the time of submission, the papers must not be under review or accepted for publication elsewhere except the main IEEE ICDM conference.

All accepted workshop papers will be published in a separate ICDM workshop proceedings published by the IEEE Computer Society Press. In addition, authors with accepted papers to the workshop will have the opportunity to be invited to publish their extended versions to a special issues in a journal.

RELATED EVENTS

HaCDAIS 2011 is the 2nd workshop focusing on handling concept drift and reoccuring contexts in adaptive information systems. The 1st HaCDAIS workshop was held in conjunction with ECML/PKDD 2010 in Barcelona, Catalonia, Spain. Several other events are addressing the problem of changing data and this way are related to HaCDAIS: International Workshop on Knowledge Discovery from Sensor Data (SensorKDD), Novel Data Stream Pattern Mining Techniques (StreamKDD), Data Streams Track at ACM Symposium on Applied Computing (SAC10), Symposium on Computational Intelligence in Dynamic and Uncertain Environments (CIDUE 2011), Concept Drift and Learning in Nonstationary Environments at IEEE World Congress on Computational Intelligence .

WORKSHOP CHAIRS

Latifur Khan University of Texas, Dallas, USA

Mykola Pechenizkiy Eindhoven University of Technology, the Netherlands

Indrė Žliobaitė Bournemouth University, UK

 

 

New release of MOA 11.5

We’ve made a new release of MOA 11.5.

The new features of this release are:

 

  •  new classification methods for text and sparse data: NaiveBayesMultinomial, SGD Stochastic Gradient Descent, and SPegasos.
  •  new classification methods: LimAttClassifier, LimAttHoeffdingTree, LimAttHoeffdingTreeNB, LimAttHoeffdingTreeNBAdaptive, Perceptron. 
  •  new chunk classification and evaluation methods: EvaluateInterleavedChunks, AccuracyUpdatedEnsemble, AccuracyWeightedEnsemble.
  •  new regression evaluation methods. Now it is possible to run regression experiments on MOA.
  •  a reader of arff files for clustering
  •  multi-label stream generators
  •  new simplified memory management

 

You find the download link for this release on the MOA homepage:

https://moa.cs.waikato.ac.nz

Please note that the documentation has also been updated. 

Cheers,

The MOA Team 

 

ADMIRE project

ADMIRE project Website

MOA is used in the Advanced Data Mining and Integration Research for Europe (ADMIRE) project. The aim of the project is to create advanced, distributed data analysis platform, where one of the major goals is to provide ability of data stream processing. MOA has been very helpful during development of one of the use cases (churn prediction in analytical CRM), where Hoeffding tree algorithm is used to classify customers stored in large datasets. The algorithm implementation has been wrapped as Process Element (step in workflow) named “BuildIterationalClassifier”.

Book “Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams”

 

This book addresses the design of learning algorithms for mining time-changing data streams. It introduces new contributions on several different aspects of the problem, identifying research opportunities and increasing the scope for applications. It also includes an in-depth study of stream mining and a theoretical analysis of proposed methods and algorithms.

The first section is concerned with the use of an adaptive sliding window algorithm (ADWIN). Since this has rigorous performance guarantees, using it in place of counters or accumulators, it offers the possibility of extending such guarantees to learning and mining algorithms not initially designed for drifting data. Testing with several methods, including Naïve Bayes, clustering, decision trees and ensemble methods, is discussed as well.

The second part of the book describes a formal study of connected acyclic graphs, or ‘trees’, from the point of view of closure-based mining, presenting efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees.

Lastly, a general methodology to identify closed patterns in a data stream is outlined. This is applied to develop an incremental method, a sliding-window based method, and a method that mines closed trees adaptively from data streams. These are used to introduce classification methods for tree data streams.

pHMM4weka: Profile Hidden Markov Models (PHMMs) for binary protein classification for WEKA

This Java software implements Profile Hidden Markov Models (PHMMs) for binary protein classification for the WEKA workbench. Standard PHMMs and newly introduced binary PHMMs are used. In addition the software allows propositionalisation of PHMMs.

This software was developed by Stefan Mutter during his PhD at the Machine Learning Group at University of Waikato. His thesis investigated similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied in the detection task as they model sequence similarity very well. From a machine learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. His basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, he transform the problem representation from a one-class to a binary one. Also, he suggested a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner. More information.

Website

 

The ClusTree: indexing micro-clusters for anytime stream mining

Knowledge and Information Systems, 2010

by Philipp Kranen, Ira Assent, Corinna Baldauf, and Thomas Seidl.

Abstract: Clustering streaming data requires algorithms that are capable of updating clustering results for the incoming data. As data is constantly arriving, time for processing is limited. Clustering has to be performed in a single pass over the incoming data and within the possibly varying inter-arrival times of the stream. Likewise, memory is limited, making it impossible to store all data. For clustering, we are faced with the challenge of maintaining a current result that can be presented to the user at any given time. In this work, we propose a parameter-free algorithm that automatically adapts to the speed of the data stream. It makes best use of the time available under the current constraints to provide a clustering of the objects seen up to that point. Our approach incorporates the age of the objects to reflect the greater importance of more recent data. For efficient and effective handling, we introduce the ClusTree, a compact and self-adaptive index structure for maintaining stream summaries. Additionally we present solutions to handle very fast streams through aggregation mechanisms and propose novel descent strategies that improve the clustering result on slower streams as long as time permits. Our experiments show that our approach is capable of handling a multitude of different stream characteristics for accurate and scalable anytime stream clustering.    

KeplerWeka: a module for Kepler providing the functionality of WEKA

KeplerWeka is a module for the open-source scientific workflow Kepler providing the full functionality of the WEKA Machine Learning workbench. It is developed by Peter Reutemann at the Machine Learning Group of the University of Waikato.

The last release of KeplerWeka is integrated into the new Kepler 2.x build framework.

Kepler is designed to help scien­tists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines.  Kepler can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging “R” scripts with compiled “C” code, or facilitating remote, distributed execution of models. Using Kepler‘s graphical user interface, users simply select and then connect pertinent analytical components and data sources to create a “scientific workflow”—an executable representation of the steps required to generate results. The Kepler software helps users share and reuse data, workflows, and compo­nents developed by the scientific community to address common needs.

Project Blog

Website

Third Edition “Data Mining: Practical Machine Learning Tools and Techniques”

By Ian H. Witten, Eibe Frank and Mark A. Hall

https://www.elsevierdirect.com/product.jsp?isbn=9780123748560

Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the Machine Learning Group at the University of Waikato. They include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. The book includes a section in Chapter 9 on data stream mining. 

 

S4: Distributed Stream Computing Platform from Yahoo!

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. S4 was initially developed to personalize search advertising products at Yahoo!, which operate at a rate of thousands of events per second. MapReduce excels at batch jobs, but is hard to apply to stream computation tasks.

Website

Data Streams Track on ACM Symposium on Applied Computing 2011

The goal of the Data Streams Track is to promote a meeting point and a discussion forum for researchers interested in any aspect of Data Stream processing. For the past twenty years, the ACM Symposium on Applied Computing has been a primary gathering forum for applied computer scientists, computer engineers, software engineers, and application developers from around the world.  

The 26nd Annual ACM Symposium on Applied Computing will be hold in Tunghai University, TaiChung, Taiwan, March 21 – 25, 2011.

Website

MEKA Software: A Multi-label Extension to the WEKA Framework

This software provides an open source implementation of the `pruned sets’ and `classifier chains’ methods for multi-label classification. These methods were developed during the PhD Thesis of Jesse Read at the Machine Learning Group at University of Waikato. See these publications:

Jesse Read. Scalable Multi-label Classification. PhD Thesis, University of Waikato, Hamilton, New Zealand. (2010)

Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank. Classifier Chains for Multi-label Classification. In Proc. of 20th European Conference on Machine Learning (ECML 2009). Bled, Slovenia, September 2009.

Jesse Read, Bernhard Pfahringer, Geoff Holmes. Multi-label Classification using Ensembles of Pruned Sets. Proc. of IEEE International Conference on Data Mining (ICDM 2008). Pisa, Italy, December 2008.

Website

Book “Knowledge Discovery from Data Streams” from João Gama

This book covers the fundamentals of data stream mining and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

https://www.crcpress.com/product/isbn/9781439826119

PAKDD 2011 Tutorial: Handling Concept Drift: Importance, Challenges and Solutions

Tutorial at PAKDD discussing concept drift, and MOA as an open source software to deal with concept drift.

Abstract: In the real world data often arrives in streams and is evolving over time. Concept drift in supervised learning means that the underlying distribution of the data is changing. As a result the predictions might become less accurate as the time passes, or opportunities to improve the accuracy might be missed. Therefore, the learning models need to adapt to changes quickly and accurately. The proposed tutorial aims to provide a unifying view on the basic and applied concept drift research in data mining and related areas. In the first part we will introduce the problem of concept drift, discuss why changes appear in supervised learning and motivation to handle them. We will overview what types of application tasks are available. In the second part we will present available approaches and techniques to handle concept drift, discuss evaluation issues and open source software. In the third part we will reflect on the past, present and future of concept drift research and outline future research directions. We will focus on the link between research scenarios and application needs.

Presenters:

  • Albert Bifet, University of Waikato, New Zealand
  • João Gama, University of Porto, Portugal
  • Mykola Pechenizkiy, Eindhoven University of Technology, Netherlands
  • Indrė Žliobaitė, Eindhoven University of Technology, the Netherlands

Tutorial website

The moa, NZ national symbol

The fame of the moa and the fact that its size made it a world-beater gave it the brief status of national symbol briefly in the 19th century. In the 1890s, New Zealand was ‘the land of the moa’, and of 103 entries for a new national coat of arms in 1906–8, 28 included moa. Moa also featured on commercial logos, and in cartoons to represent New Zealand. Its iconic status did not last, however, and was soon replaced by the kiwi.

The moa and the lion.  The fame of the moa’s size briefly turned it into a national symbol. This postcard was issued in 1905 to represent the extraordinary success of the New Zealand All Black rugby team during its tour of England that year.

More information.

 

Cooperative Cars

MOA is used as a data stream mining framework in the Cooperative Cars (CoCar) Project, a joint project between Ericsson in Aachen and Fraunhofer FIT. The CoCar project is aiming at basic research for C2C and C2I communication for future cooperative vehicle applications using cellular mobile communication technologies. Five partners out of the telecommunications- and automotive industry develop platform independent communication protocols and innovative system components. They will be prototyped, implemented and validated in selected applications. Innovation perspectives and potential future network enhancements of cellular systems for supporting cooperative, intelligent vehicles will be identified and demonstrated.

https://dbis.rwth-aachen.de/cms/projects/CoCar