moa

pHMM4weka: Profile Hidden Markov Models (PHMMs) for binary protein classification for WEKA

This Java software implements Profile Hidden Markov Models (PHMMs) for binary protein classification for the WEKA workbench. Standard PHMMs and newly introduced binary PHMMs are used. In addition the software allows propositionalisation of PHMMs.

This software was developed by Stefan Mutter during his PhD at the Machine Learning Group at University of Waikato. His thesis investigated similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied in the detection task as they model sequence similarity very well. From a machine learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. His basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, he transform the problem representation from a one-class to a binary one. Also, he suggested a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner. More information.

Website

 

Posted by moa

The ClusTree: indexing micro-clusters for anytime stream mining

Knowledge and Information Systems, 2010

by Philipp Kranen, Ira Assent, Corinna Baldauf, and Thomas Seidl.

Abstract: Clustering streaming data requires algorithms that are capable of updating clustering results for the incoming data. As data is constantly arriving, time for processing is limited. Clustering has to be performed in a single pass over the incoming data and within the possibly varying inter-arrival times of the stream. Likewise, memory is limited, making it impossible to store all data. For clustering, we are faced with the challenge of maintaining a current result that can be presented to the user at any given time. In this work, we propose a parameter-free algorithm that automatically adapts to the speed of the data stream. It makes best use of the time available under the current constraints to provide a clustering of the objects seen up to that point. Our approach incorporates the age of the objects to reflect the greater importance of more recent data. For efficient and effective handling, we introduce the ClusTree, a compact and self-adaptive index structure for maintaining stream summaries. Additionally we present solutions to handle very fast streams through aggregation mechanisms and propose novel descent strategies that improve the clustering result on slower streams as long as time permits. Our experiments show that our approach is capable of handling a multitude of different stream characteristics for accurate and scalable anytime stream clustering.    

Posted by moa

KeplerWeka: a module for Kepler providing the functionality of WEKA

KeplerWeka is a module for the open-source scientific workflow Kepler providing the full functionality of the WEKA Machine Learning workbench. It is developed by Peter Reutemann at the Machine Learning Group of the University of Waikato.

The last release of KeplerWeka is integrated into the new Kepler 2.x build framework.

Kepler is designed to help scien­tists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines.  Kepler can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging “R” scripts with compiled “C” code, or facilitating remote, distributed execution of models. Using Kepler‘s graphical user interface, users simply select and then connect pertinent analytical components and data sources to create a “scientific workflow”—an executable representation of the steps required to generate results. The Kepler software helps users share and reuse data, workflows, and compo­nents developed by the scientific community to address common needs.

Project Blog

Website

Posted by moa

Third Edition “Data Mining: Practical Machine Learning Tools and Techniques”

By Ian H. Witten, Eibe Frank and Mark A. Hall

https://www.elsevierdirect.com/product.jsp?isbn=9780123748560

Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the Machine Learning Group at the University of Waikato. They include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. The book includes a section in Chapter 9 on data stream mining. 

 

Posted by moa

S4: Distributed Stream Computing Platform from Yahoo!

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. S4 was initially developed to personalize search advertising products at Yahoo!, which operate at a rate of thousands of events per second. MapReduce excels at batch jobs, but is hard to apply to stream computation tasks.

Website

Posted by moa

Data Streams Track on ACM Symposium on Applied Computing 2011

The goal of the Data Streams Track is to promote a meeting point and a discussion forum for researchers interested in any aspect of Data Stream processing. For the past twenty years, the ACM Symposium on Applied Computing has been a primary gathering forum for applied computer scientists, computer engineers, software engineers, and application developers from around the world.  

The 26nd Annual ACM Symposium on Applied Computing will be hold in Tunghai University, TaiChung, Taiwan, March 21 – 25, 2011.

Website

Posted by moa

MEKA Software: A Multi-label Extension to the WEKA Framework

This software provides an open source implementation of the `pruned sets’ and `classifier chains’ methods for multi-label classification. These methods were developed during the PhD Thesis of Jesse Read at the Machine Learning Group at University of Waikato. See these publications:

Jesse Read. Scalable Multi-label Classification. PhD Thesis, University of Waikato, Hamilton, New Zealand. (2010)

Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank. Classifier Chains for Multi-label Classification. In Proc. of 20th European Conference on Machine Learning (ECML 2009). Bled, Slovenia, September 2009.

Jesse Read, Bernhard Pfahringer, Geoff Holmes. Multi-label Classification using Ensembles of Pruned Sets. Proc. of IEEE International Conference on Data Mining (ICDM 2008). Pisa, Italy, December 2008.

Website

Posted by moa

Book “Knowledge Discovery from Data Streams” from João Gama

This book covers the fundamentals of data stream mining and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

https://www.crcpress.com/product/isbn/9781439826119

Posted by moa

PAKDD 2011 Tutorial: Handling Concept Drift: Importance, Challenges and Solutions

Tutorial at PAKDD discussing concept drift, and MOA as an open source software to deal with concept drift.

Abstract: In the real world data often arrives in streams and is evolving over time. Concept drift in supervised learning means that the underlying distribution of the data is changing. As a result the predictions might become less accurate as the time passes, or opportunities to improve the accuracy might be missed. Therefore, the learning models need to adapt to changes quickly and accurately. The proposed tutorial aims to provide a unifying view on the basic and applied concept drift research in data mining and related areas. In the first part we will introduce the problem of concept drift, discuss why changes appear in supervised learning and motivation to handle them. We will overview what types of application tasks are available. In the second part we will present available approaches and techniques to handle concept drift, discuss evaluation issues and open source software. In the third part we will reflect on the past, present and future of concept drift research and outline future research directions. We will focus on the link between research scenarios and application needs.

Presenters:

  • Albert Bifet, University of Waikato, New Zealand
  • João Gama, University of Porto, Portugal
  • Mykola Pechenizkiy, Eindhoven University of Technology, Netherlands
  • Indrė Žliobaitė, Eindhoven University of Technology, the Netherlands

Tutorial website

Posted by moa

The moa, NZ national symbol

The fame of the moa and the fact that its size made it a world-beater gave it the brief status of national symbol briefly in the 19th century. In the 1890s, New Zealand was ‘the land of the moa’, and of 103 entries for a new national coat of arms in 1906–8, 28 included moa. Moa also featured on commercial logos, and in cartoons to represent New Zealand. Its iconic status did not last, however, and was soon replaced by the kiwi.

The moa and the lion.  The fame of the moa’s size briefly turned it into a national symbol. This postcard was issued in 1905 to represent the extraordinary success of the New Zealand All Black rugby team during its tour of England that year.

More information.

 

Posted by moa in MOA Developers, MOA Users

Cooperative Cars

MOA is used as a data stream mining framework in the Cooperative Cars (CoCar) Project, a joint project between Ericsson in Aachen and Fraunhofer FIT. The CoCar project is aiming at basic research for C2C and C2I communication for future cooperative vehicle applications using cellular mobile communication technologies. Five partners out of the telecommunications- and automotive industry develop platform independent communication protocols and innovative system components. They will be prototyped, implemented and validated in selected applications. Innovation perspectives and potential future network enhancements of cellular systems for supporting cooperative, intelligent vehicles will be identified and demonstrated.

https://dbis.rwth-aachen.de/cms/projects/CoCar

Posted by moa in MOA Developers, MOA Users