All posts filed under “Uncategorized

Big Data Mining (BigMine-12)

Call for Papers

Big Data Mining (BigMine-12)
1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (BigMine-12) – a KDD2012 Workshop

KDD2012 Conference Dates: August 12-16, 2012
BigMine-12 Workshop Date: Aug 12, 2012
Beijing, China

https://www.bigdata-mining.org

Key dates:
Papers due: May 9, 2012
Acceptance notification: May 23, 2012
Workshop Final Paper Due: June 8, 2012
Workshop Proceedings Due: June 15, 2012

Paper submission and reviewing will be handled electronically. Authors should consult the submission site (https://bigdata-mining.org/submission/) for full details regarding paper preparation and submission guidelines.

Papers submitted to BigMine-12 should be original work and substantively different from papers that have been previously published or are under review in a journal or another conference/workshop.

Following KDD main conference tradition, reviews are not double-blind, and author names and affiliations should be listed.

We invite submission of papers describing innovative research on all aspects of big data mining.

Examples of topic of interest include

  • Scalable, Distributed and Parallel Algorithms
  • New Programming Model for Large Data beyond Hadoop/MapReduce, STORM, streaming languages
  • Mining Algorithms of Data in non-traditional formats (unstructured, semi-structured)
  • Applications: social media, Internet of Things, Smart Grid, Smart Transportation Systems
  • Streaming Data Processing
  • Heterogeneous Sources and Format Mining
  • Systems Issues related to large datasets: clouds, streaming system, architecture, and issues beyond cloud and streams.
  • Interfaces to database systems and analytics.
  • Evaluation Technologies
  • Visualization for Big Data
  • Applications: Large scale recommendation systems, social media systems, social network systems, scientific data mining, environmental, urban and other large data mining applications.

Papers emphasizing theoretical foundations, algorithms, systems, applications, language issues, data storage and access, architecture are particularly encouraged.

We welcome submissions by authors who are new to the data mining research community.

Submitted papers will be assessed based on their novelty, technical quality, potential impact, and clarity of writing. For papers that rely heavily on empirical evaluations, the experimental methods and results should be clear, well executed, and repeatable. Authors are strongly encouraged to make data and code publicly available whenever possible.

Top-quality papers accepted and presented at the workshop after careful revisions by the authors, reviewed by original PC members and chairs will be recommended to ACM TIST, ACM TKDD, IEEE Intelligent Systems or IEEE Computer for fast publication, depending on relevance of the topic

 

New release of MOA 12.03

We’ve made a new release of MOA 12.03.

The new features of this release are:

  • new measure graphic visualization for classification
  • Classifiers are now in subpackages: classifiers.tree, classifiers.bayes, classifiers.functions, classifiers.meta, classifiers.drift,…
  • HoeffdingTree, HoeffdingTreeNB, and HoeffdingTreeNBAdaptive are now only one classifier: HoeffdingTree with an option to select how to do the classification at leaves. By default, the option will be NBAdaptive.
  • HoeffdingOptionTree, HoeffdingOptionTreeNB, and HoeffdingOptionTreeNBAdaptive are now only one classifier: HoeffdingOptionTree with an option to select how to do the classification at leaves. By default, the option will be NBAdaptive.
  • all diferent methods of Leveraging Bagging are now unified in only one classifier: LeveragingBag.
  • drift detection methods are in a specific subpackage: moa.classifiers.core.driftdetection
  • in the classification tab, now it is possible to store the tasks in a file, and copy and paste the task commands using the right button of the mouse
  • in the classification tab, the classifiers show a short description

You find the download link for this release on the MOA homepage:

https://moa.cs.waikato.ac.nz

Cheers,

The MOA Team

 

PRICAI 2012 Special Session on Scalable Big Data Mining

https://cs.waikato.ac.nz/~abifet/pricai2012/
September 3 – 7, 2012
Kuching, Sarawak, Malaysia
============================================== 

CALL FOR PAPERS

Data have become a torrent flowing in many important areas. Big data
refers to datasets whose size is beyond the ability of current
state-of-the art analytic tools. Streaming data is an specific approach
to deal with big data that is evolving and changing. Parallelization is
another scalable approach to mine large quantities of data.

This special session aims to provide an opportunity for researchers to
share their ideas and efforts in building new algorithms, methods and
software for dealing with big data, static or evolving streams.

Topics of interest include but are not limited to:

* Evolving Stream Mining
* Large-scale Machine Learning
* Distributed System Platforms: Hadoop, S4, Storm
* Evaluation Methodologies
* Visualization Techniques for Big Data
* Social Media Mining

IMPORTANT DATES

* Paper Submission Deadline: March 31, 2012
* Paper Acceptance Notification: May 15, 2012
* Camera-Ready Paper + Copyright Submission: May 21, 2012
* Conference: Sept 5 – 7, 2012

PAPER SUBMISSION GUIDELINES

The papers should report original and unpublished research on topics of
interest for this special session. Accepted papers are expected to be presented
at the conference, and will be published in the conference proceedings.
Papers for this special session should be prepared in accordance with the
PRICAI 2012 requirements(https://ktw.mimos.my/pricai2012/submission.html).

Submitted papers should not exceed 12 pages. Submitted papers should not be
under review or submitted for publication elsewhere during the review period.
Papers to be submitted via EasyChair at:

https://www.easychair.org/conferences/?conf=pricai2012

ORGANIZERS

* Wei Fan (IBM T.J.Watson Research, USA)
* Xiatian Zhang (Business Intelligence Center Tencent, China)
* Albert Bifet (University of Waikato, New Zealand)
* Geoff Holmes (University of Waikato, New Zealand)

Upcoming Conference: “Machine-Learning with Real-time & Streaming Applications”

FIRST CONFERENCE ANNOUNCEMENT:

From Data to Knowledge: Machine-Learning with Real-time & Streaming Applications
May 7-11 2012
On the Campus of the University of California, Berkeley

https://lyra.berkeley.edu/CDIConf/

 * * CONFIRMED INVITED SPEAKERS * *

Olfa Nasraoui (Louisville), Petros Drineas (RPI), Muthu Muthukrishnan (Rutgers),
Alex Szalay (John Hopkins), David Bader (Georgia Tech),
Eamonn Keogh (UC Riverside), Joao Gama (Univ. of Porto, Portugal),
Michael Franklin (UC Berkeley), Ziv Bar-Joseph (Carnegie Mellon University)

 * * AIMS OF THE CONFERENCE * *

We are experiencing a revolution in the capacity to quickly collect
and transport large amounts of data. Not only has this revolution
changed the means by which we store and access this data, but has also
caused a fundamental transformation in the methods and algorithms that
we use to extract knowledge from data. In scientific fields as diverse
as climatology, medical science, astrophysics, particle physics,
computer vision, and computational finance, massive streaming data
sets have sparked innovation in methodologies for knowledge discovery
in data streams. Cutting-edge methodology for streaming data has come
from a number of diverse directions, from on-line learning, randomized
linear algebra and approximate methods, to distributed optimization
methodology for cloud computing, to multi-class classification
problems in the presence of noisy and spurious data.

This conference will bring together researchers from applied
mathematics and several diverse scientific fields to discuss the
current state of the art and open research questions in streaming data
and real-time machine learning. The conference will be domain driven,
with talks focusing on well-defined areas of application and
describing the techniques and algorithms necessary to address the
current and future challenges in the field.

Sessions will be accessible to a broad audience and will have a single
track format with additional rooms for breakout sessions and posters.
There will be no formal conference proceedings, but conference
applicants are encouraged to submit an abstract and present a talk
and/or poster.

 * * IMPORTANT DATES * *

Feb 29     : Initial registration ends, participants announced.
May 7 – 11 : Conference.

 * * SESSIONS * *

Stochastic Data Streams
   Muthu Muthukrishnan: (Dept. of Computer Science, Rutgers University)

Real-Time Machine Learning in Astrophysics
   Alex Szalay:      (Dept. of Physics and Astronomy, John Hopkins University)

Real-Time Analytics with Streaming Databases
   Michael Franklin: (Computer Science Dept., UC Berkeley)

Classification of Sensor Network Data Streams
   Joao Gama:    (Lab. of A.I. & Decision Support, Economics at Univ. of Porto)

Randomized and Approximation Algorithms
   Petros Drineas:   (Computer Science Dept., Rensselaer Polytechnic Institute)

Time-Series Clustering and Classification
   Eamonn Keogh:     (Computer Science and Engineering Dept., UC Riverside)

Time Series in the Biological and Medical Sciences
   Ziv Bar-Joseph:   (Computer Science Dept., Carnegie Mellon University)

Streaming Graph/Network Data & Architectures
   David Bader:      (College of Computing, Georgia Tech)

Data Mining of Data Streams
   Olfa Nasraoui:    (Dept. of CS & Computer Engineering, Univ. of Louisville)

 * * Local Organizing Committee * *

Joshua Bloom: (Dept. of Astronomy, UC Berkeley)
Damian Eads:  (Dept. of CS, UC Santa Cruz; Dept. of Eng, Univ. of Cambridge)
Berian James: (Dept. of Astr, UC Berkeley; Dark Cosmology Centre, U Copenhagen)
Peter Nugent: (Comp. Cosmology, Lawrence Berkeley National Lab.)
John Rice:    (Dept. of Statistics, UC Berkeley)
Joseph Richards: (Dept. of Astronomy & Dept. of Statistics, UC Berkeley)
Dan Starr:    (Dept. of Astronomy, UC Berkeley)

 * * Scientific Organizing Committee * *

Leon Bottou:     (NEC Labs)
Emmanuel Candes: (Stanford)
Brad Efron:      (Stanford)
Alex Gray:       (Georgia Tech)
Michael Jordan:  (Berkeley)
John Langford:   (Yahoo)
Fernando Perez:  (Berkeley)
Ricardo Vilalta: (Houston)
Larry Wasserman: (CMU)

IBLStreams (Instance Based Learner on Streams for Regression and Classification)

IBLStreams (Instance Based Learner on Streams) is an instance-based learning algorithm for classification and regression problems on data streams by Ammar Shaker, Eyke Hüllermeier and Jürgen Beringer. The method is able to handle large streams with low requirements in terms of memory and computational power. Moreover, it disposes of mechanisms for adapting to concept drift and concept shift.

In instance-based learning, a prediction for the query instance is obtained by combining, in one way or the other, the outputs of the neighbors of this instance in the training data. The type of aggregation depends on the type of problem to be solved. We offer four different prediction schemes, namely the WeightedMode for classification, the WeightedMedian for ordinal classification, and the WeightedMean and LocalLinearRegression for regression problems.

Regression

In regression, the target attribute is numerical, and loss is typically measured in terms of the absolute or squared difference between predicted and true output. Corresponding prediction problems can be solved in two ways. First, the target value can be estimated by the weighted mean of the target values of the k neighbor instances; this prediction is obtained by using the option “-s WMeanReg”, which sets the PredictionStrategy parameter to WeightedMean(Regression). Second, a prediction can be derived by means of locally weighted linear regression. In this case, a (local) linear regression model is fitted to the k nearest neighbors, and this model is used to make a prediction for the query instance. For this approach, the PredictionStrategy parameter must be set to LocalLinearRegression by using “-s LocLinReg”.

Classification

In conventional classification, the target attribute has a nominal scale, i.e., the set of classes is simply a finite set. In ordinal (aka ordered) classification, the set of classes is finite, too, but equipped with a total order relation; that is, the class labels can be put in a natural order (e.g., hotel categories *, **,***, ****, *****). Ordered classification can be enabled by using the option “-s WMedianOClass”; in this case, the WeightedMedian prediction is used, which is suitable for minimizing the absolute error loss function (predicting the i-th class although the j-th class is correct yields an error of abs(i-j)). Leaving this option empty is equivalent to using the default value “-s WModeClass”, in which case the WeightedMode is returned; this prediction is a proper risk minimizer for the standard 0/1 loss (i.e., the loss is 0 if the predicted class is correct and 1 otherwise).

Website

New release of MOA 11.10

We’ve made a new release of MOA 11.10.

The new features of this release are:

  • new active classification methods : ActiveClassifier
  • Cluster Mapping Measure CMM
  • cleanup of Clustering Setup Panel
  • export fix for FileStream based clusterings
  • screenshot button: filename option
  • wrapper for Weka Clustering algorithms

You find the download link for this release on the MOA homepage:

https://moa.cs.waikato.ac.nz

Cheers,

The MOA Team 

Pocket Data Mining Project using MOA

Pocket Data Mining PDM is a new term describing collaborative mining of streaming data in mobile and distributed computing environments by researchers Frederic Stahl, Mohamed Medhat Gaber, Max Bramer, and Philip S. Yu. With sheer amounts of data streams are now available for subscription on our smart mobile phones, the potential of using this data for decision making using data stream mining techniques has now been achievable owing to the increasing power of these handheld devices. Wireless communication among these devices using Bluetooth and WiFi technologies has opened the door wide for collaborative mining among the mobile devices within the same range that are running data mining techniques targeting the same application.

 


Related publications:

Stahl F., Gaber M. M., Bramer M., and Yu P. S, Distributed Hoeffding Trees for Pocket Data Mining, Proceedings of the 2011 International Conference on High Performance Computing & Simulation (HPCS 2011), Special Session on High Performance Parallel and Distributed Data Mining (HPPD-DM 2011), July 4 — 8, 2011, Istanbul, Turkey, IEEE press.
https://eprints.port.ac.uk/id/eprint/3523

Stahl F., Gaber M. M., Bramer M., Liu H., and Yu P. S., Distributed Classification for Pocket Data Mining, Proceedings of the 19th International Symposium on Methodologies for Intelligent Systems (ISMIS 2011), Warsaw, Poland, 28-30 June, 2011, Lecture Notes in Artificial Intelligence LNAI, Springer Verlag.
https://eprints.port.ac.uk/3524/

Stahl F., Gaber M. M., Bramer M., and Yu P. S., Pocket Data Mining: Towards Collaborative Data Mining in Mobile Computing Environments, Proceedings of the IEEE 22nd International Conference on Tools with Artificial Intelligence (ICTAI 2010), Arras, France, 27-29 October, 2010.
https://eprints.port.ac.uk/3248/

CFP – Data Streams Track – ACM SAC 2012

DATA STREAMS TRACK
https://www.cs.waikato.ac.nz/~abifet/SAC2012/

========================================================================
ACM SAC 2012
The 27th Annual ACM Symposium on Applied Computing
in Trento University, Italy, March 20-23, 2012.
https://www.acm.org/conferences/sac/sac2012/
========================================================================

CALL FOR PAPERS

The rapid development in information science and technology in general
and in growth complexity and volume of data in particular has
introduced new challenges for the research community. Many sources
produce data continuously. Examples include sensor networks, wireless
networks, radio frequency identification (RFID), health-care devices
and information systems, customer click streams, telephone records,
multimedia data, scientific data, sets of retail chain transactions,
etc. These sources are called data streams.  A data stream is an
ordered sequence of instances that can be read only once or a small
number of times using limited computing and storage capabilities.
These sources of data are characterized by being open-ended, flowing
at high-speed, and generated by non stationary distributions.

TOPICS OF INTEREST
We are looking for original, unpublished work related to algorithms,
methods and applications on data streams. Topics include (but are not
restricted) to:

– Data Stream Models
– Data Stream Management Systems
– Data Stream Query Languages
– Continuous queries and Summarization from Data Streams
– Sampling Data Streams
– Single-Pass Algorithms
– Scalable Algorithms
– Change Detection Algorithms
– Clustering on Data Streams
– Classification and Regression on Data Streams
– Association Rules on Data Streams
– Feature Selection on Data Streams
– Visualization Techniques for Data Streams
– Evaluation of Data Streams Models
– Data Stream applications
– Sensor Networks
– Real-Time Applications

IMPORTANT DATES (strict)
– Paper Submission: 31 August, 2011
– Author Notification: 12 October, 2011
– Camera-ready Copy: 2 November, 2011

PAPER SUBMISSION GUIDELINES
Papers should be submitted in PDF using the SAC 2012 conference
management system: https://www.softconf.com/c/sac2012/. Authors are
invited to submit original papers in all topics related to data
streams. All papers should be submitted in ACM 2-column camera ready
format for publication in the symposium proceedings. ACM SAC follows a
double blind review process. Consequently, the author(s) name(s) and
address(s) must NOT appear in the body of the submitted paper, and
self-references should be in the third person. This is to facilitate
double blind review required by ACM. All submitted papers must include
the paper identification number provided by the eCMS system when the
paper is first registered. The number must appear on the front page,
above the title of the paper. Each submitted paper will be fully
refereed and undergo a blind review process by at least three
referees. The conference proceedings will be published by ACM. The
maximum number of pages allowed for the final papers is 6 pages. There
is a set of templates to support the required paper format for a
number of document preparation systems at:
https://www.acm.org/sigs/pubs/proceed/template.html

HaCDAIS @ IEEE ICDM 2011

HaCDAIS 2011: The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems

https://wwwis.win.tue.nl/hacdais2011/

CALL FOR PAPERS 

In the real world data is often non stationary. In predictive analytics, machine learning and data mining the phenomenon of unexpected change in underlying data over time is known as concept drift. Changes in underlying data might occur due to changing personal interests, changes in population, adversary activities or they can be attributed to a complex nature of the environment.

When there is a shift in data, the predictions might become less accurate as the time passes or opportunities to improve the accuracy might be missed. Thus the learning models need to be adaptive to the changes.

The problem of concept drift is of increasing importance to machine learning and data mining as more and more data is organized in the form of data streams rather than static databases, and it is rather unusual that concepts and data distributions stay stable over a long period of time. It is not surprising that the problem of concept drift has been studied in several research communities including but not limited to machine learning and data mining, data streams, information retrieval, and recommender systems. Different approaches for detecting and handling concept drift have been proposed in the literature, and many of them have already proved their potential in a wide range of application domains, e.g. fraud detection, adaptive system control, user modeling, information retrieval, text mining, biomedicine.

TOPICS OF INTEREST

In this workshop, we aim to attract researchers with an interest in handling concept drift and recurring contexts in adaptive information systems. Although we have emphasized the application aspects of handling concept drift we are open to any original work in this area.

A non-exhaustive list of topics includes:

  • Classification and clustering on data streams and evolving data
  • Change and novelty detection in online, semi-online and offline settings
  • Adaptive ensembles
  • Adaptive sampling and instance selection
  • Incremental learning and model adaptivity
  • Delayed labeling in data streams
  • Dynamic feature selection
  • Handling local and complex concept drift
  • Qualitative and quantitative evaluation of concept drift handling performance
  • Reoccurring contexts and context-aware approaches
  • Application-specific and domain driven approaches within the areas of information retrieval, recommender systems, pattern recognition, user modeling, decision support and adaptive (information) systems

We invite submissions in the following categories:

  • New approaches advancing the current state of the art
  • Generic frameworks for handing concept drift and reoccurring contexts
  • Taxonomies and categorizations of the approaches for handing concept drift and reoccurring contexts
  • Case studies and application examples dealing with drifting data

Please notice that we encourage prospective contributors to submit full papers (10 pages) as short papers (5 pages).

IMPORTANT DATES

July 23, 2011 Submission due (for both full and short papers)

September 20, 2011 Notification of acceptance

October 11, 2011 Final papers due

December 10, 2011 Workshop day

SUBMISSION PROCEDURE

Paper submissions are limited to a maximum of 10 pages in the IEEE 2-column format, which is the same as the camera-ready format (see the IEEE Computer Society Press Proceedings Author Guidelines). All papers will be reviewed by the Program Committee based on technical quality, relevance to data mining, originality, significance, and clarity. A double blind reviewing process will be adopted. Authors should therefore avoid using identifying information in the text of the paper. All papers should be submitted through the ICDM Workshop Submission Site. At the time of submission, the papers must not be under review or accepted for publication elsewhere except the main IEEE ICDM conference.

All accepted workshop papers will be published in a separate ICDM workshop proceedings published by the IEEE Computer Society Press. In addition, authors with accepted papers to the workshop will have the opportunity to be invited to publish their extended versions to a special issues in a journal.

RELATED EVENTS

HaCDAIS 2011 is the 2nd workshop focusing on handling concept drift and reoccuring contexts in adaptive information systems. The 1st HaCDAIS workshop was held in conjunction with ECML/PKDD 2010 in Barcelona, Catalonia, Spain. Several other events are addressing the problem of changing data and this way are related to HaCDAIS: International Workshop on Knowledge Discovery from Sensor Data (SensorKDD), Novel Data Stream Pattern Mining Techniques (StreamKDD), Data Streams Track at ACM Symposium on Applied Computing (SAC10), Symposium on Computational Intelligence in Dynamic and Uncertain Environments (CIDUE 2011), Concept Drift and Learning in Nonstationary Environments at IEEE World Congress on Computational Intelligence .

WORKSHOP CHAIRS

Latifur Khan University of Texas, Dallas, USA

Mykola Pechenizkiy Eindhoven University of Technology, the Netherlands

Indrė Žliobaitė Bournemouth University, UK

 

 

New release of MOA 11.5

We’ve made a new release of MOA 11.5.

The new features of this release are:

 

  •  new classification methods for text and sparse data: NaiveBayesMultinomial, SGD Stochastic Gradient Descent, and SPegasos.
  •  new classification methods: LimAttClassifier, LimAttHoeffdingTree, LimAttHoeffdingTreeNB, LimAttHoeffdingTreeNBAdaptive, Perceptron. 
  •  new chunk classification and evaluation methods: EvaluateInterleavedChunks, AccuracyUpdatedEnsemble, AccuracyWeightedEnsemble.
  •  new regression evaluation methods. Now it is possible to run regression experiments on MOA.
  •  a reader of arff files for clustering
  •  multi-label stream generators
  •  new simplified memory management

 

You find the download link for this release on the MOA homepage:

https://moa.cs.waikato.ac.nz

Please note that the documentation has also been updated. 

Cheers,

The MOA Team 

 

ADMIRE project

ADMIRE project Website

MOA is used in the Advanced Data Mining and Integration Research for Europe (ADMIRE) project. The aim of the project is to create advanced, distributed data analysis platform, where one of the major goals is to provide ability of data stream processing. MOA has been very helpful during development of one of the use cases (churn prediction in analytical CRM), where Hoeffding tree algorithm is used to classify customers stored in large datasets. The algorithm implementation has been wrapped as Process Element (step in workflow) named “BuildIterationalClassifier”.

Book “Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams”

 

This book addresses the design of learning algorithms for mining time-changing data streams. It introduces new contributions on several different aspects of the problem, identifying research opportunities and increasing the scope for applications. It also includes an in-depth study of stream mining and a theoretical analysis of proposed methods and algorithms.

The first section is concerned with the use of an adaptive sliding window algorithm (ADWIN). Since this has rigorous performance guarantees, using it in place of counters or accumulators, it offers the possibility of extending such guarantees to learning and mining algorithms not initially designed for drifting data. Testing with several methods, including Naïve Bayes, clustering, decision trees and ensemble methods, is discussed as well.

The second part of the book describes a formal study of connected acyclic graphs, or ‘trees’, from the point of view of closure-based mining, presenting efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees.

Lastly, a general methodology to identify closed patterns in a data stream is outlined. This is applied to develop an incremental method, a sliding-window based method, and a method that mines closed trees adaptively from data streams. These are used to introduce classification methods for tree data streams.

pHMM4weka: Profile Hidden Markov Models (PHMMs) for binary protein classification for WEKA

This Java software implements Profile Hidden Markov Models (PHMMs) for binary protein classification for the WEKA workbench. Standard PHMMs and newly introduced binary PHMMs are used. In addition the software allows propositionalisation of PHMMs.

This software was developed by Stefan Mutter during his PhD at the Machine Learning Group at University of Waikato. His thesis investigated similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied in the detection task as they model sequence similarity very well. From a machine learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. His basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, he transform the problem representation from a one-class to a binary one. Also, he suggested a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner. More information.

Website

 

The ClusTree: indexing micro-clusters for anytime stream mining

Knowledge and Information Systems, 2010

by Philipp Kranen, Ira Assent, Corinna Baldauf, and Thomas Seidl.

Abstract: Clustering streaming data requires algorithms that are capable of updating clustering results for the incoming data. As data is constantly arriving, time for processing is limited. Clustering has to be performed in a single pass over the incoming data and within the possibly varying inter-arrival times of the stream. Likewise, memory is limited, making it impossible to store all data. For clustering, we are faced with the challenge of maintaining a current result that can be presented to the user at any given time. In this work, we propose a parameter-free algorithm that automatically adapts to the speed of the data stream. It makes best use of the time available under the current constraints to provide a clustering of the objects seen up to that point. Our approach incorporates the age of the objects to reflect the greater importance of more recent data. For efficient and effective handling, we introduce the ClusTree, a compact and self-adaptive index structure for maintaining stream summaries. Additionally we present solutions to handle very fast streams through aggregation mechanisms and propose novel descent strategies that improve the clustering result on slower streams as long as time permits. Our experiments show that our approach is capable of handling a multitude of different stream characteristics for accurate and scalable anytime stream clustering.    

KeplerWeka: a module for Kepler providing the functionality of WEKA

KeplerWeka is a module for the open-source scientific workflow Kepler providing the full functionality of the WEKA Machine Learning workbench. It is developed by Peter Reutemann at the Machine Learning Group of the University of Waikato.

The last release of KeplerWeka is integrated into the new Kepler 2.x build framework.

Kepler is designed to help scien­tists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines.  Kepler can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging “R” scripts with compiled “C” code, or facilitating remote, distributed execution of models. Using Kepler‘s graphical user interface, users simply select and then connect pertinent analytical components and data sources to create a “scientific workflow”—an executable representation of the steps required to generate results. The Kepler software helps users share and reuse data, workflows, and compo­nents developed by the scientific community to address common needs.

Project Blog

Website

Third Edition “Data Mining: Practical Machine Learning Tools and Techniques”

By Ian H. Witten, Eibe Frank and Mark A. Hall

https://www.elsevierdirect.com/product.jsp?isbn=9780123748560

Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the Machine Learning Group at the University of Waikato. They include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. The book includes a section in Chapter 9 on data stream mining. 

 

S4: Distributed Stream Computing Platform from Yahoo!

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. S4 was initially developed to personalize search advertising products at Yahoo!, which operate at a rate of thousands of events per second. MapReduce excels at batch jobs, but is hard to apply to stream computation tasks.

Website

Data Streams Track on ACM Symposium on Applied Computing 2011

The goal of the Data Streams Track is to promote a meeting point and a discussion forum for researchers interested in any aspect of Data Stream processing. For the past twenty years, the ACM Symposium on Applied Computing has been a primary gathering forum for applied computer scientists, computer engineers, software engineers, and application developers from around the world.  

The 26nd Annual ACM Symposium on Applied Computing will be hold in Tunghai University, TaiChung, Taiwan, March 21 – 25, 2011.

Website

MEKA Software: A Multi-label Extension to the WEKA Framework

This software provides an open source implementation of the `pruned sets’ and `classifier chains’ methods for multi-label classification. These methods were developed during the PhD Thesis of Jesse Read at the Machine Learning Group at University of Waikato. See these publications:

Jesse Read. Scalable Multi-label Classification. PhD Thesis, University of Waikato, Hamilton, New Zealand. (2010)

Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank. Classifier Chains for Multi-label Classification. In Proc. of 20th European Conference on Machine Learning (ECML 2009). Bled, Slovenia, September 2009.

Jesse Read, Bernhard Pfahringer, Geoff Holmes. Multi-label Classification using Ensembles of Pruned Sets. Proc. of IEEE International Conference on Data Mining (ICDM 2008). Pisa, Italy, December 2008.

Website

Book “Knowledge Discovery from Data Streams” from João Gama

This book covers the fundamentals of data stream mining and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

https://www.crcpress.com/product/isbn/9781439826119