All posts by “moa

Using MOA’s API with Scala


As Scala runs in the Java Virtual Machine, it is very easy to use MOA objects from Scala.

Let’s see an example: the Java code of the first example in Tutorial 2.

In Scala, the same code will be as follows:

As you can see, it is very easy to use MOA objects from Scala.

Using MOA with Scala and its Interactive Shell

Scala is a powerful language that has functional programming capabilities. As it runs in the Java Virtual Machine, it is very easy to use MOA objects inside Scala.

Let’s see an example, using the Scala Interactive Interpreter. First we need to start it, telling where the MOA library is:

scala -cp moa.jar

Welcome to Scala version 2.9.2.
Type in expressions to have them evaluated.
Type :help for more information.

Let’s run a very simple experiment: using a decision tree (Hoeffding Tree) with data generated from an artificial stream generator (RandomRBFGenerator).

We should start importing the classes that we need, and defining the stream and the learner.

scala> import moa.classifiers.trees.HoeffdingTree
import moa.classifiers.trees.HoeffdingTree

scala> import moa.streams.generators.RandomRBFGenerator
import moa.streams.generators.RandomRBFGenerator

scala> val learner = new HoeffdingTree();
learner: moa.classifiers.trees.HoeffdingTree =
Model type: moa.classifiers.trees.HoeffdingTree
model training instances = 0
model serialized size (bytes) = -1
tree size (nodes) = 0
tree size (leaves) = 0
active learning leaves = 0
tree depth = 0
active leaf byte size estimate = 0
inactive leaf byte size estimate = 0
byte size estimate overhead = 0
Model description:
Model has not been trained.

scala> val stream = new RandomRBFGenerator();
stream: moa.streams.generators.RandomRBFGenerator =

Now, we need to initialize the stream and the classifier:

scala> stream.prepareForUse()
scala> learner.setModelContext(stream.getHeader())
scala> learner.prepareForUse()

Now, let’s load an instance from the stream, and use it to train the decision tree:

scala> import com.yahoo.labs.samoa.instances.Instance
import com.yahoo.labs.samoa.instances.Instance

scala> val instance = stream.nextInstance().getData()
instance: com.yahoo.labs.samoa.instances.Instance = 0.210372,1.009586,0.0919,0.272071,
0.450117,0.226098,0.212286,0.37267,0.583146,0.297007,class2

scala> learner.trainOnInstance(instance)

And finally, let’s use it to do a prediction.

scala> learner.getVotesForInstance(instance)
res9: Array[Double] = Array(0.0, 0.0)

scala> learner.correctlyClassifies(instance)
res7: Boolean = false

As shown in this example, it is very easy to use the Scala interpreter to run MOA interactively.

OpenML: exploring machine learning better, together.

www.openml.org

Now you can use MOA classifiers inside OpenML. OpenML is a website where researchers can share their datasets, implementations and experiments in such a way that they can easily be found and reused by others.

OpenML engenders a novel, collaborative approach to experimentation with important benefits. First, many questions about machine learning algorithms won’t require the laborious setup of new experiments: they can be answered on the fly by querying the combined results of thousands of studies on all available datasets. OpenML also keeps track of experimentation details, ensuring that we can easily reproduce experiments later on, and confidently build upon earlier work. Reusing experiments also allows us to run large-scale machine learning studies, yielding more generalizable results with less effort. Finally, beyond the traditional publication of algorithms in journals, often in a highly summarized form, OpenML allows researchers to share all code and results that are possibly of interest to others, which may boost their visibility, speed up further research and applications, and engender new collaborations.

SAMOA: Scalable Advanced Massive Online Analysis

https://www.samoa-project.net/

SAMOA is distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms. It is a project started at Yahoo Labs Barcelona.

SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). SAMOA users can develop distributed streaming ML algorithms once and execute the algorithms in multiple SPEs, i.e., code the algorithms once and execute them in multiple SPEs.

https://www.samoa-project.net/

To use MOA methods inside SAMOA take a look at

https://github.com/yahoo/samoa/wiki/SAMOA-for-MOA-users

RMOA: Massive online data stream classifications with R & MOA

https://bnosac.be/index.php/blog/32-rmoa-massive-online-data-stream-classifications-with-r-a-moa

For R users who work with a lot of data or encounter RAM issues when building models on large datasets, MOA and in general data streams have some nice features. Namely:
  1. It uses a limited amount of memory. So this means no RAM issues when building models.
  2. Processes one example at a time, and will run over it only once
  3. Works incrementally – so that a model is directly ready to be used for prediction purposes

Unfortunately it is written in Java and not easily accessible for R users to use. For users mostly interested in clustering, the stream package already facilites this (this blog item gave an example when using ff alongside the stream package). In our day-to-day use cases, classification is a more common request. The stream package only allows to do clustering. So hence the decision to make the classification algorithms of MOA easily available to R users as well. For this the RMOA package was created and is available on github (https://github.com/jwijffels/RMOA).

The streams Framework

https://www.jwall.org/streams/

The streams framework is a Java implementation of a simple stream processing
environment by Christian Bockermann and Hendrik Blom at TU Dortmund University. It aims at providing a clean and easy-to-use Java-based platform to process streaming data.

The core module of the streams library is a thin API layer of interfaces and
classes that reflect a high-level view of streaming processes. This API serves
as a basis for implementing custom processors and providing services with the
streams library.

Figure 1: Components of the streams library.

Figure 1 shows the components of the streams library. The binding glue element
is a thin API layer that attaches to a runtime provided as a separate module or
can embedded into existing code.

Process Design with JavaBeans

The streams library promotes simple software design patterns such as JavaBean
conventions and dependency injection to allow for a quick setup of streaming
processes using simple XML files.

As shown in Figure 2, the idea of the streams library is to provide a simple
runtime environment that lets users define streaming processes in XML files,
with a close relation to the implementing Java classes.

Figure 2: XML process definitions mapped to a runtime environment, using
stream-api components and other libraries.

Based on the conventions and patterns used, components of the
streams library are simple Java classes. Following the basic design
patterns of the streams library allows for quickly adding custom
classes to the streaming processes without much trouble.

New Release of MOA 14.04

We’ve made a new release of MOA 14.04.

The new features of this release are:

  • Change detection Tab
    • Albert Bifet, Jesse Read, Bernhard Pfahringer, Geoff Holmes, Indre Zliobaite: CD-MOA: Change Detection Framework for Massive Online Analysis. IDA 2013: 92-103
  • New Tutorial on Clustering by Frederic Stahl.
  • New version of Adaptive Model Rules for regression
    • Ezilda Almeida, Carlos Abreu Ferreira, João Gama: Adaptive Model Rules from Data Streams. ECML/PKDD (1) 2013: 480-492
  • AnyOut Outlier Detector
    • Ira Assent, Philipp Kranen, Corinna Baldauf, Thomas Seidl: AnyOut: Anytime Outlier Detection on Streaming Data. DASFAA (1) 2012: 228-242
  • ORTO Regression Tree with Options
    • Elena Ikonomovska, João Gama, Bernard Zenko, Saso Dzeroski: Speeding-Up Hoeffding-Based Regression Trees With Options. ICML 2011: 537-544
  • Online Accuracy Updated Ensemble
    • Dariusz Brzezinski, Jerzy Stefanowski: Combining block-based and online methods in learning ensembles from concept drifting data streams. Inf. Sci. 265: 50-67 (2014)
  • Anticipative and Dynamic Adaptation to Concept Changes Ensemble
    • Ghazal Jaber, Antoine Cornuéjols, Philippe Tarroux: A New On-Line Learning Method for Coping with Recurring Concepts: The ADACC System. ICONIP (2) 2013: 595-604

You find the download link for this release on the MOA homepage:

MOA Machine Learning for Streams

Cheers,

The MOA Team

New release of MOA 13.11

We’ve made a new release of MOA 13.11.

The new feature of this release is:

  • Temporal dependency evaluation
    • Albert Bifet, Jesse Read, Indre Zliobaite, Bernhard Pfahringer, Geoff Holmes: Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them. ECML/PKDD (1) 2013: 465-479

You find the download link for this release on the MOA homepage:

MOA Machine Learning for Streams

Cheers,

The MOA Team

Temporal Dependency in Classification

The paper presented at ECML-PKDD 2013 titled “Pitfalls in benchmarking data stream classification and how to avoid them“, showed that classifying data streams has an important temporal component, which we are currently not considering in the evaluation of data-stream classifiers. A very simple classifier that considers this temporal component, the non-change classifier that predicts only using the last class seen by the classifier, can outperform current state-of-the-art classifiers in some real-world datasets. MOA can now evaluate data streams considering this temporal component using:

  • NoChange classifier
  • TemporallyAugmentedClassifier classifier
  • new evaluation measure Kappa+ or Kappa Temp

which provides a more accurate gauge of classifier performance.

New recommender algorithms and evaluation

MOA has been extended in order to provide an interface to develop and visualize online recommender algorithms.

This is a simple example in order to show the functionality of the EvaluateOnlineRecommender task in MOA.

This task takes a rating predictor and a dataset (each training instance being a [user, item, rating] triplet) and evaluates how well the model predicts the ratings, given the user and item, as more and more instances are processed. This is similar to an online scenario of a recommender system, where new ratings from users to items arrive constantly, and the system has to make predictions of unrated items for the user in order to know which ones to recommend.

Let’s start by opening the MOA user interface. In the Classification tab, click on Configure task, and select from the list the ‘class moa.tasks.EvaluateOnlineRecommender’.

Now we need to select which dataset we want to process, so we click the corresponding button to edit that option.

On the list, we can choose different publicly available datasets. For this example, we will be using the Movielens 1M dataset. We can download it from https://grouplens.org/datasets/movielens/. Finally, we select the file where the input data is located.

Once the dataset is configured, the next step is to choose which ratingPredictor to evaluate.

For the moment, there are just two available: BaselinePredictor and BRISMFPredictor. The first is a very simple rating predictor, and the second is an implementation of a factorization algorithm described in Scalable Collaborative Filtering Approaches for Large Recommender Systems (Gábor Takács, István Pilászy, Bottyán Németh, and Domonkos Tikk). We choose the latter,

and find the following parameters:

  • features – the number of features to be trained for each user and item
  • learning rate – the learning rate of the gradient descent algorithm
  • regularization ratio – the regularization ratio to be used in the tikhonov regularization
  • iterations – the number of iterations to be used when retraining user and item features (online training).

We can leave the default parameters for this dataset.

Going back to the configuration of the task,

we have the sampleFrequency parameter, which defines the frequency in which the precision measures are taken. And finally, the taskResultFile which allows us to save the output of the task in a file. We can leave the default values for them.

Now the task is configured, and we only have to run it:

As the task progresses, we can see in the preview box the RMSE of the predictor from the instance 1 to the processed so far.

When the task finishes, we can see the final results, the RMSE error of the predictor at each measured point.

 

ADAMS – a different take on workflows

A fascinating new workflow for MOA and Weka is available. The Advanced
Data mining And Machine learning System (ADAMS) is a novel, flexible
workflow engine aimed at quickly building and maintaining real-world,
complex knowledge workflows. It is written in Java and uses Maven as
its build system. The framework was open-sourced in September 2012,
released under GPLv3.

The core of ADAMS is the workflow engine, which follows the philosophy
of less is more. Instead of letting the user place operators (or
actors in ADAMS terms) on a canvas and then manually connect inputs
and outputs, ADAMS uses a tree-like structure. This structure and the
control actors define how the data is flowing in the workflow, no
explicit connections necessary. The tree-like structure stems from the
internal object representation and the nesting of sub-actors within
actor-handlers.

https://adams.cms.waikato.ac.nz/

The MOA team recommends ADAMS as the best workflow tool for MOA.

New release of MOA 12.08

We’ve made a new release of MOA 12.08.

The new features of this release are:

  • new rule classification methods : VFDR Rules from Learning Decision Rules from Data Streams, IJCAI 2011, J. Gama, P. Kosina
  • migrated to proper maven project
  • NaiveBayesMultinomial and SGD updated with adaptive DoubleVector for weights
  • new multilabel classifiers: Scalable and efficient multi-label classification for evolving data streams. Jesse Read, Albert Bifet, Geoff Holmes, Bernhard Pfahringer: Machine Learning 88(1-2): 243-272 (2012)
  • updated DDM with with an option of minimum number of instances to detect change

You find the download link for this release on the MOA homepage:

https://moa.cs.waikato.ac.nz

Cheers,

The MOA Team 

 

CFP – Data Streams Track – ACM SAC 2013

============================================================
ACM SAC 2013
The 28th Annual ACM Symposium on Applied Computing
in Coimbra, Portugal, March 18-22, 2013.
https://www.acm.org/conferences/sac/sac2013/

DATA STREAMS TRACK
https://www.cs.waikato.ac.nz/~abifet/SAC2013/
============================================================

CALL FOR PAPERS

The rapid development in information science and technology in general
and in growth complexity and volume of data in particular has
introduced new challenges for the research community. Many sources
produce data continuously. Examples include sensor networks, wireless
networks, radio frequency identification (RFID), health-care devices
and information systems, customer click streams, telephone records,
multimedia data, scientific data, sets of retail chain transactions,
etc. These sources are called data streams.  A data stream is an
ordered sequence of instances that can be read only once or a small
number of times using limited computing and storage capabilities.
These sources of data are characterized by being open-ended, flowing
at high-speed, and generated by non stationary distributions.

TOPICS OF INTEREST
We are looking for original, unpublished work related to algorithms,
methods and applications on data streams. Topics include (but are not
restricted) to:

– Data Stream Models
– Languages for Stream Query
– Continuous Queries
– Clustering from Data Streams
– Decision Trees from Data Streams
– Association Rules from Data Streams
– Decision Rules from Data Streams
– Bayesian networks from Data Streams
– Feature Selection from Data Streams
– Visualization Techniques for Data Streams
– Incremental on-line Learning Algorithms
– Single-Pass Algorithms
– Temporal, spatial, and spatio-temporal data mining
– Scalable Algorithms
– Real-Time and Real-World Applications using Stream data
– Distributed and Social Stream Mining

IMPORTANT DATES (strict)
1. Paper Submission: September 21, 2012
2. Author Notification: November 10, 2012
3. Camera‐ready copies: November 30, 2012

PAPER SUBMISSION GUIDELINES
Papers should be submitted in PDF using the SAC 2013 conference
management system: https://www.softconf.com/c/sac2013/. Authors are
invited to submit original papers in all topics related to data
streams. All papers should be submitted in ACM 2-column camera ready
format for publication in the symposium proceedings. ACM SAC follows a
double blind review process. Consequently, the author(s) name(s) and
address(s) must NOT appear in the body of the submitted paper, and
self-references should be in the third person. This is to facilitate
double blind review required by ACM. All submitted papers must include
the paper identification number provided by the eCMS system when the
paper is first registered. The number must appear on the front page,
above the title of the paper. Each submitted paper will be fully
refereed and undergo a blind review process by at least three
referees. The conference proceedings will be published by ACM. The
maximum number of pages allowed for the final papers is 6 pages. There
is a set of templates to support the required paper format for a
number of document preparation systems at:
https://www.acm.org/sigs/pubs/proceed/template.html

For accepted papers, registration for the conference is required and
allows the paper to be printed in the conference proceedings.
An author or a proxy attending SAC MUST present the paper. This is a
requirement for the paper to be included in the ACM Digital Library.
No-show of scheduled papers will result in excluding the papers from
the ACM Digital Library.

STUDENT RESEARCH COMPETITION
Graduate students seeking feedback from the scientific community on
their research ideas are invited to submit abstracts of their original
un-published and in-progress research work in areas of experimental
computing and application development related to SAC 2013 Tracks. The
Student Research Competition (SRC) program is designed to provide
graduate students the opportunity to meet and exchange ideas with
researcher and practitioners in their areas of interest. All research
abstract submissions will be reviewed by researchers and practitioners
with expertise in the track focus area to which they are submitted.
Authors of selected abstracts will have the opportunity to give poster
presentations of their work and compete for three top wining places.
The SRC committee will evaluate and select First, Second, and Third
place winners. The winners will receive cash awards and SIGAPP
recognition certificates during the conference banquet dinner. Authors
of selected abstracts are eligible to apply to the SIGAPP Student
Travel Award program for support. Graduate students are invited to
submit abstracts (minimum of two pages; maximum of four pages)
of their original un-published and in-progress research work following
the instructions published at SAC 2013 website. The submissions must
address research work related to a SAC track, with emphasis on the
innovation behind the research idea, including the problem being
investigated, the proposed approach and research methodology, and
sample preliminary results of the work. In addition, the abstract
should reflect on the originality of the work, innovation of the
approach, and applicability of anticipated results to real-world
problems. All abstracts must be submitted thought the START Submission
system. Submitting the same abstract to multiple tracks is not allowed.
If you encounter any problems with your submission, please contact
the Program Chairs.

Summer School on Massive Data Mining, August 8-10, 2012

August 8-10, 2012, IT University of Copenhagen, Denmark

The summer school is aimed at PhD students and young researchers both from the algorithms community and the data mining community. A typical participant will be working in a group that aims at publishing in algorithms conferences such as ESA and SODA, and/or in data mining conferences such as ICDM and KDD. Speakers:

Michael Mahoney, Stanford University
Toon Calders, Eindhoven University of Technology
Suresh Venkatasubramanian, University of Utah
Aris Gionis, Yahoo! Research

Early registration fee (before June 15) is €90.
Organizing chair: Rasmus Pagh

Website: www.itu.dk/people/pagh/mdm12/

Big Data Mining (BigMine-12)

Call for Papers

Big Data Mining (BigMine-12)
1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (BigMine-12) – a KDD2012 Workshop

KDD2012 Conference Dates: August 12-16, 2012
BigMine-12 Workshop Date: Aug 12, 2012
Beijing, China

https://www.bigdata-mining.org

Key dates:
Papers due: May 9, 2012
Acceptance notification: May 23, 2012
Workshop Final Paper Due: June 8, 2012
Workshop Proceedings Due: June 15, 2012

Paper submission and reviewing will be handled electronically. Authors should consult the submission site (https://bigdata-mining.org/submission/) for full details regarding paper preparation and submission guidelines.

Papers submitted to BigMine-12 should be original work and substantively different from papers that have been previously published or are under review in a journal or another conference/workshop.

Following KDD main conference tradition, reviews are not double-blind, and author names and affiliations should be listed.

We invite submission of papers describing innovative research on all aspects of big data mining.

Examples of topic of interest include

  • Scalable, Distributed and Parallel Algorithms
  • New Programming Model for Large Data beyond Hadoop/MapReduce, STORM, streaming languages
  • Mining Algorithms of Data in non-traditional formats (unstructured, semi-structured)
  • Applications: social media, Internet of Things, Smart Grid, Smart Transportation Systems
  • Streaming Data Processing
  • Heterogeneous Sources and Format Mining
  • Systems Issues related to large datasets: clouds, streaming system, architecture, and issues beyond cloud and streams.
  • Interfaces to database systems and analytics.
  • Evaluation Technologies
  • Visualization for Big Data
  • Applications: Large scale recommendation systems, social media systems, social network systems, scientific data mining, environmental, urban and other large data mining applications.

Papers emphasizing theoretical foundations, algorithms, systems, applications, language issues, data storage and access, architecture are particularly encouraged.

We welcome submissions by authors who are new to the data mining research community.

Submitted papers will be assessed based on their novelty, technical quality, potential impact, and clarity of writing. For papers that rely heavily on empirical evaluations, the experimental methods and results should be clear, well executed, and repeatable. Authors are strongly encouraged to make data and code publicly available whenever possible.

Top-quality papers accepted and presented at the workshop after careful revisions by the authors, reviewed by original PC members and chairs will be recommended to ACM TIST, ACM TKDD, IEEE Intelligent Systems or IEEE Computer for fast publication, depending on relevance of the topic

 

New release of MOA 12.03

We’ve made a new release of MOA 12.03.

The new features of this release are:

  • new measure graphic visualization for classification
  • Classifiers are now in subpackages: classifiers.tree, classifiers.bayes, classifiers.functions, classifiers.meta, classifiers.drift,…
  • HoeffdingTree, HoeffdingTreeNB, and HoeffdingTreeNBAdaptive are now only one classifier: HoeffdingTree with an option to select how to do the classification at leaves. By default, the option will be NBAdaptive.
  • HoeffdingOptionTree, HoeffdingOptionTreeNB, and HoeffdingOptionTreeNBAdaptive are now only one classifier: HoeffdingOptionTree with an option to select how to do the classification at leaves. By default, the option will be NBAdaptive.
  • all diferent methods of Leveraging Bagging are now unified in only one classifier: LeveragingBag.
  • drift detection methods are in a specific subpackage: moa.classifiers.core.driftdetection
  • in the classification tab, now it is possible to store the tasks in a file, and copy and paste the task commands using the right button of the mouse
  • in the classification tab, the classifiers show a short description

You find the download link for this release on the MOA homepage:

https://moa.cs.waikato.ac.nz

Cheers,

The MOA Team

 

PRICAI 2012 Special Session on Scalable Big Data Mining

https://cs.waikato.ac.nz/~abifet/pricai2012/
September 3 – 7, 2012
Kuching, Sarawak, Malaysia
============================================== 

CALL FOR PAPERS

Data have become a torrent flowing in many important areas. Big data
refers to datasets whose size is beyond the ability of current
state-of-the art analytic tools. Streaming data is an specific approach
to deal with big data that is evolving and changing. Parallelization is
another scalable approach to mine large quantities of data.

This special session aims to provide an opportunity for researchers to
share their ideas and efforts in building new algorithms, methods and
software for dealing with big data, static or evolving streams.

Topics of interest include but are not limited to:

* Evolving Stream Mining
* Large-scale Machine Learning
* Distributed System Platforms: Hadoop, S4, Storm
* Evaluation Methodologies
* Visualization Techniques for Big Data
* Social Media Mining

IMPORTANT DATES

* Paper Submission Deadline: March 31, 2012
* Paper Acceptance Notification: May 15, 2012
* Camera-Ready Paper + Copyright Submission: May 21, 2012
* Conference: Sept 5 – 7, 2012

PAPER SUBMISSION GUIDELINES

The papers should report original and unpublished research on topics of
interest for this special session. Accepted papers are expected to be presented
at the conference, and will be published in the conference proceedings.
Papers for this special session should be prepared in accordance with the
PRICAI 2012 requirements(https://ktw.mimos.my/pricai2012/submission.html).

Submitted papers should not exceed 12 pages. Submitted papers should not be
under review or submitted for publication elsewhere during the review period.
Papers to be submitted via EasyChair at:

https://www.easychair.org/conferences/?conf=pricai2012

ORGANIZERS

* Wei Fan (IBM T.J.Watson Research, USA)
* Xiatian Zhang (Business Intelligence Center Tencent, China)
* Albert Bifet (University of Waikato, New Zealand)
* Geoff Holmes (University of Waikato, New Zealand)

Upcoming Conference: “Machine-Learning with Real-time & Streaming Applications”

FIRST CONFERENCE ANNOUNCEMENT:

From Data to Knowledge: Machine-Learning with Real-time & Streaming Applications
May 7-11 2012
On the Campus of the University of California, Berkeley

https://lyra.berkeley.edu/CDIConf/

 * * CONFIRMED INVITED SPEAKERS * *

Olfa Nasraoui (Louisville), Petros Drineas (RPI), Muthu Muthukrishnan (Rutgers),
Alex Szalay (John Hopkins), David Bader (Georgia Tech),
Eamonn Keogh (UC Riverside), Joao Gama (Univ. of Porto, Portugal),
Michael Franklin (UC Berkeley), Ziv Bar-Joseph (Carnegie Mellon University)

 * * AIMS OF THE CONFERENCE * *

We are experiencing a revolution in the capacity to quickly collect
and transport large amounts of data. Not only has this revolution
changed the means by which we store and access this data, but has also
caused a fundamental transformation in the methods and algorithms that
we use to extract knowledge from data. In scientific fields as diverse
as climatology, medical science, astrophysics, particle physics,
computer vision, and computational finance, massive streaming data
sets have sparked innovation in methodologies for knowledge discovery
in data streams. Cutting-edge methodology for streaming data has come
from a number of diverse directions, from on-line learning, randomized
linear algebra and approximate methods, to distributed optimization
methodology for cloud computing, to multi-class classification
problems in the presence of noisy and spurious data.

This conference will bring together researchers from applied
mathematics and several diverse scientific fields to discuss the
current state of the art and open research questions in streaming data
and real-time machine learning. The conference will be domain driven,
with talks focusing on well-defined areas of application and
describing the techniques and algorithms necessary to address the
current and future challenges in the field.

Sessions will be accessible to a broad audience and will have a single
track format with additional rooms for breakout sessions and posters.
There will be no formal conference proceedings, but conference
applicants are encouraged to submit an abstract and present a talk
and/or poster.

 * * IMPORTANT DATES * *

Feb 29     : Initial registration ends, participants announced.
May 7 – 11 : Conference.

 * * SESSIONS * *

Stochastic Data Streams
   Muthu Muthukrishnan: (Dept. of Computer Science, Rutgers University)

Real-Time Machine Learning in Astrophysics
   Alex Szalay:      (Dept. of Physics and Astronomy, John Hopkins University)

Real-Time Analytics with Streaming Databases
   Michael Franklin: (Computer Science Dept., UC Berkeley)

Classification of Sensor Network Data Streams
   Joao Gama:    (Lab. of A.I. & Decision Support, Economics at Univ. of Porto)

Randomized and Approximation Algorithms
   Petros Drineas:   (Computer Science Dept., Rensselaer Polytechnic Institute)

Time-Series Clustering and Classification
   Eamonn Keogh:     (Computer Science and Engineering Dept., UC Riverside)

Time Series in the Biological and Medical Sciences
   Ziv Bar-Joseph:   (Computer Science Dept., Carnegie Mellon University)

Streaming Graph/Network Data & Architectures
   David Bader:      (College of Computing, Georgia Tech)

Data Mining of Data Streams
   Olfa Nasraoui:    (Dept. of CS & Computer Engineering, Univ. of Louisville)

 * * Local Organizing Committee * *

Joshua Bloom: (Dept. of Astronomy, UC Berkeley)
Damian Eads:  (Dept. of CS, UC Santa Cruz; Dept. of Eng, Univ. of Cambridge)
Berian James: (Dept. of Astr, UC Berkeley; Dark Cosmology Centre, U Copenhagen)
Peter Nugent: (Comp. Cosmology, Lawrence Berkeley National Lab.)
John Rice:    (Dept. of Statistics, UC Berkeley)
Joseph Richards: (Dept. of Astronomy & Dept. of Statistics, UC Berkeley)
Dan Starr:    (Dept. of Astronomy, UC Berkeley)

 * * Scientific Organizing Committee * *

Leon Bottou:     (NEC Labs)
Emmanuel Candes: (Stanford)
Brad Efron:      (Stanford)
Alex Gray:       (Georgia Tech)
Michael Jordan:  (Berkeley)
John Langford:   (Yahoo)
Fernando Perez:  (Berkeley)
Ricardo Vilalta: (Houston)
Larry Wasserman: (CMU)

IBLStreams (Instance Based Learner on Streams for Regression and Classification)

IBLStreams (Instance Based Learner on Streams) is an instance-based learning algorithm for classification and regression problems on data streams by Ammar Shaker, Eyke Hüllermeier and Jürgen Beringer. The method is able to handle large streams with low requirements in terms of memory and computational power. Moreover, it disposes of mechanisms for adapting to concept drift and concept shift.

In instance-based learning, a prediction for the query instance is obtained by combining, in one way or the other, the outputs of the neighbors of this instance in the training data. The type of aggregation depends on the type of problem to be solved. We offer four different prediction schemes, namely the WeightedMode for classification, the WeightedMedian for ordinal classification, and the WeightedMean and LocalLinearRegression for regression problems.

Regression

In regression, the target attribute is numerical, and loss is typically measured in terms of the absolute or squared difference between predicted and true output. Corresponding prediction problems can be solved in two ways. First, the target value can be estimated by the weighted mean of the target values of the k neighbor instances; this prediction is obtained by using the option “-s WMeanReg”, which sets the PredictionStrategy parameter to WeightedMean(Regression). Second, a prediction can be derived by means of locally weighted linear regression. In this case, a (local) linear regression model is fitted to the k nearest neighbors, and this model is used to make a prediction for the query instance. For this approach, the PredictionStrategy parameter must be set to LocalLinearRegression by using “-s LocLinReg”.

Classification

In conventional classification, the target attribute has a nominal scale, i.e., the set of classes is simply a finite set. In ordinal (aka ordered) classification, the set of classes is finite, too, but equipped with a total order relation; that is, the class labels can be put in a natural order (e.g., hotel categories *, **,***, ****, *****). Ordered classification can be enabled by using the option “-s WMedianOClass”; in this case, the WeightedMedian prediction is used, which is suitable for minimizing the absolute error loss function (predicting the i-th class although the j-th class is correct yields an error of abs(i-j)). Leaving this option empty is equivalent to using the default value “-s WModeClass”, in which case the WeightedMode is returned; this prediction is a proper risk minimizer for the standard 0/1 loss (i.e., the loss is 0 if the predicted class is correct and 1 otherwise).

Website