“Machine Learning for Data Streams with Practical Examples in MOA” Book

https://moa.cms.waikato.ac.nz/book/

https://moa.cms.waikato.ac.nz/book/The new book “Machine Learning for Data Streams with Practical Examples in MOA” published by MIT Press presents algorithms and techniques used in data stream mining and real-time analytics. Taking a hands-on approach, the book demonstrates the techniques using MOA (Massive Online Analysis), allowing readers to try out the techniques after reading the explanations.

The book first offers a brief introduction to the topic, covering big data mining, basic methodologies for mining data streams, and a simple example of MOA. More detailed discussions follow, with chapters on sketching techniques, change, classification, ensemble methods, regression, clustering, and frequent pattern mining. Most of these chapters include exercises, an MOA-based lab session, or both. Finally, the book discusses the MOA software, covering the MOA graphical user interface, the command line, use of its API, and the development of new methods within MOA. The book will be an essential reference for readers who want to use data stream mining as a tool, researchers in innovation or data stream mining, and programmers who want to create new algorithms for MOA.

Posted by moa

MOA with Apache Flink

by Christophe Salperwyck on March 25, 2018 using MOA 2017.06.

Introduction

Apache Flink is a scalable stream processing engine but doesn’t support data stream mining (it only has a batch machine learning library: FlinkML). MOA provides many data stream mining algorithms but is not intended to be distributed with its own stream processing engine.

Apache SAMOA provides a generic way to perform distributed stream mining using stream processing engines (Storm, S4, Samza, Flink..). For specific use-cases, Flink can be used directly with MOA to use Flink internal functions optimally.

In this post we will present 2 examples of how to use MOA with Flink:

  1. Split the data into train/test in Flink, push the learnt model periodically and use Flink window for evaluation
  2. Implement OzaBag logic (weighting the instances with a Poisson law) directly in Flink

Data used

In both scenario the data is generated using the MOA RandomRBF generator.

// create the generator
DataStreamSource<Example<Instance>> rrbfSource = env.addSource(new RRBFSource());

Train/test, push Model and Evaluation within Flink

The full code for this example is available in GitHub.

Train/test split

This data stream is split into 2 streams using random sampling:

  • Train: to build an incremental Decision Tree (Hoeffding tree)
  • Test: to evaluate the performance of the classifier
 
// split the stream into a train and test streams 
SplitStream<Example<Instance>> trainAndTestStream = rrbfSource.split(new RandomSamplingSelector(0.02)); 
DataStream<Example<Instance>> testStream = TrainAndTestStream.select(RandomSamplingSelector.TEST); 
DataStream<Example<Instance>> trainStream = trainAndTestStream.select(RandomSamplingSelector.TRAIN); 

Model update

The model is updated periodically using the Flink CoFlatMapFunction. The Flink documentation states exactly what we want to do:

An example for the use of connected streams would be to apply rules that change over time onto elements of a stream. One of the connected streams has the rules, the other stream the elements to apply the rules to. The operation on the connected stream maintains the current set of rules in the state. It may receive either a rule update (from the first stream) and update the state, or a data element (from the second stream) and apply the rules in the state to the element. The result of applying the rules would be emitted.

A decision tree can be seen as a set of rules, so it fits perfectly with their example :-).

public class ClassifyAndUpdateClassifierFunction implements CoFlatMapFunction<Example<Instance>, Classifier, Boolean> {
    
  private static final long serialVersionUID = 1L;
  private Classifier classifier = new NoChange(); //default classifier - return 0 if didn't learn

  @Override
  public void flatMap1(Example<Instance> value, Collector<Boolean> out) throws Exception {
    //predict on the test stream
    out.collect(classifier.correctlyClassifies(value.getData()));
  }

  @Override
  public void flatMap2(Classifier classifier, Collector<Boolean> out) throws Exception {
    //update the classifier when a new version is sent
    this.classifier = classifier;    
  }
}

To avoid sending a new model at each new learned example we use the parameter updateSize to send the update less frequently.

nbExampleSeen++;
classifier.trainOnInstance(record);
if (nbExampleSeen % updateSize == 0) {
  collector.collect(classifier);
}

Performance evaluation

The evaluation of the performance is done using Flink aggregate windows function, which computes the performance incrementally.

public class PerformanceFunction implements AggregateFunction<Boolean, AverageAccumulator, Double> {
...
  @Override
  public AverageAccumulator add(Boolean value, AverageAccumulator accumulator) {
    accumulator.add(value ? 1 : 0);
    return accumulator;
  }
...
}

Output

The performance of the model is periodically output, each time 10,000 examples are tested.

Connected to JobManager at Actor[akka://flink/user/jobmanager_1#-511692578] with leader session id 2639af1e-4498-4bd9-a48b-673fa21529f5.
02/20/2018 16:53:28 Job execution switched to status RUNNING.
02/20/2018 16:53:28 Source: Custom Source -> Process(1/1) switched to SCHEDULED
02/20/2018 16:53:28 Co-Flat Map(1/1) switched to SCHEDULED
02/20/2018 16:53:28 TriggerWindow(GlobalWindows(), AggregatingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer@463338d7, aggFunction=moa.flink.traintest.PerformanceFunction@5f2050f6}, PurgingTrigger(CountTrigger(10000)), AllWindowedStream.aggregate(AllWindowedStream.java:475)) -> Sink: Unnamed(1/1) switched to SCHEDULED
02/20/2018 16:53:28 Source: Custom Source -> Process(1/1) switched to DEPLOYING
02/20/2018 16:53:28 Co-Flat Map(1/1) switched to DEPLOYING
02/20/2018 16:53:28 TriggerWindow(GlobalWindows(), AggregatingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer@463338d7, aggFunction=moa.flink.traintest.PerformanceFunction@5f2050f6}, PurgingTrigger(CountTrigger(10000)), AllWindowedStream.aggregate(AllWindowedStream.java:475)) -> Sink: Unnamed(1/1) switched to DEPLOYING
02/20/2018 16:53:28 Source: Custom Source -> Process(1/1) switched to RUNNING
02/20/2018 16:53:28 TriggerWindow(GlobalWindows(), AggregatingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer@463338d7, aggFunction=moa.flink.traintest.PerformanceFunction@5f2050f6}, PurgingTrigger(CountTrigger(10000)), AllWindowedStream.aggregate(AllWindowedStream.java:475)) -> Sink: Unnamed(1/1) switched to RUNNING
02/20/2018 16:53:28 Co-Flat Map(1/1) switched to RUNNING
0.8958
0.9244
0.9271
0.9321
0.9342
0.9345
0.9398
0.937
0.9386
0.9415
0.9396
0.9426
0.9429
0.9427
0.9454
...

The performance of our model increase over time, as expected for on incremental machine learning algorithm!

OzaBag in Flink

Data

In this example, the data is also generated using the MOA RandomRBF generator. Many generators with the same seed run in parallel so that each classifier receive the same examples. Then to apply “online bagging (OzaBag)” each classifier put a different weight on the instance.

int k = MiscUtils.poisson(1.0, random);
if (k > 0) {
  record.setWeight(record.weight() * k);
  ht.trainOnInstance(record);
}

Model update

The model is updated every minute using a Flink ProcessFunction. In our case we save the model on disk, but it could be pushed into a web service or some other serving layer.

@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> collector) throws Exception {
  // just serialize as text the model so that we can have a look at what we've got
  StringBuilder sb = new StringBuilder();
  synchronized (ht) {
    ht.getModelDescription(sb, 2);
  }
  // just collect the names of the file for logging purpose
  collector.collect("Saving model on disk, file " +
    // save the model in a file, but we could push to our online system!
    Files.write(
      Paths.get("model/FlinkMOA_" + timestamp + "_seed-" + seed + ".model"),
      sb.toString().getBytes(),
      StandardOpenOption.CREATE_NEW)
    .toString());
  // we can trigger the next event now
  timerOn = false;
}

Output with a parallelism of 4


Connected to JobManager at Actor[akka://flink/user/jobmanager_1#-425324055] with leader session id 5f617fa1-7c02-4288-b305-7d97413ee2d8.
02/21/2018 16:51:05 Job execution switched to status RUNNING.
02/21/2018 16:51:05 Source: Random RBF Source(1/4) switched to SCHEDULED
02/21/2018 16:51:05 Source: Random RBF Source(2/4) switched to SCHEDULED
...
1> HT with OzaBag seed 3 already processed 500000 examples
3> HT with OzaBag seed 2 already processed 500000 examples
...
2> HT with OzaBag seed 0 already processed 1500000 examples
4> HT with OzaBag seed 1 already processed 1500000 examples
1> HT with OzaBag seed 3 already processed 2000000 examples
3> HT with OzaBag seed 2 already processed 2000000 examples
2> Saving model on disk, file model\FlinkMOA_1519228325596_seed-0.model
4> Saving model on disk, file model\FlinkMOA_1519228325596_seed-1.model
1> Saving model on disk, file model\FlinkMOA_1519228325597_seed-3.model
3> Saving model on disk, file model\FlinkMOA_1519228325603_seed-2.model
2> HT with OzaBag seed 0 already processed 2000000 examples
1> HT with OzaBag seed 3 already processed 2500000 examples
4> HT with OzaBag seed 1 already processed 2000000 examples
...

Next step

The next step would be to assess the performance of our online bagging. It might be a bit tricky here as we need to collect all classifiers and bag them, but we need to know which one to replace in the bag when there is an update.

In the meantime you can use MOA to check the performances!

Posted by moa

Adaptive Random Forest

Adaptive Random Forest (ARF for short) [1] is an adaptation of the original Random Forest algorithm [2], which has been successfully applied to a multitude of machine learning tasks. In layman’s terms the original Random Forest algorithm is an ensemble of decision trees, which are trained using bagging and where the node splits are limited to a random subset of the original set of features. The “Adaptive” part of ARF comes from its mechanisms to adapt to different kinds of concept drifts, given the same hyper-parameters.

Specifically, the 3 most important aspects of the ARF algorithm are:

  1. It adds diversity through resampling (“bagging”);
  2. It adds diversity through randomly selecting subsets of features for node splits (See moa.classifiers.trees.ARFHoeffdingTree.java);
  3. It has one drift and warning detector per base tree, which cause selective resets in response to drifts. It also allows training background trees, which start training if a warning is detected and replace the active tree if the warning escalates to a drift.

ARF was designed to be “embarrassingly” parallel, in other words, there are no dependencies between trees. This allowed an easy parallel implementation of the algorithm in MOA (see parameter numberOfJobs).

Parameters

Currently, the ARF is configurable using the following parameters in MOA:

  • treeLearner (-l). ARFHoffedingTree is the base tree learner, the important default value is the grace period = 50 (this causes trees to start splitting earlier)
  • ensembleSize (-s). Ensemble size from 10 to 100 (default is 10, but the model tends to improves as more trees are added)
  • mFeaturesMode (-o). Features per subset mode = sqrt(M)+1, which corresponds to sqrt(M) + 1 that is the default for Random Forest algorithms, still it is an hyper-parameter and can be tuned for better results (this can be controlled by changing the mFeaturesMode).
  • mFeaturesPerTreeSize (-m). This controls the size features subsets based on the kFeaturesMode. For example, if mFeaturesMode = sqrt(M)+1 then this parameter is ignored, but if mFeaturesMode = “Specified m” then the value defined in for -m is considered.
  • lambda (-a). Defaults to the same as leveraging bagging, i.e., lambda = 6.
  • numberOfJobs (-j). Defines the number of threads used to train the ensemble, default is 1.
  • driftDetectionMethod (-x). Best results tend to be obtained by using ADWINChangeDetector, the default deltaAdwin (ADWIN parameter) is 0.00001. Still other drift detection methods can be easily configured and used, such as PageHinkley, DDM, EDDM, etc.
  • warningDetectionMethod (-p). This parameter controls the warning detector, such that whenever it triggers a background tree starts training along with the tree where the warning was triggered. It is important that whatever parameters are used in here, that this is consistent with the driftDetectionMethod, i.e., the warning should always trigger before the drift to guarantee the overall algorithm’s expected behavior. The default values are ADWINChangeDetector with deltaAdwin = 0.0001, notice that this is 10x greater than the default deltaAdwin for the driftDetectionMethod.
  • disableWeightedVote (-w). If this is set, then predictions will be based on majority vote instead of weighted votes. The default is to use weighted votes, thus this is not set.
  • disableDriftDetection (-u). This controls if drift detection is used, when set there will be no drift or warning detection as well as background learners are not created, independently of any other parameters. The default is to use drift detection, thus this is not set.
  • disableBackgroundLearner (-q). When set the background learner is not used as well as the warning detector is ignored (there are no checks for warning). The default is to use background learners, thus this is not set.

Comparing against the Adaptive Random Forest

A practical and simple way to compare against Adaptive Random Forest is to compare classification performance in terms of Accuracy/Kappa M and Kappa T using the traditional immediate setting, where  the true label is presented right after the instance has been used for testing, or the delayed setting, where there is a delay between the time an instance is presented and its true label becomes available.

In this post we present some practical and simple ways to compare against ARF, for a thorough comparison against other state-of-the-art ensemble classifiers and thorough discussion of other aspects of ARF (e.g. weighted vote) please refer to publication [1].

Evaluator task Adaptive Random Forest

In [1] the evaluator tasks used were EvaluatePrequential, EvaluatePrequentialCV, EvaluatePrequentialDelayed and EvaluatePrequentialCV. CV stands for Cross-Validation and more details about it can be found in [3]. In this example we are only going to use EvaluatePrequential as CV versions take longer to run, however one can easily change from EvaluatePrequential to EvaluatePrequentialCV and obtain the CV results or EvaluatePrequentialDelayed (and EvaluatePrequentialDelayedCV) to obtain delayed labeling results.

ARF fast and moderate

In the original paper, there are 2 main configurations that should be useful to compare against future algorithms, they were identified as ARF_{moderate} and ARF_{fast}. Basically ARF_{moderate} is the standard configuration (described in the default parameters values above) and ARF_{fast} increases the deltaAdwin (e.g. the default drift detector is ADWINChangeDetector for -x and -p) for both warning and drift detection causing more detections.

To test AdaptiveRandomForest in either delayed or immediate setting execute the latest MOA jar.
You can copy and paste the following commands in the interface (right click the configuration text edit and select “Enter configuration”).

Airlines dataset, ARF_{moderate} (default configuration) with 10 trees.
EvaluatePrequential -l (meta.AdaptiveRandomForest) -s (ArffFileStream -f airlines.arff) -e BasicClassificationPerformanceEvaluator -f 54000

Airlines dataset, ARF_{fast} (detect drifts earlier and yield false positives) with 10 trees.
EvaluatePrequential -l (meta.AdaptiveRandomForest -x (ADWINChangeDetector -a 0.001) -p (ADWINChangeDetector -a 0.01)) -s (ArffFileStream -f airlines.arff) -e BasicClassificationPerformanceEvaluator -f 54000

Airlines dataset, LeveragingBagging with 10 trees
EvaluatePrequential -l meta.LeveragingBag -s (ArffFileStream -f airlines.arff) -e BasicClassificationPerformanceEvaluator -f 54000

Explanation: All these commands executes EvaluatePrequential with BasicClassificationPerformanceEvaluator (this is equivalent to InterleavedTestThenTrain) on the airlines dataset (-f airlines.arff). The tree commands use 10 trees (-s 10 is implicit in all three cases). ARF also uses m = sqrt(total features) + 1 and the difference between ARF_{moderate} and ARF_{fast} remains on the detectors parameters. Make sure to extract the airlines.arff dataset, and setting -f to its location, before executing the command. The airlines and others datasets are available in here: https://github.com/hmgomes/AdaptiveRandomForest.

[1] Heitor Murilo Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabricio Enembreck, Bernhard Pfharinger, Geoff Holmes, Talel Abdessalem. Adaptive random forests for evolving data stream classification. In Machine Learning, DOI: 10.1007/s10994-017-5642-8, Springer, 2017.

This can be downloaded from Springer: https://rdcu.be/tsdi
The pre-print is also available in here: Adaptive Random Forest pre-print

[2] Breiman, Leo. Random forests.Machine Learning 45.1, 5-32, 2001.

[3] Albert Bifet, Gianmarco De Francisci Morales, Jesse Read, Geoff Holmes, Bernhard Pfahringer: Efficient Online Evaluation of Big Data Stream Classifiers. KDD 2015: 59-68

Posted by moa

New Release of MOA 17.06

We’ve made a new release of MOA 17.06

The new features of this release are:

  • SAMKnn:
    • Viktor Losing, Barbara Hammer, Heiko Wersing: KNN Classifier with Self Adjusting Memory for Heterogeneous Concept Drift. ICDM 2016: 291-300
  • Adaptive Random Forest:
    • Heitor Murilo Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabricio Enembreck, Bernhard Pfharinger, Geoff Holmes, Talel Abdessalem. Adaptive random forests for evolving data stream classification. In Machine Learning, Springer, 2017.
  • Blast:
    • Jan N. van Rijn, Geoffrey Holmes, Bernhard Pfahringer, Joaquin Vanschoren: Having a Blast: Meta-Learning and Heterogeneous Ensembles for Data Streams. ICDM 2015: 1003-1008
  • Prequential AUC:
    • Dariusz Brzezinski, Jerzy Stefanowski: Prequential AUC: properties of the area under the ROC curve for data streams with concept drift. Knowl. Inf. Syst. 52(2): 531-562 (2017)
  • D-Stream:
    • Yixin Chen, Li Tu: Density-based clustering for real-time stream data. KDD 2007: 133-142
  • IADEM:
    • Isvani Inocencio Frías Blanco, José del Campo-Ávila, Gonzalo Ramos-Jiménez, André Carvalho, Agustín Alejandro Ortiz Díaz, Rafael Morales Bueno: Online adaptive decision trees based on concentration inequalities. Knowl.-Based Syst. 104: 179-194 (2016)
  • RCD Classifier:
    • Gonçalves Jr, Paulo Mauricio, and Roberto Souto Maior De Barros: RCD: A recurring concept drift framework. Pattern Recognition Letters 34.9 (2013): 1018-1025.
  • ADAGrad:
    • John C. Duchi, Elad Hazan, Yoram Singer: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12: 2121-2159 (2011)
  • Dynamic Weighted Majority:
    • J. Zico Kolter, Marcus A. Maloof: Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts. Journal of Machine Learning Research 8: 2755-2790 (2007)
  • StepD:
    • Kyosuke Nishida, Koichiro Yamauchi: Detecting Concept Drift Using Statistical Testing. Discovery Science 2007: 264-269
  • LearnNSE:
    • Ryan Elwell, Robi Polikar: Incremental Learning of Concept Drift in Nonstationary Environments. IEEE Trans. Neural Networks 22(10): 1517-1531 (2011)
  • ASSETS Generator:
    • Jean Paul Barddal, Heitor Murilo Gomes, Fabrício Enembreck, Bernhard Pfahringer, Albert Bifet: On Dynamic Feature Weighting for Feature Drifting Data Streams. ECML/PKDD (2) 2016: 129-144
  • SineGenerator and MixedGenerator:
    • Joao Gama, Pedro Medas, Gladys Castillo, Pedro Pereira Rodrigues: Learning with Drift Detection. SBIA 2004: 286-295
  • AMRulesMultiLabelLearnerSemiSuper (Semi-supervised Multi-target regressor):
    • Ricardo Sousa, Joao Gama: Online Semi-supervised Learning for Multi-target Regression in Data Streams Using AMRules. IDA 2016: 123-133
  • AMRulesMultiLabelClassifier (Multi-label classifier):
    • Ricardo Sousa, Joao Gama: Online Multi-label Classification with Adaptive Model Rules. CAEPIA 2016: 58-67
  • WeightedMajorityFeatureRanking, MeritFeatureRanking and BasicFeatureRanking (Feature ranking methods)
    • Joao Duarte, Joao Gama: Feature ranking in hoeffding algorithms for regression. SAC 2017: 836-841
  • AMRulesMultiTargetRegressor (Multi-target regressor):
    • Joao Duarte, Joao Gama, Albert Bifet: Adaptive Model Rules From High-Speed Data Streams. TKDD 10(3): 30:1-30:22 (2016)

You can find the download link for this release on the MOA homepage:

MOA Machine Learning for Streams

Cheers,

The MOA Team

Posted by moa

New Release of MOA 16.04

We’ve made a new release of MOA 16.04.

The new features of this release are:

  • BICO: BIRCH Meets Coresets for k-Means Clustering.
    • Hendrik Fichtenberger, Marc Gillé, Melanie Schmidt,
      Chris Schwiegelshohn, Christian Sohler: ESA 2013: 481-492 (2013)
      https://ls2-www.cs.tu-dortmund.de/bico/
  • Updates:
    • MultiLabel and MultiTarget methods

There are these important changes after MOA 2015.11 release:

  • Use Examples instead of Instances to be able to deal easily with unstructured data
  • Use Apache Samoa instances instead of WEKA instances
  • Use the javacliparser library

You find the download link for this release on the MOA homepage:

MOA Machine Learning for Streams

Cheers,

The MOA Team

Posted by moa

Prequential Cross-Validation Evaluation

In data stream classification, the most used evaluation is the prequential one, where instances are first used to test, and then to train. However, the weakness of prequential evaluation compared to cross-validation was that it was running only one experiment.

We are proud to announce that MOA now contains a new prequential cross-validation evaluation with the advantages of prequential evaluation and the advantages of cross-validation evaluation. The new task is called EvaluatePrequentialCV:

Other new techniques added are:

  • AdwinClassificationPerformanceEvaluator: new performance evaluator that uses an adaptive size sliding window to estimate accuracy on real time.
  • Kappa M measure: a new measure that compares with a majority class classifier and that in streaming is more appropriate than the standard Kappa statistic.

 

Reference

Albert Bifet, Gianmarco De Francisci Morales, Jesse Read, Geoff Holmes, Bernhard Pfahringer: Efficient Online Evaluation of Big Data Stream Classifiers. KDD 2015: 59-68

 

Posted by moa in MOA Users

New Release of MOA 15.11

We’ve made a new release of MOA 15.11.

The new features of this release are:

  • iSOUPTree.
    • Aljaz Osojnik, Pance Panov, Saso Dzeroski: Multi-label Classification via Multi-target Regression on Data Streams. Discovery Science 2015: 170-185
  • SEEDChangeDetector.
    • David Tse Jung Huang, Yun Sing Koh, Gillian Dobbie, Russel Pears: Detecting Volatility Shift in Data Streams. ICDM 2014: 863-868
  • Paired Learners for Concept Drift, by Paulo Gonçalves
    • Stephen H. Bach, Marcus A. Maloof: Paired Learners for Concept Drift. ICDM 2008: 23-32
  • Updates:
    • MultiLabel and MultiTarget methods, FIMT-DD and ORTO

There are these important changes in this new release:

  • Use Examples instead of Instances to be able to deal easily with unstructured data
  • Use Apache Samoa instances instead of WEKA instances
  • Use the javacliparser library

You find the download link for this release on the MOA homepage:

MOA Machine Learning for Streams

Cheers,

The MOA Team

Posted by moa

New Release of MOA 14.11

We’ve made a new release of MOA 14.11.

The new features of this release are:

  • Lazy kNN methods.
    • Albert Bifet, Bernhard Pfahringer, Jesse Read, Geoff Holmes: Efficient data stream classification via probabilistic adaptive windows. SAC 2013: 801-806
  • SGDMultiClass for multi-class SGD learning.
  • OnlineSmoothBoost
    • Shang-Tse Chen, Hsuan-Tien Lin, Chi-Jen Lu:An Online Boosting Algorithm with Theoretical Justifications. ICML 2012
  • ReplacingMissingValuesFilter: a filter to replace missing values by Manuel Martin Salvador.
  • HDDM Concept Drift detector
    • I. Frias-Blanco, J. del Campo-Avila, G. Ramos-Jimenez, R. Morales-Bueno, A. Ortiz-Diaz, and Y. Caballero-Mota, Online and non-parametric drift detection methods based on Hoeffding’s bound, IEEE Transactions on Knowledge and Data Engineering, 2014.
  • SeqDriftChangeDetector Concept Drift detector
    • Pears, R., Sakthithasan, S., & Koh, Y. (2014). Detecting concept change in dynamic data streams. Machine Learning, 97(3), 259-293.
  • Updates:
    • SGD, HoeffdingOptionTree, HAT, FIMTDD, Change Detectors, and DACC

You find the download link for this release on the MOA homepage:

MOA Machine Learning for Streams

Cheers,

The MOA Team

Posted by admin

Using MOA from ADAMS workflow engine

MOA and WEKA are powerful tools to perform data mining analysis tasks. Usually, in real applications and professional settings, the data mining processes are complex and consist of several steps. These steps can be seen as a workflow. Instead of implementing a program in JAVA, a professional data miner will build a solution using a workflow, so that it will be much easier to maintain for non-programmer users.

The Advanced Data mining And Machine learning System (ADAMS) is a novel, flexible workflow engine aimed at quickly building and maintaining real-world, complex knowledge workflows.

The core of ADAMS is the workflow engine, which follows the philosophy of less is more. Instead of letting the user place operators (or actors in ADAMS terms) on a canvas and then manually connect inputs and outputs, ADAMS uses a tree-like structure. This structure and the control actors define how the data is flowing in the workflow, no explicit connections necessary. The tree-like structure stems from the internal object representation and the nesting of sub-actors within actor-handlers.

This figure shows ADAMS Flow editor and the adams-moa-classifier-evaluation flow.

For more information, take a look at the following tutorial: Tutorial 4.

Posted by moa

Using MOA’s API with Scala


As Scala runs in the Java Virtual Machine, it is very easy to use MOA objects from Scala.

Let’s see an example: the Java code of the first example in Tutorial 2.

In Scala, the same code will be as follows:

As you can see, it is very easy to use MOA objects from Scala.

Posted by moa

Using MOA with Scala and its Interactive Shell

Scala is a powerful language that has functional programming capabilities. As it runs in the Java Virtual Machine, it is very easy to use MOA objects inside Scala.

Let’s see an example, using the Scala Interactive Interpreter. First we need to start it, telling where the MOA library is:

scala -cp moa.jar

Welcome to Scala version 2.9.2.
Type in expressions to have them evaluated.
Type :help for more information.

Let’s run a very simple experiment: using a decision tree (Hoeffding Tree) with data generated from an artificial stream generator (RandomRBFGenerator).

We should start importing the classes that we need, and defining the stream and the learner.

scala> import moa.classifiers.trees.HoeffdingTree
import moa.classifiers.trees.HoeffdingTree

scala> import moa.streams.generators.RandomRBFGenerator
import moa.streams.generators.RandomRBFGenerator

scala> val learner = new HoeffdingTree();
learner: moa.classifiers.trees.HoeffdingTree =
Model type: moa.classifiers.trees.HoeffdingTree
model training instances = 0
model serialized size (bytes) = -1
tree size (nodes) = 0
tree size (leaves) = 0
active learning leaves = 0
tree depth = 0
active leaf byte size estimate = 0
inactive leaf byte size estimate = 0
byte size estimate overhead = 0
Model description:
Model has not been trained.

scala> val stream = new RandomRBFGenerator();
stream: moa.streams.generators.RandomRBFGenerator =

Now, we need to initialize the stream and the classifier:

scala> stream.prepareForUse()
scala> learner.setModelContext(stream.getHeader())
scala> learner.prepareForUse()

Now, let’s load an instance from the stream, and use it to train the decision tree:

scala> import weka.core.Instance
import weka.core.Instance

scala> val instance = stream.nextInstance()
instance: weka.core.Instance = 0.210372,1.009586,0.0919,0.272071,
0.450117,0.226098,0.212286,0.37267,0.583146,0.297007,class2

scala> learner.trainOnInstance(instance)

And finally, let’s use it to do a prediction.

scala> learner.getVotesForInstance(instance)
res9: Array[Double] = Array(0.0, 0.0)

scala> learner.correctlyClassifies(instance)
res7: Boolean = false

As shown in this example, it is very easy to use the Scala interpreter to run MOA interactively.

Posted by moa

OpenML: exploring machine learning better, together.

www.openml.org

Now you can use MOA classifiers inside OpenML. OpenML is a website where researchers can share their datasets, implementations and experiments in such a way that they can easily be found and reused by others.

OpenML engenders a novel, collaborative approach to experimentation with important benefits. First, many questions about machine learning algorithms won’t require the laborious setup of new experiments: they can be answered on the fly by querying the combined results of thousands of studies on all available datasets. OpenML also keeps track of experimentation details, ensuring that we can easily reproduce experiments later on, and confidently build upon earlier work. Reusing experiments also allows us to run large-scale machine learning studies, yielding more generalizable results with less effort. Finally, beyond the traditional publication of algorithms in journals, often in a highly summarized form, OpenML allows researchers to share all code and results that are possibly of interest to others, which may boost their visibility, speed up further research and applications, and engender new collaborations.

Posted by moa

SAMOA: Scalable Advanced Massive Online Analysis

https://www.samoa-project.net/

SAMOA is distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms. It is a project started at Yahoo Labs Barcelona.

SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). SAMOA users can develop distributed streaming ML algorithms once and execute the algorithms in multiple SPEs, i.e., code the algorithms once and execute them in multiple SPEs.

https://www.samoa-project.net/

To use MOA methods inside SAMOA take a look at

https://github.com/yahoo/samoa/wiki/SAMOA-for-MOA-users

Posted by moa

RMOA: Massive online data stream classifications with R & MOA

https://bnosac.be/index.php/blog/32-rmoa-massive-online-data-stream-classifications-with-r-a-moa

For R users who work with a lot of data or encounter RAM issues when building models on large datasets, MOA and in general data streams have some nice features. Namely:
  1. It uses a limited amount of memory. So this means no RAM issues when building models.
  2. Processes one example at a time, and will run over it only once
  3. Works incrementally – so that a model is directly ready to be used for prediction purposes

Unfortunately it is written in Java and not easily accessible for R users to use. For users mostly interested in clustering, the stream package already facilites this (this blog item gave an example when using ff alongside the stream package). In our day-to-day use cases, classification is a more common request. The stream package only allows to do clustering. So hence the decision to make the classification algorithms of MOA easily available to R users as well. For this the RMOA package was created and is available on github (https://github.com/jwijffels/RMOA).
Posted by moa

The streams Framework

https://www.jwall.org/streams/

The streams framework is a Java implementation of a simple stream processing
environment by Christian Bockermann and Hendrik Blom at TU Dortmund University. It aims at providing a clean and easy-to-use Java-based platform to process streaming data.

The core module of the streams library is a thin API layer of interfaces and
classes that reflect a high-level view of streaming processes. This API serves
as a basis for implementing custom processors and providing services with the
streams library.

Figure 1: Components of the streams library.

Figure 1 shows the components of the streams library. The binding glue element
is a thin API layer that attaches to a runtime provided as a separate module or
can embedded into existing code.

Process Design with JavaBeans

The streams library promotes simple software design patterns such as JavaBean
conventions and dependency injection to allow for a quick setup of streaming
processes using simple XML files.

As shown in Figure 2, the idea of the streams library is to provide a simple
runtime environment that lets users define streaming processes in XML files,
with a close relation to the implementing Java classes.

Figure 2: XML process definitions mapped to a runtime environment, using
stream-api components and other libraries.

Based on the conventions and patterns used, components of the
streams library are simple Java classes. Following the basic design
patterns of the streams library allows for quickly adding custom
classes to the streaming processes without much trouble.

Posted by moa

New Release of MOA 14.04

We’ve made a new release of MOA 14.04.

The new features of this release are:

  • Change detection Tab
    • Albert Bifet, Jesse Read, Bernhard Pfahringer, Geoff Holmes, Indre Zliobaite: CD-MOA: Change Detection Framework for Massive Online Analysis. IDA 2013: 92-103
  • New Tutorial on Clustering by Frederic Stahl.
  • New version of Adaptive Model Rules for regression
    • Ezilda Almeida, Carlos Abreu Ferreira, João Gama: Adaptive Model Rules from Data Streams. ECML/PKDD (1) 2013: 480-492
  • AnyOut Outlier Detector
    • Ira Assent, Philipp Kranen, Corinna Baldauf, Thomas Seidl: AnyOut: Anytime Outlier Detection on Streaming Data. DASFAA (1) 2012: 228-242
  • ORTO Regression Tree with Options
    • Elena Ikonomovska, João Gama, Bernard Zenko, Saso Dzeroski: Speeding-Up Hoeffding-Based Regression Trees With Options. ICML 2011: 537-544
  • Online Accuracy Updated Ensemble
    • Dariusz Brzezinski, Jerzy Stefanowski: Combining block-based and online methods in learning ensembles from concept drifting data streams. Inf. Sci. 265: 50-67 (2014)
  • Anticipative and Dynamic Adaptation to Concept Changes Ensemble
    • Ghazal Jaber, Antoine Cornuéjols, Philippe Tarroux: A New On-Line Learning Method for Coping with Recurring Concepts: The ADACC System. ICONIP (2) 2013: 595-604

You find the download link for this release on the MOA homepage:

MOA Machine Learning for Streams

Cheers,

The MOA Team

Posted by moa

New release of MOA 13.11

We’ve made a new release of MOA 13.11.

The new feature of this release is:

  • Temporal dependency evaluation
    • Albert Bifet, Jesse Read, Indre Zliobaite, Bernhard Pfahringer, Geoff Holmes: Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them. ECML/PKDD (1) 2013: 465-479

You find the download link for this release on the MOA homepage:

MOA Machine Learning for Streams

Cheers,

The MOA Team

Posted by moa

Temporal Dependency in Classification

The paper presented at ECML-PKDD 2013 titled “Pitfalls in benchmarking data stream classification and how to avoid them“, showed that classifying data streams has an important temporal component, which we are currently not considering in the evaluation of data-stream classifiers. A very simple classifier that considers this temporal component, the non-change classifier that predicts only using the last class seen by the classifier, can outperform current state-of-the-art classifiers in some real-world datasets. MOA can now evaluate data streams considering this temporal component using:

  • NoChange classifier
  • TemporallyAugmentedClassifier classifier
  • new evaluation measure Kappa+ or Kappa Temp

which provides a more accurate gauge of classifier performance.

Posted by moa

New recommender algorithms and evaluation

MOA has been extended in order to provide an interface to develop and visualize online recommender algorithms.

This is a simple example in order to show the functionality of the EvaluateOnlineRecommender task in MOA.

This task takes a rating predictor and a dataset (each training instance being a [user, item, rating] triplet) and evaluates how well the model predicts the ratings, given the user and item, as more and more instances are processed. This is similar to an online scenario of a recommender system, where new ratings from users to items arrive constantly, and the system has to make predictions of unrated items for the user in order to know which ones to recommend.

Let’s start by opening the MOA user interface. In the Classification tab, click on Configure task, and select from the list the ‘class moa.tasks.EvaluateOnlineRecommender’.

Now we need to select which dataset we want to process, so we click the corresponding button to edit that option.

On the list, we can choose different publicly available datasets. For this example, we will be using the Movielens 1M dataset. We can download it from https://grouplens.org/datasets/movielens/. Finally, we select the file where the input data is located.

Once the dataset is configured, the next step is to choose which ratingPredictor to evaluate.

For the moment, there are just two available: BaselinePredictor and BRISMFPredictor. The first is a very simple rating predictor, and the second is an implementation of a factorization algorithm described in Scalable Collaborative Filtering Approaches for Large Recommender Systems (Gábor Takács, István Pilászy, Bottyán Németh, and Domonkos Tikk). We choose the latter,

and find the following parameters:

  • features – the number of features to be trained for each user and item
  • learning rate – the learning rate of the gradient descent algorithm
  • regularization ratio – the regularization ratio to be used in the tikhonov regularization
  • iterations – the number of iterations to be used when retraining user and item features (online training).

We can leave the default parameters for this dataset.

Going back to the configuration of the task,

we have the sampleFrequency parameter, which defines the frequency in which the precision measures are taken. And finally, the taskResultFile which allows us to save the output of the task in a file. We can leave the default values for them.

Now the task is configured, and we only have to run it:

As the task progresses, we can see in the preview box the RMSE of the predictor from the instance 1 to the processed so far.

When the task finishes, we can see the final results, the RMSE error of the predictor at each measured point.

 

Posted by moa

New release of MOA 13.08

We’ve made a new release of MOA 13.08.

The new features of this release are:

  • new outlier detection tab
    • Dimitrios Georgiadis, Maria Kontaki, Anastasios Gounaris, Apostolos N. Papadopoulos, Kostas Tsichlas, Yannis Manolopoulos: Continuous outlier detection in data streams: an extensible framework and state-of-the-art algorithms. SIGMOD Conference 2013: 1061-1064
  • new regression tab
  • FIMT-DD regression tree
    • Elena Ikonomovska, João Gama, Saso Dzeroski: Learning model trees from evolving data streams. Data Min. Knowl. Discov. 23(1): 128-168 (2011)
  • Adaptive Model Rules for regression
    • Ezilda Almeida, Carlos Abreu Ferreira, João Gama: Adaptive Model Rules from Data Streams. ECML/PKDD (1) 2013: 480-492
  • a recommender system based in BRISMFPredictor
    • Gábor Takács, István Pilászy, Bottyán Németh, Domonkos Tikk: Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research 10: 623-656 (2009)
  • clustering updates

You find the download link for this release on the MOA homepage:

MOA Machine Learning for Streams

Cheers,

The MOA Team

Posted by admin
Load more