Classifiers

The classifiers implemented in MOA are the following:

  • Bayesian classifiers
    • Naive Bayes
    • Naive Bayes Multinomial
  • Decision trees classifiers
    • Decision Stump
    • Hoeffding Tree
    • Hoeffding Option Tree
    • Hoeffding Adaptive Tree
  • Meta classifiers
    • Bagging
    • Boosting
    • Bagging using ADWIN
    • Bagging using Adaptive-Size Hoeffding Trees.
    • Perceptron Stacking of Restricted Hoeffding Trees
    • Leveraging Bagging
  • Function classifiers
    • Perceptron
    • SGD: Stochastic Gradient Descent
    • SPegasos
  • Drift classifiers
    • SingleClassifierDrift

Classifiers for static streams

MajorityClass

Always predicts the class that has been observed most frequently the in the training data.

Parameters:

  • -r : Seed for random behaviour of the classifier

NaiveBayes

Performs classic bayesian prediction while making naive assumption that all inputs are independent.
Naive Bayes is a classifier algorithm known for its simplicity and low computational cost. Given n different classes, the trained Naive Bayes classifier predicts for every unlabelled instance I the class C to which it belongs with high accuracy.

Parameters:

  • -r : Seed for random behaviour of the classifier

DecisionStump

Decision trees of one level.
Parameters:

  • -g : The number of instances to observe between model changes
  • -b : Only allow binary splits
  • -c : Split criterion to use. Example : InfoGainSplitCriterion
  • -r : Seed for random behaviour of the classifier

HoeffdingTree

Decision tree for streaming data.
A Hoeffding tree is an incremental, anytime decision tree induction algorithm that is capable of learning from massive data streams, assuming that the distribution generating examples does not change over time. Hoeffding trees exploit the fact that a small sample can often be enough to choose an optimal splitting attribute. This idea is supported mathematically by the Hoeffding bound, which quantifies the number of observations (in our case, examples) needed to estimate some statistics within a prescribed precision (in our case, the goodness of an attribute).

A theoretically appealing feature of Hoeffding Trees not shared by other incremental decision tree learners is that it has sound guarantees of performance. Using the Hoeffding bound one can show that its output is asymptotically nearly identical to that of a non-incremental learner using infinitely many examples. See for details:

G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In KDD’01, pages 97–106, San Francisco, CA, 2001. ACM Press.

Parameters:

  • -m : Maximum memory consumed by the tree
  • -n : Numeric estimator to use :
    • Gaussian approximation evaluating 10 splitpoints
    • Gaussian approximation evaluating 100 splitpoints
    • Greenwald-Khanna quantile summary with 10 tuples
    • Greenwald-Khanna quantile summary with 100 tuples
    • Greenwald-Khanna quantile summary with 1000 tuples
    • VFML method with 10 bins
    • VFML method with 100 bins
    • VFML method with 1000 bins
    • Exhaustive binary tree
  • -e : How many instances between memory consumption checks
  • -g : The number of instances a leaf should observe between split
    attempts
  • -s : Split criterion to use. Example : InfoGainSplitCriterion
  • -c : The allowable error in split decision, values closer to 0 will take
    longer to decide
  • -t : Threshold below which a split will be forced to break ties
  • -b : Only allow binary splits
  • -z : Stop growing as soon as memory limit is hit
  • -r : Disable poor attributes
  • -p : Disable pre-pruning
  • -l : Leaf classifier to use at the leaves: Majority class, Naive Bayes, Naive Bayes Adaptive. By default: Naive Bayes Adaptive.

In old versions of MOA, a HoeffdingTreeNB was a HoeffdingTree with Naive Bayes classification at leaves, and a HoeffdingTreeNBAdaptive was a HoeffdingTree with adaptive Naive Bayes classification at leaves. In the current version of MOA, there is an option to select wich classification perform at leaves: Majority class, Naive Bayes, Naive Bayes Adaptive. By default, the option selected is Naive Bayes Adaptive, since it is the classifier that gives better results. This adaptive Naive Bayes prediction method monitors the error rate of majority class and Naive Bayes decisions in every leaf, and chooses to employ Naive Bayes decisions only where they have been more accurate in past cases.
To run experiments using the old default version of HoeffdingTree, with a majority class learner at leaves, use “HoeffdingTree -l MC”.

HoeffdingOptionTree

Decision option tree for streaming data.

Hoeffding Option Trees are regular Hoeffding trees containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. They consist of a single structure that efficiently represents multiple trees. A particular example can travel down multiple paths of the tree, contributing, in different ways, to different options.

See for details:

B. Pfahringer, G. Holmes, and R. Kirkby. New options for hoeffding
trees. In AI, pages 90–99, 2007.

Parameters:

  • -o : Maximum number of option paths per node
  • -m : Maximum memory consumed by the tree
  • -n : Numeric estimator to use :
    • Gaussian approximation evaluating 10 splitpoints
    • Gaussian approximation evaluating 100 splitpoints
    • Greenwald-Khanna quantile summary with 10 tuples
    • Greenwald-Khanna quantile summary with 100 tuples
    • Greenwald-Khanna quantile summary with 1000 tuples
    • VFML method with 10 bins
    • VFML method with 100 bins
    • VFML method with 1000 bins
    • Exhaustive binary tree
  • -e : How many instances between memory consumption checks
  • -g : The number of instances a leaf should observe between split attempts
  • -s : Split criterion to use. Example : InfoGainSplitCriterion
  • -c : The allowable error in split decision, values closer to 0 will take longer to decide
  • -w : The allowable error in secondary split decisions, values closer to 0 will take longer to decide
  • -t : Threshold below which a split will be forced to break ties
  • -b : Only allow binary splits
  • -z : Memory strategy to use
  • -r : Disable poor attributes
  • -p : Disable pre-pruning
  • -d : File to append option table to.
  • -l : Leaf classifier to use at the leaves: Majority class, Naive Bayes, Naive Bayes Adaptive. By default: Naive Bayes Adaptive.

In old versions of MOA, a HoeffdingOptionTreeNB was a HoeffdingTree with Naive Bayes classification at leaves, and a HoeffdingOptionTreeNBAdaptive was a HoeffdingOptionTree with adaptive Naive Bayes classification at leaves. In the current version of MOA, there is an option to select wich classification perform at leaves: Majority class, Naive Bayes, Naive Bayes Adaptive. By default, the option selected is Naive Bayes Adaptive, since it is the classifier that gives better results. This adaptive Naive Bayes prediction method monitors the error rate of majority class and Naive Bayes decisions in every leaf, and chooses to employ Naive Bayes decisions only where they have been more accurate in past cases.
To run experiments using the old default version of HoeffdingOptionTree, with a majority class learner at leaves, use “HoeffdingOptionTree -l MC”.

AdaHoeffdingOptionTree

Adaptive decision option tree for streaming data with adaptive Naive Bayes classification at leaves.

An Adaptive Hoeffding Option Tree is a Hoeffding Option Tree with the following improvement: each leaf stores an estimation of the current error. It uses an EWMA estimator with α = .2. The weight of each node in the voting process is proportional to the square of the inverse of the error.

Example:

AdaHoeffdingOptionTree -o 50

Parameters:

  • Same parameters as HoeffdingOptionTree

HoeffdingAdaptiveTree

This adaptive Hoeffding Tree uses ADWIN to monitor performance of branches on the tree and to replace them with new branches when their accuracy decreases if the new branches are more accurate. For more information, see:

Albert Bifet, Ricard Gavaldá. Adaptive Learning from Evolving Data Streams In IDA 2009.

OzaBag

Incremental on-line bagging of Oza and Russell.

Oza and Russell developed online versions of bagging and boosting for Data Streams. They show how the process of sampling bootstrap replicates from training data can be simulated in a data stream context. They observe that the probability that any individual example will be chosen for a replicate tends to a Poisson(1) distribution.

[OR] N. Oza and S. Russell. Online bagging and boosting. In Artificial Intelligence and Statistics 2001, pages 105–112. Morgan Kaufmann, 2001.

Parameters:

  • -l : Classifier to train
  • -s : The number of models in the bag

OzaBoost

Incremental on-line boosting of Oza and Russell.

See details in:
[OR] N. Oza and S. Russell. Online bagging and boosting. In Artificial Intelligence and Statistics 2001, pages 105–112. Morgan Kaufmann, 2001.

For the boosting method, Oza and Russell note that the weighting procedure of AdaBoost actually divides the total example weight into two halves – half of the weight is assigned to the correctly classified examples, and the other half goes to the misclassified examples. They use the Poisson distribution for deciding the random probability that an example is used for training, only this time the parameter changes according to the boosting weight of the example as it is passed through each model in sequence.

Parameters:

  • -l : Classifier to train
  • -s : The number of models to boost
  • -p : Boost with weights only; no poisson

OCBoost

Online Coordinate Boosting.

Pelossof et al. presented Online Coordinate Boosting, a new online boosting algorithm for adapting the weights of a boosted classifier, which yields a closer approximation to Freund and Schapire’s AdaBoost algorithm. The weight update procedure is derived by minimizing AdaBoost’s loss when viewed in an incremental form. This boosting method may be reduced to a form similar to Oza and Russell’s algorithm.

See details in:
[PJ] Raphael Pelossof, Michael Jones, Ilia Vovsha, and Cynthia Rudin. Online coordinate boosting. 2008.

Example:

OCBoost -l HoeffdingTreeNBAdaptive -e 0.5

Parameters:

  • -l : Classifier to train
  • -s : The number of models to boost
  • -e : Smoothing parameter

Classifiers for evolving streams

OzaBagASHT

Bagging using trees of different size.

The Adaptive-Size Hoeffding Tree (ASHT) is derived from the Hoeffding Tree algorithm with the following differences:

  • it has a maximum number of split nodes, or size
  • after one node splits, if the number of split nodes of the ASHT tree is higher than the maximum value, then it deletes some nodes to reduce its size

The intuition behind this method is as follows: smaller trees adapt more quickly to changes, and larger trees do better during periods with no or little change, simply because they were built on more data. Trees limited to size s will be reset about twice as often as trees with a size limit of 2s. This creates a set of different reset-speeds for an ensemble of such trees, and therefore a subset of trees that are a good approximation for the current rate of change. It is important to note that resets will happen all the time, even for stationary datasets, but this behaviour should not have a negative impact on the ensemble’s predictive performance. When the tree size exceeds the maximun size value, there are two different delete options:

  • delete the oldest node, the root, and all of its children except the one where the split has been made. After that, the root of the child not deleted becomes the new root
  • delete all the nodes of the tree, i.e., restart from a new root.

The maximum allowed size for the n-th ASHT tree is twice the maximum allowed size for the (n − 1)-th tree. Moreover, each tree has a weight proportional to the inverse of the square of its error, and it monitors its error with an exponential weighted moving average (EWMA) with α = .01. The size of the first tree is 2.

With this new method, it is attempted to improve bagging performance by increasing tree diversity. It has been observed that boosting tends to produce a more diverse set of classifiers than bagging, and this has been cited as a factor in increased performance.

See more details in:
[BHPKG] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà . New ensemble methods for evolving data streams. In 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009.

The learner must be ASHoeffdingTree, a Hoeffding Tree with a maximum size value.

Example:

OzaBagASHT -l ASHoeffdingTree -s 10 -u -r

Parameters:

  • Same parameters as OzaBag
  • -f : the size of first classifier in the bag.
  • -u : Enable weight classifiers
  • -r : Reset trees when size is higher than the max

OzaBagADWIN

Bagging using ADWIN.

ADWIN is a change detector and estimator that solves in a well-specified way the problem of tracking the average of a stream of bits or real-valued numbers. ADWIN keeps a variable-length window of recently seen items, with the property that the window has the maximal length statistically consistent with the hypothesis “there has been no change in the average value inside the window”.
More precisely, an older fragment of the window is dropped if and only if there is enough evidence that its average value differs from that of the rest of the window. This has two consequences: one, that change reliably declared whenever the window shrinks; and two, that at any time the average over the existing window can be reliably taken as an estimation of the current average in the stream (barring a very small or very recent change that is still not statistically visible). A formal and quantitative statement of these two points (a theorem) appears in

[BG07c] Albert Bifet and Ricard Gavaldà. Learning from time-changing data with adaptive windowing. In SIAM International Conference on Data Mining, 2007.

ADWIN is parameter- and assumption-free in the sense that it automatically detects and adapts to the current rate of change. Its only parameter is a confidence bound δ, indicating how confident we want to be in the algorithm’s output, inherent to all algorithms dealing with random processes. Also important, ADWIN does not maintain the window explicitly, but compresses it using a variant of the exponential histogram technique. This means that it keeps a window of length W using only O(log W) memory and O(log W) processing time per item.
ADWIN Bagging is the online bagging method of Oza and Rusell with the addition of the ADWIN algorithm as a change detector and as an estimator for the weights of the boosting method. When a change is detected, the worst classifier of the ensemble of classifiers is removed and a new classifier is added to the ensemble.

See details in:
[BHPKG] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavald` . New ensemble methods for evolving data streams. In 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009.

Example:

OzaBagAdwin -l HoeffdingTreeNBAdaptive -s 10

Parameters:

  • -l : Classifier to train
  • -s : The number of models in the bag

LeveragingBag

Leveraging Bagging for evolving data streams using ADWIN and Leveraging Bagging MC using Random Output Codes ( -o option). These methods leverage the performance of bagging, with two randomization improvements: increasing resampling and using output detection codes. For more information see

Albert Bifet, Geoffrey Holmes, Bernhard Pfahringer Leveraging Bagging for Evolving Data Streams In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD, 2010.

Parameters:

  • -l : Classifier to train.
  • -s : The number of models in the bagging
  • -w : The number to use to compute the weight of new instances.
  • -a : Delta of Adwin change detection
  • -o : Use Output Codes to use binary classifiers
  • -m : Leveraging Bagging to use:
    • Leveraging Bagging ME using weight 1 if misclassified, otherwise error/(1-error)
    • Leveraging Bagging Half using resampling without replacement half of the instances
    • Leveraging Bagging WT without taking out all instances.
    • Leveraging Subagging using resampling without replacement.

Perceptron

Single perceptron classifier. Performs classic perceptron multiclass learning incrementally.
Parameters:

  • -r : Learning ratio of the classifier

SGD

Implements stochastic gradient descent for learning various linear models: binary class SVM, binary class logistic regression and linear regression.

SPegasos

Implements the stochastic variant of the Pegasos (Primal Estimated sub-GrAdient SOlver for SVM) method of Shalev-Shwartz et al. (2007). For more information, see:

S. Shalev-Shwartz, Y. Singer, N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In 4th International Conference on MachineLearning, 807-814, 2007.

SingleClassifierDrift

Class for handling concept drift datasets with a wrapper on a classifier.

The drift detection method (DDM) proposed by Gama et al. controls the number of errors produced by the learning model during prediction. It compares the statistics of two windows: the first one contains all the data, and the second one contains only the data from the beginning until the number of errors increases. Their method doesn’t store these windows in memory. It keeps only statistics and a window of recent errors. They consider that the number of errors in a sample of examples is modeled by a binomial distribution. A significant increase in the error of the algorithm, suggests that the class distribution is changing and, hence, the actual decision model is supposed to be inappropriate. They check for a warning level and a drift level. Beyond these levels, change of context is considered.

The number of errors in a sample of n examples is modeled by a binomial distribution. For each point i in the sequence that is being sampled, the error rate is the probability of misclassifying (pi ), with standard deviation given by si = pi (1 − pi )/i. A significant increase in the error of the algorithm, suggests that the class distribution is changing and, hence, the actual decision model is supposed to be inappropriate. Thus, they store the values of pi and si when pi + si reaches its minimum value during the process (obtaining ppmin and smin ). And it checks when the following conditions trigger:

  • pi + si ≥ pmin + 2 · smin for the warning level. Beyond this level, the examples are stored in anticipation of a possible change of context.
  • pi +si ≥ pmin +3·smin for the drift level. Beyond this level, the model induced by the learning method is reset and a new model is learnt using the examples stored since the warning level triggered.

Baena-García et al. proposed a new method EDDM in order to improve DDM. It is based on the estimated distribution of the distances between classification errors. The window resize procedure is governed by the same heuristics.

See more details in:
[GMCR] J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. In SBIA Brazilian Symposium on Artificial Intelligence, pages 286–295, 2004.

[BDF] Manuel Baena-García, José del Campo-Avila, Raul Fidalgo, Albert Bifet, Ricard Gavaldà , and Rafael Morales-Bueno. Early drift detection method. In Fourth International Workshop on Knowledge Discovery from Data Streams, 2006.

Example:

SingleClassifierDrift -d EDDM -l HoeffdingTreeNBAdaptive

Parameters:

  • -l : Classifier to train
  • -d : Drift detection method to use: DDM or EDDM