Stream Clustering

In today’s applications, evolving data streams are ubiquitous. Stream clustering algorithms were introduced to gain useful knowledge from these streams in real-time. The quality of the obtained clusterings, i.e. how good they reflect the data, can be assessed by evaluation measures. A multitude of stream clustering algorithms and evaluation measures for clusterings were introduced in the literature. The clustering tab in MOA allows to easily test and compare stream clustering algorithms as well as evaluation measures. Moreover it is easily extensible for new stream generators, stream clustering algorithms and evaluation measures. Watch the video to get an impression of the MOA stream clustering tab.

 

 

Data feeds and data generators

For stream clustering we added new data generators that support the simulation of cluster evolution events such as merging or disappearing of clusters. In the configuration dialog the dimensionality, number and size of clusters can be set as well as the drift speed, decay horizon (aging) and noise rate etc. Events constitute changes in the underlying data model such as growing of clusters, merging of clusters or creation of new clusters. Using the event frequency and the individual event weights, one can study the behavior and performance of different approaches on various settings. Finally, the settings for the data generators can be stored and loaded, which offers the opportunity of sharing settings and thereby providing benchmark streaming data sets for repeatability and comparison. New data feeds and generators can be added to the MOA framework by implementing the ClusteringStream.java interface.

Stream clustering algorithms

Currently MOA contains several stream clustering methods including:

  • StreamKM++: It computes a small weighted sample of the data stream and it uses the k-means++ algorithm as a randomized seeding technique to choose the first values for the clusters. To compute the small sample, it employs coreset constructions using a coreset tree for speed up.
  • CluStream: It maintains statistical information about the data using micro-clusters. These micro-clusters are temporal extensions of cluster feature vectors. The micro-clusters are stored at snapshots in time following a pyramidal pattern. This pattern allows to recall summary statistics from different time horizons.
  • ClusTree: It is a parameter free algorithm automatically adapting to the speed of the stream and it is capable of detecting concept drift, novelty, and outliers in the stream. It uses a compact and self-adaptive index structure for maintaining stream summaries.
  • DenStream: It uses dense micro-clusters (named core-micro-cluster) to summarize clusters. To maintain and distinguish the potential clusters and outliers, this method presents core-micro-cluster and outlier micro-cluster structures.
  • D-Stream: This method maps each input data record into a grid and it computes the grid density. The grids are clustered based on the density. This algorithm adopts a density decaying technique to capture the dynamic changes of a data stream.
  • CobWeb. One of the first incremental methods for clustering data. It uses a classification tree. Each node in a classification tree represents a class (concept) and is labeled by a probabilistic concept that summarizes the attribute-value distributions of objects classified under the node.

The set of algorithms is extensible through classes that implement the interface Clusterer.java. These are added to the framework via reections on start up. The three main methods of this interface are

  • void resetLearningImpl(): a method for initializing a clusterer learner
  • void trainOnInstanceImpl(Instance): a method to train a new instance
  • Clustering getClusteringResult(): a method to obtain the current clustering result for evaluation or visualization

Stream clustering evaluation measures

For cluster evaluation various measures have been developed and proposed over the last decades. A common classification of these measures is the separation in so called internal measures and external measures. Internal measures only consider the cluster properties, e.g. distances between points within one cluster or between two different clusters. External evaluation measures compare a given clusterings to a ground truth. MOA contains an extensible set of both internal and external measures that can be applied to both micro and macro clusterings. Moreover, specialized measures that take the peculiarities of an evolving data stream into account to fairly evaluate the performance of a stream clustering algorithm will be included in our set of measures. To extend the available collection with additional or novel evaluation measures one has to implement the Measure interface. The main methods are:

  • void evaluateClustering(Clustering clustering, Clustering trueClustering): uses the implemented measure to evaluate the given clustering w.r.t. to the provided ground truth.
  • double getLastValue(): a method that outputs the last result of the evaluation measure.
  • double getMaxValue(), getMinValue(), getMean(): methods that provide more statistics about the measure’s distribution.