StreamingKMeansModel

StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.

The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfullness (i.e. decay). The update rule (for each cluster) is:

$$ \begin{align} c_{t+1} &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ n_{t+1} &= n_t * a + m_t \end{align} $$

Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.

The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.

Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.

Annotations: @Since( "1.2.0" )
Source: StreamingKMeans.scala

Linear Supertypes

Logging, KMeansModel, PMMLExportable, Serializable, Serializable, Saveable, AnyRef, Any

Instance Constructors

new StreamingKMeansModel(clusterCenters: Array[Vector], clusterWeights: Array[Double])

Annotations
@Since( "1.2.0" )

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
val clusterCenters: Array[Vector]

Definition Classes
StreamingKMeansModel → KMeansModel
Annotations
@Since( "1.2.0" )
val clusterWeights: Array[Double]

Annotations
@Since( "1.2.0" )
def computeCost(data: RDD[Vector]): Double

Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

Definition Classes
KMeansModel
Annotations
@Since( "0.8.0" )
val distanceMeasure: String

Definition Classes
KMeansModel
Annotations
@Since( "2.4.0" )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def formatVersion: String

Current version of model save/load format.
Current version of model save/load format.

Attributes
protected
Definition Classes
KMeansModel → Saveable
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean = false): Boolean

Attributes
protected
Definition Classes
Logging
def initializeLogIfNecessary(isInterpreter: Boolean): Unit

Attributes
protected
Definition Classes
Logging
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isTraceEnabled(): Boolean

Attributes
protected
Definition Classes
Logging
def k: Int

Total number of clusters.
Total number of clusters.

Definition Classes
KMeansModel
Annotations
@Since( "0.8.0" )
def log: Logger

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logName: String

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def predict(points: JavaRDD[Vector]): JavaRDD[Integer]

Maps given points to their cluster indices.
Maps given points to their cluster indices.

Definition Classes
KMeansModel
Annotations
@Since( "1.0.0" )
def predict(points: RDD[Vector]): RDD[Int]

Maps given points to their cluster indices.
Maps given points to their cluster indices.

Definition Classes
KMeansModel
Annotations
@Since( "1.0.0" )
def predict(point: Vector): Int

Returns the cluster index that a given point belongs to.
Returns the cluster index that a given point belongs to.

Definition Classes
KMeansModel
Annotations
@Since( "0.8.0" )
def save(sc: SparkContext, path: String): Unit

Save this model to the given path.
Save this model to the given path.
This saves:
- human-readable (JSON) model metadata to path/metadata/
- Parquet formatted data to path/data/
The model may be loaded using Loader.load.
sc
Spark context used to save model data.
path
Path specifying the directory in which to save this model. If the directory already exists, this method throws an exception.

Definition Classes
KMeansModel → Saveable
Annotations
@Since( "1.4.0" )
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toPMML(): String

Export the model to a String in PMML format
Export the model to a String in PMML format

Definition Classes
PMMLExportable
Annotations
@Since( "1.4.0" )
def toPMML(outputStream: OutputStream): Unit

Export the model to the OutputStream in PMML format
Export the model to the OutputStream in PMML format

Definition Classes
PMMLExportable
Annotations
@Since( "1.4.0" )
def toPMML(sc: SparkContext, path: String): Unit

Export the model to a directory on a distributed file system in PMML format
Export the model to a directory on a distributed file system in PMML format

Definition Classes
PMMLExportable
Annotations
@Since( "1.4.0" )
def toPMML(localPath: String): Unit

Export the model to a local file in PMML format
Export the model to a local file in PMML format

Definition Classes
PMMLExportable
Annotations
@Since( "1.4.0" )
def toString(): String

Definition Classes
AnyRef → Any
val trainingCost: Double

Definition Classes
KMeansModel
Annotations
@Since( "2.4.0" )
def update(data: RDD[Vector], decayFactor: Double, timeUnit: String): StreamingKMeansModel

Perform a k-means update on a batch of data.
Perform a k-means update on a batch of data.

Annotations
@Since( "1.2.0" )
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package clustering

class StreamingKMeansModel extends KMeansModel with Logging

Instance Constructors

new StreamingKMeansModel(clusterCenters: Array[Vector], clusterWeights: Array[Double])

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

val clusterCenters: Array[Vector]

val clusterWeights: Array[Double]

def computeCost(data: RDD[Vector]): Double

val distanceMeasure: String

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

def formatVersion: String

final def getClass(): Class[_]

def hashCode(): Int

def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean = false): Boolean

def initializeLogIfNecessary(isInterpreter: Boolean): Unit

final def isInstanceOf[T0]: Boolean

def isTraceEnabled(): Boolean

def k: Int

def log: Logger

def logDebug(msg: ⇒ String, throwable: Throwable): Unit

def logDebug(msg: ⇒ String): Unit

def logError(msg: ⇒ String, throwable: Throwable): Unit

def logError(msg: ⇒ String): Unit

def logInfo(msg: ⇒ String, throwable: Throwable): Unit

def logInfo(msg: ⇒ String): Unit

def logName: String

def logTrace(msg: ⇒ String, throwable: Throwable): Unit

def logTrace(msg: ⇒ String): Unit

def logWarning(msg: ⇒ String, throwable: Throwable): Unit

def logWarning(msg: ⇒ String): Unit

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def predict(points: JavaRDD[Vector]): JavaRDD[Integer]

def predict(points: RDD[Vector]): RDD[Int]

def predict(point: Vector): Int

def save(sc: SparkContext, path: String): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toPMML(): String

def toPMML(outputStream: OutputStream): Unit

def toPMML(sc: SparkContext, path: String): Unit

def toPMML(localPath: String): Unit

def toString(): String

val trainingCost: Double

def update(data: RDD[Vector], decayFactor: Double, timeUnit: String): StreamingKMeansModel

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Logging

Inherited from KMeansModel

Inherited from PMMLExportable

Inherited from Serializable

Inherited from Serializable

Inherited from Saveable

Inherited from AnyRef

Inherited from Any

Ungrouped