StreamingKMeansModel��

class pyspark.mllib.clustering.StreamingKMeansModel(clusterCenters, clusterWeights)[source]��

Clustering model which can perform an online update of the centroids.

The update formula for each centroid is given by

c_t+1 = ((c_t * n_t * a) + (x_t * m_t)) / (n_t + m_t)
n_t+1 = n_t * a + m_t

where

c_t: Centroid at the n_th iteration.
n_t: Number of samples (or) weights associated with the centroid at the n_th iteration.
x_t: Centroid of the new data closest to c_t.
m_t: Number of samples (or) weights of the new data closest to c_t
c_t+1: New centroid.
n_t+1: New number of weights.
a: Decay Factor, which gives the forgetfulness.

New in version 1.5.0.

Parameters

clusterCenterslist of pyspark.mllib.linalg.Vector or covertible: Initial cluster centers.
clusterWeightspyspark.mllib.linalg.Vector or covertible: List of weights assigned to each cluster.

Notes

If a is set to 1, it is the weighted mean of the previous and new data. If it set to zero, the old centroids are completely forgotten.

Examples

>>>>>> initCenters = [[0.0, 0.0], [1.0, 1.0]]
>>> initWeights = [1.0, 1.0]
>>> stkm = StreamingKMeansModel(initCenters, initWeights)
>>> data = sc.parallelize([[-0.1, -0.1], [0.1, 0.1],
...                        [0.9, 0.9], [1.1, 1.1]])
>>> stkm = stkm.update(data, 1.0, "batches")
>>> stkm.centers
array([[ 0.,  0.],
       [ 1.,  1.]])
>>> stkm.predict([-0.1, -0.1])
0
>>> stkm.predict([0.9, 0.9])
1
>>> stkm.clusterWeights
[3.0, 3.0]
>>> decayFactor = 0.0
>>> data = sc.parallelize([DenseVector([1.5, 1.5]), DenseVector([0.2, 0.2])])
>>> stkm = stkm.update(data, 0.0, "batches")
>>> stkm.centers
array([[ 0.2,  0.2],
       [ 1.5,  1.5]])
>>> stkm.clusterWeights
[1.0, 1.0]
>>> stkm.predict([0.2, 0.2])
0
>>> stkm.predict([1.5, 1.5])
1

Methods

`computeCost`(rdd)	Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
`load`(sc,��path)	Load a model from the given path.
`predict`(x)	Find the cluster that each of the points belongs to in this model.
`save`(sc,��path)	Save this model to the given path.
`update`(data,��decayFactor,��timeUnit)	Update the centroids, according to data

Attributes

`clusterCenters`	Get the cluster centers, represented as a list of NumPy arrays.
`clusterWeights`	Return the cluster weights.
`k`	Total number of clusters.

Methods Documentation

computeCost(rdd)��

Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

New in version 1.4.0.

Parameters

rdd:pyspark.RDD: The RDD of points to compute the cost on.

classmethod load(sc, path)��: Load a model from the given path.

New in version 1.4.0.

predict(x)��

Find the cluster that each of the points belongs to in this model.

New in version 0.9.0.

Parameters

xpyspark.mllib.linalg.Vector or pyspark.RDD: A data point (or RDD of points) to determine cluster index. pyspark.mllib.linalg.Vector can be replaced with equivalent objects (list, tuple, numpy.ndarray).

Returns

int or pyspark.RDD of int: Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD.

save(sc, path)��: Save this model to the given path.

New in version 1.4.0.

update(data, decayFactor, timeUnit)[source]��

Update the centroids, according to data

New in version 1.5.0.

Parameters

datapyspark.RDD: RDD with new data for the model update.
decayFactorfloat: Forgetfulness of the previous centroids.
timeUnitstr: Can be ��batches�� or ��points��. If points, then the decay factor is raised to the power of number of new points and if batches, then decay factor will be used as is.
.. versionadded:: 1.5.0

Attributes Documentation

clusterCenters��: Get the cluster centers, represented as a list of NumPy arrays.

New in version 1.0.0.

clusterWeights��: Return the cluster weights.

New in version 1.5.0.

k��: Total number of clusters.

New in version 1.4.0.

StreamingKMeans LDA