LDA (Spark 1.6.3 JavaDoc)

Object
- org.apache.spark.mllib.clustering.LDA

All Implemented Interfaces:

Logging
```
public class LDA
extends Object
implements Logging
```
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept
References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

See Also:
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Latent Dirichlet allocation (Wikipedia)}

Constructor Summary

Constructors
Constructor and Description

LDA()
Constructs a LDA instance with default parameters.

Constructors
Constructor and Description
`LDA()` Constructs a LDA instance with default parameters.

Method Summary

Methods
Modifier and Type	Method and Description
`double`	`getAlpha()` Alias for `getDocConcentration`
`Vector`	`getAsymmetricAlpha()` Alias for `getAsymmetricDocConcentration`
`Vector`	`getAsymmetricDocConcentration()` Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
`double`	`getBeta()` Alias for `getTopicConcentration`
`int`	`getCheckpointInterval()` Period (in iterations) between checkpoints.
`double`	`getDocConcentration()` Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
`int`	`getK()` Number of topics to infer.
`int`	`getMaxIterations()` Maximum number of iterations for learning.
`LDAOptimizer`	`getOptimizer()` :: DeveloperApi ::
`long`	`getSeed()` Random seed
`double`	`getTopicConcentration()` Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
`LDAModel`	`run(JavaPairRDD<Long,Vector> documents)` Java-friendly version of `run()`
`LDAModel`	`run(RDD<scala.Tuple2<Object,Vector>> documents)` Learn an LDA model using the given dataset.
`LDA`	`setAlpha(double alpha)` Alias for `setDocConcentration()`
`LDA`	`setAlpha(Vector alpha)` Alias for `setDocConcentration()`
`LDA`	`setBeta(double beta)` Alias for `setTopicConcentration()`
`LDA`	`setCheckpointInterval(int checkpointInterval)` Period (in iterations) between checkpoints (default = 10).
`LDA`	`setDocConcentration(double docConcentration)` Replicates a `Double` docConcentration to create a symmetric prior.
`LDA`	`setDocConcentration(Vector docConcentration)` Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
`LDA`	`setK(int k)` Number of topics to infer.
`LDA`	`setMaxIterations(int maxIterations)` Maximum number of iterations for learning.
`LDA`	`setOptimizer(LDAOptimizer optimizer)` :: DeveloperApi ::
`LDA`	`setOptimizer(String optimizerName)` Set the LDAOptimizer used to perform the actual calculation by algorithm name.
`LDA`	`setSeed(long seed)` Random seed
`LDA`	`setTopicConcentration(double topicConcentration)` Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning

- Constructor Detail
  - LDA
```
public LDA()
```
    Constructs a LDA instance with default parameters.
- Method Detail
  - getK
```
public int getK()
```
    Number of topics to infer. I.e., the number of soft cluster centers.
    
    Returns:
    (undocumented)
  - setK
```
public LDA setK(int k)
```
    Number of topics to infer. I.e., the number of soft cluster centers. (default = 10)
    
    Parameters:
    k - (undocumented)
    
    Returns:
    (undocumented)
  - getAsymmetricDocConcentration
```
public Vector getAsymmetricDocConcentration()
```
    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
    This is the parameter to a Dirichlet distribution.
    
    Returns:
    (undocumented)
  - getDocConcentration
```
public double getDocConcentration()
```
    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
    This method assumes the Dirichlet distribution is symmetric and can be described by a single Double parameter. It should fail if docConcentration is asymmetric.
    
    Returns:
    (undocumented)
  - setDocConcentration
```
public LDA setDocConcentration(Vector docConcentration)
```
    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
    This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
    If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during LDAOptimizer.initialize(). Otherwise, the docConcentration vector must be length k. (default = Vector(-1) = automatic)
    Optimizer-specific parameter settings: - EM - Currently only supports symmetric distributions, so all values in the vector should be the same. - Values should be > 1.0 - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Values should be >= 0 - default = uniformly (1.0 / k), following the implementation from https://github.com/Blei-Lab/onlineldavb.
    
    Parameters:
    docConcentration - (undocumented)
    
    Returns:
    (undocumented)
  - setDocConcentration
```
public LDA setDocConcentration(double docConcentration)
```
    Replicates a Double docConcentration to create a symmetric prior.
    
    Parameters:
    docConcentration - (undocumented)
    
    Returns:
    (undocumented)
  - getAsymmetricAlpha
```
public Vector getAsymmetricAlpha()
```
    Alias for getAsymmetricDocConcentration
    
    Returns:
    (undocumented)
  - getAlpha
```
public double getAlpha()
```
    Alias for getDocConcentration
    
    Returns:
    (undocumented)
  - setAlpha
```
public LDA setAlpha(Vector alpha)
```
    Alias for setDocConcentration()
    
    Parameters:
    alpha - (undocumented)
    
    Returns:
    (undocumented)
  - setAlpha
```
public LDA setAlpha(double alpha)
```
    Alias for setDocConcentration()
    
    Parameters:
    alpha - (undocumented)
    
    Returns:
    (undocumented)
  - getTopicConcentration
```
public double getTopicConcentration()
```
    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
    This is the parameter to a symmetric Dirichlet distribution.
    Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
    
    Returns:
    (undocumented)
  - setTopicConcentration
```
public LDA setTopicConcentration(double topicConcentration)
```
    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
    This is the parameter to a symmetric Dirichlet distribution.
    Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
    If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
    Optimizer-specific parameter settings: - EM - Value should be > 1.0 - default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Value should be >= 0 - default = (1.0 / k), following the implementation from https://github.com/Blei-Lab/onlineldavb.
    
    Parameters:
    topicConcentration - (undocumented)
    
    Returns:
    (undocumented)
  - getBeta
```
public double getBeta()
```
    Alias for getTopicConcentration
    
    Returns:
    (undocumented)
  - setBeta
```
public LDA setBeta(double beta)
```
    Alias for setTopicConcentration()
    
    Parameters:
    beta - (undocumented)
    
    Returns:
    (undocumented)
  - getMaxIterations
```
public int getMaxIterations()
```
    Maximum number of iterations for learning.
    
    Returns:
    (undocumented)
  - setMaxIterations
```
public LDA setMaxIterations(int maxIterations)
```
    Maximum number of iterations for learning. (default = 20)
    
    Parameters:
    maxIterations - (undocumented)
    
    Returns:
    (undocumented)
  - getSeed
```
public long getSeed()
```
    Random seed
    
    Returns:
    (undocumented)
  - setSeed
```
public LDA setSeed(long seed)
```
    Random seed
    
    Parameters:
    seed - (undocumented)
    
    Returns:
    (undocumented)
  - getCheckpointInterval
```
public int getCheckpointInterval()
```
    Period (in iterations) between checkpoints.
    
    Returns:
    (undocumented)
  - setCheckpointInterval
```
public LDA setCheckpointInterval(int checkpointInterval)
```
    Period (in iterations) between checkpoints (default = 10). Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set in SparkContext, this setting is ignored.
    
    Parameters:
    checkpointInterval - (undocumented)
    
    Returns:
    (undocumented)
    See Also:
    SparkContext.setCheckpointDir(java.lang.String)
  - getOptimizer
```
public LDAOptimizer getOptimizer()
```
    :: DeveloperApi ::
    LDAOptimizer used to perform the actual calculation
    
    Returns:
    (undocumented)
  - setOptimizer
```
public LDA setOptimizer(LDAOptimizer optimizer)
```
    :: DeveloperApi ::
    LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)
    
    Parameters:
    optimizer - (undocumented)
    
    Returns:
    (undocumented)
  - setOptimizer
```
public LDA setOptimizer(String optimizerName)
```
    Set the LDAOptimizer used to perform the actual calculation by algorithm name. Currently "em", "online" are supported.
    
    Parameters:
    optimizerName - (undocumented)
    
    Returns:
    (undocumented)
  - run
```
public LDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
```
    Learn an LDA model using the given dataset.
    
    Parameters:
    documents - RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.
    
    Returns:
    Inferred LDA model
  - run
```
public LDAModel run(JavaPairRDD<Long,Vector> documents)
```
    Java-friendly version of run()
    
    Parameters:
    documents - (undocumented)
    
    Returns:
    (undocumented)

Class LDA

Constructor Summary

Method Summary

Methods inherited from class Object

Methods inherited from interface org.apache.spark.Logging

Constructor Detail

LDA

Method Detail

getK

setK

getAsymmetricDocConcentration

getDocConcentration

setDocConcentration

setDocConcentration

getAsymmetricAlpha

getAlpha

setAlpha

setAlpha

getTopicConcentration

setTopicConcentration

getBeta

setBeta

getMaxIterations

setMaxIterations

getSeed

setSeed

getCheckpointInterval

setCheckpointInterval

getOptimizer

setOptimizer

setOptimizer

run

run