public class LDA extends Object implements Logging
Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept
References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Latent Dirichlet allocation
(Wikipedia)}
Constructor and Description |
---|
LDA()
Constructs a LDA instance with default parameters.
|
Modifier and Type | Method and Description |
---|---|
double |
getAlpha()
Alias for
getDocConcentration |
Vector |
getAsymmetricAlpha()
Alias for
getAsymmetricDocConcentration |
Vector |
getAsymmetricDocConcentration()
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
|
double |
getBeta()
Alias for
getTopicConcentration |
int |
getCheckpointInterval()
Period (in iterations) between checkpoints.
|
double |
getDocConcentration()
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
|
int |
getK()
Number of topics to infer.
|
int |
getMaxIterations()
Maximum number of iterations for learning.
|
LDAOptimizer |
getOptimizer()
:: DeveloperApi ::
|
long |
getSeed()
Random seed
|
double |
getTopicConcentration()
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
distributions over terms.
|
LDAModel |
run(JavaPairRDD<Long,Vector> documents)
Java-friendly version of
run() |
LDAModel |
run(RDD<scala.Tuple2<Object,Vector>> documents)
Learn an LDA model using the given dataset.
|
LDA |
setAlpha(double alpha)
Alias for
setDocConcentration() |
LDA |
setAlpha(Vector alpha)
Alias for
setDocConcentration() |
LDA |
setBeta(double beta)
Alias for
setTopicConcentration() |
LDA |
setCheckpointInterval(int checkpointInterval)
Period (in iterations) between checkpoints (default = 10).
|
LDA |
setDocConcentration(double docConcentration)
Replicates a
Double docConcentration to create a symmetric prior. |
LDA |
setDocConcentration(Vector docConcentration)
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
|
LDA |
setK(int k)
Number of topics to infer.
|
LDA |
setMaxIterations(int maxIterations)
Maximum number of iterations for learning.
|
LDA |
setOptimizer(LDAOptimizer optimizer)
:: DeveloperApi ::
|
LDA |
setOptimizer(String optimizerName)
Set the LDAOptimizer used to perform the actual calculation by algorithm name.
|
LDA |
setSeed(long seed)
Random seed
|
LDA |
setTopicConcentration(double topicConcentration)
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
distributions over terms.
|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
public int getK()
public LDA setK(int k)
k
- (undocumented)public Vector getAsymmetricDocConcentration()
This is the parameter to a Dirichlet distribution.
public double getDocConcentration()
This method assumes the Dirichlet distribution is symmetric and can be described by a single
Double
parameter. It should fail if docConcentration is asymmetric.
public LDA setDocConcentration(Vector docConcentration)
This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to
singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during
LDAOptimizer.initialize()
. Otherwise, the docConcentration
vector must be length k.
(default = Vector(-1) = automatic)
Optimizer-specific parameter settings:
- EM
- Currently only supports symmetric distributions, so all values in the vector should be
the same.
- Values should be > 1.0
- default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows
from Asuncion et al. (2009), who recommend a +1 adjustment for EM.
- Online
- Values should be >= 0
- default = uniformly (1.0 / k), following the implementation from
https://github.com/Blei-Lab/onlineldavb
.
docConcentration
- (undocumented)public LDA setDocConcentration(double docConcentration)
Double
docConcentration to create a symmetric prior.docConcentration
- (undocumented)public Vector getAsymmetricAlpha()
getAsymmetricDocConcentration
public double getAlpha()
getDocConcentration
public LDA setAlpha(Vector alpha)
setDocConcentration()
alpha
- (undocumented)public LDA setAlpha(double alpha)
setDocConcentration()
alpha
- (undocumented)public double getTopicConcentration()
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
public LDA setTopicConcentration(double topicConcentration)
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
Optimizer-specific parameter settings:
- EM
- Value should be > 1.0
- default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows
Asuncion et al. (2009), who recommend a +1 adjustment for EM.
- Online
- Value should be >= 0
- default = (1.0 / k), following the implementation from
https://github.com/Blei-Lab/onlineldavb
.
topicConcentration
- (undocumented)public double getBeta()
getTopicConcentration
public LDA setBeta(double beta)
setTopicConcentration()
beta
- (undocumented)public int getMaxIterations()
public LDA setMaxIterations(int maxIterations)
maxIterations
- (undocumented)public long getSeed()
public LDA setSeed(long seed)
seed
- (undocumented)public int getCheckpointInterval()
public LDA setCheckpointInterval(int checkpointInterval)
SparkContext
, this setting is ignored.
checkpointInterval
- (undocumented)SparkContext.setCheckpointDir(java.lang.String)
public LDAOptimizer getOptimizer()
LDAOptimizer used to perform the actual calculation
public LDA setOptimizer(LDAOptimizer optimizer)
LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)
optimizer
- (undocumented)public LDA setOptimizer(String optimizerName)
optimizerName
- (undocumented)public LDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
documents
- RDD of documents, which are term (word) count vectors paired with IDs.
The term count vectors are "bags of words" with a fixed-size vocabulary
(where the vocabulary size is the length of the vector).
Document IDs must be unique and >= 0.public LDAModel run(JavaPairRDD<Long,Vector> documents)
run()
documents
- (undocumented)