Word2Vec (Spark 2.3.3 JavaDoc)

Object
- org.apache.spark.mllib.feature.Word2Vec

All Implemented Interfaces:

java.io.Serializable, Logging
```
public class Word2Vec
extends Object
implements scala.Serializable, Logging
```
Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We used skip-gram model in our implementation and hierarchical softmax method to train the model. The variable names in the implementation matches the original C implementation.
For original C implementation, see https://code.google.com/p/word2vec/ For research papers, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description

Word2Vec()

Constructors
Constructor and Description
`Word2Vec()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`<S extends Iterable<String>> Word2VecModel`	`fit(JavaRDD<S> dataset)` Computes the vector representation of each word in vocabulary (Java version).
`<S extends scala.collection.Iterable<String>> Word2VecModel`	`fit(RDD<S> dataset)` Computes the vector representation of each word in vocabulary.
`Word2Vec`	`setLearningRate(double learningRate)` Sets initial learning rate (default: 0.025).
`Word2Vec`	`setMaxSentenceLength(int maxSentenceLength)` Sets the maximum length (in words) of each sentence in the input data.
`Word2Vec`	`setMinCount(int minCount)` Sets minCount, the minimum number of times a token must appear to be included in the word2vec model's vocabulary (default: 5).
`Word2Vec`	`setNumIterations(int numIterations)` Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.
`Word2Vec`	`setNumPartitions(int numPartitions)` Sets number of partitions (default: 1).
`Word2Vec`	`setSeed(long seed)` Sets random seed (default: a random long integer).
`Word2Vec`	`setVectorSize(int vectorSize)` Sets vector size (default: 100).
`Word2Vec`	`setWindowSize(int window)` Sets the window of words (default: 5)

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
initializeLogging, initializeLogIfNecessary, initializeLogIfNecessary, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning

- Constructor Detail
  - Word2Vec
```
public Word2Vec()
```
- Method Detail
  - setMaxSentenceLength
```
public Word2Vec setMaxSentenceLength(int maxSentenceLength)
```
    Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to maxSentenceLength size (default: 1000)
    
    Parameters:
    
    maxSentenceLength - (undocumented)
    
    Returns:
    
    (undocumented)
  - setVectorSize
```
public Word2Vec setVectorSize(int vectorSize)
```
    Sets vector size (default: 100).
    
    Parameters:
    
    vectorSize - (undocumented)
    
    Returns:
    
    (undocumented)
  - setLearningRate
```
public Word2Vec setLearningRate(double learningRate)
```
    Sets initial learning rate (default: 0.025).
    
    Parameters:
    
    learningRate - (undocumented)
    
    Returns:
    
    (undocumented)
  - setNumPartitions
```
public Word2Vec setNumPartitions(int numPartitions)
```
    Sets number of partitions (default: 1). Use a small number for accuracy.
    
    Parameters:
    
    numPartitions - (undocumented)
    
    Returns:
    
    (undocumented)
  - setNumIterations
```
public Word2Vec setNumIterations(int numIterations)
```
    Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.
    
    Parameters:
    
    numIterations - (undocumented)
    
    Returns:
    
    (undocumented)
  - setSeed
```
public Word2Vec setSeed(long seed)
```
    Sets random seed (default: a random long integer).
    
    Parameters:
    
    seed - (undocumented)
    
    Returns:
    
    (undocumented)
  - setWindowSize
```
public Word2Vec setWindowSize(int window)
```
    Sets the window of words (default: 5)
    
    Parameters:
    
    window - (undocumented)
    
    Returns:
    
    (undocumented)
  - setMinCount
```
public Word2Vec setMinCount(int minCount)
```
    Sets minCount, the minimum number of times a token must appear to be included in the word2vec model's vocabulary (default: 5).
    
    Parameters:
    
    minCount - (undocumented)
    
    Returns:
    
    (undocumented)
  - fit
```
public <S extends scala.collection.Iterable<String>> Word2VecModel fit(RDD<S> dataset)
```
    Computes the vector representation of each word in vocabulary.
    
    Parameters:
    
    dataset - an RDD of sentences, each sentence is expressed as an iterable collection of words
    
    Returns:
    
    a Word2VecModel
  - fit
```
public <S extends Iterable<String>> Word2VecModel fit(JavaRDD<S> dataset)
```
    Computes the vector representation of each word in vocabulary (Java version).
    
    Parameters:
    
    dataset - a JavaRDD of words
    
    Returns:
    
    a Word2VecModel

Class Word2Vec

Constructor Summary

Method Summary

Methods inherited from class Object

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Detail

Word2Vec

Method Detail

setMaxSentenceLength

setVectorSize

setLearningRate

setNumPartitions

setNumIterations

setSeed

setWindowSize

setMinCount

fit

fit