Binarize a column of continuous features given a threshold.
Binarize a column of continuous features given a threshold.
:: Experimental ::
:: Experimental ::
This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics.
The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.
References:
1. Wikipedia on Stable Distributions
2. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).
:: Experimental ::
:: Experimental ::
Model produced by BucketedRandomProjectionLSH, where multiple random vectors are stored. The
vectors are normalized to be unit vectors and each vector is used in a hash function:
h_i(x) = floor(r_i.dot(x) / bucketLength)
where r_i
is the i-th random unit vector. The number of buckets will be (max L2 norm of input
vectors) / bucketLength
.
Bucketizer
maps a column of continuous features to a column of feature buckets.
Bucketizer
maps a column of continuous features to a column of feature buckets.
Since 2.3.0,
Bucketizer
can map multiple columns at once by setting the inputCols
parameter. Note that
when both the inputCol
and inputCols
parameters are set, an Exception will be thrown. The
splits
parameter is only used for single column usage, and splitsArray
is for multiple
columns.
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label.
Chi-Squared feature selection, which selects categorical features to use for predicting a
categorical label.
The selector supports different selection methods: numTopFeatures
, percentile
, fpr
,
fdr
, fwe
.
numTopFeatures
chooses a fixed number of top features according to a chi-squared test.percentile
is similar but chooses a fraction of all features instead of a fixed number.fpr
chooses all features whose p-value are below a threshold, thus controlling the false
positive rate of selection.fdr
uses the [Benjamini-Hochberg procedure]
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
to choose all features whose false discovery rate is below a threshold.fwe
chooses all features whose p-values are below a threshold. The threshold is scaled by
1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is numTopFeatures
, with the default number of top features
set to 50.
Model fitted by ChiSqSelector.
Model fitted by ChiSqSelector.
Extracts a vocabulary from document collections and generates a CountVectorizerModel.
Extracts a vocabulary from document collections and generates a CountVectorizerModel.
Converts a text document to a sparse vector of token counts.
Converts a text document to a sparse vector of token counts.
A feature transformer that takes the 1D discrete cosine transform of a real vector.
A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).
More information on DCT-II in Discrete cosine transform (Wikipedia).
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector.
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier.
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space).
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector.
The FeatureHasher transformer operates on multiple columns. Each column may contain either
numeric or categorical features. Behavior and handling of column data types is as follows:
-Numeric columns: For numeric features, the hash value of the column name is used to map the
feature value to its index in the feature vector. By default, numeric features
are not treated as categorical (even when they are integers). To treat them
as categorical, specify the relevant columns in categoricalCols
.
-String columns: For categorical features, the hash value of the string "column_name=value"
is used to map to the vector index, with an indicator value of 1.0
.
Thus, categorical features are "one-hot" encoded
(similarly to using OneHotEncoder with dropLast=false
).
-Boolean columns: Boolean values are treated in the same way as string columns. That is,
boolean features are represented as "column_name=true" or "column_name=false",
with an indicator value of 1.0
.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.
val df = Seq( (2.0, true, "1", "foo"), (3.0, false, "2", "bar") ).toDF("real", "bool", "stringNum", "string") val hasher = new FeatureHasher() .setInputCols("real", "bool", "stringNum", "string") .setOutputCol("features") hasher.transform(df).show(false) +----+-----+---------+------+------------------------------------------------------+ |real|bool |stringNum|string|features | +----+-----+---------+------+------------------------------------------------------+ |2.0 |true |1 |foo |(262144,[51871,63643,174475,253195],[1.0,1.0,2.0,1.0])| |3.0 |false|2 |bar |(262144,[6031,80619,140467,174475],[1.0,1.0,1.0,3.0]) | +----+-----+---------+------+------------------------------------------------------+
Maps a sequence of terms to their term frequencies using the hashing trick.
Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.
Compute the Inverse Document Frequency (IDF) given a collection of documents.
Compute the Inverse Document Frequency (IDF) given a collection of documents.
Model fitted by IDF.
Model fitted by IDF.
:: Experimental :: Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located.
:: Experimental :: Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of DoubleType or FloatType. Currently Imputer does not support categorical features (SPARK-15041) and possibly creates incorrect values for a categorical feature.
Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.
:: Experimental :: Model fitted by Imputer.
:: Experimental :: Model fitted by Imputer.
A Transformer
that maps a column of indices back to a new column of corresponding
string values.
A Transformer
that maps a column of indices back to a new column of corresponding
string values.
The index-string mapping is either from the ML attributes of the input column,
or from user-supplied labels (which take precedence over ML attributes).
StringIndexer
for converting strings into indices
Implements the feature interaction transform.
Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced.
For example, given the input feature values Double(2)
and Vector(3, 4)
, the output would be
Vector(6, 8)
if all input features were numeric. If the first feature was instead nominal
with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0)
.
Class that represents the features and label of a data point.
Class that represents the features and label of a data point.
Label for this data point.
List of features for this data point.
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature.
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
Model fitted by MaxAbsScaler.
Model fitted by MaxAbsScaler.
:: Experimental ::
:: Experimental ::
LSH class for Jaccard distance.
The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example,
Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0)))
means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any
input vector must have at least 1 non-zero index, and all non-zero values are
treated as binary "1" values.
References: Wikipedia on MinHash
:: Experimental ::
:: Experimental ::
Model produced by MinHashLSH, where multiple hash functions are stored. Each hash function
is picked from the following family of hash functions, where a_i and b_i are randomly chosen
integers less than prime:
h_i(x) = ((x \cdot a_i + b_i) \mod prime)
This hash family is approximately min-wise independent according to the reference.
Reference: Tom Bohman, Colin Cooper, and Alan Frieze. "Min-wise independent linear permutations." Electronic Journal of Combinatorics 7 (2000): R26.
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as:
$$ Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min $$
For the case \(E_{max} == E_{min}\), \(Rescaled(e_i) = 0.5 * (max + min)\).
Since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.
Model fitted by MinMaxScaler.
Model fitted by MinMaxScaler.
A feature transformer that converts the input array of strings into an array of n-grams.
A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.
When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
Normalize a vector to have unit norm using the given p-norm.
Normalize a vector to have unit norm using the given p-norm.
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
A one-hot encoder that maps a column of category indices to a column of binary vectors, with
at most a single one-value per row that indicates the input category index.
For example with 5 categories, an input value of 2.0 would map to an output vector of
[0.0, 0.0, 1.0, 0.0]
.
The last category is not included by default (configurable via dropLast
),
because it makes the vector entries sum up to one, and hence linearly dependent.
So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]
.
When encoding multi-column by using inputCols
and outputCols
params, input/output cols
come in pairs, specified by the order in the arrays, and each pair is treated independently.
This is different from scikit-learn's OneHotEncoder, which keeps all categories.
The output vectors are sparse.
When handleInvalid
is configured to 'keep', an extra "category" indicating invalid values is
added as last category. So when dropLast
is true, invalid values are encoded as all-zeros
vector.
StringIndexer
for converting categorical values into category indices
PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k
principal components.
PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k
principal components.
Model fitted by PCA.
Model fitted by PCA. Transforms vectors to a lower dimensional space.
Perform feature expansion in a polynomial space.
Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion,
which is available at
Polynomial expansion (Wikipedia)
, "In mathematics, an expansion of a product of sums expresses it as a sum of products by using
the fact that multiplication distributes over addition". Take a 2-variable feature vector
as an example: (x, y)
, if we want to expand it with degree 2, then we get
(x, x * x, y, x * y, y * y)
.
QuantileDiscretizer
takes a column with continuous features and outputs a column with binned
categorical features.
QuantileDiscretizer
takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the numBuckets
parameter. It is
possible that the number of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct quantiles.
Since 2.3.0, QuantileDiscretizer
can map multiple columns at once by setting the inputCols
parameter. If both of the inputCol
and inputCols
parameters are set, an Exception will be
thrown. To specify the number of buckets for each column, the numBucketsArray
parameter can
be set, or if the number of buckets should be the same across columns, numBuckets
can be
set as a convenience.
NaN handling:
null and NaN values will be ignored from the column during QuantileDiscretizer
fitting. This
will produce a Bucketizer
model for making predictions. During the transformation,
Bucketizer
will raise an error when it finds NaN values in the dataset, but the user can
also choose to either keep or remove NaN values within the dataset by setting handleInvalid
.
If the user chooses to keep NaN values, they will be handled specially and placed into their own
bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3],
but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for a detailed description). The precision of the approximation can be controlled with the
relativeError
parameter. The lower and upper bin bounds will be -Infinity
and +Infinity
,
covering all real values.
:: Experimental :: Implements the transforms required for fitting a dataset against an R model formula.
:: Experimental :: Implements the transforms required for fitting a dataset against an R model formula. Currently we support a limited subset of the R operators, including '~', '.', ':', '+', and '-'. Also see the R formula docs here: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html
The basic operators are:
~
separate target and terms+
concat terms, "+ 0" means removing intercept-
remove a term, "- 1" means removing intercept:
interaction (multiplication for numeric values, or binarized categorical values).
all columns except targetSuppose a
and b
are double columns, we use the following simple examples
to illustrate the effect of RFormula
:
y ~ a + b
means model y ~ w0 + w1 * a + w2 * b
where w0
is the intercept and w1, w2
are coefficients.y ~ a + b + a:b - 1
means model y ~ w1 * a + w2 * b + w3 * a * b
where w1, w2, w3
are coefficients.RFormula produces a vector column of features and a double or string column of label.
Like when formulas are used in R for linear regression, string input columns will be one-hot
encoded, and numeric columns will be cast to doubles.
If the label column is of type string, it will be first transformed to double with
StringIndexer
. If the label column does not exist in the DataFrame, the output label column
will be created from the specified response variable in the formula.
:: Experimental :: Model fitted by RFormula.
:: Experimental :: Model fitted by RFormula. Fitting is required to determine the factor levels of formula terms.
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps
is false).
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps
is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.
Implements the transformations which are defined by SQL statement.
Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like:
SELECT a, a + b AS a_b FROM __THIS__
SELECT a, SQRT(b) AS b_sqrt FROM __THIS__ where a > 5
SELECT a, b, SUM(c) AS c_sum FROM __THIS__ GROUP BY a, b
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
The "unit std" is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.
Model fitted by StandardScaler.
Model fitted by StandardScaler.
A feature transformer that filters out stop words from input.
A feature transformer that filters out stop words from input.
null values from input array are preserved unless adding null to stopWords explicitly.
A label indexer that maps a string column of labels to an ML column of label indices.
A label indexer that maps a string column of labels to an ML column of label indices.
If the input column is numeric, we cast it to string and index the string values.
The indices are in [0, numLabels). By default, this is ordered by label frequencies
so the most frequent label gets index 0. The ordering behavior is controlled by
setting stringOrderType
.
IndexToString
for the inverse transformation
Model fitted by StringIndexer.
Model fitted by StringIndexer.
During transformation, if the input column does not exist,
StringIndexerModel.transform
would return the input dataset unmodified.
This is a temporary fix for the case when target labels do not exist during prediction.
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
A feature transformer that merges multiple columns into a vector column.
A feature transformer that merges multiple columns into a vector column.
This requires one pass over the entire dataset. In case we need to infer column lengths from the data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter.
Class for indexing categorical feature columns in a dataset of Vector
.
Class for indexing categorical feature columns in a dataset of Vector
.
This has 2 usage modes:
This returns a model which can transform categorical features to use 0-based indices.
Index stability:
TODO: Future extensions: The following functionality is planned for the future:
Model fitted by VectorIndexer.
Model fitted by VectorIndexer. Transform categorical features to use 0-based indices instead of their original values.
This maintains vector sparsity.
:: Experimental :: A feature transformer that adds size information to the metadata of a vector column.
:: Experimental :: A feature transformer that adds size information to the metadata of a vector column. VectorAssembler needs size information for its input columns and cannot be used on streaming dataframes without this metadata.
Note: VectorSizeHint modifies inputCol
to include size metadata and does not have an outputCol.
This class takes a feature vector and outputs a new feature vector with a subarray of the original features.
This class takes a feature vector and outputs a new feature vector with a subarray of the original features.
The subset of features can be specified with either indices (setIndices()
)
or names (setNames()
). At least one feature must be selected. Duplicate features
are not allowed, so there can be no overlap between selected indices and names.
The output vector will order features with the selected indices first (in the order given), followed by the selected names (in the order given).
Word2Vec trains a model of Map(String, Vector)
, i.e.
Word2Vec trains a model of Map(String, Vector)
, i.e. transforms a word into a code for further
natural language processing or machine learning process.
Model fitted by Word2Vec.
Model fitted by Word2Vec.
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
A one-hot encoder that maps a column of category indices to a column of binary vectors, with
at most a single one-value per row that indicates the input category index.
For example with 5 categories, an input value of 2.0 would map to an output vector of
[0.0, 0.0, 1.0, 0.0]
.
The last category is not included by default (configurable via OneHotEncoder!.dropLast
because it makes the vector entries sum up to one, and hence linearly dependent.
So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]
.
(Since version 2.3.0)
This is different from scikit-learn's OneHotEncoder, which keeps all categories. The output vectors are sparse.
StringIndexer
for converting categorical values into category indices
The expansion is done via recursion.
The expansion is done via recursion. Given n features and degree d, the size after expansion is (n + d choose d) (including 1 and first-order values). For example, let f([a, b, c], 3) be the function that expands [a, b, c] to their monomials of degree 3. We have the following recursion:
$$ f([a, b, c], 3) &= f([a, b], 3) ++ f([a, b], 2) * c ++ f([a, b], 1) * c^2 ++ [c^3] $$
To handle sparsity, if c is zero, we can skip all monomials that contain it. We remember the current index and increment it properly for sparse input.
:: Experimental ::
:: Experimental ::
Feature transformers
The
ml.feature
package provides common feature transformers that help convert raw data or features into more suitable forms for model fitting. Most feature transformers are implemented as Transformers, which transform oneDataFrame
into another, e.g., HashingTF. Some feature transformers are implemented as Estimators, because the transformation requires some aggregated information of the dataset, e.g., document frequencies in IDF. For those feature transformers, callingEstimator.fit
is required to obtain the model first, e.g., IDFModel, in order to apply transformation. The transformation is usually done by appending new columns to the inputDataFrame
, so all input columns are carried over.We try to make each transformer minimal, so it becomes flexible to assemble feature transformation pipelines. Pipeline can be used to chain feature transformers, and VectorAssembler can be used to combine multiple feature transformations, for example:
Some feature transformers implemented in MLlib are inspired by those implemented in scikit-learn. The major difference is that most scikit-learn feature transformers operate eagerly on the entire input dataset, while MLlib's feature transformers operate lazily on individual columns, which is more efficient and flexible to handle large and complex datasets.
scikit-learn.preprocessing