public class MLUtils
extends Object
Constructor and Description |
---|
MLUtils() |
Modifier and Type | Method and Description |
---|---|
static Vector |
appendBias(Vector vector)
Returns a new vector with
1.0 (bias) appended to the input vector. |
static Dataset<Row> |
convertMatrixColumnsFromML(Dataset<?> dataset,
scala.collection.Seq<String> cols)
|
static Dataset<Row> |
convertMatrixColumnsFromML(Dataset<?> dataset,
String... cols)
|
static Dataset<Row> |
convertMatrixColumnsToML(Dataset<?> dataset,
scala.collection.Seq<String> cols)
|
static Dataset<Row> |
convertMatrixColumnsToML(Dataset<?> dataset,
String... cols)
|
static Dataset<Row> |
convertVectorColumnsFromML(Dataset<?> dataset,
scala.collection.Seq<String> cols)
|
static Dataset<Row> |
convertVectorColumnsFromML(Dataset<?> dataset,
String... cols)
|
static Dataset<Row> |
convertVectorColumnsToML(Dataset<?> dataset,
scala.collection.Seq<String> cols)
|
static Dataset<Row> |
convertVectorColumnsToML(Dataset<?> dataset,
String... cols)
|
static <T> scala.Tuple2<RDD<T>,RDD<T>>[] |
kFold(RDD<T> rdd,
int numFolds,
int seed,
scala.reflect.ClassTag<T> evidence$1)
Return a k element array of pairs of RDDs with the first element of each pair
containing the training data, a complement of the validation data and the second
element, the validation data, containing a unique 1/kth of the data.
|
static <T> scala.Tuple2<RDD<T>,RDD<T>>[] |
kFold(RDD<T> rdd,
int numFolds,
long seed,
scala.reflect.ClassTag<T> evidence$2)
Version of
kFold() taking a Long seed. |
static RDD<LabeledPoint> |
loadLabeledPoints(SparkContext sc,
String dir)
Loads labeled points saved using
RDD[LabeledPoint].saveAsTextFile with the default number of
partitions. |
static RDD<LabeledPoint> |
loadLabeledPoints(SparkContext sc,
String path,
int minPartitions)
Loads labeled points saved using
RDD[LabeledPoint].saveAsTextFile . |
static RDD<LabeledPoint> |
loadLibSVMFile(SparkContext sc,
String path)
Loads binary labeled data in the LIBSVM format into an RDD[LabeledPoint], with number of
features determined automatically and the default number of partitions.
|
static RDD<LabeledPoint> |
loadLibSVMFile(SparkContext sc,
String path,
int numFeatures)
Loads labeled data in the LIBSVM format into an RDD[LabeledPoint], with the default number of
partitions.
|
static RDD<LabeledPoint> |
loadLibSVMFile(SparkContext sc,
String path,
int numFeatures,
int minPartitions)
Loads labeled data in the LIBSVM format into an RDD[LabeledPoint].
|
static RDD<Vector> |
loadVectors(SparkContext sc,
String path)
Loads vectors saved using
RDD[Vector].saveAsTextFile with the default number of partitions. |
static RDD<Vector> |
loadVectors(SparkContext sc,
String path,
int minPartitions)
Loads vectors saved using
RDD[Vector].saveAsTextFile . |
static void |
saveAsLibSVMFile(RDD<LabeledPoint> data,
String dir)
Save labeled data in LIBSVM format.
|
public static Dataset<Row> convertVectorColumnsToML(Dataset<?> dataset, String... cols)
Vector
type to the new Vector
type under the spark.ml
package.dataset
- input datasetcols
- a list of vector columns to be converted. New vector columns will be ignored. If
unspecified, all old vector columns will be converted except nested ones.DataFrame
with old vector columns converted to the new vector typepublic static Dataset<Row> convertVectorColumnsFromML(Dataset<?> dataset, String... cols)
Vector
type from the new Vector
type under the spark.ml
package.dataset
- input datasetcols
- a list of vector columns to be converted. Old vector columns will be ignored. If
unspecified, all new vector columns will be converted except nested ones.DataFrame
with new vector columns converted to the old vector typepublic static Dataset<Row> convertMatrixColumnsToML(Dataset<?> dataset, String... cols)
Matrix
type to the new Matrix
type under the spark.ml
package.dataset
- input datasetcols
- a list of matrix columns to be converted. New matrix columns will be ignored. If
unspecified, all old matrix columns will be converted except nested ones.DataFrame
with old matrix columns converted to the new matrix typepublic static Dataset<Row> convertMatrixColumnsFromML(Dataset<?> dataset, String... cols)
Matrix
type from the new Matrix
type under the spark.ml
package.dataset
- input datasetcols
- a list of matrix columns to be converted. Old matrix columns will be ignored. If
unspecified, all new matrix columns will be converted except nested ones.DataFrame
with new matrix columns converted to the old matrix typepublic static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, String path, int numFeatures, int minPartitions)
label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order.
This method parses each line into a {@link org.apache.spark.mllib.regression.LabeledPoint},
where the feature indices are converted to zero-based.
@param sc Spark context
@param path file or directory path in any Hadoop-supported file system URI
@param numFeatures number of features, which will be determined from the input data if a
nonpositive value is given. This is useful when the dataset is already split
into multiple files and you want to load them separately, because some
features may not present in certain files, which leads to inconsistent
feature dimensions.
@param minPartitions min number of partitions
@return labeled data stored as an RDD[LabeledPoint]sc
- (undocumented)path
- (undocumented)numFeatures
- (undocumented)minPartitions
- (undocumented)public static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, String path, int numFeatures)
sc
- (undocumented)path
- (undocumented)numFeatures
- (undocumented)public static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, String path)
sc
- (undocumented)path
- (undocumented)public static void saveAsLibSVMFile(RDD<LabeledPoint> data, String dir)
data
- an RDD of LabeledPoint to be saveddir
- directory to save the dataloadLibSVMFile(org.apache.spark.SparkContext, java.lang.String, int, int)
public static RDD<Vector> loadVectors(SparkContext sc, String path, int minPartitions)
RDD[Vector].saveAsTextFile
.sc
- Spark contextpath
- file or directory path in any Hadoop-supported file system URIminPartitions
- min number of partitionspublic static RDD<Vector> loadVectors(SparkContext sc, String path)
RDD[Vector].saveAsTextFile
with the default number of partitions.sc
- (undocumented)path
- (undocumented)public static RDD<LabeledPoint> loadLabeledPoints(SparkContext sc, String path, int minPartitions)
RDD[LabeledPoint].saveAsTextFile
.sc
- Spark contextpath
- file or directory path in any Hadoop-supported file system URIminPartitions
- min number of partitionspublic static RDD<LabeledPoint> loadLabeledPoints(SparkContext sc, String dir)
RDD[LabeledPoint].saveAsTextFile
with the default number of
partitions.sc
- (undocumented)dir
- (undocumented)public static <T> scala.Tuple2<RDD<T>,RDD<T>>[] kFold(RDD<T> rdd, int numFolds, int seed, scala.reflect.ClassTag<T> evidence$1)
rdd
- (undocumented)numFolds
- (undocumented)seed
- (undocumented)evidence$1
- (undocumented)public static <T> scala.Tuple2<RDD<T>,RDD<T>>[] kFold(RDD<T> rdd, int numFolds, long seed, scala.reflect.ClassTag<T> evidence$2)
kFold()
taking a Long seed.rdd
- (undocumented)numFolds
- (undocumented)seed
- (undocumented)evidence$2
- (undocumented)public static Vector appendBias(Vector vector)
1.0
(bias) appended to the input vector.vector
- (undocumented)public static Dataset<Row> convertVectorColumnsToML(Dataset<?> dataset, scala.collection.Seq<String> cols)
Vector
type to the new Vector
type under the spark.ml
package.dataset
- input datasetcols
- a list of vector columns to be converted. New vector columns will be ignored. If
unspecified, all old vector columns will be converted except nested ones.DataFrame
with old vector columns converted to the new vector typepublic static Dataset<Row> convertVectorColumnsFromML(Dataset<?> dataset, scala.collection.Seq<String> cols)
Vector
type from the new Vector
type under the spark.ml
package.dataset
- input datasetcols
- a list of vector columns to be converted. Old vector columns will be ignored. If
unspecified, all new vector columns will be converted except nested ones.DataFrame
with new vector columns converted to the old vector typepublic static Dataset<Row> convertMatrixColumnsToML(Dataset<?> dataset, scala.collection.Seq<String> cols)
Matrix
type to the new Matrix
type under the spark.ml
package.dataset
- input datasetcols
- a list of matrix columns to be converted. New matrix columns will be ignored. If
unspecified, all old matrix columns will be converted except nested ones.DataFrame
with old matrix columns converted to the new matrix typepublic static Dataset<Row> convertMatrixColumnsFromML(Dataset<?> dataset, scala.collection.Seq<String> cols)
Matrix
type from the new Matrix
type under the spark.ml
package.dataset
- input datasetcols
- a list of matrix columns to be converted. Old matrix columns will be ignored. If
unspecified, all new matrix columns will be converted except nested ones.DataFrame
with new matrix columns converted to the old matrix type