public class MLUtils
extends java.lang.Object
Constructor and Description |
---|
MLUtils() |
Modifier and Type | Method and Description |
---|---|
static Vector |
appendBias(Vector vector)
Returns a new vector with
1.0 (bias) appended to the input vector. |
static double |
EPSILON() |
static <T> scala.Tuple2<RDD<T>,RDD<T>>[] |
kFold(RDD<T> rdd,
int numFolds,
int seed,
scala.reflect.ClassTag<T> evidence$1)
:: Experimental ::
Return a k element array of pairs of RDDs with the first element of each pair
containing the training data, a complement of the validation data and the second
element, the validation data, containing a unique 1/kth of the data.
|
static RDD<LabeledPoint> |
loadLabeledData(SparkContext sc,
java.lang.String dir)
Deprecated.
Should use
RDD.saveAsTextFile(java.lang.String) for saving and
loadLabeledPoints(org.apache.spark.SparkContext, java.lang.String, int) for loading. |
static RDD<LabeledPoint> |
loadLabeledPoints(SparkContext sc,
java.lang.String dir)
Loads labeled points saved using
RDD[LabeledPoint].saveAsTextFile with the default number of
partitions. |
static RDD<LabeledPoint> |
loadLabeledPoints(SparkContext sc,
java.lang.String path,
int minPartitions)
Loads labeled points saved using
RDD[LabeledPoint].saveAsTextFile . |
static RDD<LabeledPoint> |
loadLibSVMFile(SparkContext sc,
java.lang.String path)
Loads binary labeled data in the LIBSVM format into an RDD[LabeledPoint], with number of
features determined automatically and the default number of partitions.
|
static RDD<LabeledPoint> |
loadLibSVMFile(SparkContext sc,
java.lang.String path,
boolean multiclass) |
static RDD<LabeledPoint> |
loadLibSVMFile(SparkContext sc,
java.lang.String path,
boolean multiclass,
int numFeatures) |
static RDD<LabeledPoint> |
loadLibSVMFile(SparkContext sc,
java.lang.String path,
boolean multiclass,
int numFeatures,
int minPartitions) |
static RDD<LabeledPoint> |
loadLibSVMFile(SparkContext sc,
java.lang.String path,
int numFeatures)
Loads labeled data in the LIBSVM format into an RDD[LabeledPoint], with the default number of
partitions.
|
static RDD<LabeledPoint> |
loadLibSVMFile(SparkContext sc,
java.lang.String path,
int numFeatures,
int minPartitions)
Loads labeled data in the LIBSVM format into an RDD[LabeledPoint].
|
static RDD<Vector> |
loadVectors(SparkContext sc,
java.lang.String path)
Loads vectors saved using
RDD[Vector].saveAsTextFile with the default number of partitions. |
static RDD<Vector> |
loadVectors(SparkContext sc,
java.lang.String path,
int minPartitions)
Loads vectors saved using
RDD[Vector].saveAsTextFile . |
static void |
saveAsLibSVMFile(RDD<LabeledPoint> data,
java.lang.String dir)
Save labeled data in LIBSVM format.
|
static void |
saveLabeledData(RDD<LabeledPoint> data,
java.lang.String dir)
Deprecated.
Should use
RDD.saveAsTextFile(java.lang.String) for saving and
loadLabeledPoints(org.apache.spark.SparkContext, java.lang.String, int) for loading. |
public static double EPSILON()
public static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, java.lang.String path, int numFeatures, int minPartitions)
label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order.
This method parses each line into a {@link org.apache.spark.mllib.regression.LabeledPoint},
where the feature indices are converted to zero-based.
@param sc Spark context
@param path file or directory path in any Hadoop-supported file system URI
@param numFeatures number of features, which will be determined from the input data if a
nonpositive value is given. This is useful when the dataset is already split
into multiple files and you want to load them separately, because some
features may not present in certain files, which leads to inconsistent
feature dimensions.
@param minPartitions min number of partitions
@return labeled data stored as an RDD[LabeledPoint]sc
- (undocumented)path
- (undocumented)numFeatures
- (undocumented)minPartitions
- (undocumented)public static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, java.lang.String path, boolean multiclass, int numFeatures, int minPartitions)
public static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, java.lang.String path, int numFeatures)
sc
- (undocumented)path
- (undocumented)numFeatures
- (undocumented)public static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, java.lang.String path, boolean multiclass, int numFeatures)
public static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, java.lang.String path, boolean multiclass)
public static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, java.lang.String path)
sc
- (undocumented)path
- (undocumented)public static void saveAsLibSVMFile(RDD<LabeledPoint> data, java.lang.String dir)
data
- an RDD of LabeledPoint to be saveddir
- directory to save the data
loadLibSVMFile(org.apache.spark.SparkContext, java.lang.String, int, int)
public static RDD<Vector> loadVectors(SparkContext sc, java.lang.String path, int minPartitions)
RDD[Vector].saveAsTextFile
.sc
- Spark contextpath
- file or directory path in any Hadoop-supported file system URIminPartitions
- min number of partitionspublic static RDD<Vector> loadVectors(SparkContext sc, java.lang.String path)
RDD[Vector].saveAsTextFile
with the default number of partitions.sc
- (undocumented)path
- (undocumented)public static RDD<LabeledPoint> loadLabeledPoints(SparkContext sc, java.lang.String path, int minPartitions)
RDD[LabeledPoint].saveAsTextFile
.sc
- Spark contextpath
- file or directory path in any Hadoop-supported file system URIminPartitions
- min number of partitionspublic static RDD<LabeledPoint> loadLabeledPoints(SparkContext sc, java.lang.String dir)
RDD[LabeledPoint].saveAsTextFile
with the default number of
partitions.sc
- (undocumented)dir
- (undocumented)public static RDD<LabeledPoint> loadLabeledData(SparkContext sc, java.lang.String dir)
RDD.saveAsTextFile(java.lang.String)
for saving and
loadLabeledPoints(org.apache.spark.SparkContext, java.lang.String, int)
for loading.sc
- SparkContextdir
- Directory to the input data files.public static void saveLabeledData(RDD<LabeledPoint> data, java.lang.String dir)
RDD.saveAsTextFile(java.lang.String)
for saving and
loadLabeledPoints(org.apache.spark.SparkContext, java.lang.String, int)
for loading.data
- An RDD of LabeledPoints containing data to be saved.dir
- Directory to save the data.
public static <T> scala.Tuple2<RDD<T>,RDD<T>>[] kFold(RDD<T> rdd, int numFolds, int seed, scala.reflect.ClassTag<T> evidence$1)
rdd
- (undocumented)numFolds
- (undocumented)seed
- (undocumented)evidence$1
- (undocumented)