org.apache.spark.mllib.tree.configuration
Java-friendly constructor for org.apache.spark.mllib.tree.configuration.Strategy
Java-friendly constructor for org.apache.spark.mllib.tree.configuration.Strategy
Learning goal. Supported:
org.apache.spark.mllib.tree.configuration.Algo.Classification
,
org.apache.spark.mllib.tree.configuration.Algo.Regression
Criterion used for information gain calculation. Supported for Classification: org.apache.spark.mllib.tree.impurity.Gini, org.apache.spark.mllib.tree.impurity.Entropy. Supported for Regression: org.apache.spark.mllib.tree.impurity.Variance.
Maximum depth of the tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes).
Number of classes for classification. (Ignored for regression.) Default value is 2 (binary classification).
Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
Algorithm for calculating quantiles. Supported:
org.apache.spark.mllib.tree.configuration.QuantileStrategy.Sort
A map storing information about the categorical variables and the number of discrete values they take. An entry (n to k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
Minimum number of instances each child must have after split. Default value is 1. If a split cause left or right child to have less than minInstancesPerNode, this split will not be considered as a valid split.
Minimum information gain a split must get. Default value is 0.0. If a split has less information gain than minInfoGain, this split will not be considered as a valid split.
Maximum memory in MB allocated to histogram aggregation. Default value is 256 MB. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size.
Fraction of the training data used for learning decision tree.
If this is true, instead of passing trees to executors, the algorithm will maintain a separate RDD of node Id cache for each row.
How often to checkpoint when the node Id cache gets updated. E.g. 10 means that the cache will get checkpointed every 10 updates. If the checkpoint directory is not set in org.apache.spark.SparkContext, this setting is ignored.
Learning goal.
Learning goal. Supported:
org.apache.spark.mllib.tree.configuration.Algo.Classification
,
org.apache.spark.mllib.tree.configuration.Algo.Regression
A map storing information about the categorical variables and the number of discrete values they take.
A map storing information about the categorical variables and the number of discrete values they take. An entry (n to k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
How often to checkpoint when the node Id cache gets updated.
How often to checkpoint when the node Id cache gets updated. E.g. 10 means that the cache will get checkpointed every 10 updates. If the checkpoint directory is not set in org.apache.spark.SparkContext, this setting is ignored.
Returns a shallow copy of this instance.
Returns a shallow copy of this instance.
Criterion used for information gain calculation.
Criterion used for information gain calculation. Supported for Classification: org.apache.spark.mllib.tree.impurity.Gini, org.apache.spark.mllib.tree.impurity.Entropy. Supported for Regression: org.apache.spark.mllib.tree.impurity.Variance.
Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node.
Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
Maximum depth of the tree (e.g.
Maximum depth of the tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes).
Maximum memory in MB allocated to histogram aggregation.
Maximum memory in MB allocated to histogram aggregation. Default value is 256 MB. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size.
Minimum information gain a split must get.
Minimum information gain a split must get. Default value is 0.0. If a split has less information gain than minInfoGain, this split will not be considered as a valid split.
Minimum number of instances each child must have after split.
Minimum number of instances each child must have after split. Default value is 1. If a split cause left or right child to have less than minInstancesPerNode, this split will not be considered as a valid split.
Number of classes for classification.
Number of classes for classification. (Ignored for regression.) Default value is 2 (binary classification).
Algorithm for calculating quantiles.
Algorithm for calculating quantiles. Supported:
org.apache.spark.mllib.tree.configuration.QuantileStrategy.Sort
Sets Algorithm using a String.
Sets Algorithm using a String.
Sets categoricalFeaturesInfo using a Java Map.
Sets categoricalFeaturesInfo using a Java Map.
Fraction of the training data used for learning decision tree.
Fraction of the training data used for learning decision tree.
If this is true, instead of passing trees to executors, the algorithm will maintain a separate RDD of node Id cache for each row.
If this is true, instead of passing trees to executors, the algorithm will maintain a separate RDD of node Id cache for each row.
Stores all the configuration options for tree construction