public class DataFrame
extends java.lang.Object
implements org.apache.spark.sql.execution.Queryable, scala.Serializable
A DataFrame
is equivalent to a relational table in Spark SQL. The following example creates
a DataFrame
by pointing Spark SQL to a Parquet data set.
val people = sqlContext.read.parquet("...") // in Scala
DataFrame people = sqlContext.read().parquet("...") // in Java
Once created, it can be manipulated using the various domain-specific-language (DSL) functions
defined in: DataFrame
(this class), Column
, and functions
.
To select a column from the data frame, use apply
method in Scala and col
in Java.
val ageCol = people("age") // in Scala
Column ageCol = people.col("age") // in Java
Note that the Column
type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10.
people("age") + 10 // in Scala
people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create DataFrame using SQLContext
val people = sqlContext.read.parquet("...")
val department = sqlContext.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
and in Java:
// To create DataFrame using SQLContext
DataFrame people = sqlContext.read().parquet("...");
DataFrame department = sqlContext.read().parquet("...");
people.filter("age".gt(30))
.join(department, people.col("deptId").equalTo(department("id")))
.groupBy(department.col("name"), "gender")
.agg(avg(people.col("salary")), max(people.col("age")));
Constructor and Description |
---|
DataFrame(SQLContext sqlContext,
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)
A constructor that automatically analyzes the logical plan.
|
Modifier and Type | Method and Description |
---|---|
DataFrame |
agg(Column expr,
Column... exprs)
Aggregates on the entire
DataFrame without groups. |
DataFrame |
agg(Column expr,
scala.collection.Seq<Column> exprs)
Aggregates on the entire
DataFrame without groups. |
DataFrame |
agg(scala.collection.immutable.Map<java.lang.String,java.lang.String> exprs)
(Scala-specific) Aggregates on the entire
DataFrame without groups. |
DataFrame |
agg(java.util.Map<java.lang.String,java.lang.String> exprs)
(Java-specific) Aggregates on the entire
DataFrame without groups. |
DataFrame |
agg(scala.Tuple2<java.lang.String,java.lang.String> aggExpr,
scala.collection.Seq<scala.Tuple2<java.lang.String,java.lang.String>> aggExprs)
(Scala-specific) Aggregates on the entire
DataFrame without groups. |
DataFrame |
alias(java.lang.String alias)
Returns a new
DataFrame with an alias set. |
DataFrame |
alias(scala.Symbol alias)
(Scala-specific) Returns a new
DataFrame with an alias set. |
Column |
apply(java.lang.String colName)
Selects column based on the column name and return it as a
Column . |
<U> Dataset<U> |
as(Encoder<U> evidence$1)
|
DataFrame |
as(java.lang.String alias)
Returns a new
DataFrame with an alias set. |
DataFrame |
as(scala.Symbol alias)
(Scala-specific) Returns a new
DataFrame with an alias set. |
DataFrame |
cache()
Persist this
DataFrame with the default storage level (MEMORY_AND_DISK ). |
DataFrame |
coalesce(int numPartitions)
Returns a new
DataFrame that has exactly numPartitions partitions. |
Column |
col(java.lang.String colName)
Selects column based on the column name and return it as a
Column . |
Row[] |
collect()
|
java.util.List<Row> |
collectAsList()
|
protected int |
collectToPython() |
java.lang.String[] |
columns()
Returns all column names as an array.
|
long |
count()
Returns the number of rows in the
DataFrame . |
void |
createJDBCTable(java.lang.String url,
java.lang.String table,
boolean allowExisting)
Deprecated.
As of 1.340, replaced by
write().jdbc() . This will be removed in Spark 2.0. |
GroupedData |
cube(Column... cols)
Create a multi-dimensional cube for the current
DataFrame using the specified columns,
so we can run aggregation on them. |
GroupedData |
cube(scala.collection.Seq<Column> cols)
Create a multi-dimensional cube for the current
DataFrame using the specified columns,
so we can run aggregation on them. |
GroupedData |
cube(java.lang.String col1,
scala.collection.Seq<java.lang.String> cols)
Create a multi-dimensional cube for the current
DataFrame using the specified columns,
so we can run aggregation on them. |
GroupedData |
cube(java.lang.String col1,
java.lang.String... cols)
Create a multi-dimensional cube for the current
DataFrame using the specified columns,
so we can run aggregation on them. |
DataFrame |
describe(scala.collection.Seq<java.lang.String> cols)
Computes statistics for numeric columns, including count, mean, stddev, min, and max.
|
DataFrame |
describe(java.lang.String... cols)
Computes statistics for numeric columns, including count, mean, stddev, min, and max.
|
DataFrame |
distinct()
|
DataFrame |
drop(Column col)
Returns a new
DataFrame with a column dropped. |
DataFrame |
drop(java.lang.String colName)
Returns a new
DataFrame with a column dropped. |
DataFrame |
dropDuplicates()
|
DataFrame |
dropDuplicates(scala.collection.Seq<java.lang.String> colNames)
(Scala-specific) Returns a new
DataFrame with duplicate rows removed, considering only
the subset of columns. |
DataFrame |
dropDuplicates(java.lang.String[] colNames)
Returns a new
DataFrame with duplicate rows removed, considering only
the subset of columns. |
scala.Tuple2<java.lang.String,java.lang.String>[] |
dtypes()
Returns all column names and their data types as an array.
|
DataFrame |
except(DataFrame other)
Returns a new
DataFrame containing rows in this frame but not in another frame. |
void |
explain()
Prints the physical plan to the console for debugging purposes.
|
void |
explain(boolean extended)
Prints the plans (logical and physical) to the console for debugging purposes.
|
<A extends scala.Product> |
explode(scala.collection.Seq<Column> input,
scala.Function1<Row,scala.collection.TraversableOnce<A>> f,
scala.reflect.api.TypeTags.TypeTag<A> evidence$2)
(Scala-specific) Returns a new
DataFrame where each row has been expanded to zero or more
rows by the provided function. |
<A,B> DataFrame |
explode(java.lang.String inputColumn,
java.lang.String outputColumn,
scala.Function1<A,scala.collection.TraversableOnce<B>> f,
scala.reflect.api.TypeTags.TypeTag<B> evidence$3)
(Scala-specific) Returns a new
DataFrame where a single column has been expanded to zero
or more rows by the provided function. |
DataFrame |
filter(Column condition)
Filters rows using the given condition.
|
DataFrame |
filter(java.lang.String conditionExpr)
Filters rows using the given SQL expression.
|
Row |
first()
Returns the first row.
|
<R> RDD<R> |
flatMap(scala.Function1<Row,scala.collection.TraversableOnce<R>> f,
scala.reflect.ClassTag<R> evidence$5)
Returns a new RDD by first applying a function to all rows of this
DataFrame ,
and then flattening the results. |
void |
foreach(scala.Function1<Row,scala.runtime.BoxedUnit> f)
Applies a function
f to all rows. |
void |
foreachPartition(scala.Function1<scala.collection.Iterator<Row>,scala.runtime.BoxedUnit> f)
Applies a function f to each partition of this
DataFrame . |
GroupedData |
groupBy(Column... cols)
Groups the
DataFrame using the specified columns, so we can run aggregation on them. |
GroupedData |
groupBy(scala.collection.Seq<Column> cols)
Groups the
DataFrame using the specified columns, so we can run aggregation on them. |
GroupedData |
groupBy(java.lang.String col1,
scala.collection.Seq<java.lang.String> cols)
Groups the
DataFrame using the specified columns, so we can run aggregation on them. |
GroupedData |
groupBy(java.lang.String col1,
java.lang.String... cols)
Groups the
DataFrame using the specified columns, so we can run aggregation on them. |
Row |
head()
Returns the first row.
|
Row[] |
head(int n)
Returns the first
n rows. |
java.lang.String[] |
inputFiles()
Returns a best-effort snapshot of the files that compose this DataFrame.
|
void |
insertInto(java.lang.String tableName)
Deprecated.
As of 1.4.0, replaced by
write().mode(SaveMode.Append).saveAsTable(tableName) .
This will be removed in Spark 2.0. |
void |
insertInto(java.lang.String tableName,
boolean overwrite)
Deprecated.
As of 1.4.0, replaced by
write().mode(SaveMode.Append|SaveMode.Overwrite).saveAsTable(tableName) .
This will be removed in Spark 2.0. |
void |
insertIntoJDBC(java.lang.String url,
java.lang.String table,
boolean overwrite)
Deprecated.
As of 1.4.0, replaced by
write().jdbc() . This will be removed in Spark 2.0. |
DataFrame |
intersect(DataFrame other)
Returns a new
DataFrame containing rows only in both this frame and another frame. |
boolean |
isLocal()
Returns true if the
collect and take methods can be run locally
(without any Spark executors). |
JavaRDD<Row> |
javaRDD()
|
protected JavaRDD<byte[]> |
javaToPython()
Converts a JavaRDD to a PythonRDD.
|
DataFrame |
join(DataFrame right)
Cartesian join with another
DataFrame . |
DataFrame |
join(DataFrame right,
Column joinExprs)
Inner join with another
DataFrame , using the given join expression. |
DataFrame |
join(DataFrame right,
Column joinExprs,
java.lang.String joinType)
Join with another
DataFrame , using the given join expression. |
DataFrame |
join(DataFrame right,
scala.collection.Seq<java.lang.String> usingColumns)
Inner equi-join with another
DataFrame using the given columns. |
DataFrame |
join(DataFrame right,
scala.collection.Seq<java.lang.String> usingColumns,
java.lang.String joinType)
Equi-join with another
DataFrame using the given columns. |
DataFrame |
join(DataFrame right,
java.lang.String usingColumn)
Inner equi-join with another
DataFrame using the given column. |
DataFrame |
limit(int n)
Returns a new
DataFrame by taking the first n rows. |
protected org.apache.spark.sql.catalyst.plans.logical.LogicalPlan |
logicalPlan() |
<R> RDD<R> |
map(scala.Function1<Row,R> f,
scala.reflect.ClassTag<R> evidence$4)
Returns a new RDD by applying a function to all rows of this DataFrame.
|
<R> RDD<R> |
mapPartitions(scala.Function1<scala.collection.Iterator<Row>,scala.collection.Iterator<R>> f,
scala.reflect.ClassTag<R> evidence$6)
Returns a new RDD by applying a function to each partition of this DataFrame.
|
DataFrameNaFunctions |
na()
Returns a
DataFrameNaFunctions for working with missing data. |
protected scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> |
numericColumns() |
DataFrame |
orderBy(Column... sortExprs)
Returns a new
DataFrame sorted by the given expressions. |
DataFrame |
orderBy(scala.collection.Seq<Column> sortExprs)
Returns a new
DataFrame sorted by the given expressions. |
DataFrame |
orderBy(java.lang.String sortCol,
scala.collection.Seq<java.lang.String> sortCols)
Returns a new
DataFrame sorted by the given expressions. |
DataFrame |
orderBy(java.lang.String sortCol,
java.lang.String... sortCols)
Returns a new
DataFrame sorted by the given expressions. |
DataFrame |
persist()
Persist this
DataFrame with the default storage level (MEMORY_AND_DISK ). |
DataFrame |
persist(StorageLevel newLevel)
Persist this
DataFrame with the given storage level. |
void |
printSchema()
Prints the schema to the console in a nice tree format.
|
org.apache.spark.sql.execution.QueryExecution |
queryExecution() |
DataFrame[] |
randomSplit(double[] weights)
Randomly splits this
DataFrame with the provided weights. |
DataFrame[] |
randomSplit(double[] weights,
long seed)
Randomly splits this
DataFrame with the provided weights. |
RDD<Row> |
rdd()
|
void |
registerTempTable(java.lang.String tableName)
Registers this
DataFrame as a temporary table using the given name. |
DataFrame |
repartition(Column... partitionExprs)
Returns a new
DataFrame partitioned by the given partitioning expressions preserving
the existing number of partitions. |
DataFrame |
repartition(int numPartitions)
Returns a new
DataFrame that has exactly numPartitions partitions. |
DataFrame |
repartition(int numPartitions,
Column... partitionExprs)
Returns a new
DataFrame partitioned by the given partitioning expressions into
numPartitions . |
DataFrame |
repartition(int numPartitions,
scala.collection.Seq<Column> partitionExprs)
Returns a new
DataFrame partitioned by the given partitioning expressions into
numPartitions . |
DataFrame |
repartition(scala.collection.Seq<Column> partitionExprs)
Returns a new
DataFrame partitioned by the given partitioning expressions preserving
the existing number of partitions. |
protected org.apache.spark.sql.catalyst.expressions.NamedExpression |
resolve(java.lang.String colName) |
GroupedData |
rollup(Column... cols)
Create a multi-dimensional rollup for the current
DataFrame using the specified columns,
so we can run aggregation on them. |
GroupedData |
rollup(scala.collection.Seq<Column> cols)
Create a multi-dimensional rollup for the current
DataFrame using the specified columns,
so we can run aggregation on them. |
GroupedData |
rollup(java.lang.String col1,
scala.collection.Seq<java.lang.String> cols)
Create a multi-dimensional rollup for the current
DataFrame using the specified columns,
so we can run aggregation on them. |
GroupedData |
rollup(java.lang.String col1,
java.lang.String... cols)
Create a multi-dimensional rollup for the current
DataFrame using the specified columns,
so we can run aggregation on them. |
DataFrame |
sample(boolean withReplacement,
double fraction)
Returns a new
DataFrame by sampling a fraction of rows, using a random seed. |
DataFrame |
sample(boolean withReplacement,
double fraction,
long seed)
Returns a new
DataFrame by sampling a fraction of rows. |
void |
save(java.lang.String path)
Deprecated.
As of 1.4.0, replaced by
write().save(path) . This will be removed in Spark 2.0. |
void |
save(java.lang.String path,
SaveMode mode)
Deprecated.
As of 1.4.0, replaced by
write().mode(mode).save(path) .
This will be removed in Spark 2.0. |
void |
save(java.lang.String source,
SaveMode mode,
java.util.Map<java.lang.String,java.lang.String> options)
Deprecated.
As of 1.4.0, replaced by
write().format(source).mode(mode).options(options).save(path) .
This will be removed in Spark 2.0. |
void |
save(java.lang.String source,
SaveMode mode,
scala.collection.immutable.Map<java.lang.String,java.lang.String> options)
Deprecated.
As of 1.4.0, replaced by
write().format(source).mode(mode).options(options).save(path) .
This will be removed in Spark 2.0. |
void |
save(java.lang.String path,
java.lang.String source)
Deprecated.
As of 1.4.0, replaced by
write().format(source).save(path) .
This will be removed in Spark 2.0. |
void |
save(java.lang.String path,
java.lang.String source,
SaveMode mode)
Deprecated.
As of 1.4.0, replaced by
write().format(source).mode(mode).save(path) .
This will be removed in Spark 2.0. |
void |
saveAsParquetFile(java.lang.String path)
Deprecated.
As of 1.4.0, replaced by
write().parquet() . This will be removed in Spark 2.0. |
void |
saveAsTable(java.lang.String tableName)
Deprecated.
As of 1.4.0, replaced by
write().saveAsTable(tableName) .
This will be removed in Spark 2.0. |
void |
saveAsTable(java.lang.String tableName,
SaveMode mode)
Deprecated.
As of 1.4.0, replaced by
write().mode(mode).saveAsTable(tableName) .
This will be removed in Spark 2.0. |
void |
saveAsTable(java.lang.String tableName,
java.lang.String source)
Deprecated.
As of 1.4.0, replaced by
write().format(source).saveAsTable(tableName) .
This will be removed in Spark 2.0. |
void |
saveAsTable(java.lang.String tableName,
java.lang.String source,
SaveMode mode)
Deprecated.
As of 1.4.0, replaced by
write().mode(mode).saveAsTable(tableName) .
This will be removed in Spark 2.0. |
void |
saveAsTable(java.lang.String tableName,
java.lang.String source,
SaveMode mode,
java.util.Map<java.lang.String,java.lang.String> options)
Deprecated.
As of 1.4.0, replaced by
write().format(source).mode(mode).options(options).saveAsTable(tableName) .
This will be removed in Spark 2.0. |
void |
saveAsTable(java.lang.String tableName,
java.lang.String source,
SaveMode mode,
scala.collection.immutable.Map<java.lang.String,java.lang.String> options)
Deprecated.
As of 1.4.0, replaced by
write().format(source).mode(mode).options(options).saveAsTable(tableName) .
This will be removed in Spark 2.0. |
StructType |
schema()
Returns the schema of this
DataFrame . |
DataFrame |
select(Column... cols)
Selects a set of column based expressions.
|
DataFrame |
select(scala.collection.Seq<Column> cols)
Selects a set of column based expressions.
|
DataFrame |
select(java.lang.String col,
scala.collection.Seq<java.lang.String> cols)
Selects a set of columns.
|
DataFrame |
select(java.lang.String col,
java.lang.String... cols)
Selects a set of columns.
|
DataFrame |
selectExpr(scala.collection.Seq<java.lang.String> exprs)
Selects a set of SQL expressions.
|
DataFrame |
selectExpr(java.lang.String... exprs)
Selects a set of SQL expressions.
|
void |
show()
Displays the top 20 rows of
DataFrame in a tabular form. |
void |
show(boolean truncate)
Displays the top 20 rows of
DataFrame in a tabular form. |
void |
show(int numRows)
Displays the
DataFrame in a tabular form. |
void |
show(int numRows,
boolean truncate)
Displays the
DataFrame in a tabular form. |
DataFrame |
sort(Column... sortExprs)
Returns a new
DataFrame sorted by the given expressions. |
DataFrame |
sort(scala.collection.Seq<Column> sortExprs)
Returns a new
DataFrame sorted by the given expressions. |
DataFrame |
sort(java.lang.String sortCol,
scala.collection.Seq<java.lang.String> sortCols)
Returns a new
DataFrame sorted by the specified column, all in ascending order. |
DataFrame |
sort(java.lang.String sortCol,
java.lang.String... sortCols)
Returns a new
DataFrame sorted by the specified column, all in ascending order. |
DataFrame |
sortWithinPartitions(Column... sortExprs)
Returns a new
DataFrame with each partition sorted by the given expressions. |
DataFrame |
sortWithinPartitions(scala.collection.Seq<Column> sortExprs)
Returns a new
DataFrame with each partition sorted by the given expressions. |
DataFrame |
sortWithinPartitions(java.lang.String sortCol,
scala.collection.Seq<java.lang.String> sortCols)
Returns a new
DataFrame with each partition sorted by the given expressions. |
DataFrame |
sortWithinPartitions(java.lang.String sortCol,
java.lang.String... sortCols)
Returns a new
DataFrame with each partition sorted by the given expressions. |
SQLContext |
sqlContext() |
DataFrameStatFunctions |
stat()
Returns a
DataFrameStatFunctions for working statistic functions support. |
Row[] |
take(int n)
Returns the first
n rows in the DataFrame . |
java.util.List<Row> |
takeAsList(int n)
Returns the first
n rows in the DataFrame as a list. |
DataFrame |
toDF()
Returns the object itself.
|
DataFrame |
toDF(scala.collection.Seq<java.lang.String> colNames)
Returns a new
DataFrame with columns renamed. |
DataFrame |
toDF(java.lang.String... colNames)
Returns a new
DataFrame with columns renamed. |
JavaRDD<Row> |
toJavaRDD()
|
RDD<java.lang.String> |
toJSON()
Returns the content of the
DataFrame as a RDD of JSON strings. |
DataFrame |
toSchemaRDD()
Deprecated.
As of 1.3.0, replaced by
toDF() . This will be removed in Spark 2.0. |
<U> DataFrame |
transform(scala.Function1<DataFrame,DataFrame> t)
Concise syntax for chaining custom transformations.
|
DataFrame |
unionAll(DataFrame other)
Returns a new
DataFrame containing union of rows in this frame and another frame. |
DataFrame |
unpersist()
Mark the
DataFrame as non-persistent, and remove all blocks for it from memory and disk. |
DataFrame |
unpersist(boolean blocking)
Mark the
DataFrame as non-persistent, and remove all blocks for it from memory and disk. |
DataFrame |
where(Column condition)
Filters rows using the given condition.
|
DataFrame |
where(java.lang.String conditionExpr)
Filters rows using the given SQL expression.
|
DataFrame |
withColumn(java.lang.String colName,
Column col)
Returns a new
DataFrame by adding a column or replacing the existing column that has
the same name. |
DataFrame |
withColumnRenamed(java.lang.String existingName,
java.lang.String newName)
Returns a new
DataFrame with a column renamed. |
DataFrameWriter |
write()
:: Experimental ::
Interface for saving the content of the
DataFrame out into external storage. |
public DataFrame(SQLContext sqlContext, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)
This reports error eagerly as the DataFrame
is constructed, unless
SQLConf.dataFrameEagerAnalysis
is turned off.
sqlContext
- (undocumented)logicalPlan
- (undocumented)public DataFrame toDF(java.lang.String... colNames)
DataFrame
with columns renamed. This can be quite convenient in conversion
from a RDD of tuples into a DataFrame
with meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame with column name _1 and _2
rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
colNames
- (undocumented)public DataFrame sortWithinPartitions(java.lang.String sortCol, java.lang.String... sortCols)
DataFrame
with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortCol
- (undocumented)sortCols
- (undocumented)public DataFrame sortWithinPartitions(Column... sortExprs)
DataFrame
with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortExprs
- (undocumented)public DataFrame sort(java.lang.String sortCol, java.lang.String... sortCols)
DataFrame
sorted by the specified column, all in ascending order.
// The following 3 are equivalent
df.sort("sortcol")
df.sort($"sortcol")
df.sort($"sortcol".asc)
sortCol
- (undocumented)sortCols
- (undocumented)public DataFrame sort(Column... sortExprs)
DataFrame
sorted by the given expressions. For example:
df.sort($"col1", $"col2".desc)
sortExprs
- (undocumented)public DataFrame orderBy(java.lang.String sortCol, java.lang.String... sortCols)
DataFrame
sorted by the given expressions.
This is an alias of the sort
function.sortCol
- (undocumented)sortCols
- (undocumented)public DataFrame orderBy(Column... sortExprs)
DataFrame
sorted by the given expressions.
This is an alias of the sort
function.sortExprs
- (undocumented)public DataFrame select(Column... cols)
df.select($"colA", $"colB" + 1)
cols
- (undocumented)public DataFrame select(java.lang.String col, java.lang.String... cols)
select
that can only select
existing columns using column names (i.e. cannot construct expressions).
// The following two are equivalent:
df.select("colA", "colB")
df.select($"colA", $"colB")
col
- (undocumented)cols
- (undocumented)public DataFrame selectExpr(java.lang.String... exprs)
select
that accepts
SQL expressions.
// The following are equivalent:
df.selectExpr("colA", "colB as newName", "abs(colC)")
df.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
exprs
- (undocumented)public GroupedData groupBy(Column... cols)
DataFrame
using the specified columns, so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
// Compute the average for all numeric columns grouped by department.
df.groupBy($"department").avg()
// Compute the max age and average salary, grouped by department and gender.
df.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public GroupedData rollup(Column... cols)
DataFrame
using the specified columns,
so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
// Compute the average for all numeric columns rolluped by department and group.
df.rollup($"department", $"group").avg()
// Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public GroupedData cube(Column... cols)
DataFrame
using the specified columns,
so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group.
df.cube($"department", $"group").avg()
// Compute the max age and average salary, cubed by department and gender.
df.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public GroupedData groupBy(java.lang.String col1, java.lang.String... cols)
DataFrame
using the specified columns, so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department.
df.groupBy("department").avg()
// Compute the max age and average salary, grouped by department and gender.
df.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public GroupedData rollup(java.lang.String col1, java.lang.String... cols)
DataFrame
using the specified columns,
so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolluped by department and group.
df.rollup("department", "group").avg()
// Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public GroupedData cube(java.lang.String col1, java.lang.String... cols)
DataFrame
using the specified columns,
so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group.
df.cube("department", "group").avg()
// Compute the max age and average salary, cubed by department and gender.
df.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public DataFrame agg(Column expr, Column... exprs)
DataFrame
without groups.
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg(max($"age"), avg($"salary"))
df.groupBy().agg(max($"age"), avg($"salary"))
expr
- (undocumented)exprs
- (undocumented)public DataFrame describe(java.lang.String... cols)
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
. If you want to
programmatically compute summary statistics, use the agg
function instead.
df.describe("age", "height").show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// max 92.0 192.0
cols
- (undocumented)public DataFrame repartition(int numPartitions, Column... partitionExprs)
DataFrame
partitioned by the given partitioning expressions into
numPartitions
. The resulting DataFrame is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
numPartitions
- (undocumented)partitionExprs
- (undocumented)public DataFrame repartition(Column... partitionExprs)
DataFrame
partitioned by the given partitioning expressions preserving
the existing number of partitions. The resulting DataFrame is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
partitionExprs
- (undocumented)public SQLContext sqlContext()
sqlContext
in interface org.apache.spark.sql.execution.Queryable
public org.apache.spark.sql.execution.QueryExecution queryExecution()
queryExecution
in interface org.apache.spark.sql.execution.Queryable
protected org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan()
protected org.apache.spark.sql.catalyst.expressions.NamedExpression resolve(java.lang.String colName)
protected scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> numericColumns()
public DataFrame toDF()
public <U> Dataset<U> as(Encoder<U> evidence$1)
DataFrame
to a strongly-typed Dataset
containing objects of the
specified type, U
.evidence$1
- (undocumented)public DataFrame toDF(scala.collection.Seq<java.lang.String> colNames)
DataFrame
with columns renamed. This can be quite convenient in conversion
from a RDD of tuples into a DataFrame
with meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame with column name _1 and _2
rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
colNames
- (undocumented)public StructType schema()
DataFrame
.schema
in interface org.apache.spark.sql.execution.Queryable
public void printSchema()
printSchema
in interface org.apache.spark.sql.execution.Queryable
public void explain(boolean extended)
explain
in interface org.apache.spark.sql.execution.Queryable
extended
- (undocumented)public void explain()
explain
in interface org.apache.spark.sql.execution.Queryable
public scala.Tuple2<java.lang.String,java.lang.String>[] dtypes()
public java.lang.String[] columns()
public boolean isLocal()
collect
and take
methods can be run locally
(without any Spark executors).public void show(int numRows)
DataFrame
in a tabular form. Strings more than 20 characters will be
truncated, and all cells will be aligned right. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
numRows
- Number of rows to show
public void show()
DataFrame
in a tabular form. Strings more than 20 characters
will be truncated, and all cells will be aligned right.public void show(boolean truncate)
DataFrame
in a tabular form.
truncate
- Whether truncate long strings. If true, strings more than 20 characters will
be truncated and all cells will be aligned right
public void show(int numRows, boolean truncate)
DataFrame
in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
numRows
- Number of rows to showtruncate
- Whether truncate long strings. If true, strings more than 20 characters will
be truncated and all cells will be aligned right
public DataFrameNaFunctions na()
DataFrameNaFunctions
for working with missing data.
// Dropping rows containing any null values.
df.na.drop()
public DataFrameStatFunctions stat()
DataFrameStatFunctions
for working statistic functions support.
// Finding frequent items in column with name 'a'.
df.stat.freqItems(Seq("a"))
public DataFrame join(DataFrame right)
DataFrame
.
Note that cartesian joins are very expensive without an extra filter that can be pushed down.
right
- Right side of the join operation.public DataFrame join(DataFrame right, java.lang.String usingColumn)
DataFrame
using the given column.
Different from other join functions, the join column will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the column "user_id"
df1.join(df2, "user_id")
Note that if you perform a self-join using this function without aliasing the input
DataFrame
s, you will NOT be able to reference any columns after the join, since
there is no way to disambiguate which side of the join you would like to reference.
right
- Right side of the join operation.usingColumn
- Name of the column to join on. This column must exist on both sides.public DataFrame join(DataFrame right, scala.collection.Seq<java.lang.String> usingColumns)
DataFrame
using the given columns.
Different from other join functions, the join columns will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the columns "user_id" and "user_name"
df1.join(df2, Seq("user_id", "user_name"))
Note that if you perform a self-join using this function without aliasing the input
DataFrame
s, you will NOT be able to reference any columns after the join, since
there is no way to disambiguate which side of the join you would like to reference.
right
- Right side of the join operation.usingColumns
- Names of the columns to join on. This columns must exist on both sides.public DataFrame join(DataFrame right, scala.collection.Seq<java.lang.String> usingColumns, java.lang.String joinType)
DataFrame
using the given columns.
Different from other join functions, the join columns will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
Note that if you perform a self-join using this function without aliasing the input
DataFrame
s, you will NOT be able to reference any columns after the join, since
there is no way to disambiguate which side of the join you would like to reference.
right
- Right side of the join operation.usingColumns
- Names of the columns to join on. This columns must exist on both sides.joinType
- One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.public DataFrame join(DataFrame right, Column joinExprs)
DataFrame
, using the given join expression.
// The following two are equivalent:
df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
right
- (undocumented)joinExprs
- (undocumented)public DataFrame join(DataFrame right, Column joinExprs, java.lang.String joinType)
DataFrame
, using the given join expression. The following performs
a full outer join between df1
and df2
.
// Scala:
import org.apache.spark.sql.functions._
df1.join(df2, $"df1Key" === $"df2Key", "outer")
// Java:
import static org.apache.spark.sql.functions.*;
df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
right
- Right side of the join.joinExprs
- Join expression.joinType
- One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.public DataFrame sortWithinPartitions(java.lang.String sortCol, scala.collection.Seq<java.lang.String> sortCols)
DataFrame
with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortCol
- (undocumented)sortCols
- (undocumented)public DataFrame sortWithinPartitions(scala.collection.Seq<Column> sortExprs)
DataFrame
with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortExprs
- (undocumented)public DataFrame sort(java.lang.String sortCol, scala.collection.Seq<java.lang.String> sortCols)
DataFrame
sorted by the specified column, all in ascending order.
// The following 3 are equivalent
df.sort("sortcol")
df.sort($"sortcol")
df.sort($"sortcol".asc)
sortCol
- (undocumented)sortCols
- (undocumented)public DataFrame sort(scala.collection.Seq<Column> sortExprs)
DataFrame
sorted by the given expressions. For example:
df.sort($"col1", $"col2".desc)
sortExprs
- (undocumented)public DataFrame orderBy(java.lang.String sortCol, scala.collection.Seq<java.lang.String> sortCols)
DataFrame
sorted by the given expressions.
This is an alias of the sort
function.sortCol
- (undocumented)sortCols
- (undocumented)public DataFrame orderBy(scala.collection.Seq<Column> sortExprs)
DataFrame
sorted by the given expressions.
This is an alias of the sort
function.sortExprs
- (undocumented)public Column apply(java.lang.String colName)
Column
.
Note that the column name can also reference to a nested column like a.b
.colName
- (undocumented)public Column col(java.lang.String colName)
Column
.
Note that the column name can also reference to a nested column like a.b
.colName
- (undocumented)public DataFrame as(java.lang.String alias)
DataFrame
with an alias set.alias
- (undocumented)public DataFrame as(scala.Symbol alias)
DataFrame
with an alias set.alias
- (undocumented)public DataFrame alias(java.lang.String alias)
DataFrame
with an alias set. Same as as
.alias
- (undocumented)public DataFrame alias(scala.Symbol alias)
DataFrame
with an alias set. Same as as
.alias
- (undocumented)public DataFrame select(scala.collection.Seq<Column> cols)
df.select($"colA", $"colB" + 1)
cols
- (undocumented)public DataFrame select(java.lang.String col, scala.collection.Seq<java.lang.String> cols)
select
that can only select
existing columns using column names (i.e. cannot construct expressions).
// The following two are equivalent:
df.select("colA", "colB")
df.select($"colA", $"colB")
col
- (undocumented)cols
- (undocumented)public DataFrame selectExpr(scala.collection.Seq<java.lang.String> exprs)
select
that accepts
SQL expressions.
// The following are equivalent:
df.selectExpr("colA", "colB as newName", "abs(colC)")
df.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
exprs
- (undocumented)public DataFrame filter(Column condition)
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
condition
- (undocumented)public DataFrame filter(java.lang.String conditionExpr)
peopleDf.filter("age > 15")
conditionExpr
- (undocumented)public DataFrame where(Column condition)
filter
.
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
condition
- (undocumented)public DataFrame where(java.lang.String conditionExpr)
peopleDf.where("age > 15")
conditionExpr
- (undocumented)public GroupedData groupBy(scala.collection.Seq<Column> cols)
DataFrame
using the specified columns, so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
// Compute the average for all numeric columns grouped by department.
df.groupBy($"department").avg()
// Compute the max age and average salary, grouped by department and gender.
df.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public GroupedData rollup(scala.collection.Seq<Column> cols)
DataFrame
using the specified columns,
so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
// Compute the average for all numeric columns rolluped by department and group.
df.rollup($"department", $"group").avg()
// Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public GroupedData cube(scala.collection.Seq<Column> cols)
DataFrame
using the specified columns,
so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group.
df.cube($"department", $"group").avg()
// Compute the max age and average salary, cubed by department and gender.
df.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public GroupedData groupBy(java.lang.String col1, scala.collection.Seq<java.lang.String> cols)
DataFrame
using the specified columns, so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department.
df.groupBy("department").avg()
// Compute the max age and average salary, grouped by department and gender.
df.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public GroupedData rollup(java.lang.String col1, scala.collection.Seq<java.lang.String> cols)
DataFrame
using the specified columns,
so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolluped by department and group.
df.rollup("department", "group").avg()
// Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public GroupedData cube(java.lang.String col1, scala.collection.Seq<java.lang.String> cols)
DataFrame
using the specified columns,
so we can run aggregation on them.
See GroupedData
for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group.
df.cube("department", "group").avg()
// Compute the max age and average salary, cubed by department and gender.
df.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public DataFrame agg(scala.Tuple2<java.lang.String,java.lang.String> aggExpr, scala.collection.Seq<scala.Tuple2<java.lang.String,java.lang.String>> aggExprs)
DataFrame
without groups.
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg("age" -> "max", "salary" -> "avg")
df.groupBy().agg("age" -> "max", "salary" -> "avg")
aggExpr
- (undocumented)aggExprs
- (undocumented)public DataFrame agg(scala.collection.immutable.Map<java.lang.String,java.lang.String> exprs)
DataFrame
without groups.
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg(Map("age" -> "max", "salary" -> "avg"))
df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
exprs
- (undocumented)public DataFrame agg(java.util.Map<java.lang.String,java.lang.String> exprs)
DataFrame
without groups.
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg(Map("age" -> "max", "salary" -> "avg"))
df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
exprs
- (undocumented)public DataFrame agg(Column expr, scala.collection.Seq<Column> exprs)
DataFrame
without groups.
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg(max($"age"), avg($"salary"))
df.groupBy().agg(max($"age"), avg($"salary"))
expr
- (undocumented)exprs
- (undocumented)public DataFrame limit(int n)
DataFrame
by taking the first n
rows. The difference between this function
and head
is that head
returns an array while limit
returns a new DataFrame
.n
- (undocumented)public DataFrame unionAll(DataFrame other)
DataFrame
containing union of rows in this frame and another frame.
This is equivalent to UNION ALL
in SQL.other
- (undocumented)public DataFrame intersect(DataFrame other)
DataFrame
containing rows only in both this frame and another frame.
This is equivalent to INTERSECT
in SQL.other
- (undocumented)public DataFrame except(DataFrame other)
DataFrame
containing rows in this frame but not in another frame.
This is equivalent to EXCEPT
in SQL.other
- (undocumented)public DataFrame sample(boolean withReplacement, double fraction, long seed)
DataFrame
by sampling a fraction of rows.
withReplacement
- Sample with replacement or not.fraction
- Fraction of rows to generate.seed
- Seed for sampling.public DataFrame sample(boolean withReplacement, double fraction)
DataFrame
by sampling a fraction of rows, using a random seed.
withReplacement
- Sample with replacement or not.fraction
- Fraction of rows to generate.public DataFrame[] randomSplit(double[] weights, long seed)
DataFrame
with the provided weights.
weights
- weights for splits, will be normalized if they don't sum to 1.seed
- Seed for sampling.public DataFrame[] randomSplit(double[] weights)
DataFrame
with the provided weights.
weights
- weights for splits, will be normalized if they don't sum to 1.public <A extends scala.Product> DataFrame explode(scala.collection.Seq<Column> input, scala.Function1<Row,scala.collection.TraversableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$2)
DataFrame
where each row has been expanded to zero or more
rows by the provided function. This is similar to a LATERAL VIEW
in HiveQL. The columns of
the input row are implicitly joined with each row that is output by the function.
The following example uses this function to count the number of books which contain a given word:
case class Book(title: String, words: String)
val df: RDD[Book]
case class Word(word: String)
val allWords = df.explode('words) {
case Row(words: String) => words.split(" ").map(Word(_))
}
val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
input
- (undocumented)f
- (undocumented)evidence$2
- (undocumented)public <A,B> DataFrame explode(java.lang.String inputColumn, java.lang.String outputColumn, scala.Function1<A,scala.collection.TraversableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$3)
DataFrame
where a single column has been expanded to zero
or more rows by the provided function. This is similar to a LATERAL VIEW
in HiveQL. All
columns of the input row are implicitly joined with each value that is output by the function.
df.explode("words", "word"){words: String => words.split(" ")}
inputColumn
- (undocumented)outputColumn
- (undocumented)f
- (undocumented)evidence$3
- (undocumented)public DataFrame withColumn(java.lang.String colName, Column col)
DataFrame
by adding a column or replacing the existing column that has
the same name.colName
- (undocumented)col
- (undocumented)public DataFrame withColumnRenamed(java.lang.String existingName, java.lang.String newName)
DataFrame
with a column renamed.
This is a no-op if schema doesn't contain existingName.existingName
- (undocumented)newName
- (undocumented)public DataFrame drop(java.lang.String colName)
DataFrame
with a column dropped.
This is a no-op if schema doesn't contain column name.colName
- (undocumented)public DataFrame drop(Column col)
DataFrame
with a column dropped.
This version of drop accepts a Column rather than a name.
This is a no-op if the DataFrame doesn't have a column
with an equivalent expression.col
- (undocumented)public DataFrame dropDuplicates()
DataFrame
that contains only the unique rows from this DataFrame
.
This is an alias for distinct
.public DataFrame dropDuplicates(scala.collection.Seq<java.lang.String> colNames)
DataFrame
with duplicate rows removed, considering only
the subset of columns.
colNames
- (undocumented)public DataFrame dropDuplicates(java.lang.String[] colNames)
DataFrame
with duplicate rows removed, considering only
the subset of columns.
colNames
- (undocumented)public DataFrame describe(scala.collection.Seq<java.lang.String> cols)
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
. If you want to
programmatically compute summary statistics, use the agg
function instead.
df.describe("age", "height").show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// max 92.0 192.0
cols
- (undocumented)public Row[] head(int n)
n
rows.n
- (undocumented)public Row head()
public Row first()
public <U> DataFrame transform(scala.Function1<DataFrame,DataFrame> t)
def featurize(ds: DataFrame) = ...
df
.transform(featurize)
.transform(...)
t
- (undocumented)public <R> RDD<R> map(scala.Function1<Row,R> f, scala.reflect.ClassTag<R> evidence$4)
f
- (undocumented)evidence$4
- (undocumented)public <R> RDD<R> flatMap(scala.Function1<Row,scala.collection.TraversableOnce<R>> f, scala.reflect.ClassTag<R> evidence$5)
DataFrame
,
and then flattening the results.f
- (undocumented)evidence$5
- (undocumented)public <R> RDD<R> mapPartitions(scala.Function1<scala.collection.Iterator<Row>,scala.collection.Iterator<R>> f, scala.reflect.ClassTag<R> evidence$6)
f
- (undocumented)evidence$6
- (undocumented)public void foreach(scala.Function1<Row,scala.runtime.BoxedUnit> f)
f
to all rows.f
- (undocumented)public void foreachPartition(scala.Function1<scala.collection.Iterator<Row>,scala.runtime.BoxedUnit> f)
DataFrame
.f
- (undocumented)public Row[] take(int n)
n
rows in the DataFrame
.
Running take requires moving data into the application's driver process, and doing so with
a very large n
can crash the driver process with OutOfMemoryError.
n
- (undocumented)public java.util.List<Row> takeAsList(int n)
n
rows in the DataFrame
as a list.
Running take requires moving data into the application's driver process, and doing so with
a very large n
can crash the driver process with OutOfMemoryError.
n
- (undocumented)public Row[] collect()
Row
s in this DataFrame
.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList
.
public java.util.List<Row> collectAsList()
Row
s in this DataFrame
.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
public long count()
DataFrame
.public DataFrame repartition(int numPartitions)
DataFrame
that has exactly numPartitions
partitions.numPartitions
- (undocumented)public DataFrame repartition(int numPartitions, scala.collection.Seq<Column> partitionExprs)
DataFrame
partitioned by the given partitioning expressions into
numPartitions
. The resulting DataFrame is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
numPartitions
- (undocumented)partitionExprs
- (undocumented)public DataFrame repartition(scala.collection.Seq<Column> partitionExprs)
DataFrame
partitioned by the given partitioning expressions preserving
the existing number of partitions. The resulting DataFrame is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
partitionExprs
- (undocumented)public DataFrame coalesce(int numPartitions)
DataFrame
that has exactly numPartitions
partitions.
Similar to coalesce defined on an RDD
, this operation results in a narrow dependency, e.g.
if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
the 100 new partitions will claim 10 of the current partitions.numPartitions
- (undocumented)public DataFrame distinct()
DataFrame
that contains only the unique rows from this DataFrame
.
This is an alias for dropDuplicates
.public DataFrame persist()
DataFrame
with the default storage level (MEMORY_AND_DISK
).public DataFrame cache()
DataFrame
with the default storage level (MEMORY_AND_DISK
).public DataFrame persist(StorageLevel newLevel)
DataFrame
with the given storage level.newLevel
- One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
,
MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
,
MEMORY_AND_DISK_2
, etc.public DataFrame unpersist(boolean blocking)
DataFrame
as non-persistent, and remove all blocks for it from memory and disk.blocking
- Whether to block until all blocks are deleted.public DataFrame unpersist()
DataFrame
as non-persistent, and remove all blocks for it from memory and disk.public RDD<Row> rdd()
DataFrame
as an RDD
of Row
s. Note that the RDD is
memoized. Once called, it won't change even if you change any query planning related Spark SQL
configurations (e.g. spark.sql.shuffle.partitions
).public void registerTempTable(java.lang.String tableName)
DataFrame
as a temporary table using the given name. The lifetime of this
temporary table is tied to the SQLContext
that was used to create this DataFrame.
tableName
- (undocumented)public DataFrameWriter write()
DataFrame
out into external storage.
public RDD<java.lang.String> toJSON()
DataFrame
as a RDD of JSON strings.public java.lang.String[] inputFiles()
protected JavaRDD<byte[]> javaToPython()
protected int collectToPython()
public DataFrame toSchemaRDD()
toDF()
. This will be removed in Spark 2.0.public void createJDBCTable(java.lang.String url, java.lang.String table, boolean allowExisting)
write().jdbc()
. This will be removed in Spark 2.0.DataFrame
to a JDBC database at url
under the table name table
.
This will run a CREATE TABLE
and a bunch of INSERT INTO
statements.
If you pass true
for allowExisting
, it will drop any table with the
given name; if you pass false
, it will throw if the table already
exists.url
- (undocumented)table
- (undocumented)allowExisting
- (undocumented)public void insertIntoJDBC(java.lang.String url, java.lang.String table, boolean overwrite)
write().jdbc()
. This will be removed in Spark 2.0.DataFrame
to a JDBC database at url
under the table name table
.
Assumes the table already exists and has a compatible schema. If you
pass true
for overwrite
, it will TRUNCATE
the table before
performing the INSERT
s.
The table must already exist on the database. It must have a schema
that is compatible with the schema of this RDD; inserting the rows of
the RDD in order via the simple statement
INSERT INTO table VALUES (?, ?, ..., ?)
should not fail.
url
- (undocumented)table
- (undocumented)overwrite
- (undocumented)public void saveAsParquetFile(java.lang.String path)
write().parquet()
. This will be removed in Spark 2.0.DataFrame
as a parquet file, preserving the schema.
Files that are written out using this method can be read back in as a DataFrame
using the parquetFile
function in SQLContext
.path
- (undocumented)public void saveAsTable(java.lang.String tableName)
write().saveAsTable(tableName)
.
This will be removed in Spark 2.0.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation
with a single input
path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
and Parquet), the table is persisted in a Hive compatible format, which means other systems
like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
specific format.
tableName
- (undocumented)public void saveAsTable(java.lang.String tableName, SaveMode mode)
write().mode(mode).saveAsTable(tableName)
.
This will be removed in Spark 2.0.SaveMode.ErrorIfExists
as the save mode.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation
with a single input
path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
and Parquet), the table is persisted in a Hive compatible format, which means other systems
like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
specific format.
tableName
- (undocumented)mode
- (undocumented)public void saveAsTable(java.lang.String tableName, java.lang.String source)
write().format(source).saveAsTable(tableName)
.
This will be removed in Spark 2.0.SaveMode.ErrorIfExists
as the save mode.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation
with a single input
path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
and Parquet), the table is persisted in a Hive compatible format, which means other systems
like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
specific format.
tableName
- (undocumented)source
- (undocumented)public void saveAsTable(java.lang.String tableName, java.lang.String source, SaveMode mode)
write().mode(mode).saveAsTable(tableName)
.
This will be removed in Spark 2.0.SaveMode
specified by mode, and a set of options.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation
with a single input
path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
and Parquet), the table is persisted in a Hive compatible format, which means other systems
like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
specific format.
tableName
- (undocumented)source
- (undocumented)mode
- (undocumented)public void saveAsTable(java.lang.String tableName, java.lang.String source, SaveMode mode, java.util.Map<java.lang.String,java.lang.String> options)
write().format(source).mode(mode).options(options).saveAsTable(tableName)
.
This will be removed in Spark 2.0.SaveMode
specified by mode, and a set of options.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation
with a single input
path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
and Parquet), the table is persisted in a Hive compatible format, which means other systems
like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
specific format.
tableName
- (undocumented)source
- (undocumented)mode
- (undocumented)options
- (undocumented)public void saveAsTable(java.lang.String tableName, java.lang.String source, SaveMode mode, scala.collection.immutable.Map<java.lang.String,java.lang.String> options)
write().format(source).mode(mode).options(options).saveAsTable(tableName)
.
This will be removed in Spark 2.0.SaveMode
specified by mode, and a set of options.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation
with a single input
path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
and Parquet), the table is persisted in a Hive compatible format, which means other systems
like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
specific format.
tableName
- (undocumented)source
- (undocumented)mode
- (undocumented)options
- (undocumented)public void save(java.lang.String path)
write().save(path)
. This will be removed in Spark 2.0.SaveMode.ErrorIfExists
as the save mode.path
- (undocumented)public void save(java.lang.String path, SaveMode mode)
write().mode(mode).save(path)
.
This will be removed in Spark 2.0.SaveMode
specified by mode,
using the default data source configured by spark.sql.sources.default.path
- (undocumented)mode
- (undocumented)public void save(java.lang.String path, java.lang.String source)
write().format(source).save(path)
.
This will be removed in Spark 2.0.SaveMode.ErrorIfExists
as the save mode.path
- (undocumented)source
- (undocumented)public void save(java.lang.String path, java.lang.String source, SaveMode mode)
write().format(source).mode(mode).save(path)
.
This will be removed in Spark 2.0.SaveMode
specified by mode.path
- (undocumented)source
- (undocumented)mode
- (undocumented)public void save(java.lang.String source, SaveMode mode, java.util.Map<java.lang.String,java.lang.String> options)
write().format(source).mode(mode).options(options).save(path)
.
This will be removed in Spark 2.0.SaveMode
specified by mode, and a set of options.source
- (undocumented)mode
- (undocumented)options
- (undocumented)public void save(java.lang.String source, SaveMode mode, scala.collection.immutable.Map<java.lang.String,java.lang.String> options)
write().format(source).mode(mode).options(options).save(path)
.
This will be removed in Spark 2.0.SaveMode
specified by mode, and a set of optionssource
- (undocumented)mode
- (undocumented)options
- (undocumented)public void insertInto(java.lang.String tableName, boolean overwrite)
write().mode(SaveMode.Append|SaveMode.Overwrite).saveAsTable(tableName)
.
This will be removed in Spark 2.0.tableName
- (undocumented)overwrite
- (undocumented)public void insertInto(java.lang.String tableName)
write().mode(SaveMode.Append).saveAsTable(tableName)
.
This will be removed in Spark 2.0.tableName
- (undocumented)