pyspark.pandas.Series.apply¶

Series.apply(func: Callable, args: Sequence[Any] = (), **kwds: Any) → pyspark.pandas.series.Series[source]¶

Invoke function on values of Series.

Can be a Python function that only works on the Series.

Note

this API executes the function once to infer the type which is potentially expensive, for instance, when the dataset is created after aggregations or sorting.

To avoid this, specify return type in func, for instance, as below:

>>> def square(x) -> np.int32:
...     return x ** 2

pandas-on-Spark uses return type hint and does not try to infer the type.

Parameters

funcfunction: Python function to apply. Note that type hint for return type is required.
argstuple: Positional arguments passed to func after the series value.
**kwds: Additional keyword arguments passed to func.

Returns

Series

See also

Series.aggregate: Only perform aggregating type operations.
Series.transform: Only perform transforming type operations.
DataFrame.apply: The equivalent function for DataFrame.

Examples

Create a Series with typical summer temperatures for each city.

>>> s = ps.Series([20, 21, 12],
...               index=['London', 'New York', 'Helsinki'])
>>> s
London      20
New York    21
Helsinki    12
dtype: int64

Square the values by defining a function and passing it as an argument to apply().

>>> def square(x) -> np.int64:
...     return x ** 2
>>> s.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64

Define a custom function that needs additional positional arguments and pass these additional arguments using the args keyword

>>> def subtract_custom_value(x, custom_value) -> np.int64:
...     return x - custom_value

>>> s.apply(subtract_custom_value, args=(5,))
London      15
New York    16
Helsinki     7
dtype: int64

Define a custom function that takes keyword arguments and pass these arguments to apply

>>> def add_custom_values(x, **kwargs) -> np.int64:
...     for month in kwargs:
...         x += kwargs[month]
...     return x

>>> s.apply(add_custom_values, june=30, july=20, august=25)
London      95
New York    96
Helsinki    87
dtype: int64

Use a function from the Numpy library

>>> def numpy_log(col) -> np.float64:
...     return np.log(col)
>>> s.apply(numpy_log)
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64

You can omit the type hint and let pandas-on-Spark infer its type.

>>> s.apply(np.log)
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64

pyspark.pandas.Series.dot pyspark.pandas.Series.agg