pyspark.streaming.DStream.reduceByWindow

DStream.reduceByWindow(reduceFunc: Callable[[T, T], T], invReduceFunc: Optional[Callable[[T, T], T]], windowDuration: int, slideDuration: int) → pyspark.streaming.dstream.DStream[T][source]

Return a new DStream in which each RDD has a single element generated by reducing all elements in a sliding window over this DStream.

if invReduceFunc is not None, the reduction is done incrementally using the old window’s reduced value :

  1. reduce the new values that entered the window (e.g., adding new counts)

2. “inverse reduce” the old values that left the window (e.g., subtracting old counts) This is more efficient than invReduceFunc is None.

Parameters
reduceFuncfunction

associative and commutative reduce function

invReduceFuncfunction

inverse reduce function of reduceFunc; such that for all y, and invertible x: invReduceFunc(reduceFunc(x, y), x) = y

windowDurationint

width of the window; must be a multiple of this DStream’s batching interval

slideDurationint

sliding interval of the window (i.e., the interval after which the new DStream will generate RDDs); must be a multiple of this DStream’s batching interval