Skip to content

Latest commit

 

History

History
87 lines (70 loc) · 2.08 KB

spark-sql-streaming-Dataset-operators.adoc

File metadata and controls

87 lines (70 loc) · 2.08 KB

Streaming Operators / Streaming Dataset API

Dataset API has a set of operators that are of particular use in Apache Spark’s Structured Streaming that together constitute so-called Streaming Dataset API.

Table 1. Streaming Operators
Operator Description

dropDuplicates

Drops duplicate records (given a subset of columns)

dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

explain

Explains execution plans

explain(): Unit
explain(extended: Boolean): Unit

groupBy

Aggregates rows by a untyped grouping function

groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupByKey

Aggregates rows by a typed grouping function

groupByKey(func: T => K): KeyValueGroupedDataset[K, T]

withWatermark

Defines a streaming watermark on a column

withWatermark(eventTime: String, delayThreshold: String): Dataset[T]
scala> spark.version
res0: String = 2.3.0-SNAPSHOT

// input stream
val rates = spark.
  readStream.
  format("rate").
  option("rowsPerSecond", 1).
  load

// stream processing
// replace [operator] with the operator of your choice
rates.[operator]

// output stream
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val sq = rates.
  writeStream.
  format("console").
  option("truncate", false).
  trigger(Trigger.ProcessingTime(10.seconds)).
  outputMode(OutputMode.Complete).
  queryName("rate-console").
  start

// eventually...
sq.stop