Dataset API has a set of operators that are of particular use in Apache Spark’s Structured Streaming that together constitute so-called Streaming Dataset API.
Operator | Description |
---|---|
Drops duplicate records (given a subset of columns) dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T] |
|
Explains execution plans explain(): Unit
explain(extended: Boolean): Unit |
|
Aggregates rows by a untyped grouping function groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset |
|
Aggregates rows by a typed grouping function groupByKey(func: T => K): KeyValueGroupedDataset[K, T] |
|
Defines a streaming watermark on a column withWatermark(eventTime: String, delayThreshold: String): Dataset[T] |
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
// input stream
val rates = spark.
readStream.
format("rate").
option("rowsPerSecond", 1).
load
// stream processing
// replace [operator] with the operator of your choice
rates.[operator]
// output stream
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val sq = rates.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Complete).
queryName("rate-console").
start
// eventually...
sq.stop