Skip to content

SP24-CS511-Final-Project/column-sketch

Repository files navigation

column-sketch

ci

This is the main repository of the final project of CS511 offered at Spring 2024 semester.

Project Goals

The goal of the project is to reproduce column sketch on top of apache parquet format, and to explore how and if it improves performance on the workload of scanning a parquet table with a predicate to evaluate and produce a position list of rows that satisfy the predicate.

Column Sketch

The essential idea of column sketch is that given an array of data (either numerical or string, but we will prototype numerical data first), we will output a compression map, which is an order-preserving map from the original data to 8-bit integer space.

Assume the original dataset is denoted by $D$, the compression map is a function denoted by $S$, then by order-preserving, we mean the following:

$\forall a, b \in D, a >= b \rightarrow S(a) \geq S(b)$

And the useful corollary is that $S(a) < S(b) \rightarrow a < b$, which will be used for evaluating range queries on the compressed column.

Concretely, we want to implement something like follows:

pub trait CompressionMap<T> {
  fn compress(value: T) -> u8
}

fn column_sketch<T>(data: Vec<T>) -> impl CompressionMap<T> {}

Transform Parquet Tables

We will also implement a tool on top of the core algorithm, which automatically implement the compression for a parquet table, which involves transforming the table and serializing the compression map.

  • We store compressed columns directly as additional UINT8 columns in the parquet.
  • As a first step, we will serialize the compression map into a separate file so that we need not change parquet internals. It is also possible to store the serialized compression map directly as a distinct parquet page.

About

Column sketch compression on apache parquet

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published