Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

User-Defined Functions #1510

Open
wants to merge 155 commits into
base: master
Choose a base branch
from

Conversation

turingcompl33t
Copy link
Contributor

@turingcompl33t turingcompl33t commented Mar 9, 2021

This PR adds support for user-defined functions.

Background

For his master's thesis work, Tanuj (@tanujnay112 ) implemented support for user-defined functions in NoisePage on a branch in his fork of the project, the most recent version of which is here. However, because his research was primarily focused on an evaluation of different UDF performance enhancements (see the Froid paper for an introduction to UDF inlining) he also implemented some degree of support for other big name features, namely common table expressions and lateral joins. These two features are now largely being handled by a separate PR, so to avoid overlap, reduce the blast radius of the PRs, and (hopefully) integrate these features in a more timely manner, we are splitting the functionality implemented by Tanuj (and others) into distinct PRs.

Therefore, this PR is concerned with cherry-picking the UDF-relevant components from the existing fork, and preparing them to be integrated into master in a clean, controlled manner.

Starting Point

The basic statistics for the original PR for user-defined functions and common table expressions are as follows:

  • Commits: 482
  • Files Changed: 304

Excluding non-source files (e.g. Java, Python, etc.) we have:

  • 164 Source (.cpp) Files
  • 114 Header (.hpp) Files
  • 10 TPL (.tpl) Test Files

Clearly we need some way to more accurately assess both the scale of the PR as it relates to user-defined functions alone, as well as a way to track current progress, given the large number of components of the system that will be affected.

Current Status

The enumeration below lists all of the files in the original PR that pertain to UDF support. While the primary goal is simply to integrate these into the current master branch, I reserve the right to perform any refactoring I see fit while doing so.

Binder (5/5)

  • src/binder/bind_node_visitor.cpp
  • src/include/binder/bind_node_visitor.h
  • src/binder/binder_context.cpp
  • src/include/binder/binder_context.h
  • src/include/binder/binder_sherpa.h

Catalog (1/1)

  • src/catalog/database_catalog.cpp

Execution: AST (13/13)

  • src/execution/ast/ast.cpp
  • src/include/execution/ast/ast.h
  • src/execution/ast/ast_clone.cpp
  • src/include/execution/ast/ast_clone.h
  • src/execution/ast/ast_dump.cpp
  • src/execution/ast/ast_pretty_print.cpp
  • src/execution/ast/context.cpp
  • src/execution/ast/type.cpp
  • src/include/execution/ast/type.h
  • src/execution/ast/type_printer.cpp
  • src/include/execution/ast/ast_node_factory.h
  • src/include/execution/ast/builtins.h
  • src/include/execution/compiler/ast_fwd.h

Execution: Compiler (18/18)

  • src/execution/compiler/codegen.cpp
  • src/include/execution/compiler/codegen.h
  • src/execution/compiler/compilation_context.cpp
  • src/include/execution/compiler/compilation_context.h
  • src/execution/compiler/executable_query.cpp
  • src/include/execution/compiler/executable_query.h
  • src/execution/compiler/executable_query_builder.cpp
  • src/execution/compiler/expression/expression_translator.cpp
  • src/include/execution/compiler/expression/expression_translator.h
  • src/execution/compiler/expression/function_translator.cpp
  • src/include/execution/compiler/expression/function_translator.h
  • src/execution/compiler/function_builder.cpp
  • src/include/execution/compiler/function_builder.h
  • src/execution/compiler/operator/output_translator.cpp
  • src/execution/compiler/operator/operator_translator.cpp
  • src/include/execution/compiler/operator/operator_translator.h
  • src/execution/compiler/pipeline.cpp
  • src/include/execution/compiler/pipeline.h

Execution: Exec (4/4)

  • src/execution/exec/execution_context.cpp
  • src/execution/exec/output.cpp
  • src/include/execution/exec/execution_context.h
  • src/include/execution/exec/output.h

Execution: Functions (1/1)

  • src/include/execution/functions/function_context.h

Execution: Parsing (4/4)

  • src/execution/parsing/parser.cpp
  • src/include/execution/parsing/parser.h
  • src/execution/parsing/scanner.cpp
  • src/include/execution/parsing/token.h

Execution: SEMA (9/9)

  • src/execution/sema/scope.cpp
  • src/include/execution/sema/scope.h
  • src/execution/sema/sema_builtin.cpp
  • src/execution/sema/sema_checking.cpp
  • src/execution/sema/sema_decl.cpp
  • src/execution/sema/sema_expr.cpp
  • src/execution/sema/sema_stmt.cpp
  • src/execution/sema/sema_type.cpp
  • src/include/execution/sema/error_message.h

Execution: SQL (2/2)

  • src/execution/sql/ddl_executors.cpp
  • src/include/execution/sql/ddl_executors.h

Execution: VM (13/13)

  • src/execution/vm/bytecode_emitter.cpp
  • src/include/execution/vm/bytecode_emitter.h
  • src/execution/vm/bytecode_function_info.cpp
  • src/include/execution/vm/bytecode_function_info.h
  • src/execution/vm/bytecode_generator.cpp
  • src/include/execution/vm/bytecode_generator.h
  • src/execution/vm/bytecode_handlers.cpp
  • src/include/execution/vm/bytecode_handlers.h
  • src/execution/vm/bytecode_module.cpp
  • src/execution/vm/llvm_engine.cpp
  • src/execution/vm/module.cpp
  • src/execution/vm/vm.cpp
  • src/include/execution/vm/bytecodes.h

Network (1/1)

  • src/include/network/network_defs.h

Parser (4/4)

  • src/parser/postgresparser.cpp
  • src/include/parser/postgresparser.h
  • src/include/parser/create_function_statement.h
  • src/include/parser/expression/column_value_expression.h
  • src/parser/expression/constant_value_expression.cpp
  • src/include/parser/expression/constant_value_expression.h

Parser: UDF (10/10)

  • src/include/parser/udf/ast_node_visitor.h
  • src/parser/udf/ast_nodes.cpp
  • src/include/parser/udf/ast_nodes.h
  • src/include/parser/udf/udf_ast_context.h
  • src/parser/udf/udf_codegen.cpp
  • src/include/parser/udf/udf_codegen.h
  • src/parser/udf/udf_handler.cpp
  • src/include/parser/udf/udf_handler.h
  • src/parser/udf/udf_parser.cpp
  • src/include/parser/udf/udf_parser.h

Traffic Cop (2/2)

  • src/traffic_cop/traffic_cop.cpp
  • src/traffic_cop/traffic_cop_util.cpp

TPL Test Files (4/4)

  • sample_tpl/agg.tpl
  • sample_tpl/call.tpl
  • sample_tpl/param.tpl
  • sample_tpl/struct.tpl

Questions / Comments

  • In database_catalog.cpp, the original PR essentially hand-rolled the functionality that was already present in PgProcImpl. Why? Is there some shortcoming of the implementation within PgProcImpl that I have not encountered yet? Regardless, the update should be made in PgProcImpl rather than in DatabaseCatalog as it was in the original PR.
  • Instead of overwriting existing TPL tests with their lambda analogs, i brought in the new lambda TPL tests under new names; they are identified by the -lambda suffix.
  • While pulling in the UDF-specific files that were originally included in the src/parser/udf/ and src/include/parser/udf/ directories, I moved them to their respective parts of the system that I thought made sense. This involved creating new udf/ subdirectories in parser/, execution/compiler/ and execution/ast.
  • While pulling in the UDF-specific files that were originally included in the src/parser/udf/ and src/include/parser/udf/ directories, I omitted three files that were either empty, unreachable (not included anywhere) or entirely commented out: src/parser/udf/ast_nodes.cpp, src/parser/udf/udf_handler.cpp, and src/include/parser/udf/udf_handler.h.
  • I mistakenly identified binder_context.cpp and binder_context.h as files that were required as part of the integration. This is not the case; the changes to these files are only concerned with CTE implementation.
  • We define the ast_fwd.h header in both include/execution/ast/ and include/execution/compiler; the files are identical, except when you update one and neglect to update the other... I spent about half an hour trying to debug an issue related to this. We should probably remove the duplicate forward declaration files.
  • I mistakenly identified binder_test.cpp as having modifications related to UDFs. This is not the case; I have since removed this file from the TODO list.

@turingcompl33t turingcompl33t added feature Adds a requested feature in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. labels Mar 9, 2021
@turingcompl33t
Copy link
Contributor Author

I believe we are at a point now where we can begin the review process for user-defined functions. I have enumerated the files in the diff and grouped them into logical "sections" below (largely corresponding to the component of the system in which they reside). Under the assumption that this PR will have two reviewers, I grouped the groups into two major categories: "frontend" and "backend". These identifiers refer to the layers of the system involved - the files in the "frontend" category include those that implement functionality above the execution engine layer, while those in the "backend" comprise the execution engine and below. I believe the work breakdown between the two is nearly even - there are more files in the "frontend" section, but the changes are more superficial and will not require the same level of scrutiny that those in the "backend" might.

Auxiliary + "Frontend"

Documentation

  • docs/design_closures.md
  • docs/design_codegen.md
  • docs/design_ctes.md
  • docs/design_udfs.md
  • script/testing/junit/README.md

Documentation primarily consists of notes that I made for myself when I first started working on this PR. They are things that helped me understand how our implementation works (e.g. how do we use lambdas to make queries within UDFs work?). Anyone can take a glance at these if they feel so inclined. If we are of the mind that these are not generally useful to have around, I can remove them from the repository and just keep them for myself.

Tests

  • sample_tple/closure0.tpl
  • sample_tpl/closure1.tpl
  • sample_tpl/closure2.tpl
  • sample_tpl/closure3.tpl
  • sample_tpl/closure4.tpl
  • script/testing/junit/sql/udf.sql
  • script/testing/junit/traces/udf.test
  • script/testing/junit/src/GenerateTrace.java
  • script/testing/util/db_server.py
  • test/catalog/catalog_test.cpp
  • test/execution/ast_test.cpp
  • test/execution/atomics_test.cpp
  • test/execution/compiler_test.cpp
  • test/execution/execution_context_builder_test.cpp
  • test/execution/system_functions_test.cpp
  • test/include/execution/sql_test.h
  • test/optimizer/index_nested_loops_join_test.cpp
  • test/parser/plpgsql_parser_test.cpp
  • test/test_util/tpcc/workload_cached.cpp
  • test/test_util/tpch/workload.cpp

The important tests to look at will be TPL tests for TPL closures. I covered some basic test cases but didn't actually get too crazy with the TPL unit tests. The C++ unit tests I left entirely untouched from Tanuj's original pull request. The primary tests for UDF functionality are the JUnit integration tests. The functions are defined in script/testing/junit/sql/udf.sql. I wrote a couple integration tests for each of the constructs that I am particularly interested in for running the functions from SQL ProcBench. As mentioned above, there are plenty of features of UDFs that we don't currently support, but with all of this infrastructure in place, adding new PL/pgSQL language features should not be difficult. For instance, I didn't implement integer-style for-loops (i.e. FOR i IN 1..10 LOOP ...) despite the fact we can just desugar this to a while-loop because we don't actually need it to run ProcBench functions.

Network

  • src/include/network/network_defs.h
  • src/network/postgres/postgres_packet_writer.cpp

Network changes are trivial, just adding support for new SQL statements (i.e. DROP FUNCTION).

Traffic Cop

  • src/traffic_cop/traffic_cop.cpp
  • src/include/traffic_cop/traffic_cop.h
  • src/traffic_cop/traffic_cop_util.cpp

Nothing major is changed in the traffic cop. I added support for CREATE FUNCTION and DROP FUNCTION in the appropriate places, and hit this file with the refactor to ExecutionContextBuilder.

Parser

  • src/include/parser/nodes.h
  • src/include/parser/create_function_statement.h
  • src/include/parser/drop_statement.h
  • src/include/parser/expression/column_value_expression.h
  • src/parser/expression/constant_value_expression.cpp
  • src/include/parser/expression/constant_value_expression.h
  • src/include/parser/expression/subquery_expression.h
  • src/include/parser/parse_result.h
  • src/parser/postgresparser.cpp
  • src/include/parser/postgresparser.h
  • src/include/parser/select_statement.h
  • src/parser/udf/plpgsql_parser.cpp
  • src/include/parser/udf/plpgsql_parser.h
  • src/parser/udf/string_utils.cpp
  • src/include/parser/udf/string_utils.h
  • src/include/parser/udf/variable_ref.h
  • src/include/parser/udf/plpgsql_parse_result.h

Most of the changes in the parser are trivial and not worth much time. I updated the Postgres parser to add support for DROP FUNCTION. Similarly for the DropStatement statement type.

Beyond that, the PLpgSQLParser class implements the translation from the raw parse tree (obtained from libpq_query) to the AST that serves as input to UDF code generation. The most important functionality in this section is therefore located in plpgsql_parser.cpp and plpgsql_parser.h.

Binder

  • src/binder/bind_node_visitor.cpp
  • src/include/binder/bind_node_visitor.h
  • src/include/binder/binder_sherpa.h

We require changes to the binder because now, when we are binding a query, we may be doing so in the context of a user-defined function (i.e. a SQL query embedded in a UDF, either directly or in the form of a query-fed for-loop). Therefore, we may encounter names during binding that refer to PL/pgSQL variables, and we need to be able to recognize and resolve these.

Planner

  • src/planner/plannodes/plan_node_defs.cpp
  • src/include/planner/plannodes/plan_node_defs.h
  • src/include/planner/plannodes/plan_visitor.h
  • src/planner/plannodes/abstract_plan_node.cpp
  • src/include/planner/plannodes/abstract_plan_node.h
  • src/include/planner/plannodes/create_function_plan_node.h
  • src/planner/plannodes/drop_function_plan_node.cpp
  • src/include/planner/plannodes/drop_function_plan_node.h

Changes to the planner are made to support CREATE FUNCTION and DROP FUNCTION. These plan nodes should be relatively uninteresting as they look the same as many of the other plan node types.

Optimizer

  • src/optimizer/child_property_deriver.cpp
  • src/include/optimizer/child_property_deriver.h
  • src/optimizer/logical_operators.cpp
  • src/include/optimizer/logical_operators.h
  • src/include/optimizer/operator_visitor.h
  • src/optimizer/physical_operators.cpp
  • src/include/optimizer/physical_operators.h
  • src/optimizer/plan_generator.cpp
  • src/include/optimizer/plan_generator.h
  • src/optimizer/rule.cpp
  • src/include/optimizer/rule.h
  • src/optimizer/rules/implementation_rules.cpp
  • src/include/optimizer/rules/implementation_rules.h
  • src/optimizer/query_to_operator_transformer.cpp

It looks like the changes to the optimizer are non-trivial because so many files are touched, but all I did here is update the necessary files to add the DROP FUNCTION functionality. There is nothing really interesting going on here.

Catalog

  • src/catalog/catalog_accessor.cpp
  • src/include/catalog/catalog_accessor.h
  • src/catalog/database_catalog.cpp
  • src/include/catalog/database_catalog.h
  • src/catalog/postgres/pg_proc_impl.cpp
  • src/include/catalog/postgres/pg_proc_impl.h
  • src/include/catalog/postgres/pg_proc.h
  • src/include/catalog/postgres/pg_language.h

Changes to the catalog are relatively minor. I just cleaned up the API related to creation, manipulation (querying), and dropping of procedures.

"Backend"

Execution Engine (SQL)

  • src/execution/sql/ddl_executors.cpp
  • src/include/execution/sql/ddl_executors.h

The DDL executors "tie together" all of the functionality of UDFs. DDLExecutors::CreateFunctionExecutor is a great place to look to get a general overview of how the entire process is implemented.

Execution Engine (Parser)

  • src/include/execution/parsing/token.h
  • src/execution/parsing/parser.cpp
  • src/include/execution/parsing/parser.h
  • src/execution/parsing/scanner.cpp

Updates to the execution engine parser are made to add support for TPL closures, which manifest as lambda expressions. Updates in the parser are minor, and should be unsurprising for anyone familiar with parsers.

Execution Engine (AST + Semantic Analysis)

  • src/include/execution/ast/ast_fwd.h
  • src/include/execution/ast/ast_node_factory.h
  • src/include/execution/ast/ast_traversal_visitor.h
  • src/include/execution/ast/builtins.h
  • src/execution/ast/ast.cpp
  • src/include/execution/ast/ast.h
  • src/execution/ast/ast_clone.cpp
  • src/include/execution/ast/ast_clone.h
  • src/execution/ast/ast_dump.cpp
  • src/execution/ast/ast_pretty_print.cpp
  • src/execution/ast/context.cpp
  • src/execution/ast/type.cpp
  • src/include/execution/ast/type.h
  • src/execution/ast/type_printer.cpp
  • src/include/execution/sema/error_message.h
  • src/execution/sema/scope.cpp
  • src/include/execution/sema/scope.h
  • src/execution/sema/sema_builtin.cpp
  • src/execution/sema/sema_checking.cpp
  • src/execution/sema/sema_decl.cpp
  • src/execution/sema/sema_expr.cpp
  • src/execution/sema/sema_stmt.cpp
  • src/execution/sema/sema_type.cpp
  • src/include/execution/ast/udf/udf_ast_context.h
  • src/include/execution/ast/udf/udf_ast_node_visitor.h
  • src/include/execution/ast/udf/udf_ast_nodes.h

Most updates to the execution engine AST and semantic analysis components are made to add support for TPL closures. The largest part of the diff in this section, however, comes from new files ast_clone.cpp and ast_clone.h. These are part of the new functionality we need to clone the AST of an existing function into the definition of another to implement function calls.

Execution Engine (Compiler)

  • src/include/execution/compiler/ast_fwd.h
  • src/execution/compiler/codegen.cpp
  • src/include/execution/compiler/codegen.h
  • src/execution/compiler/compilation_context.cpp
  • src/include/execution/compiler/compilation_context.h
  • src/execution/compiler/executable_query.cpp
  • src/include/execution/compiler/executable_query.h
  • src/execution/compiler/executable_query_builder.cpp
  • src/execution/compiler/expression/expression_translator.cpp
  • src/include/execution/compiler/expression/expression_translator.h
  • src/execution/compiler/expression/function_translator.cpp
  • src/include/execution/compiler/function_builder.h
  • src/include/execution/compiler/expression/function_translator.h
  • src/execution/compiler/function_builder.cpp
  • src/execution/compiler/operator/hash_join_translator.cpp
  • src/execution/compiler/operator/operator_translator.cpp
  • src/include/execution/compiler/operator/operator_translator.h
  • src/execution/compiler/operator/output_translator.cpp
  • src/include/execution/compiler/operator/output_translator.h
  • src/execution/compiler/operator/static_aggregation_translator.cpp
  • src/execution/compiler/pipeline.cpp
  • src/include/execution/compiler/pipeline.h
  • src/execution/exec/execution_context.cpp
  • src/include/execution/exec/execution_context.h
  • src/execution/exec/execution_context_builder.cpp
  • src/include/execution/exec/execution_context_builder.h
  • src/include/execution/exec/output.h
  • src/include/execution/functions/function_context.h
  • src/execution/compiler/udf/udf_codegen.cpp
  • src/include/execution/compiler/udf/udf_codegen.h

Naturally, updates to the compiler constitute the largest part of this pull request. I will call out two specific places to look in this section. First, and most importantly, all of the code generation we do for UDFs is implemented in udf_codegen.cpp and udf_codegen.h. This is where all of the nitty-gritty details of how we implement various imperative PL/pgSQL constructs in our engine come together. The code for some operations can be a bit byzantine, but I believe I have added comments at most points where the intent is less than obvious.

Second, I had to update some fundamental aspects of code generation at the intersection of operator translators and pipelines. To make a long story short, because we now (sometimes) execute SQL queries embedded in the context of a UDF, I had to make the function signature for some of the top-level pipeline functions more flexible. For instance, embedded queries must have access to the TPL closure that implements their output callback in the top-level output translator. For this reason, the pipeline Run function must accept an additional parameter - the output callback. I am not particularly thrilled about how much complexity this adds to the code generation infrastructure, and there might be a larger discussion here regarding how to accomplish this in a more principled way. However, for now, this implementation works and does not affect queries that do not make use of output callbacks.

Execution Engine (VM)

  • src/include/execution/vm/bytecode_function_info.h
  • src/include/execution/vm/bytecode_handlers.h
  • src/include/execution/vm/bytecodes.h
  • src/execution/vm/bytecode_emitter.cpp
  • src/include/execution/vm/bytecode_emitter.h
  • src/execution/vm/bytecode_generator.cpp
  • src/include/execution/vm/bytecode_generator.h
  • src/execution/vm/bytecode_module.cpp
  • src/include/execution/vm/bytecode_module.h
  • src/execution/vm/control_flow_builders.cpp
  • src/include/execution/vm/control_flow_builders.h
  • src/execution/vm/llvm_engine.cpp
  • src/include/execution/vm/llvm_engine.h
  • src/execution/vm/module.cpp
  • src/include/execution/vm/module.h
  • src/execution/vm/vm.cpp

Despite the number of files that are touched in this section, changes to the VM are actually relatively minor. We add some new bytecode operations in order to "inject" parameter values from a PL/pgSQL function into an embedded SQL query. The bytecodes themselves are simple. The only updates made at the LLVM-level (i.e. llvm_engine.cpp and llvm_engine.h) are to add support for TPL closures.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature Adds a requested feature in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. ready-for-ci Indicate that this build should be run through CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants