Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Misinterpretation of Parquet List schema with single GROUP child named "array" #13313

Open
mythrocks opened this issue May 9, 2023 · 0 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@mythrocks
Copy link
Contributor

mythrocks commented May 9, 2023

This bug is to track a (possible) misinterpretation of Parquet list schemas when stored in a legacy format. This is a follow-up to #13277.

This is specific to rules #3 and #4 in the Parquet LogicalType spec, which states:

3. If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.
4. Otherwise, the repeated field's type is the element type with the repeated field's repetition.

Consider the following schema, from the Parquet file attached herewith:

 <pyarrow._parquet.ParquetSchema object at 0x7fe1cc5849c0>
required group field_id=-1 spark_schema {
  required group field_id=-1 my_list (List) {
    repeated group field_id=-1 array {
      required int32 field_id=-1 item;
    }
  }
}

libcudf seems to interpret this as List<Int32>:

$ gtests/PARQUET_TEST --gtest_filter=ParquetReaderTest.Myth
...
cudf::list_view<int32_t>:
Length : 1
Offsets : 0, 2
   0, 1

By my reading of the spec, this should be interpreted as a List<Struct<Int32>>. Apache Spark seems to concur:

scala> spark.read.parquet("pq_array.parquet").printSchema
root
 |-- my_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item: integer (nullable = true)
@mythrocks mythrocks added bug Something isn't working Needs Triage Need team to review and classify labels May 9, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

2 participants