Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create table from plain Parquet files #445

Closed
Fokko opened this issue Feb 19, 2024 · 5 comments
Closed

Create table from plain Parquet files #445

Fokko opened this issue Feb 19, 2024 · 5 comments
Assignees

Comments

@Fokko
Copy link
Contributor

Fokko commented Feb 19, 2024

Feature Request / Improvement

Today we can write to tables. During the write process we make sure that the schema is correct, and we collect column statistics during the write. It would be cool if we could also import existing Parquet files into a table, by extracting the schema, and fetching the statistics from the footer.

@Fokko Fokko added this to the PyIceberg 0.7.0 release milestone Feb 22, 2024
@sungwy
Copy link
Collaborator

sungwy commented Feb 24, 2024

@Fokko would this process be similar to running the following steps?

  1. create_table from PyArrow schema
  2. invoking add_files migration procedure?

Potentially related Issue: #354

@HonahX
Copy link
Contributor

HonahX commented Feb 26, 2024

@syun64 I think the logic of importing parquet files can be shared with add_files migration procedure. But for this issue we may want to do the creation and file import in a single transaction. In this way, if anything goes wrong, we won't leave users with an empty new table.

I think we needs something similar to CreateTableTranscation. WDYT?

@sungwy
Copy link
Collaborator

sungwy commented Feb 27, 2024

Thank you for the explanation @HonahX . Yes that's really great insight. I'm definitely in support of a CreateTableTransaction, because that's what we will need to support CREATE TABLE ... AS SELECT semantics as well, where we stage the creation of the table until we create the data files and create the new snapshot that represents the table.

@HonahX
Copy link
Contributor

HonahX commented Feb 29, 2024

what we will need to support CREATE TABLE ... AS SELECT semantics as well...

Totally agree! I've created an issue to track this feature: #483

@sungwy sungwy self-assigned this Mar 7, 2024
@Fokko
Copy link
Contributor Author

Fokko commented Mar 26, 2024

Has been fixed in #506

@Fokko Fokko closed this as completed Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants