Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-783] Spark with Iceberg tables: catalog.json is empty #376

Closed
zsvoboda opened this issue Jun 28, 2022 · 6 comments
Closed

[CT-783] Spark with Iceberg tables: catalog.json is empty #376

zsvoboda opened this issue Jun 28, 2022 · 6 comments
Labels
enhancement New feature or request help_wanted Extra attention is needed Stale

Comments

@zsvoboda
Copy link

zsvoboda commented Jun 28, 2022

Describe the bug

A clear and concise description of what the bug is. What command did you run? What happened?

I'm using Spark 3.2.1 with Iceberg 0.13.2 (more details here spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.2)

Steps To Reproduce

In as much detail as possible, please provide steps to reproduce the issue. Sample data that triggers the issue, example model code, etc is all very helpful here.

  1. Create a model
  2. Write model.yml file.
  3. Run dbt docs generate.
  4. Check catalog.json - it is empty

Expected behavior

A clear and concise description of what you expected to happen.

catalog.json is populated with the table schema and docs.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - mysql5:  1.0.0 - Not compatible!
  - mariadb: 1.0.0 - Not compatible!
  - trino:   1.1.1 - Up to date!
  - mysql:   1.0.0 - Not compatible!
  - spark:   1.0.0 - Not compatible!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:

Mac OSX Monterey

The output of python --version:

Python 3.8.12

Additional context

Add any other context about the problem here.

Seems to be similar problem like with the Delta tables (#295)

show table extended in warehouse like '*';

SQL Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: SHOW TABLE EXTENDED is not supported for v2 tables.;
ShowTableExtended *, [namespace#21, tableName#22, isTemporary#23, information#24]
+- ResolvedNamespace org.apache.iceberg.spark.SparkCatalog@7929bdd7

at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:225)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:239)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

Caused by: org.apache.spark.sql.AnalysisException: SHOW TABLE EXTENDED is not supported for v2 tables.;
ShowTableExtended *, [namespace#21, tableName#22, isTemporary#23, information#24]
+- ResolvedNamespace org.apache.iceberg.spark.SparkCatalog@7929bdd7

@zsvoboda zsvoboda added bug Something isn't working triage labels Jun 28, 2022
@github-actions github-actions bot changed the title Spark with Iceberg tables: catalog.json is empty [CT-783] Spark with Iceberg tables: catalog.json is empty Jun 28, 2022
@lostmygithubaccount
Copy link

Hi @zsvoboda, thank you for opening this issue! Apache Iceberg is not yet an officially supported file format: https://docs.getdbt.com/reference/resource-configs/spark-configs#configuring-tables

Would you be interested in contributing this? If so, we can likely take some time to give guidance on how this could be implemented. I'll mark this as help wanted as it's likely not something we can prioritize in the near future.

@lostmygithubaccount lostmygithubaccount added enhancement New feature or request and removed bug Something isn't working labels Aug 11, 2022
@jtcohen6
Copy link
Contributor

Seems to be similar problem like with the Delta tables (#295)

I'm not sure what the longer-term solution here. If using OSS Delta, Iceberg, or other file formats, do we need to revert to the much older way of doing this (describe table extended once per model/seed/source/snapshot)? Or can we hope for the eventual addition of information_schema (Unity Catalog / Databricks only) to OSS Apache Spark?

@kbendick
Copy link

I think there will eventually be a migration to some sort of information_schema, but we’d need to have a generic API to support it (like merge into does) so that data sources could implement that.

that will probably be a while and having format v1 vs v2 for the provider in the general configuration for the table would be a good idea. That’s the difference in Spark between the two statements (why the schema looks the way it does) and the SQL queries they need.

But information_schema is not yet part of the Spark catalog API at all so I wouldn’t recommend relying on that if more is to be supported.

My 2 cents. Happy to help where I can as I get back from my break if there’s interest!

@brandys11
Copy link

Is there some workaround for this at the moment?

@Fleid Fleid mentioned this issue Dec 6, 2022
4 tasks
@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@Fokko
Copy link
Contributor

Fokko commented May 11, 2023

This issue has been fixed in #294

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Extra attention is needed Stale
Projects
None yet
Development

No branches or pull requests

6 participants