Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

Namespace name is set when undefined #255

Open
jung-kim opened this issue Nov 15, 2017 · 8 comments
Open

Namespace name is set when undefined #255

jung-kim opened this issue Nov 15, 2017 · 8 comments

Comments

@jung-kim
Copy link

This is related to linkedin/goavro#96

Since 4.0.0, within the nested structure we are seeing that namespace is defined despite us never explicitly setting them. Is this defined in spec?

If we set Map("recordName" -> "usageData", "recordNamespace" -> "abc") than namespace becomes "abc.usageData".

root
 |-- usageData: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- app: string (nullable = true)
{
  "type": "record",
  "name": "topLevelRecord",
  "fields": [
    {
      "name": "usageData",
      "type": [
        {
          "type": "array",
          "items": [
            {
              "type": "record",
              "name": "usageData",
              "namespace": ".usageData",  <<---- ??????
              "fields": [
                {
                  "name": "app",
                  "type": [
                    "string",
                    "null"
                  ]
                }
              ]
            },
            "null"
          ]
        },
        "null"
      ]
    }
  ]
}
@gengliangwang
Copy link
Contributor

gengliangwang commented Nov 19, 2017

Hi, the behavior you mentioned is from this commit:
25cab2a

From avro spec:

A namespace is a dot-separated sequence of such names. The empty string may also be used as a namespace to indicate the null namespace. 

So namespace with leading dot should be OK. Can you specify where the problem is?

@jung-kim
Copy link
Author

jung-kim commented Jan 8, 2018

Problem is that when namespace is null namespace value should be empty or null.

when namespace is not specified, we are expecting namespace to be "" or null but right now it's ".usageData" in above case

@matthew-fishkin
Copy link

I am actually having the same issue as above. And it is coming back to bite us because we are trying to load the data into a Big Query table.

We don't set the namespace, and one is automatically generated that begins with a dot. We then get the following error:

The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .topic_scores

@tovbinm
Copy link

tovbinm commented Jul 3, 2018

After this change was merged: loading a dataset, then saving and loading it again with the same schema, since every nested record is prefixed with invalid namespace (example below).
I think this change has to be reverted since it fixes one thing, but breaks a ton of others.

val schema = Schema.Parser().parse("""{
  "type": "record",
  "name": "TestRecord",
  "namespace": "a.b.c",
  "fields": [
    {
      "name": "key",
      "type": [
        {
          "type": "record",
          "name": "key",
          "fields": [
            {
              "name": "email",
              "type": [ "string", "null"]
            }
          ]
        },
        "null"
      ]
    }
  ]
}"""

val df = sql.read.format("com.databricks.spark.avro")
   .option("avroSchema", schema)
   .load("/tmp/random.avro")

df.show(false) // so far so good

df.write.format("com.databricks.spark.avro")
  .option("recordName", "TestRecord")
  .option("recordNamespace", "a.b.c")
  .option("avroSchema", schema)
  .save("/tmp/random.out")

val loaded = sql.read.format("com.databricks.spark.avro")
  .option("recordName", "TestRecord")
  .option("recordNamespace", "a.b.c")
  .option("avroSchema", schema)
  .load("/tmp/random.out")

loaded.show(false) // Failure! AvroTypeException is thrown

Caused by: org.apache.avro.AvroTypeException: Found a.b.c.key.key, expecting union
  at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
  at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
  at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
  at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
  at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
  at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
  at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
  at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
  at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1$$anon$1.next(DefaultSource.scala:228)
  at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1$$anon$1.next(DefaultSource.scala:205)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)

@tovbinm
Copy link

tovbinm commented Jul 3, 2018

Even worse - each time you save/load/save/ your dataset it prepends a field name into the namespace.

@tovbinm
Copy link

tovbinm commented Jul 3, 2018

I made a patched release by undoing the #249 PR - https://jitpack.io/#relateiq/spark-avro

@gengliangwang
Copy link
Contributor

Fix it here: apache/spark#21974
Spark 2.4 release will have built in Avro package.

@tovbinm
Copy link

tovbinm commented Aug 3, 2018

@gengliangwang try adding a test case I suggested above ^

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants