Close #169 and add support for DataSet of Avro records #217

themodernlife · 2017-02-10T23:27:03Z

@bdrillard you did an amazing job getting this working and I hope you'll have a look at the slight refactoring I've done to get this over the finish line:

Works with multiple versions of Spark (2.0.x and 2.1.x so far)
Travis test matrix should be updated appropriately
Works with multiple versions of Avro

The main challenge was getting past Spark version incompatibilities. I've done that through the addition of SBT modules, liberal use of reflection and some straight up copy/paste. I've tried to put proper fences around the hacks.

Let me know what you guys think! Again, all credit to @bdrillard since this is 90% his work.

Cheers!

To avoid confusion, we should remove the documentation / linking instructions for the 2.x line of releases since the current README describes features which don't apply there. Instead, we should link to the older docs. Author: Josh Rosen <joshrosen@databricks.com> Closes databricks#199 from JoshRosen/readme-fixes. (cherry picked from commit b01a034) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

There is still an issue with ExternalMapToCatalyst to be resolved

…pass

…t will fail under pacakaging

This patch builds on databricks#206 in order to restore support for Spark 2.0.x. This means that a single binary artifact can be used with both Spark 2.0.x and 2.1.x, simplifying the builds of downstream projects which are compatible with both Spark versions. Author: Josh Rosen <joshrosen@databricks.com> Closes databricks#212 from JoshRosen/add-spark-2.1.

Author: Josh Rosen <joshrosen@databricks.com> Closes databricks#213 from JoshRosen/prepare-for-3.2-release.

…he final jar

codecov-io · 2017-02-10T23:34:50Z

Codecov Report

Merging #217 into master will increase coverage by 2.45%.
The diff coverage is 95.56%.

@@            Coverage Diff             @@
##           master     #217      +/-   ##
==========================================
+ Coverage   90.71%   93.16%   +2.45%     
==========================================
  Files           5        7       +2     
  Lines         334      688     +354     
  Branches       50       73      +23     
==========================================
+ Hits          303      641     +338     
- Misses         31       47      +16

bdrillard · 2017-02-13T15:40:58Z

@themodernlife thanks for putting this pull-request together, I certainly didn't have the SBT expertise necessary to handle multiple builds of Spark, so this additional work is just what the encoder needed. I had a look at the inlining of additional object types that aren't common to both 2.0.0 and 2.1.x, as well as the serializing of schemas (when used in parsing complex unions), those changes look fine to me. I noticed just a few stray commented-out lines from what may have been some refactoring experiments and a few style points in doc comments. Otherwise this looks great, thanks for the additional legwork on this issue.

gcpagano · 2017-02-14T09:52:07Z

I'm trying to add the following test based on this PR, but I get a org.apache.spark.SparkException: Task not serializable caused by Cause: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
Here the full stack trace https://gist.github.com/gcpagano/e95a351dc75c4a6ecd2a50fdd7665e90

I'm not sure if this use case wasn't meant to be supported in this PR or maybe I'm doing something wrong in the test code.

  test("create Dataset from GenericRecord") {
    val sparkSession = spark
    import sparkSession.implicits._

    val schema: Schema =
      SchemaBuilder
        .record("GenericRecordTest")
        .namespace("com.databricks.spark.avro")
        .fields()
        .requiredString("field1")
        .endRecord()

    implicit val enc = AvroEncoder.of[GenericData.Record](schema)

    val genericRecords = (1 to 10) map { i =>
      new GenericRecordBuilder(schema)
        .set("field1", "field-" + i)
        .build()
    }

    val rdd: RDD[Record] = sparkSession.sparkContext
      .parallelize(genericRecords)

    val ds = rdd.toDS()

    assert(ds.count() == genericRecords.size)
  }

themodernlife · 2017-02-15T22:35:46Z

Hmm... ran into this with SpecificRecord as well. The problem is toDS runs the encoder in mapPartitions. In mapPartitions we check that the encoder's entire expression tree is serializable. Some of the nodes in the expression tree hold a reference to the Avro Schema which is not, and you get errors like the above.

I'm looking into a way to solve this. I have a fix for your unit test, but looking at the code I'm worried there could be other runtime errors scattered about.

themodernlife · 2017-02-15T22:55:14Z

@gcpagano pushed change for your unit test and added coverage for fixed and enum cases. Let me know if you have any feedback.

Hi, based on this issue databricks#67 I create this pull request Author: Nihed MBAREK <nihedmm@gmail.com> Author: vlyubin <vlyubin@gmail.com> Author: nihed <nihedmm@gmail.com> Closes databricks#124 from nihed/master. (cherry picked from commit c19f01a) Signed-off-by: vlyubin <vlyubin@gmail.com>

gcpagano · 2017-02-17T14:19:29Z

LGTM

tkinz27 · 2017-03-06T23:43:51Z

Been watching this PR for a while. Any reason to hold back on this issue?

bdrillard · 2017-04-17T14:47:43Z

@themodernlife Could you check and resolve this last merge conflict? It appears to be pretty minor.
@JoshRosen Would it be possible for us to merge this PR soon?

gengliangwang · 2017-10-26T21:30:35Z

@bdrillard Do you have time to create a clean PR based on latest master? We will keep reviewing on the changes. Thanks!

cloud-fan · 2017-10-27T21:44:11Z

@themodernlife I'm afraid you need to send a new PR to set the base branch to master...

themodernlife · 2017-10-30T19:22:44Z

@cloud-fan I was able to update the base branch to master. Looking into why Travis is complaining.

gengliangwang · 2017-10-30T19:59:56Z

Hi @themodernlife you changes contains commits from @JoshRosen and @liancheng , which should not be related to this PR.

rxin · 2017-11-01T18:31:41Z

I took a look at the implementation and it depends on a lot of the internals of Spark, with a lot of expressions that even do code generation. Basically this would break with every new release of Spark and isn't really maintainable.

We should really design and implement a public API in Spark for expressing encoders, and then implement that here.

rbrush · 2017-11-03T20:38:27Z

I discussed this with @bdrillard, who did the bulk of the Encoder implementation, and we would welcome a stable public API in Spark for custom encoders like this. We will transition this logic to that when it's available. Should we log a JIRA to the Spark project to track that?

In the meantime, we realize this taps into some Spark classes that are subject to change, and are willing to update our internal usage of it as they evolve. We realize this isn't ideal, but it provides quite a bit of value to our existing workloads for reasons described earlier in this chain. Hopefully this state of affairs is temporary and we'll be able to migrate to the public API when available.

rxin · 2017-11-03T20:53:41Z

Certainly you can use this if you are willing to pay the cost to upgrade it with each version of Spark. But if that's the case, perhaps just do it in your fork for now?

rbrush · 2017-11-04T15:22:51Z

Sure, we're happy to continue with our internal fork for the time being. But of course we're only one of several interested parties on this thread, so I'm sure a public API to support this would be appreciated in general.

rxin · 2017-11-04T18:22:32Z

Yup designing a public API for encoder/decoders make a lot of sense in Spark proper. We should do it!

rxin · 2017-11-07T19:14:29Z

@rbrush @bdrillard Aside from that, is there a way to update the patch to not rely on custom code generation? We might need to add couple small expressions to Spark to do that.

If that works, perhaps we can create a release based on that first, and then use that as the basis to create the encoder API.

bdrillard · 2017-11-07T19:43:31Z

@rxin, to check my understanding of your question, it depends on how broadly you mean "to not rely on custom code generation".

The Serializer/Deserializer of the ExpressionEncoder API produces a custom Expression, and that Expression ultimately generates custom code for the Projection to UnsafeRow for a collection of typed objects conforming to a given Avro schema. I don't think the core task of this patch (getting typed Datasets from collections of arbitrary Avro objects) could be attained without an Encoder, its SerDe, and the "custom code" it would generate. As we discussed a bit above, Spark-Avro already has the ability to convert a collection of Avro objects to a Dataframe, but without the Encoder SerDe we lose static type information that makes authoring different functions in the Spark API so much easier and more reasonable.

More specifically though, the functions producing the SerDe rely on currently public functions in the existing Spark expressions API, e.g. Invoke, InvokeStatic, Literal, etc.

We do however, have a few functions that are custom extensions of Expression, namely ObjectCast, InitializeAvroObject, and ExternalMapToCatalyst.

ExternalMapToCatalyst was in the Spark Expressions API at the time this patch was prepared, but the function was in private scope, so we copied it inline. At this time, I believe we could opt for the version in the Spark API, and so we could remove our copy.
ObjectCast creates Java code for performing an inline object cast, which is necessary for instances where (in Avro for example) a getter returns only Object and through traversal of the schema, we know what type we want to cast the object to. I think this could be safely/easily added to the Spark API if such functionality isn't already there.
InitializeAvroObject is exactly like InitializeJavaBean, but for creating instances of Avro objects. This function is necessary for the Deserializer of the Encoder, so it's not clear how we could take this function out/move it (I don't think it would make sense to add such an Avro-specific Expression to the Spark API).

Let me know your thoughts.

marmbrus · 2017-11-08T15:41:35Z

@bdrillard I think it is the custom extensions of Expression that are the scariest as they build stringly typed code that relies on undocumented and unstable details of codegen. If we can eliminate these, I think it significantly reduces the concern. If we restrict ourselves to using APIs of internal Expressions then changes will at least give compiler errors.

Regarding the specific Expressions:

ExternalMapToCatalyst - catalyst is implicitly private (i.e excluded from the scala doc) so there is no need to rely on visibility modifiers. I think we can just make this public and use it here.
ObjectCast - I don't think that we have this (but @cloud-fan might know better). If we were going to add this to Spark I might just make it an optional argument to Invoke.
InitializeAvroObject - This one does seem a little harder. Again maybe we could generalize Spark's InitializeX to take Seq[(String, Seq[Expression])] or something so that it could be used for both.

cloud-fan · 2017-11-08T21:43:41Z

Spark doesn't have ObjectCast, but you can easily implement one.

bdrillard · 2017-11-08T21:51:38Z

@marmbrus, @cloud-fan, I'd be happy to open a PR in Spark proper with those changes, so we can then refactor this PR to make use of those public functions.

cloud-fan · 2017-11-09T10:59:55Z

why not just implement it here? Otherwise you can only support Spark 2.3 or later

marmbrus · 2017-11-10T00:06:09Z

@cloud-fan the fear is that having an external library depend on the specifics of code-gen is too brittle. I think its okay if this requires a new version of Spark if that reduces the tight coupling with the specifics of execution. Until that is released we can update this PR by depending on a nightly snapshot and then once 2.3 is out we can make a new spark-avro release.

@bdrillard, that sounds great. Feel free to message me on the Spark JIRA / PR and I can help shepherd them in. Thanks for working on this!

kalvinnchau · 2017-11-15T19:28:10Z

@themodernlife
I'm getting a Cause: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: Couldn't find bytes on class java.nio.ByteBuffer Exception with this schema format:

{
  "type": "record",
  "name": "GenericRecordTest2",
  "namespace": "com.databricks.spark.avro",
  "fields": [
    {
      "name": "fixedUnionVal",
      "type": [
        {
          "type": "fixed",
          "name": "IPv4",
          "size": 4
        },
        "int"
      ]
    }
  ]
}

Here's the test I added:

  test("create Dataset from Fixed Generic Record") {
    val sparkSession = spark
    import sparkSession.implicits._

    val schema: Schema =
      SchemaBuilder
        .record("GenericRecordTest2")
        .namespace("com.databricks.spark.avro")
        .fields()
        .name("fixedUnionVal").`type`()
          .unionOf()
            .fixed("IPv4").size(4).and()
            .intType()
            .endUnion()
          .noDefault()
        .endRecord()


    implicit val enc = AvroEncoder.of[GenericData.Record](schema)

    val genericRecords = (1 to 10) map { i =>
      new GenericRecordBuilder(schema)
        .set("fixedUnionVal", i)
        .build()
    }

    val rdd: RDD[GenericData.Record] = sparkSession.sparkContext
      .parallelize(genericRecords)

    val ds = rdd.toDS()

    assert(ds.count() == genericRecords.size)
  }

Gist to the stacktrace

bdrillard · 2017-12-08T16:15:21Z

@marmbrus I've logged SPARK-22739 to track the additions to the Expressions API.

@kalvinnchau Thanks for the additional test case! We'll be circling back to this once the Spark ticket I link just above is resolved.

bdrillard · 2018-05-21T15:59:26Z

An update for this issue, I've created a PR to Spark that includes the necessary expressions for an Avro-Encoder, see #21348.

cc: @marmbrus

mtraynham · 2018-07-16T16:28:21Z

@bdrillard @themodernlife I've attempted to use this but ran into an exception, conversion between java.lang.String & org.apache.avro.util.Utf8. I was hoping one of you could make a code change suggestion and potentially fix the underlying problem here.

The stack trace is:

Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.avro.util.Utf8
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply1_11$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr3$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply3_1$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.ObjectOperator$$anonfun$serializeObjectToRow$1.apply(objects.scala:153)
	at org.apache.spark.sql.execution.ObjectOperator$$anonfun$serializeObjectToRow$1.apply(objects.scala:151)
	at org.apache.spark.sql.execution.AppendColumnsWithObjectExec$$anonfun$10$$anonfun$apply$3.apply(objects.scala:302)
	at org.apache.spark.sql.execution.AppendColumnsWithObjectExec$$anonfun$10$$anonfun$apply$3.apply(objects.scala:299)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

So this seems to take place within Serialization of the Avro POJO to Row. I've tracked the particular field of the Avro schema down to a property called attributes, which is a Map<String, String>.

That shema field is defined like so:

{
    "name": "attributes",
    "type": [
        "null",
        {
            "type": "map",
            "avro.java.string": "String",
            "values": {
                "type": "string",
                "avro.java.string": "String"
            }
        }
    ],
    "doc": "A mapping of arbitrary event attributes, as key-value pairs.",
    "default": null
}

The "avro.java.string": "String" property tells Avro that the field is to be serde with java.lang.String instead of org.apache.avro.util.Utf8 (which implements java.lang.CharSequence).

Interestingly enough, this error does not occur for fields that are not "type": "map". I have a different field that seems to work correctly:

{
    "name": "segment",
    "type": [
        "null",
        {
            "type": "string",
            "avro.java.string": "String"
        }
    ],
    "default": null
}

mtraynham · 2018-07-16T21:17:36Z

Well I think I resolved the issue, I just cast it to CharSequence instead of Utf8, with:

@@ -604,7 +604,7 @@ private object AvroTypeInference {
 
       ExternalMapToCatalyst(
         inputObject,
-        ObjectType(classOf[org.apache.avro.util.Utf8]),
+        ObjectType(classOf[java.lang.CharSequence]),
         serializerFor(_, Schema.create(STRING)),
         valueType,
         serializerFor(_, valueSchema),

There was also a Hive issue which is closely related that I kinda of stole the idea for the fix from: https://issues.apache.org/jira/browse/HIVE-5865

JoshRosen and others added 18 commits November 27, 2016 17:51

Change version to 3.1.1-SNAPSHOT

39bc512

WIP - allow creation of Dataset from RDD[SpecificRecord]

29ef5f6

There is still an issue with ExternalMapToCatalyst to be resolved

Fix imports so that we can redeclare LambdaVariable, make unit tests …

fd777c6

…pass

Update SBT for Spark-version-dependent code

e4c7a42

Keep spark version at 2.0.0

f661f61

Ensures all unit tests pass on Windows

cdba653

Unit tests pass with different versions, but build.sbt is messy and i…

f9ef636

…t will fail under pacakaging

Update README in preparation for 3.2.0 release

762f7a1

Author: Josh Rosen <joshrosen@databricks.com> Closes databricks#213 from JoshRosen/prepare-for-3.2-release.

Setting version to 3.2.0

e28e456

Setting version to 3.2.1-SNAPSHOT

1bfe421

3.1.0 -> 3.2.0 in README

51eb883

Build is now working and including the classfiles in the final artifact

69a0570

build.sbt is looking better!

ce47906

Much tighter way to include the spark version-specific modules into t…

0934452

…he final jar

Merge upstream (3.2.1)

3ee8ccd

Update create table to avoid deprecration warnings

73fc4d8

themodernlife mentioned this pull request Feb 12, 2017

AvroEncoder for Datasets of Generic/Specific Avro Records #215

Closed

Ian Hummel added 2 commits February 15, 2017 17:44

Make unit test for GenericData.Record pass

0e99d72

Make tests pass with ENUM and FIXED

3a1fd20

Merge master

e15c5bd

themodernlife changed the base branch from branch-3.2 to master October 30, 2017 19:22

Ian Hummel added 2 commits October 30, 2017 15:25

Merge master

8300332

Flakey tests?

7869c11

bdrillard mentioned this pull request Dec 26, 2017

[SPARK-22739][Catalyst][WIP] Additional Expression Support for Objects apache/spark#20085

Closed

bdrillard mentioned this pull request May 17, 2018

[SPARK-22739][Catalyst] Additional Expression Support for Objects apache/spark#21348

Closed

viirya mentioned this pull request May 21, 2018

[SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should support user-defined types as fields of Scala case class and tuple apache/spark#21310

Closed

bdrillard mentioned this pull request Oct 30, 2018

[SPARK-25789][SQL] Support for Dataset of Avro apache/spark#22878

Closed

bdrillard mentioned this pull request Apr 10, 2019

[SPARK-27388][SQL] encoder for objects defined by properties (ie. Avro) apache/spark#24299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close #169 and add support for DataSet of Avro records #217

Close #169 and add support for DataSet of Avro records #217

themodernlife commented Feb 10, 2017

codecov-io commented Feb 10, 2017 •

edited

Loading

bdrillard commented Feb 13, 2017

gcpagano commented Feb 14, 2017 •

edited

Loading

themodernlife commented Feb 15, 2017

themodernlife commented Feb 15, 2017

gcpagano commented Feb 17, 2017

tkinz27 commented Mar 6, 2017

bdrillard commented Apr 17, 2017

gengliangwang commented Oct 26, 2017

cloud-fan commented Oct 27, 2017

themodernlife commented Oct 30, 2017

gengliangwang commented Oct 30, 2017

rxin commented Nov 1, 2017 •

edited

Loading

rbrush commented Nov 3, 2017

rxin commented Nov 3, 2017

rbrush commented Nov 4, 2017

rxin commented Nov 4, 2017

rxin commented Nov 7, 2017

bdrillard commented Nov 7, 2017 •

edited

Loading

marmbrus commented Nov 8, 2017

cloud-fan commented Nov 8, 2017

bdrillard commented Nov 8, 2017

cloud-fan commented Nov 9, 2017

marmbrus commented Nov 10, 2017

kalvinnchau commented Nov 15, 2017 •

edited

Loading

bdrillard commented Dec 8, 2017

bdrillard commented May 21, 2018

mtraynham commented Jul 16, 2018 •

edited

Loading

mtraynham commented Jul 16, 2018

Close #169 and add support for DataSet of Avro records #217

Are you sure you want to change the base?

Close #169 and add support for DataSet of Avro records #217

Conversation

themodernlife commented Feb 10, 2017

codecov-io commented Feb 10, 2017 • edited Loading

Codecov Report

bdrillard commented Feb 13, 2017

gcpagano commented Feb 14, 2017 • edited Loading

themodernlife commented Feb 15, 2017

themodernlife commented Feb 15, 2017

gcpagano commented Feb 17, 2017

tkinz27 commented Mar 6, 2017

bdrillard commented Apr 17, 2017

gengliangwang commented Oct 26, 2017

cloud-fan commented Oct 27, 2017

themodernlife commented Oct 30, 2017

gengliangwang commented Oct 30, 2017

rxin commented Nov 1, 2017 • edited Loading

rbrush commented Nov 3, 2017

rxin commented Nov 3, 2017

rbrush commented Nov 4, 2017

rxin commented Nov 4, 2017

rxin commented Nov 7, 2017

bdrillard commented Nov 7, 2017 • edited Loading

marmbrus commented Nov 8, 2017

cloud-fan commented Nov 8, 2017

bdrillard commented Nov 8, 2017

cloud-fan commented Nov 9, 2017

marmbrus commented Nov 10, 2017

kalvinnchau commented Nov 15, 2017 • edited Loading

bdrillard commented Dec 8, 2017

bdrillard commented May 21, 2018

mtraynham commented Jul 16, 2018 • edited Loading

mtraynham commented Jul 16, 2018

codecov-io commented Feb 10, 2017 •

edited

Loading

gcpagano commented Feb 14, 2017 •

edited

Loading

rxin commented Nov 1, 2017 •

edited

Loading

bdrillard commented Nov 7, 2017 •

edited

Loading

kalvinnchau commented Nov 15, 2017 •

edited

Loading

mtraynham commented Jul 16, 2018 •

edited

Loading