Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950

Closed
wants to merge 1 commit into from
Closed

[SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950

wants to merge 1 commit into from

Conversation

codesue
Copy link

@codesue codesue commented Jun 29, 2020

What changes were proposed in this pull request?

Update cloudpickle in PySpark to cloudpickle v1.4.1 (See https://github.com/cloudpipe/cloudpickle/blob/v1.4.1/cloudpickle/cloudpickle.py)

This is currently the highest version of cloudpickle, and it does not support Python 2.

Port cloudpickle_fast.py to leverage C pickle in Python 3.8+ (See https://github.com/cloudpipe/cloudpickle/blob/v1.4.1/cloudpickle/cloudpickle_fast.py)

Why are the changes needed?

PySpark's cloudpickle.py and versions of cloudpickle below 1.3.0 interfere with dill unpickling because they define types.ClassType, which is undefined in dill. This results in the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads
    return load(file, ignore)
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load
    obj = pik.load()
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type
    return _reverse_typemap[name]
KeyError: 'ClassType'

(See cloudpipe/cloudpickle#82)

This was fixed for cloudpickle 1.3.0+ (cloudpipe/cloudpickle#337), but PySpark's cloudpickle.py doesn't have this change yet.

Does this PR introduce any user-facing change?

No

How was this patch tested?

python/run-tests

@viirya
Copy link
Member

viirya commented Jun 29, 2020

@HyukjinKwon Do we want to drop Python 2 in Spark 3.1.0?

Btw, if we want to backport this to 3.0 branch, we also cannot drop Python 2.

@codesue
Copy link
Author

codesue commented Jun 29, 2020

In that case, cloudpickle v1.3.0 might be a better option.

@dongjoon-hyun
Copy link
Member

Thank you for your first contribution, @codesue .

Also, since this is related to Python 2.x, ping @holdenk , @mengxr , @gatorsmile , @srowen , @BryanCutler .

@holdenk
Copy link
Contributor

holdenk commented Jun 29, 2020

Thanks for the ping @dongjoon-hyun & thanks for working on this PR @codesue, I've been meaning to take a look at cloudpickle's updates. @viirya I think backporting cloudpickle changes is a bit risky given how core it is. I'd rather target the latest cloud pickle to Spark 3.1 if that sounds good to folks?

@dongjoon-hyun
Copy link
Member

+1 for @holdenk 's advice.

@HyukjinKwon
Copy link
Member

Yeah, to upgrade we should drop Python 2. I target to drop it in Spark 3.1. I will make a PR to officially drop first.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codesue, BTW is it a clean matching? We should also port https://github.com/cloudpipe/cloudpickle/blob/master/cloudpickle/cloudpickle_fast.py too to leverage C pickle in Python 3.8+

@HyukjinKwon
Copy link
Member

ok to test

1 similar comment
@viirya
Copy link
Member

viirya commented Jun 30, 2020

ok to test

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124677 has finished for PR 28950 at commit 164764a.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124680 has finished for PR 28950 at commit 5f137d3.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -0,0 +1,557 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add cloudpickle_fast.py to dev/.rat-excludes to avoid the RAT test error.

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124683 has finished for PR 28950 at commit 23933eb.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jun 30, 2020

Do the docs build locally for you @codesue? It's possible we need to file an Spark JIRA ticket to ask Shane to update the version of our doc tooling.

@HyukjinKwon
Copy link
Member

Ah .. yeah, if this is the Python version issue in the Jenkins, let's file a JIRA under https://issues.apache.org/jira/browse/SPARK-29909. Seems like there are some more works that block this PR ...

@codesue
Copy link
Author

codesue commented Jul 2, 2020

@holdenk I'm able to build the python docs when I run jekyll build. I don't have R installed, so I get Cannot find 'R_HOME'. Please specify 'R_HOME' or make sure R is properly installed. When I run SKIP_RDOC=1 jekyll build, I get

[error] (streaming-kafka-0-10-assembly/*:projectDescriptors) Could not create directory /Users/codesue/oss/spark/external/kafka-0-10-assembly/target/streams/$global/projectDescriptors/$global/streams
[error] (sketch/*:projectDescriptors) Could not create file /Users/codesue/oss/spark/common/sketch/target/streams/$global/projectDescriptors/$global/streams/outjava.io.IOException: No such file or directory
[error] Total time: 1 s, completed Jul 1, 2020 11:15:12 PM

@HyukjinKwon
Copy link
Member

I double checked we can build the documentation with Python 3.7 too. R build is fine. It's not relevant.

Then seems to be it's the problem of the Python version installed in Jenkins machine for documentation build in Sphinx. I will file a separate ticket soon for that. If you want to at least test though, you can disable the documentation build for now with the changes d74aa53 and https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R161.

I am pushing to drop Python 2, 3.4 and 3.5 so this PR can be unblocked at #28957. Feel free to give more feedback on [DISCUSS] Drop Python 2, 3.4 and 3.5.

@codesue
Copy link
Author

codesue commented Jul 2, 2020

@HyukjinKwon, I'll wait for #28957 to be merged.

HyukjinKwon added a commit that referenced this pull request Jul 14, 2020
### What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.

### Why are the changes needed?

 1. Unsupport EOL Python versions
 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
 4. Users can use Python type hints with Pandas UDFs without thinking about Python version
 5. Users can leverage one latest cloudpickle, #28950. With Python 3.8+ it can also leverage C pickle.

### Does this PR introduce _any_ user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

### How was this patch tested?

Manually tested and also tested in Jenkins.

Closes #28957 from HyukjinKwon/SPARK-32138.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@holdenk
Copy link
Contributor

holdenk commented Jul 14, 2020

It looks like it's accidentally picking up some unrelated changes. When I have this happen to me I normally just re-create the PR from a fresh branch (I know there's other ways to clean it up, but my git magic is not strong enough).

@codesue
Copy link
Author

codesue commented Jul 14, 2020

Thanks, @holdenk! I'll do that.

@SparkQA
Copy link

SparkQA commented Jul 14, 2020

Test build #125844 has finished for PR 28950 at commit d731f63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks something went wrong during the rebase.

@HyukjinKwon
Copy link
Member

I just resolved it and pushed by myself

@@ -42,27 +42,42 @@
"""
Copy link
Member

@HyukjinKwon HyukjinKwon Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codesue, we should make it as a proper package so cloudpickle_fast can be properly used as well. Seems like v.1.5 was release too .. we should better match it as well.

Let me create a PR by myself to upgrade to 1.5, and co-author you together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds great! Thank you!!

@SparkQA
Copy link

SparkQA commented Jul 15, 2020

Test build #125862 has finished for PR 28950 at commit b90ea32.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • skeleton_class = types.new_class(
  • class CloudPickler(Pickler):
  • function/class builtin-saving method (save_global), to serialize dynamic
  • is_anyclass = issubclass(t, type)
  • except TypeError: # t is not a class (old Boost; see SF #502085)

@HyukjinKwon
Copy link
Member

Let's track it in #29114.

@codesue codesue deleted the SPARK-32094-update-cloudpickle branch July 15, 2020 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants