Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Optional dependencies for accelerating JSON serialization #2944

Closed
jonmmease opened this issue Nov 29, 2020 · 2 comments
Closed

ENH: Optional dependencies for accelerating JSON serialization #2944

jonmmease opened this issue Nov 29, 2020 · 2 comments

Comments

@jonmmease
Copy link
Contributor

On top of #2943, I investigated a couple of interesting libraries we could potentially use as optional dependencies to further accelerate JSON serialization

pybase64

I played with pybase64 a little, and it looks like an easy way to get a decent speedup over the built-in python base64 module for performing the numpy base64 encoding step being introduced in #2943. This wouldn't require any refactoring or anything, and can drop the base64 encoding time (which is a substantial portion of the total json encoding time for figures that contain large numpy arrays) by something liek 20% to 40%.

orjson

orjson is a really impressive alternative JSON encoder that, in playing with a little bit, I've seen it be 2x to 5x times faster than the built-in Python json encoder.

orjson doesn't support custom JSON encoder classes (like PlotlyJSONEncoder), so supporting this as an optional dependency would require a total refactor of the current json encoding process.

Basically, we would need to switch to an architecture where we would preprocess the figure dictionary recursively to perform any conversions we need, and then feed that dictionary through the JSON encoder.

Another nice thing about orjson is that it automatically converts nan and infinity values to JSON null values, so the JSON re-endcoding stuff we were working through in #2880 wouldn't be needed (cc @emmanuelle ).

@sdementen
Copy link

sdementen commented Nov 30, 2020

Some complement on the performance of orjson: https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html#tables

I have also been digging into the JSON serialization performance in plotly, and noticed that, on a large plot px.line(df) with df a pandas Dataframe (17520 rows x 8 columns of random float, index is a DatetimeIndex) that takes 1.8s to generate (px.line(df) only),

  • there is an extra 0.4s due to the deepcopy in line https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/plotly/basedatatypes.py#L3289 which is not necessary I guess as the JSON serialization will not change the data
  • for serializing one column of the Dataframe (pandas Series), json.dumps(sr, cls=plotly.utils.PlotlyJSONEncoder) take 10x more time than serializing via sr.to_json(orient="values")
  • for serializing the index (pandas DatetimeIndex), json.dumps(df.index, cls=plotly.utils.PlotlyJSONEncoder) take 32x more time than serializing via df.index.to_series().to_json(orient="values", date_format="iso", date_unit="s"). The output format is not identical as the is a trailing 'Z' in the to_json method.
  • when multiple traces share the same index, the index is serialized for each trace independently (ie. multiple times the same serialization). For complex objects like numpy/pandas.*, it may be worth to have some "cache" for the JSONified string on the id of the object to reuse what has already been done.

On the figure generation (so not related to JSON) for this same large Dataframe, it is more than 13x faster (from 1.8s to 0.4s) to:

  1. generate the figure with only the first row of the dataframe px.line(df.iloc[:1,:])
  2. loop on each trace and adapt the "x", "y" data with the full data
    So the following code (vs the simpler px.line(df))
            fig = px.line(df.iloc[:1])
            data = fig["data"]
            traces = {trace["name"]: trace for trace in data}
            x = df.index
            for col, y in df.items():
                trace = traces[str(col)]
                trace["x"] = x
                trace["y"] = y

and in this case, we can also manage the NaN more efficiently by removing them from the trace

            fig = px.line(df.iloc[:1])
            data = fig["data"]
            traces = {trace["name"]: trace for trace in data}
            x = df.index
            for col, y in df.items():
                trace = traces[str(col)]
                notnan = ~y.isna()
                trace["x"] = x[notnan]
                trace["y"] = y[notnan]

I hope this information can help improving plotly performances.
If I misusing plotly in some way and there is already today better way to get better performance, I would glad to apply it!

I haven't tested with the change from #2880

@jonmmease
Copy link
Contributor Author

Thanks for sharing your observations here @sdementen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants