Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(csv): Do not coerce persisted data integer columns to float #20760

Merged
merged 6 commits into from
Jul 19, 2022

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Jul 19, 2022

SUMMARY

Regrettably #20151 wasn't suffice is the result set was stored prior to downloading the CSV file. More specifically Pandas coerces an integer array with None to a float—likely because of the Numpy coercion, i.e.,

>>> pd.DataFrame.from_records([{"foo": 1}, {"foo": None}])
   foo
0  1.0
1  NaN

The fix is to explicitly define the dtype, using the standard DataFrame constructor, i.e.,

>>> pd.DataFrame(data=[{"foo": 1}, {"foo": None}], dtype=object)
    foo
0     1
1  None

Long term we should probably replace quirky Pandas with PyArrow globally.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

CI.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley changed the title John bodley fix 20151 fix(csv): Do not coerce persisted data integer columns to float Jul 19, 2022
@codecov
Copy link

codecov bot commented Jul 19, 2022

Codecov Report

Merging #20760 (4d2f439) into master (e60083b) will decrease coverage by 11.48%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master   #20760       +/-   ##
===========================================
- Coverage   66.35%   54.87%   -11.49%     
===========================================
  Files        1754     1754               
  Lines       66689    66688        -1     
  Branches     7049     7049               
===========================================
- Hits        44253    36595     -7658     
- Misses      20639    28296     +7657     
  Partials     1797     1797               
Flag Coverage Δ
hive 53.23% <0.00%> (+<0.01%) ⬆️
mysql ?
postgres ?
presto 53.09% <0.00%> (+<0.01%) ⬆️
python 58.00% <0.00%> (-23.69%) ⬇️
sqlite ?
unit 50.57% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/views/core.py 34.46% <0.00%> (-43.43%) ⬇️
superset/utils/dashboard_import_export.py 0.00% <0.00%> (-100.00%) ⬇️
superset/key_value/commands/update.py 0.00% <0.00%> (-88.89%) ⬇️
superset/key_value/commands/delete.py 0.00% <0.00%> (-85.30%) ⬇️
superset/key_value/commands/delete_expired.py 0.00% <0.00%> (-80.77%) ⬇️
superset/dashboards/commands/importers/v0.py 15.62% <0.00%> (-76.25%) ⬇️
superset/datasets/commands/update.py 25.30% <0.00%> (-68.68%) ⬇️
superset/datasets/commands/create.py 29.41% <0.00%> (-68.63%) ⬇️
superset/datasets/commands/importers/v0.py 24.03% <0.00%> (-67.45%) ⬇️
superset/reports/commands/execute.py 24.45% <0.00%> (-67.16%) ⬇️
... and 275 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e60083b...4d2f439. Read the comment docs.

@john-bodley john-bodley merged commit e1fd906 into master Jul 19, 2022
john-bodley added a commit to airbnb/superset-fork that referenced this pull request Jul 19, 2022
…he#20760)

* Replace pd.DataFrame.from_records with pd.DataFrame

* Remove unused code

* Update core.py

* Update core.py

* Update csv.py

* Update core.py

(cherry picked from commit e1fd906)
@mbcsa
Copy link

mbcsa commented Jul 29, 2022

Hi @john-bodley

This fix introduces a new problem when user exports CSV file from a cached Query.
I've created a new issue #20919

The thing is, when Dataframe is created dinamically from cached data, it is not respecting column formats.
This is a problem when decimal separator is configured by CSV_EXPORT, "sep" attribute

I'm testing this, and it works well when changing:

df = pd.DataFrame(
    data=obj["data"],
    dtype=object,
    columns=[c["name"] for c in obj["columns"]],
)

to

df = pd.DataFrame(
    data=obj["data"],
    columns=[c["name"] for c in obj["columns"]],
)

Thank you

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.1.0 labels Mar 13, 2024
@mistercrunch mistercrunch deleted the john-bodley--fix-20151 branch March 26, 2024 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/XS 🚢 2.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants