Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(#13734): Properly escape special characters in CSV output #13735

Merged
merged 5 commits into from
Mar 26, 2021

Conversation

benjreinhart
Copy link
Contributor

This fixes #13734 by escaping CSV values when downloading data as a CSV.

SUMMARY

Escape leading special characters in CSV output.

TEST PLAN

Some unit tests and manual QA

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@benjreinhart
Copy link
Contributor Author

As an aside, the CsvResponse object is used in three places. In two of those, the mime type is set to application/csv, but I believe the expected csv mime type is text/csv. Can I make that change in this PR or would that be considered a breaking API change?

@codecov
Copy link

codecov bot commented Mar 22, 2021

Codecov Report

Merging #13735 (62dd99f) into master (aa92c1e) will decrease coverage by 6.89%.
The diff coverage is 96.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #13735      +/-   ##
==========================================
- Coverage   77.43%   70.53%   -6.90%     
==========================================
  Files         933      850      -83     
  Lines       47093    42158    -4935     
  Branches     5818     4431    -1387     
==========================================
- Hits        36465    29737    -6728     
- Misses      10485    12421    +1936     
+ Partials      143        0     -143     
Flag Coverage Δ
cypress 56.04% <ø> (-0.46%) ⬇️
hive 80.29% <96.15%> (+0.03%) ⬆️
javascript ?
mysql ?
postgres ?
presto ?
python 80.29% <96.15%> (-0.82%) ⬇️
sqlite ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/viz.py 56.07% <50.00%> (ø)
superset/charts/api.py 81.30% <100.00%> (ø)
superset/common/query_context.py 81.62% <100.00%> (-0.45%) ⬇️
superset/utils/csv.py 100.00% <100.00%> (ø)
superset/views/base.py 75.83% <100.00%> (+0.10%) ⬆️
superset/views/core.py 72.10% <100.00%> (-3.53%) ⬇️
...uperset-frontend/src/dashboard/util/dnd-reorder.js 0.00% <0.00%> (-100.00%) ⬇️
...rset-frontend/src/dashboard/util/getEmptyLayout.js 0.00% <0.00%> (-100.00%) ⬇️
...et-frontend/src/components/Pagination/Ellipsis.tsx 0.00% <0.00%> (-100.00%) ⬇️
...-frontend/src/components/OmniContainer/Omnibar.tsx 0.00% <0.00%> (-100.00%) ⬇️
... and 539 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa92c1e...62dd99f. Read the comment docs.

@villebro
Copy link
Member

As an aside, the CsvResponse object is used in three places. In two of those, the mime type is set to application/csv, but I believe the expected csv mime type is text/csv. Can I make that change in this PR or would that be considered a breaking API change?

I agree with the change, please make this change here if you don't mind!

@@ -479,6 +479,7 @@ class CsvResponse(Response): # pylint: disable=too-many-ancestors
"""

charset = conf["CSV_EXPORT"].get("encoding", "utf-8")
default_mimetype = "text/csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

superset/charts/api.py Show resolved Hide resolved
superset/views/core.py Show resolved Hide resolved


def test_df_to_escaped_csv():
csv_str = "col_a,=col_b\n-10,=cmd|' /C calc'!A0\na,=b"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add another row to assert that both the column headers and the data are escaped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is currently testing both the header and body rows. Do you mean break the assert up so that it's easier to read? Happy to do that as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be easier to parse visually, but not a blocker.

@benjreinhart
Copy link
Contributor Author

@robdiciuccio thanks for taking a look! I think you may have commented on an outdated version as it looks like I made those changes already. Can you take a look when you have a chance and let me know if those concerns were addressed?

@robdiciuccio
Copy link
Member

@benjreinhart a few minor questions still pending (PR seems to be current), but overall looks good. I'll create a test env for this PR.

@robdiciuccio
Copy link
Member

/testenv up

@github-actions
Copy link
Contributor

@robdiciuccio Ephemeral environment spinning up at http://35.162.70.159:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

assert result == "'-value"

result = csv.escape_value("=value")
assert result == "'=value"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some testing in the ephemeral env, I'm curious why just the single quote at the beginning rather than wrapping the value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defusedcsv package that you reference above states:

Excel will not display the ' character to the user

but this is not the case in my tests, where Excel displays the following: '=10+20

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: it was what I saw recommended the most when googling around.

Longer: I saw a few different recommendations but they all seem to be intrusive in one form or another. The approach I saw most often recommended was the single preceding quote. There is also a tab approach, but that looks invasive to some degree as well.

However, now that I just went back to look this up again, I found a post that shows how even this is not sufficient:

One product in my organization fixed a similar issue with 2 layers of defense. For any CSV cell value that starts with +, -, @, = (as suggested in http://georgemauer.net/2017/10/07/csv-injection.html or OWASP) the fix added (1) a preceding TAB char, (2) single quotes around the cell value. But later the we found that adding a single / pair of double quote(s) before the attacker's payload can simply evade the filter to check for the chars +, -, @, =. e.g. if the payload attacker injects is =2+5+cmd|' /C calc'!A0 filter will catch it and mitigate the risks. But if attacker puts the payload ""=2+5+cmd|' /C calc'!A0 filter won't be able to catch it as it was checking for only values starting with +, -, @, =. The end result will be same because MS excel, while rendering the value, simply skips the leading double quotes in CSV values.

Their recommendation is:

prepend a SPACE or TAB or SINGLE QUOTE to ALL CSV values before exporting them to file. This mitigation leaves the CSV file human readable but not executable. DO NOT check for leading +, -, @, =, " and prepend to only those values.

Does that sound like a valid approach or do you have a preferred alternative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatives that I see are:

  1. Alter every column with a preceding space. Intrusive, but not as noticeable as a single quote and will, according to that post, solve the issue in all cases
  2. Try to be more clever with a regex to account for the quote issue he mentions and hope this captures everything
  3. Keep current solution (maybe change to preceding space) and ignore the quote issue. Less invasive, but still problematic.

We can also wrap the whole column with single quotes, but then the UX is worse in some spreadsheet software that hides the preceding quote from user visually.

Mitigating this is going to be intrusive no matter which solution we choose. I'd recommend 1 above if we want to be covered in all cases, but also happy to go with 2 if we feel confident that the the quote issue is the only other case to account for.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think option 1 is too intrusive. Option 2 or shipping as-is are viable options. Curious what others think here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like 2 is doable. I agree with @robdiciuccio that 1 is intrusive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the implementation to be a bit more clever after some more testing in google sheets and excel. Best I can tell, this gets the job done without being overly intrusive.

Some interesting things I found while testing:

  • Many resources say that a leading space will work the same as a leading single quote, but I found that Google sheets will ignore a leading space and evaluate the command
  • Google Sheets seems to hide leading single quote while Excel did not
  • Pandas seems to remove the leading double quotes

@junlincc junlincc added v1.2 and removed v1.1 labels Mar 25, 2021
Copy link
Member

@craig-rueda craig-rueda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One Q, otherwise LGTM

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, Ben! That linked article is super interesting.

Comment on lines +24 to +30
# This regex will match if the string starts with:
#
# 1. one of -, @, +, |, =, %
# 2. two double quotes immediately followed by one of -, @, +, |, =, %
# 3. one or more spaces immediately followed by one of -, @, +, |, =, %
#
problematic_chars_re = re.compile(r'^(?:"{2}|\s{1,})(?=[\-@+|=%])|^[\-@+|=%]')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a FYI, in Python you can write verbose regular expressions with inline comments: https://docs.python.org/3/library/re.html#re.VERBOSE

You can also use + instead of {1,}, but no need to change this.

@betodealmeida betodealmeida added need:merge The PR is ready to be merged v1.1 and removed v1.2 labels Mar 26, 2021
@betodealmeida betodealmeida merged commit 55ba47e into apache:master Mar 26, 2021
@github-actions
Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

betodealmeida pushed a commit to betodealmeida/incubator-superset that referenced this pull request Mar 26, 2021
…pache#13735)

* fix: Escape csv content during downloads

* Reuse CsvResponse object

* Use correct mimetype for csv responses

* Ensure that headers are also escaped

* Update escaping logic
@benjreinhart benjreinhart deleted the benjreinhart/escaping branch March 29, 2021 23:30
amitmiran137 pushed a commit that referenced this pull request Mar 31, 2021
* master: (56 commits)
  test: Adds tests and storybook to CertifiedIcon component (#13457)
  chore: Moves CheckboxIcons to Checkbox folder (#13459)
  chore: Removes Popover duplication (#13462)
  build(deps): bump elliptic from 6.5.3 to 6.5.4 in /docs (#13527)
  fix: allow spaces in DB names (#13800)
  chore: Update PR template for SIP-59 DB migrations process (#13855)
  Add CODEOWNERS (#13759)
  feat(alerts & reports): Easier to read execution logs (#13752)
  fix: Disallows negative options remaining (#13749)
  Fix broken link (#13861)
  fix(native-filters): add global async query support to native filters (#13837)
  Displays row limit warning with Alert component (#13854)
  fix(errors): Downgrade error on stop query to a warning (#13826)
  fix(alerts and reports): Unify timestamp format on execution log view (#13718)
  fix(sqllab): warning message when rows limited (#13841)
  chore: add success log whenever a connection is working (#13811)
  fix(native-filters): improve loading styles for filter component (#13794)
  chore: update change log with cherry-picks for release 1.1 (#13824)
  feat: added support to configure the default explorer viz (#13610)
  fix(#13734): Properly escape special characters in CSV output  (#13735)
  ...
cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
…pache#13735)

* fix: Escape csv content during downloads

* Reuse CsvResponse object

* Use correct mimetype for csv responses

* Ensure that headers are also escaped

* Update escaping logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need:merge The PR is ready to be merged size/L v1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Downloading as CSV should properly escape the values
6 participants