Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to parse Content-Type from content type contains charset #27301

Closed
wants to merge 3 commits into from

Conversation

liketic
Copy link
Contributor

@liketic liketic commented Nov 7, 2017

Closes #27065

cc @javanna

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@DaveCTurner
Copy link
Contributor

I don't think it closes that issue you mentioned, but it is related to #27065.

@javanna
Copy link
Member

javanna commented Nov 7, 2017

I don't think it closes that issue you mentioned

I wish it would though :)

@DaveCTurner
Copy link
Contributor

@elasticmachine please test this

@liketic
Copy link
Contributor Author

liketic commented Nov 7, 2017

@DaveCTurner My mistake... 👍

final String lowercaseContentType = restRequest.header("Content-Type").toLowerCase(Locale.ROOT);
final String[] elements = lowercaseContentType.split(";");
final String lowercaseMediaType = elements[0].trim();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should still have some form of warning if you claim to be using a charset that isn't UTF-{8,16,32}, as per #22769. This might not be the time or the place to ask for that, however.

@DaveCTurner DaveCTurner added :Core/Infra/REST API REST infrastructure and utilities v5.6.5 labels Nov 7, 2017
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not enough for me. It certainly fixes the errant deprecation logger warning (and that we are not obeying the content type), but now it's leniently ignoring any parameters on the content type.

@javanna
Copy link
Member

javanna commented Nov 7, 2017

@jasontedor are you suggesting to parse the charset as part of this PR then aka address #22769 ?

@javanna javanna removed their assignment Nov 7, 2017
@jasontedor
Copy link
Member

@javanna It doesn't have to be this PR, but I'm not willing to pull this PR in as-is because of the leniency. Ideally, yes, #22769 would be addressed as part of a complete solution. 😄

@javanna
Copy link
Member

javanna commented Nov 7, 2017

We currently don't do anything with charset other than issuing a wrong deprecation warning due to it right? I understand the reasons why leniency should be avoided, but why does solving the wrong deprecation warning means parsing the charset and doing things with it. Is there some middle ground between the two so that we can remove the deprecation warning (hopefully by getting this PR in) without addressing #22769 at the same time?

@DaveCTurner
Copy link
Contributor

Could it just accept the literal application/x-ndjson; charset=utf-8 and consider it to be equivalent to application/x-ndjson? That'd save any mucking around with parsing the header without introducing any undue leniency. Are there any other Content-Types that are problematic here?

@rjernst
Copy link
Member

rjernst commented Nov 8, 2017

I think splitting as is implemented in this PR is fine if the charset is checked. Since at the moment we don't actually support anything but utf8, we can just fail if the charset is present and not utf8? Actually supporting charset values can be a followup, but at least we won't add leniency, and it doesn't allow odd behavior (sending a charset for say utf 16 failing in decoding).

@jasontedor
Copy link
Member

I was going to suggest the same thing as @rjernst but now I can take a shortcut and simply register my agreement with his proposal. 😄

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Nov 9, 2017

Since at the moment we don't actually support anything but utf8.

It seems that we also accept UTF-16 and UTF-32 (both little and big endian variants) via autodetection, but nothing weirder:

$ for i in ISO-8859-1 UTF-8 UTF-16LE UTF-16BE UTF-32LE UTF-32BE ISO-IR-165; do echo; echo; echo $i; echo -n '{"query":{"match":{"message":"é"}}}' | iconv -f utf-8 -t $i > body.dat; xxd body.dat; curl http://127.0.0.1:9200/_search -H'Content-type: application/json' --data-binary @body.dat; done


ISO-8859-1
00000000: 7b22 7175 6572 7922 3a7b 226d 6174 6368  {"query":{"match
00000010: 223a 7b22 6d65 7373 6167 6522 3a22 e922  ":{"message":"."
00000020: 7d7d 7d                                  }}}
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0x22\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@45ee3f59; line: 1, column: 33]"}],"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0x22\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@45ee3f59; line: 1, column: 33]"},"status":500}

UTF-8
00000000: 7b22 7175 6572 7922 3a7b 226d 6174 6368  {"query":{"match
00000010: 223a 7b22 6d65 7373 6167 6522 3a22 c3a9  ":{"message":"..
00000020: 227d 7d7d                                "}}}
{"took":1,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

UTF-16LE
00000000: 7b00 2200 7100 7500 6500 7200 7900 2200  {.".q.u.e.r.y.".
00000010: 3a00 7b00 2200 6d00 6100 7400 6300 6800  :.{.".m.a.t.c.h.
00000020: 2200 3a00 7b00 2200 6d00 6500 7300 7300  ".:.{.".m.e.s.s.
00000030: 6100 6700 6500 2200 3a00 2200 e900 2200  a.g.e.".:."...".
00000040: 7d00 7d00 7d00                           }.}.}.
{"took":0,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

UTF-16BE
00000000: 007b 0022 0071 0075 0065 0072 0079 0022  .{.".q.u.e.r.y."
00000010: 003a 007b 0022 006d 0061 0074 0063 0068  .:.{.".m.a.t.c.h
00000020: 0022 003a 007b 0022 006d 0065 0073 0073  .".:.{.".m.e.s.s
00000030: 0061 0067 0065 0022 003a 0022 00e9 0022  .a.g.e.".:."..."
00000040: 007d 007d 007d                           .}.}.}
{"took":0,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

UTF-32LE
00000000: 7b00 0000 2200 0000 7100 0000 7500 0000  {..."...q...u...
00000010: 6500 0000 7200 0000 7900 0000 2200 0000  e...r...y..."...
00000020: 3a00 0000 7b00 0000 2200 0000 6d00 0000  :...{..."...m...
00000030: 6100 0000 7400 0000 6300 0000 6800 0000  a...t...c...h...
00000040: 2200 0000 3a00 0000 7b00 0000 2200 0000  "...:...{..."...
00000050: 6d00 0000 6500 0000 7300 0000 7300 0000  m...e...s...s...
00000060: 6100 0000 6700 0000 6500 0000 2200 0000  a...g...e..."...
00000070: 3a00 0000 2200 0000 e900 0000 2200 0000  :..."......."...
00000080: 7d00 0000 7d00 0000 7d00 0000            }...}...}...
{"took":4,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

UTF-32BE
00000000: 0000 007b 0000 0022 0000 0071 0000 0075  ...{..."...q...u
00000010: 0000 0065 0000 0072 0000 0079 0000 0022  ...e...r...y..."
00000020: 0000 003a 0000 007b 0000 0022 0000 006d  ...:...{..."...m
00000030: 0000 0061 0000 0074 0000 0063 0000 0068  ...a...t...c...h
00000040: 0000 0022 0000 003a 0000 007b 0000 0022  ..."...:...{..."
00000050: 0000 006d 0000 0065 0000 0073 0000 0073  ...m...e...s...s
00000060: 0000 0061 0000 0067 0000 0065 0000 0022  ...a...g...e..."
00000070: 0000 003a 0000 0022 0000 00e9 0000 0022  ...:..."......."
00000080: 0000 007d 0000 007d 0000 007d            ...}...}...}
{"took":0,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

ISO-IR-165
00000000: 2a7b 2a22 2a71 2a75 2a65 2a72 2a79 2a22  *{*"*q*u*e*r*y*"
00000010: 2a3a 2a7b 2a22 2a6d 2a61 2a74 2a63 2a68  *:*{*"*m*a*t*c*h
00000020: 2a22 2a3a 2a7b 2a22 2a6d 2a65 2a73 2a73  *"*:*{*"*m*e*s*s
00000030: 2a61 2a67 2a65 2a22 2a3a 2a22 2b26 2a22  *a*g*e*"*:*"+&*"
00000040: 2a7d 2a7d 2a7d                           *}*}*}
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Unexpected character ('*' (code 42)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@4f45863b; line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Unexpected character ('*' (code 42)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@4f45863b; line: 1, column: 2]"},"status":500}

We may not support them, of course :)

@liketic
Copy link
Contributor Author

liketic commented Nov 9, 2017

I pushed bc424b3 to validate charset. @DaveCTurner I'm OK to add utf-16[32] if both you think the way to fix the issue is make sense. Thanks everyone's comments.

Copy link
Member

@jaymode jaymode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see if we can get this fix merged in and ported to 6.x so that requests do not fail if a charset is specified. @liketic are you able to update the PR (merge 5.6)?

@DaveCTurner @jasontedor can you take a look again at the changes?

@@ -308,6 +309,23 @@ private boolean hasContentTypeOrCanAutoDetect(final RestRequest restRequest, fin
return true;
}

private String parseMediaType(String rawContentType) {
final String contentType = rawContentType.toLowerCase(Locale.ROOT);
int firstSemiColonIndex = contentType.indexOf(';');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up. I pushed 0fabbd0 . The merge target is 5.6 now. Thank you so much!

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@jasontedor jasontedor added v5.6.9 and removed v5.6.8 labels Feb 20, 2018
@mbarretta
Copy link

To echo Jay, any hope of seeing this in a near-term 6.x release (e.g. 6.3)?

@DaveCTurner
Copy link
Contributor

I'm -½ on this change. It's a quick-and-dirty hack that means we accept a few more Content-type headers, but not all of the ones that we should.

I also note that we are inconsistent with our leniency in this area, comparing our handling of application/json and application/x-ndjson:

$ curl -s http://localhost:9200/ | jq .version.number
"6.2.2"
$ curl http://localhost:9200/_search -H'Content-type: application/json; charset=utf-8' --data-binary '{}'
{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}
$ curl http://localhost:9200/_search -H'Content-type: application/json; charset=xcvbxcvbxcv' --data-binary '{}'
{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}
$ curl http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=utf-8' --data-binary '{}'
{"error":"Content-Type header [application/x-ndjson; charset=utf-8] is not supported","status":406}
$ curl http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=xcvbxcvbxcv' --data-binary '{}'
{"error":"Content-Type header [application/x-ndjson; charset=xcvbxcvbxcv] is not supported","status":406}

I think this needs a deeper look.

@jaymode
Copy link
Member

jaymode commented Mar 13, 2018

It's a quick-and-dirty hack that means we accept a few more Content-type headers, but not all of the ones that we should.

@DaveCTurner I agree, which is why I believe that the validation of the charset should be separate from this PR. A portion of this PR is still important, which is the ability to accept a x-ndjson content type header that has the charset or other parameters specified. We do this already for all other content types we support, see RestRequest#parseContentType. @jasontedor @rjernst can you elaborate on why we could not pull a small fix in that extends existing behavior that makes things consistent? Elasticsearch has always ignored the charset and other parameters.

@mbarretta
Copy link

mbarretta commented Mar 13, 2018

again agree with Jay: The rejection of the _bulk requests with a charset is a change in behavior and breaks compatibility with some 3rd party REST client libs. The particular scenario I'm stuck behind is similar to the third listed in @DaveCTurner example, though specifically using _bulk vs _search:

$ curl http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=utf-8' --data-binary '{}'

It's a valid call, and while the charset would be ignored, accepting that request will be consistent with past behavior as well as consistent with current behavior when combined with the application/json content-type

@rjernst
Copy link
Member

rjernst commented Mar 13, 2018

can you elaborate on why we could not pull a small fix in that extends existing behavior that makes things consistent?

I don't think we should add leniency because leniency is incredibly hard to remove later (people start depending on it).

Elasticsearch has always ignored the charset and other parameters.

But now we are parsing content type, thus we should parse the entire content type. If we ignore parts (ie charset) it is not better than the state we were already in (when we "auto detected" everything).

@DaveCTurner Your curl examples aren't showing the headers. This PR, as it is currently, is only emitting deprecation warnings. I think we should actually be erroring if we are to parse the charset. Your earlier examples of support UTF16 and UTF32 only work because of the underlying parser we use. We could swap that out today and no tests would fail. Therefore, I claim we only support UTF8 (the only encoding we use in tests, afaik).

I also do not think this is a "quick and dirty hack". It is parsing the charset. We do need to also fail on other parameters, and invalid charset values. I don't think this should be a deprecation warning. Bogus and unsupported charset values already won't work (aside from the ones that "work" above, which I'm torn on whether to fail hard on or give a warning).

@mbarretta
Copy link

just for clarity, the issue to which I'm referring, previously linked this one, is #28123

@jaymode
Copy link
Member

jaymode commented Mar 13, 2018

I don't think we should add leniency because leniency is incredibly hard to remove later (people start depending on it).

I'd agree in most cases, but this one I view differently because we do not use the charset at all (#22769) and it causes issues with certain clients that send the charset in 6.x. The leniency simply dropping the charset for x-ndjson introduces leaves us in a no-worse position than we are in today with ignoring charset. I'd like to fix our use of charset everywhere but that is a breaking change and should be handled separately.

The change to not allow the charset when using x-ndjson is also a breaking change that wasn't documented or intended in 6.0.

I also do not think this is a "quick and dirty hack". It is parsing the charset.

The current change in this PR, is incomplete as it only parses charset in the case where we haven't already parsed the content type and mapped it to the appropriate XContentType.

@DaveCTurner
Copy link
Contributor

Your curl examples aren't showing the headers.

There were no interesting headers to report (unlike in 5.6.8):

$ curl -s http://localhost:9200/ | jq .version.number
"6.2.2"

$ curl -i http://localhost:9200/_search -H'Content-type: application/json; charset=utf-8' --data-binary '{}'
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 133

{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}
$ curl -i http://localhost:9200/_search -H'Content-type: application/json; charset=xcvbxcvbxcv' --data-binary '{}'
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 133

{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}
$ curl -i http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=utf-8' --data-binary '{}'
HTTP/1.1 406 Not Acceptable
content-type: application/json; charset=UTF-8
content-length: 99

{"error":"Content-Type header [application/x-ndjson; charset=utf-8] is not supported","status":406}
$ curl -i http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=xcvbxcvbxcv' --data-binary '{}'
HTTP/1.1 406 Not Acceptable
content-type: application/json; charset=UTF-8
content-length: 105

{"error":"Content-Type header [application/x-ndjson; charset=xcvbxcvbxcv] is not supported","status":406}

The point is that in we already treat the charset leniently for application/json.

I also do not think this is a "quick and dirty hack". It is parsing the charset.

It's checking for the expected charset as a token but not as a quoted-string: charset="utf-8" is something else to consider, and weirder things like charset="utf\-8" are also legitimate.

@rjernst
Copy link
Member

rjernst commented Mar 15, 2018

The point is that in we already treat the charset leniently for application/json.

The existence of leniency does not mean we should propagate or increase the leniency.

It's checking for the expected charset as a token but not as a quoted-string

I wasn't aware of the quoted strings/case-insensitive side of the spec (I see it now). I think that should be fixed here (or in a followup).

@jaymode @DaveCTurner Can you help me understand where your reluctance is to parse the parameters from content-type and at least verify they match what we actually support today? Adding parsing (as in this PR) will fix the original "x-ndjson is erroneously rejected" problem. But adding it at the cost of furthering leniency (intentionally ignoring the media type when we can verify it) doesn't seem to help anyone.

@jaymode
Copy link
Member

jaymode commented Mar 15, 2018

Can you help me understand where your reluctance is to parse the parameters from content-type and at least verify they match what we actually support today?

I am reluctant to have parsing the parameters tied to this change as we are holding up a change that fixes something broken in 6.x while we get the validation logic correct. Parameter values should be validated to ensure they are in the correct format (token or quoted string per RFC 7231) and with this we need to add tests for this validation. Additionally, we need to ensure we do this in the correct places; not just one place for x-ndjson. Are there other parameters that we should accept/validate? All of these are missing from this PR right now and I believe this should not be tossed in as part of a small fix from a community member that was originally submitted 4 months ago.

@clintongormley clintongormley changed the title Add ability to parse Content-Type from content type contains charset (#27065) Add ability to parse Content-Type from content type contains charset Apr 18, 2018
@jpountz jpountz added v5.6.11 and removed v5.6.10 labels Jun 13, 2018
@javanna
Copy link
Member

javanna commented Aug 16, 2018

Looking at the discussion above, it seems like we reached no agreement on getting this in. I think there's no point in keeping it open then, feel free to reopen if you disagree.

@javanna javanna closed this Aug 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/REST API REST infrastructure and utilities >enhancement v5.6.11
Projects
None yet
Development

Successfully merging this pull request may close these issues.