Add ability to parse Content-Type from content type contains charset #27301

liketic · 2017-11-07T11:05:56Z

elasticmachine · 2017-11-07T11:05:58Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

DaveCTurner · 2017-11-07T12:37:20Z

I don't think it closes that issue you mentioned, but it is related to #27065.

javanna · 2017-11-07T12:40:37Z

I don't think it closes that issue you mentioned

I wish it would though :)

DaveCTurner · 2017-11-07T12:45:13Z

@elasticmachine please test this

liketic · 2017-11-07T12:45:35Z

@DaveCTurner My mistake... 👍

DaveCTurner · 2017-11-07T12:52:25Z

core/src/main/java/org/elasticsearch/rest/RestController.java

+                final String lowercaseContentType = restRequest.header("Content-Type").toLowerCase(Locale.ROOT);
+                final String[] elements = lowercaseContentType.split(";");
+                final String lowercaseMediaType = elements[0].trim();
+


I feel we should still have some form of warning if you claim to be using a charset that isn't UTF-{8,16,32}, as per #22769. This might not be the time or the place to ask for that, however.

jasontedor

This change is not enough for me. It certainly fixes the errant deprecation logger warning (and that we are not obeying the content type), but now it's leniently ignoring any parameters on the content type.

javanna · 2017-11-07T13:35:20Z

@jasontedor are you suggesting to parse the charset as part of this PR then aka address #22769 ?

jasontedor · 2017-11-07T13:38:35Z

@javanna It doesn't have to be this PR, but I'm not willing to pull this PR in as-is because of the leniency. Ideally, yes, #22769 would be addressed as part of a complete solution. 😄

javanna · 2017-11-07T13:57:25Z

We currently don't do anything with charset other than issuing a wrong deprecation warning due to it right? I understand the reasons why leniency should be avoided, but why does solving the wrong deprecation warning means parsing the charset and doing things with it. Is there some middle ground between the two so that we can remove the deprecation warning (hopefully by getting this PR in) without addressing #22769 at the same time?

DaveCTurner · 2017-11-07T18:29:14Z

Could it just accept the literal application/x-ndjson; charset=utf-8 and consider it to be equivalent to application/x-ndjson? That'd save any mucking around with parsing the header without introducing any undue leniency. Are there any other Content-Types that are problematic here?

rjernst · 2017-11-08T19:43:24Z

I think splitting as is implemented in this PR is fine if the charset is checked. Since at the moment we don't actually support anything but utf8, we can just fail if the charset is present and not utf8? Actually supporting charset values can be a followup, but at least we won't add leniency, and it doesn't allow odd behavior (sending a charset for say utf 16 failing in decoding).

jasontedor · 2017-11-08T19:50:06Z

I was going to suggest the same thing as @rjernst but now I can take a shortcut and simply register my agreement with his proposal. 😄

DaveCTurner · 2017-11-09T12:25:51Z

Since at the moment we don't actually support anything but utf8.

It seems that we also accept UTF-16 and UTF-32 (both little and big endian variants) via autodetection, but nothing weirder:

$ for i in ISO-8859-1 UTF-8 UTF-16LE UTF-16BE UTF-32LE UTF-32BE ISO-IR-165; do echo; echo; echo $i; echo -n '{"query":{"match":{"message":"é"}}}' | iconv -f utf-8 -t $i > body.dat; xxd body.dat; curl http://127.0.0.1:9200/_search -H'Content-type: application/json' --data-binary @body.dat; done


ISO-8859-1
00000000: 7b22 7175 6572 7922 3a7b 226d 6174 6368  {"query":{"match
00000010: 223a 7b22 6d65 7373 6167 6522 3a22 e922  ":{"message":"."
00000020: 7d7d 7d                                  }}}
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0x22\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@45ee3f59; line: 1, column: 33]"}],"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0x22\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@45ee3f59; line: 1, column: 33]"},"status":500}

UTF-8
00000000: 7b22 7175 6572 7922 3a7b 226d 6174 6368  {"query":{"match
00000010: 223a 7b22 6d65 7373 6167 6522 3a22 c3a9  ":{"message":"..
00000020: 227d 7d7d                                "}}}
{"took":1,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

UTF-16LE
00000000: 7b00 2200 7100 7500 6500 7200 7900 2200  {.".q.u.e.r.y.".
00000010: 3a00 7b00 2200 6d00 6100 7400 6300 6800  :.{.".m.a.t.c.h.
00000020: 2200 3a00 7b00 2200 6d00 6500 7300 7300  ".:.{.".m.e.s.s.
00000030: 6100 6700 6500 2200 3a00 2200 e900 2200  a.g.e.".:."...".
00000040: 7d00 7d00 7d00                           }.}.}.
{"took":0,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

UTF-16BE
00000000: 007b 0022 0071 0075 0065 0072 0079 0022  .{.".q.u.e.r.y."
00000010: 003a 007b 0022 006d 0061 0074 0063 0068  .:.{.".m.a.t.c.h
00000020: 0022 003a 007b 0022 006d 0065 0073 0073  .".:.{.".m.e.s.s
00000030: 0061 0067 0065 0022 003a 0022 00e9 0022  .a.g.e.".:."..."
00000040: 007d 007d 007d                           .}.}.}
{"took":0,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

UTF-32LE
00000000: 7b00 0000 2200 0000 7100 0000 7500 0000  {..."...q...u...
00000010: 6500 0000 7200 0000 7900 0000 2200 0000  e...r...y..."...
00000020: 3a00 0000 7b00 0000 2200 0000 6d00 0000  :...{..."...m...
00000030: 6100 0000 7400 0000 6300 0000 6800 0000  a...t...c...h...
00000040: 2200 0000 3a00 0000 7b00 0000 2200 0000  "...:...{..."...
00000050: 6d00 0000 6500 0000 7300 0000 7300 0000  m...e...s...s...
00000060: 6100 0000 6700 0000 6500 0000 2200 0000  a...g...e..."...
00000070: 3a00 0000 2200 0000 e900 0000 2200 0000  :..."......."...
00000080: 7d00 0000 7d00 0000 7d00 0000            }...}...}...
{"took":4,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

UTF-32BE
00000000: 0000 007b 0000 0022 0000 0071 0000 0075  ...{..."...q...u
00000010: 0000 0065 0000 0072 0000 0079 0000 0022  ...e...r...y..."
00000020: 0000 003a 0000 007b 0000 0022 0000 006d  ...:...{..."...m
00000030: 0000 0061 0000 0074 0000 0063 0000 0068  ...a...t...c...h
00000040: 0000 0022 0000 003a 0000 007b 0000 0022  ..."...:...{..."
00000050: 0000 006d 0000 0065 0000 0073 0000 0073  ...m...e...s...s
00000060: 0000 0061 0000 0067 0000 0065 0000 0022  ...a...g...e..."
00000070: 0000 003a 0000 0022 0000 00e9 0000 0022  ...:..."......."
00000080: 0000 007d 0000 007d 0000 007d            ...}...}...}
{"took":0,"timed_out":false,"_shards":{"total":4,"successful":4,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

ISO-IR-165
00000000: 2a7b 2a22 2a71 2a75 2a65 2a72 2a79 2a22  *{*"*q*u*e*r*y*"
00000010: 2a3a 2a7b 2a22 2a6d 2a61 2a74 2a63 2a68  *:*{*"*m*a*t*c*h
00000020: 2a22 2a3a 2a7b 2a22 2a6d 2a65 2a73 2a73  *"*:*{*"*m*e*s*s
00000030: 2a61 2a67 2a65 2a22 2a3a 2a22 2b26 2a22  *a*g*e*"*:*"+&*"
00000040: 2a7d 2a7d 2a7d                           *}*}*}
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Unexpected character ('*' (code 42)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@4f45863b; line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Unexpected character ('*' (code 42)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@4f45863b; line: 1, column: 2]"},"status":500}

We may not support them, of course :)

liketic · 2017-11-09T12:48:53Z

I pushed bc424b3 to validate charset. @DaveCTurner I'm OK to add utf-16[32] if both you think the way to fix the issue is make sense. Thanks everyone's comments.

jaymode

I'd like to see if we can get this fix merged in and ported to 6.x so that requests do not fail if a charset is specified. @liketic are you able to update the PR (merge 5.6)?

@DaveCTurner @jasontedor can you take a look again at the changes?

jaymode · 2018-02-08T19:47:12Z

core/src/main/java/org/elasticsearch/rest/RestController.java

@@ -308,6 +309,23 @@ private boolean hasContentTypeOrCanAutoDetect(final RestRequest restRequest, fin
        return true;
    }

+    private String parseMediaType(String rawContentType) {
+        final String contentType = rawContentType.toLowerCase(Locale.ROOT);
+        int firstSemiColonIndex = contentType.indexOf(';');


make it final

Thanks for picking this up. I pushed 0fabbd0 . The merge target is 5.6 now. Thank you so much!

elasticmachine · 2018-02-10T04:31:24Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

mbarretta · 2018-03-12T21:17:21Z

To echo Jay, any hope of seeing this in a near-term 6.x release (e.g. 6.3)?

DaveCTurner · 2018-03-13T09:21:56Z

I'm -½ on this change. It's a quick-and-dirty hack that means we accept a few more Content-type headers, but not all of the ones that we should.

I also note that we are inconsistent with our leniency in this area, comparing our handling of application/json and application/x-ndjson:

$ curl -s http://localhost:9200/ | jq .version.number
"6.2.2"
$ curl http://localhost:9200/_search -H'Content-type: application/json; charset=utf-8' --data-binary '{}'
{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}
$ curl http://localhost:9200/_search -H'Content-type: application/json; charset=xcvbxcvbxcv' --data-binary '{}'
{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}
$ curl http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=utf-8' --data-binary '{}'
{"error":"Content-Type header [application/x-ndjson; charset=utf-8] is not supported","status":406}
$ curl http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=xcvbxcvbxcv' --data-binary '{}'
{"error":"Content-Type header [application/x-ndjson; charset=xcvbxcvbxcv] is not supported","status":406}

I think this needs a deeper look.

jaymode · 2018-03-13T15:08:16Z

It's a quick-and-dirty hack that means we accept a few more Content-type headers, but not all of the ones that we should.

@DaveCTurner I agree, which is why I believe that the validation of the charset should be separate from this PR. A portion of this PR is still important, which is the ability to accept a x-ndjson content type header that has the charset or other parameters specified. We do this already for all other content types we support, see RestRequest#parseContentType. @jasontedor @rjernst can you elaborate on why we could not pull a small fix in that extends existing behavior that makes things consistent? Elasticsearch has always ignored the charset and other parameters.

mbarretta · 2018-03-13T20:51:11Z

again agree with Jay: The rejection of the _bulk requests with a charset is a change in behavior and breaks compatibility with some 3rd party REST client libs. The particular scenario I'm stuck behind is similar to the third listed in @DaveCTurner example, though specifically using _bulk vs _search:

$ curl http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=utf-8' --data-binary '{}'

It's a valid call, and while the charset would be ignored, accepting that request will be consistent with past behavior as well as consistent with current behavior when combined with the application/json content-type

rjernst · 2018-03-13T21:36:31Z

can you elaborate on why we could not pull a small fix in that extends existing behavior that makes things consistent?

I don't think we should add leniency because leniency is incredibly hard to remove later (people start depending on it).

Elasticsearch has always ignored the charset and other parameters.

But now we are parsing content type, thus we should parse the entire content type. If we ignore parts (ie charset) it is not better than the state we were already in (when we "auto detected" everything).

@DaveCTurner Your curl examples aren't showing the headers. This PR, as it is currently, is only emitting deprecation warnings. I think we should actually be erroring if we are to parse the charset. Your earlier examples of support UTF16 and UTF32 only work because of the underlying parser we use. We could swap that out today and no tests would fail. Therefore, I claim we only support UTF8 (the only encoding we use in tests, afaik).

I also do not think this is a "quick and dirty hack". It is parsing the charset. We do need to also fail on other parameters, and invalid charset values. I don't think this should be a deprecation warning. Bogus and unsupported charset values already won't work (aside from the ones that "work" above, which I'm torn on whether to fail hard on or give a warning).

mbarretta · 2018-03-13T21:50:54Z

just for clarity, the issue to which I'm referring, previously linked this one, is #28123

jaymode · 2018-03-13T22:36:02Z

I don't think we should add leniency because leniency is incredibly hard to remove later (people start depending on it).

I'd agree in most cases, but this one I view differently because we do not use the charset at all (#22769) and it causes issues with certain clients that send the charset in 6.x. The leniency simply dropping the charset for x-ndjson introduces leaves us in a no-worse position than we are in today with ignoring charset. I'd like to fix our use of charset everywhere but that is a breaking change and should be handled separately.

The change to not allow the charset when using x-ndjson is also a breaking change that wasn't documented or intended in 6.0.

I also do not think this is a "quick and dirty hack". It is parsing the charset.

The current change in this PR, is incomplete as it only parses charset in the case where we haven't already parsed the content type and mapped it to the appropriate XContentType.

DaveCTurner · 2018-03-14T11:56:19Z

Your curl examples aren't showing the headers.

There were no interesting headers to report (unlike in 5.6.8):

$ curl -s http://localhost:9200/ | jq .version.number
"6.2.2"

$ curl -i http://localhost:9200/_search -H'Content-type: application/json; charset=utf-8' --data-binary '{}'
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 133

{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}
$ curl -i http://localhost:9200/_search -H'Content-type: application/json; charset=xcvbxcvbxcv' --data-binary '{}'
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 133

{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}
$ curl -i http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=utf-8' --data-binary '{}'
HTTP/1.1 406 Not Acceptable
content-type: application/json; charset=UTF-8
content-length: 99

{"error":"Content-Type header [application/x-ndjson; charset=utf-8] is not supported","status":406}
$ curl -i http://localhost:9200/_search -H'Content-type: application/x-ndjson; charset=xcvbxcvbxcv' --data-binary '{}'
HTTP/1.1 406 Not Acceptable
content-type: application/json; charset=UTF-8
content-length: 105

{"error":"Content-Type header [application/x-ndjson; charset=xcvbxcvbxcv] is not supported","status":406}

The point is that in we already treat the charset leniently for application/json.

I also do not think this is a "quick and dirty hack". It is parsing the charset.

It's checking for the expected charset as a token but not as a quoted-string: charset="utf-8" is something else to consider, and weirder things like charset="utf\-8" are also legitimate.

rjernst · 2018-03-15T00:32:08Z

The point is that in we already treat the charset leniently for application/json.

The existence of leniency does not mean we should propagate or increase the leniency.

It's checking for the expected charset as a token but not as a quoted-string

I wasn't aware of the quoted strings/case-insensitive side of the spec (I see it now). I think that should be fixed here (or in a followup).

@jaymode @DaveCTurner Can you help me understand where your reluctance is to parse the parameters from content-type and at least verify they match what we actually support today? Adding parsing (as in this PR) will fix the original "x-ndjson is erroneously rejected" problem. But adding it at the cost of furthering leniency (intentionally ignoring the media type when we can verify it) doesn't seem to help anyone.

jaymode · 2018-03-15T03:51:53Z

Can you help me understand where your reluctance is to parse the parameters from content-type and at least verify they match what we actually support today?

I am reluctant to have parsing the parameters tied to this change as we are holding up a change that fixes something broken in 6.x while we get the validation logic correct. Parameter values should be validated to ensure they are in the correct format (token or quoted string per RFC 7231) and with this we need to add tests for this validation. Additionally, we need to ensure we do this in the correct places; not just one place for x-ndjson. Are there other parameters that we should accept/validate? All of these are missing from this PR right now and I believe this should not be tossed in as part of a small fix from a community member that was originally submitted 4 months ago.

javanna · 2018-08-16T14:09:23Z

Looking at the discussion above, it seems like we reached no agreement on getting this in. I think there's no point in keeping it open then, feel free to reopen if you disagree.

Add ability to parse Content-Type from content type contains charset (e…

3209a35

…lastic#27065)

DaveCTurner reviewed Nov 7, 2017

View reviewed changes

DaveCTurner assigned javanna Nov 7, 2017

DaveCTurner added :Core/Infra/REST API REST infrastructure and utilities v5.6.5 labels Nov 7, 2017

jasontedor requested changes Nov 7, 2017

View reviewed changes

javanna removed their assignment Nov 7, 2017

Validate charset if exist when parse media type from Content-Type

bc424b3

lcawl added v5.6.6 and removed v5.6.5 labels Dec 6, 2017

jaymode mentioned this pull request Jan 9, 2018

406 response when POSTing to _bulk with a charset specified #28123

Closed

jasontedor added v5.6.7 and removed v5.6.6 labels Jan 16, 2018

bleskes added v5.6.8 and removed v5.6.7 labels Jan 29, 2018

jaymode reviewed Feb 8, 2018

View reviewed changes

Make variables to be final

0fabbd0

jasontedor added v5.6.9 and removed v5.6.8 labels Feb 20, 2018

clintongormley changed the title ~~Add ability to parse Content-Type from content type contains charset (#27065)~~ Add ability to parse Content-Type from content type contains charset Apr 18, 2018

clintongormley added v5.6.10 and removed v5.6.9 labels Apr 18, 2018

colings86 added the >enhancement label Apr 24, 2018

jpountz added v5.6.11 and removed v5.6.10 labels Jun 13, 2018

javanna closed this Aug 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to parse Content-Type from content type contains charset #27301

Add ability to parse Content-Type from content type contains charset #27301

liketic commented Nov 7, 2017 •

edited

Loading

elasticmachine commented Nov 7, 2017

DaveCTurner commented Nov 7, 2017

javanna commented Nov 7, 2017

DaveCTurner commented Nov 7, 2017

liketic commented Nov 7, 2017

DaveCTurner Nov 7, 2017

jasontedor left a comment

javanna commented Nov 7, 2017

jasontedor commented Nov 7, 2017

javanna commented Nov 7, 2017

DaveCTurner commented Nov 7, 2017

rjernst commented Nov 8, 2017

jasontedor commented Nov 8, 2017

DaveCTurner commented Nov 9, 2017 •

edited

Loading

liketic commented Nov 9, 2017

jaymode left a comment

jaymode Feb 8, 2018

liketic Feb 10, 2018

elasticmachine commented Feb 10, 2018

mbarretta commented Mar 12, 2018

DaveCTurner commented Mar 13, 2018

jaymode commented Mar 13, 2018

mbarretta commented Mar 13, 2018 •

edited

Loading

rjernst commented Mar 13, 2018

mbarretta commented Mar 13, 2018

jaymode commented Mar 13, 2018

DaveCTurner commented Mar 14, 2018

rjernst commented Mar 15, 2018

jaymode commented Mar 15, 2018

javanna commented Aug 16, 2018 •

edited

Loading

Add ability to parse Content-Type from content type contains charset #27301

Add ability to parse Content-Type from content type contains charset #27301

Conversation

liketic commented Nov 7, 2017 • edited Loading

elasticmachine commented Nov 7, 2017

DaveCTurner commented Nov 7, 2017

javanna commented Nov 7, 2017

DaveCTurner commented Nov 7, 2017

liketic commented Nov 7, 2017

DaveCTurner Nov 7, 2017

Choose a reason for hiding this comment

jasontedor left a comment

Choose a reason for hiding this comment

javanna commented Nov 7, 2017

jasontedor commented Nov 7, 2017

javanna commented Nov 7, 2017

DaveCTurner commented Nov 7, 2017

rjernst commented Nov 8, 2017

jasontedor commented Nov 8, 2017

DaveCTurner commented Nov 9, 2017 • edited Loading

liketic commented Nov 9, 2017

jaymode left a comment

Choose a reason for hiding this comment

jaymode Feb 8, 2018

Choose a reason for hiding this comment

liketic Feb 10, 2018

Choose a reason for hiding this comment

elasticmachine commented Feb 10, 2018

mbarretta commented Mar 12, 2018

DaveCTurner commented Mar 13, 2018

jaymode commented Mar 13, 2018

mbarretta commented Mar 13, 2018 • edited Loading

rjernst commented Mar 13, 2018

mbarretta commented Mar 13, 2018

jaymode commented Mar 13, 2018

DaveCTurner commented Mar 14, 2018

rjernst commented Mar 15, 2018

jaymode commented Mar 15, 2018

javanna commented Aug 16, 2018 • edited Loading

liketic commented Nov 7, 2017 •

edited

Loading

DaveCTurner commented Nov 9, 2017 •

edited

Loading

mbarretta commented Mar 13, 2018 •

edited

Loading

javanna commented Aug 16, 2018 •

edited

Loading