Feature/bugfix: make the HTTP client able to return HTTP chunks when chunked transfer encoding is used #2150

jlacoline · 2017-07-31T13:56:43Z

What do these changes do?

Those changes make the StreamReader.readchunk method return HTTP chunks when Transfer-Encoding: chunked is used.
The ._buffer attribute of StreamReader now contains the actual HTTP chunks instead of the TCP chunks received by the server (only if chunk encoding is used, otherwise it is the same as before)

Are there changes in behavior for the user?

Previous behaviour: In all cases, StreamReader.readchunk returns the oldest chunk, the chunks beeing as received by the server (i.e. TCP chunks)

New behaviour: If chunked encoding is used, StreamReader.readchunk returns the oldest HTTP chunks. Otherwise it will return the oldest TCP chunk.

Why?

The current implementation of the HTTP client does not take into account how the payload is split when chunked encoding is used. The actual chunks that are stored and returned to the user correspond to TCP chunks, which are virtually meaningless in HTTP communication.
This leads to problems if someone needs to retreive the HTTP chunks as sent by the server: for example issue #1899 (note that I work in the same team as @Anvil and experience the same problems about Docker response parsing)
When reading the documentation, it seems that the iter_chunks generator was intended to yield HTTP chunks and not TCP chunks (related issue: #1949).

As @asvetlov argued here, the way the payload is chunked is a transport problem and should be transparent to the user.
However some HTTP servers consider the HTTP chunks they send as atomic blocks of information that have a meaning by themselves (at least Docker does, but I suppose they are not the only ones to have made that choice for data streaming). More precisely, Docker streamed responses are chunks of json objects and the json decoding is done once for each chunk, so in this scenario preserving chunk integrity is mandatory.

From my understanding, allowing to retreive HTTP chunks is a quite common feature for HTTP clients (docker clients use the requests package in the Python version and the native client in the Go version)

How?

The C parser fires the signals on_chunk_header and on_chunk_complete at the right moments but no handler was defined to act on those signals, so I added two handlers in _http_parser.pyx. Those two handlers basically tell the StreamReader payload when to start and stop storing chunks.
I modified the Python parser to reflect this behaviour.

Checklist

I think the code is well written
Unit tests for the changes exist
Documentation reflects the changes
If you provide code modification, please add yourself to CONTRIBUTORS.txt
- The format is <Name> <Surname>.
- Please keep alphabetical order, the file is sorted by names.
Add a new news fragment into the changes folder
- name it <issue_id>.<type> for example (588.bug)
- if you don't have an issue_id change it to the pr id after creating the pr
- ensure type is one of the following:
  - .feature: Signifying a new feature.
  - .bugfix: Signifying a bug fix.
  - .doc: Signifying a documentation improvement.
  - .removal: Signifying a deprecation or removal of public API.
  - .misc: A ticket has been closed, but it is not of interest to users.
- Make sure to use full sentences with correct case and punctuation, for example: "Fix issue with non-ascii contents in doctest text files."

… iter_chunks()

asvetlov · 2017-07-31T14:18:06Z

I still don't like the feature.
What servers except Docker relies on chunks?
What Docker API relies on it?
https://docs.docker.com/engine/api/v1.30/#operation/ContainerAttach sends exact size of every message explicitly.

fafhrd91 · 2017-07-31T14:25:00Z

This was already discussed. This not going to happen, not in this way.

I see two possible solution:

Each chunk is separate stream.
Parser should emit end of chunk. It might be callback or special value.

Solution should not affect users who does not care about transfer encoding.

codecov-io · 2017-07-31T14:30:10Z

Codecov Report

Merging #2150 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #2150      +/-   ##
=========================================
+ Coverage   97.15%   97.2%   +0.04%     
=========================================
  Files          39      39              
  Lines        7913    7945      +32     
  Branches     1371    1378       +7     
=========================================
+ Hits         7688    7723      +35     
+ Misses        100      98       -2     
+ Partials      125     124       -1

Impacted Files	Coverage Δ
aiohttp/http_parser.py	`97.77% <100%> (+0.52%)`	⬆️
aiohttp/streams.py	`99.06% <100%> (+0.31%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42361c5...39381ba. Read the comment docs.

jlacoline · 2017-07-31T14:41:24Z

@asvetlov The "attach" endpoint triggers a HTTP Upgrade so this not about chunked encoding.
/images/create is an endpoint that shows the behaviour I mentionned (docker pull command). The progression of the image download is sent in several json chunks that look like that:
(capture of my TCP packets during streaming)

T ::1:2376 -> ::1:52905 [AP]
b9.
{"status":"Downloading","progressDetail":{"current":32264,"total":1990402},"progress":"[\u003e                                                  ]  32.26kB/1.99MB","id":"88286f41530e"}.
.


T ::1:2376 -> ::1:52905 [AP]
ba.
{"status":"Downloading","progressDetail":{"current":130279,"total":1990402},"progress":"[===\u003e                                               ]  130.3kB/1.99MB","id":"88286f41530e"}.
.


T ::1:2376 -> ::1:52905 [AP]
ba.
{"status":"Downloading","progressDetail":{"current":228583,"total":1990402},"progress":"[=====\u003e                                             ]  228.6kB/1.99MB","id":"88286f41530e"}.
.

@fafhrd91 The solution I propose is not the same as the one Anvil implemented earlier, the current version of the C parser uses your second solution.
As for user impact, I am not sure there is a problem with my implementation, basically ._chunk_buffer will be filled temporary instead of ._buffer in the StreamReader class. I could add a size limit to avoid problems if this is what you suggest

fafhrd91 · 2017-07-31T14:45:35Z

With your implementation I could write very simple script that would flood server memory, and application would not be able to react.

fafhrd91 · 2017-07-31T14:47:16Z

As for impact, everyone suddenly would need to care about transfer encoding

jlacoline · 2017-07-31T15:03:59Z

If the script you suggest consists in sending huge chunk sizes, the problem seems to be exactly the same as users sending huge Content-Length sizes isn't it? I guess it should be handled the same way?
Actually I do not know what is done in the current code to prevent this but I suppose I could apply the same logic

fafhrd91 · 2017-07-31T15:31:40Z

with content-length application has full control on how to handle huge payloads. it can close connection, it can pause connection, etc. with this patch, application can not make any decision. I think, this is biggest problem.

asvetlov · 2017-07-31T16:14:52Z

I like suggested approach for emitting special tokens on end of chunk if opt-in parameter is passed.

fafhrd91 · 2017-07-31T16:20:11Z

I don't think we can make it in a backward compatible way, payload object is a byte stream.

We can make payload configurable and use something like DataQueue

fafhrd91 · 2017-07-31T16:23:36Z

Or we can just break readchunk

asvetlov · 2017-07-31T18:57:40Z

I don't care about backward compatibility of readchunk -- now it is almost useless and not widely used I pretty sure.

fafhrd91 · 2017-07-31T18:59:34Z

ok, good.

@jlacoline would like to rewrite your PR?

asvetlov · 2017-07-31T19:37:44Z

@fafhrd91 to be clean: the proposal is rewriting readchunk to emit either bytes or END_CHUNK token on getting the chunk ending. All other methods should skip END_CHUNK and concatenate bytes from the stream silently.
Do I understand you properly?

fafhrd91 · 2017-07-31T19:40:11Z

yes, that is right. maybe it should return tuple: (data, eof_chunk)
otherwise users would need to use: isinstance(data, END_OF_CHUNK)

fafhrd91 · 2017-07-31T19:42:52Z

btw do we need parameter at all if we do not care about backward compatibility?
lets just change method.

asvetlov · 2017-07-31T19:47:19Z

I think no, the method behavior change is ok to me.

fafhrd91 · 2017-07-31T19:48:13Z

ok, good. then +1 for new proposal

asvetlov · 2017-07-31T19:51:31Z

I not sure about returning tuple.
data is END_OF_CHUNK check is pretty fast, there is no need for isinstance check.

fafhrd91 · 2017-07-31T19:59:13Z

changing type of returning data doesn't seem very good, but that is matter of taste
here is case with json:

while 1:
   data, end_of_chunk = yield from payload.readchunk()
   buffer += data
   if end_of_chunk:
        decode_json_and_do_somthing(buffer)

as oposite to:

while 1:
   data = yield from payload.readchunk()
   if data is END_OF_CHUNK:
        decode_json_and_do_somthing(buffer)
   else:
        buffer += data

seems documentation should be simpler with tuple as we would need to describe each field in tuple only.
but that is up to you.

jlacoline · 2017-08-01T08:53:48Z

I am OK with this new logic, seems to be a good compromise for usability and security.
About iter_chunks(), since it is using readchunk(), then the generator will yield tuples of (data, end_of_chunk) and it will be up to the user to concatenate the different parts to get full chunks (maybe add a helper fonction later that does it with a size limit?).
I will update the PR when I get some time (before the end of the week I think).
As for the choice between a tuple or a special value, it seems pretty equivalent so I let you tell me which one you prefer.

asvetlov · 2017-08-01T09:03:35Z

@after careful thinking I agree that (bytes, bool) tuple is better: it breaks existing code fast, users will upgrade without long pain period.
And yes, let's left up to user partial chunks concatenation.
It is low level API anyway.
Moreover helper with limit doesn't help: user still should add a code for concatenation if chunk size exceeds the limit.

…ttp_chunk)

…ature

jlacoline · 2017-08-04T14:59:44Z

I updated the code to make readchunk() behave like specified, ready for your feedback.
However I do not understand what caused the CI job to fail, the test that failed does not seem to be linked to my changes.

jlacoline · 2017-08-10T13:55:43Z

ping @asvetlov @fafhrd91: are you ok with the changes?

asvetlov · 2017-08-10T14:07:16Z

Will review in a few hours.

asvetlov

Please reflect in docs/client_reference.rst your changes.
Use .. versionchanged:: 2.3 sphinx tag.

Also please add missing tests.

I highly recommend installing https://docs.codecov.io/v4.3.6/docs/browser-extension -- it shows coverage result just in github Files changed tab.

asvetlov · 2017-08-10T17:09:46Z

aiohttp/streams.py

+    class ChunkTupleAsyncStreamIterator(AsyncStreamIterator):
+        @asyncio.coroutine
+        def __anext__(self):
+            try:


Please add tests for covering these lines

asvetlov · 2017-08-10T17:10:07Z

aiohttp/streams.py


            Python-3.5 available for Python 3.5+ only
            """
-            return AsyncStreamIterator(self.readchunk)
+            return ChunkTupleAsyncStreamIterator(self.readchunk)


Add a test for the call.

asvetlov · 2017-08-10T17:10:41Z

aiohttp/streams.py

+        elif self._http_chunk_splits is not None:
+            if self._http_chunk_splits:
+                pos = self._http_chunk_splits[0]
+                if pos > self._cursor:


Add a test for case if not pos > self._cursor.

FTR this condition was here because I was afraid of what would happen if some other calls than readchunk were made before.
So I did a small modification to the condition here and added tests for those situations, covering cases where stream.unread_data and stream.read are used before stream.readchunk.

asvetlov · 2017-08-10T17:13:21Z

aiohttp/streams.py

        if self._exception is not None:
            raise self._exception

        if not self._buffer and not self._eof:
+            # end of http chunk without available data
+            if self._http_chunk_splits and \


Replace backslash with brackets.

asvetlov · 2017-08-10T17:15:57Z

aiohttp/http_parser.py

@@ -547,18 +548,17 @@ def feed_data(self, chunk, SEP=b'\r\n', CHUNK_EXT=b';'):
                    required = self._chunk_size
                    chunk_len = len(chunk)

-                    if required >= chunk_len:
+                    if required > chunk_len:


Please explain why you replaced >= with >.

If required == chunk_len, then self._chunk_size == 0 and the "else" part does what is needed.
So lines 552 and 553 looked like duplicated code, and deduplicating this allowed me to put self.payload.end_http_chunk_receiving() at only one place in the code.

But the code will continue after the 'else' as opposed to the previous version where (False, None) was returned immediately. So the behaviour is not exactly the same even if the same result will be eventually returned.

So imo it looks better but if you want me to revert to the previous version tell me.

Also I'm going to remove the useless blank line(554)

ok, fine with the change

…http chunk

jlacoline · 2017-08-11T15:22:16Z

Thanks for the review. I updated my branch according to your feedback.
I will modify docs/client_reference.rst when I get some time, probably next week.

asvetlov · 2017-08-11T18:12:29Z

Good. Please drop a message when you will be ready for next review.

jlacoline · 2017-08-16T13:00:43Z

@asvetlov ready for next review

jlacoline · 2017-08-28T09:01:49Z

@asvetlov pinging you again in case you didn't see the last message

asvetlov

lgtm, thanks

asvetlov · 2017-09-01T11:19:27Z

aiohttp/http_parser.py

@@ -547,18 +548,17 @@ def feed_data(self, chunk, SEP=b'\r\n', CHUNK_EXT=b';'):
                    required = self._chunk_size
                    chunk_len = len(chunk)

-                    if required >= chunk_len:
+                    if required > chunk_len:


ok, fine with the change

lock · 2019-10-28T18:04:47Z

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

jlacoline added 6 commits July 31, 2017 10:34

implement http chunk parsing in http parser (C version)

7a30dc5

http chunk decoding: implement chunk signals in Python parser

ad942af

StreamReader: add tests for [begin|end]_chunk_receiving methods

09b1da5

update documentation to clarify the difference between iter_any() and…

511a680

… iter_chunks()

add tests for http chunks parsing

6d0211f

add changelog file for PR 2150

32e4273

jlacoline added 2 commits August 4, 2017 15:47

http chunk parsing: readchunk() now returns tuples of (data, end_of_h…

fae40a7

…ttp_chunk)

Merge remote-tracking branch 'upstream/master' into add_http_chunk_fe…

f56b069

…ature

http chunk parsing: adapt iterchunks() generator to new return format

4563338

asvetlov requested changes Aug 10, 2017

View reviewed changes

jlacoline added 6 commits August 11, 2017 10:11

streams.py: use parenthesis for line wrapping instead of backslash

513f0ad

add unit tests for ChunkTupleAsyncStreamIterator

85ea0eb

do not catch EofStream in ChunkTupleAsyncStreamIterator

09848db

change the behaviour of stream.readchunk when searching for the next …

d69d890

…http chunk

add tests to the stream.readchunk() method

5e8e25f

http_parser.py: remove useless blank line

939d386

jlacoline added 6 commits August 16, 2017 10:51

update documentation in streams.rst

bf31405

update documentation in docs/client_reference.rst

19b9d14

minor change to test_streams.py

5ed76d6

change formatting in streams.rst

e172a23

fix spelling errors in documentation

c99e06c

stream.rs: replace 'boolean' with :class:

1136c56

asvetlov approved these changes Sep 1, 2017

View reviewed changes

Merge branch 'master' into add_http_chunk_feature

39381ba

asvetlov merged commit e7c9390 into aio-libs:master Sep 1, 2017

asvetlov mentioned this pull request Sep 1, 2017

Doc clarification - StreamReader.iter_any() vs iter_chunks() #1949

Closed

lock bot added the outdated label Oct 28, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019

psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Oct 28, 2019

Feature/bugfix: make the HTTP client able to return HTTP chunks when chunked transfer encoding is used #2150

Feature/bugfix: make the HTTP client able to return HTTP chunks when chunked transfer encoding is used #2150

Conversation

jlacoline commented Jul 31, 2017 • edited Loading

What do these changes do?

Are there changes in behavior for the user?

Why?

How?

Checklist

asvetlov commented Jul 31, 2017

fafhrd91 commented Jul 31, 2017

codecov-io commented Jul 31, 2017 • edited Loading

Codecov Report

jlacoline commented Jul 31, 2017

fafhrd91 commented Jul 31, 2017

fafhrd91 commented Jul 31, 2017

jlacoline commented Jul 31, 2017

fafhrd91 commented Jul 31, 2017 • edited Loading

asvetlov commented Jul 31, 2017

fafhrd91 commented Jul 31, 2017

fafhrd91 commented Jul 31, 2017

asvetlov commented Jul 31, 2017

fafhrd91 commented Jul 31, 2017

asvetlov commented Jul 31, 2017 • edited Loading

fafhrd91 commented Jul 31, 2017 • edited Loading

fafhrd91 commented Jul 31, 2017

asvetlov commented Jul 31, 2017

fafhrd91 commented Jul 31, 2017

asvetlov commented Jul 31, 2017

fafhrd91 commented Jul 31, 2017

jlacoline commented Aug 1, 2017

asvetlov commented Aug 1, 2017 • edited Loading

jlacoline commented Aug 4, 2017

jlacoline commented Aug 10, 2017

asvetlov commented Aug 10, 2017

asvetlov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlacoline commented Aug 11, 2017

asvetlov commented Aug 11, 2017

jlacoline commented Aug 16, 2017

jlacoline commented Aug 28, 2017

asvetlov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lock bot commented Oct 28, 2019

jlacoline commented Jul 31, 2017 •

edited

Loading

codecov-io commented Jul 31, 2017 •

edited

Loading

fafhrd91 commented Jul 31, 2017 •

edited

Loading

asvetlov commented Jul 31, 2017 •

edited

Loading

fafhrd91 commented Jul 31, 2017 •

edited

Loading

asvetlov commented Aug 1, 2017 •

edited

Loading