Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/bugfix: make the HTTP client able to return HTTP chunks when chunked transfer encoding is used #2150

Merged
merged 22 commits into from
Sep 1, 2017

Conversation

jlacoline
Copy link
Contributor

@jlacoline jlacoline commented Jul 31, 2017

What do these changes do?

Those changes make the StreamReader.readchunk method return HTTP chunks when Transfer-Encoding: chunked is used.
The ._buffer attribute of StreamReader now contains the actual HTTP chunks instead of the TCP chunks received by the server (only if chunk encoding is used, otherwise it is the same as before)

Are there changes in behavior for the user?

Previous behaviour: In all cases, StreamReader.readchunk returns the oldest chunk, the chunks beeing as received by the server (i.e. TCP chunks)

New behaviour: If chunked encoding is used, StreamReader.readchunk returns the oldest HTTP chunks. Otherwise it will return the oldest TCP chunk.

Why?

The current implementation of the HTTP client does not take into account how the payload is split when chunked encoding is used. The actual chunks that are stored and returned to the user correspond to TCP chunks, which are virtually meaningless in HTTP communication.
This leads to problems if someone needs to retreive the HTTP chunks as sent by the server: for example issue #1899 (note that I work in the same team as @Anvil and experience the same problems about Docker response parsing)
When reading the documentation, it seems that the iter_chunks generator was intended to yield HTTP chunks and not TCP chunks (related issue: #1949).

As @asvetlov argued here, the way the payload is chunked is a transport problem and should be transparent to the user.
However some HTTP servers consider the HTTP chunks they send as atomic blocks of information that have a meaning by themselves (at least Docker does, but I suppose they are not the only ones to have made that choice for data streaming). More precisely, Docker streamed responses are chunks of json objects and the json decoding is done once for each chunk, so in this scenario preserving chunk integrity is mandatory.

From my understanding, allowing to retreive HTTP chunks is a quite common feature for HTTP clients (docker clients use the requests package in the Python version and the native client in the Go version)

How?

The C parser fires the signals on_chunk_header and on_chunk_complete at the right moments but no handler was defined to act on those signals, so I added two handlers in _http_parser.pyx. Those two handlers basically tell the StreamReader payload when to start and stop storing chunks.
I modified the Python parser to reflect this behaviour.

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
    • The format is <Name> <Surname>.
    • Please keep alphabetical order, the file is sorted by names.
  • Add a new news fragment into the changes folder
    • name it <issue_id>.<type> for example (588.bug)
    • if you don't have an issue_id change it to the pr id after creating the pr
    • ensure type is one of the following:
      • .feature: Signifying a new feature.
      • .bugfix: Signifying a bug fix.
      • .doc: Signifying a documentation improvement.
      • .removal: Signifying a deprecation or removal of public API.
      • .misc: A ticket has been closed, but it is not of interest to users.
    • Make sure to use full sentences with correct case and punctuation, for example: "Fix issue with non-ascii contents in doctest text files."

@asvetlov
Copy link
Member

I still don't like the feature.
What servers except Docker relies on chunks?
What Docker API relies on it?
https://docs.docker.com/engine/api/v1.30/#operation/ContainerAttach sends exact size of every message explicitly.

@fafhrd91
Copy link
Member

This was already discussed. This not going to happen, not in this way.

I see two possible solution:

  1. Each chunk is separate stream.
  2. Parser should emit end of chunk. It might be callback or special value.

Solution should not affect users who does not care about transfer encoding.

@codecov-io
Copy link

codecov-io commented Jul 31, 2017

Codecov Report

Merging #2150 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2150      +/-   ##
=========================================
+ Coverage   97.15%   97.2%   +0.04%     
=========================================
  Files          39      39              
  Lines        7913    7945      +32     
  Branches     1371    1378       +7     
=========================================
+ Hits         7688    7723      +35     
+ Misses        100      98       -2     
+ Partials      125     124       -1
Impacted Files Coverage Δ
aiohttp/http_parser.py 97.77% <100%> (+0.52%) ⬆️
aiohttp/streams.py 99.06% <100%> (+0.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42361c5...39381ba. Read the comment docs.

@jlacoline
Copy link
Contributor Author

@asvetlov The "attach" endpoint triggers a HTTP Upgrade so this not about chunked encoding.
/images/create is an endpoint that shows the behaviour I mentionned (docker pull command). The progression of the image download is sent in several json chunks that look like that:
(capture of my TCP packets during streaming)

T ::1:2376 -> ::1:52905 [AP]
b9.
{"status":"Downloading","progressDetail":{"current":32264,"total":1990402},"progress":"[\u003e                                                  ]  32.26kB/1.99MB","id":"88286f41530e"}.
.


T ::1:2376 -> ::1:52905 [AP]
ba.
{"status":"Downloading","progressDetail":{"current":130279,"total":1990402},"progress":"[===\u003e                                               ]  130.3kB/1.99MB","id":"88286f41530e"}.
.


T ::1:2376 -> ::1:52905 [AP]
ba.
{"status":"Downloading","progressDetail":{"current":228583,"total":1990402},"progress":"[=====\u003e                                             ]  228.6kB/1.99MB","id":"88286f41530e"}.
.

@fafhrd91 The solution I propose is not the same as the one Anvil implemented earlier, the current version of the C parser uses your second solution.
As for user impact, I am not sure there is a problem with my implementation, basically ._chunk_buffer will be filled temporary instead of ._buffer in the StreamReader class. I could add a size limit to avoid problems if this is what you suggest

@fafhrd91
Copy link
Member

With your implementation I could write very simple script that would flood server memory, and application would not be able to react.

@fafhrd91
Copy link
Member

As for impact, everyone suddenly would need to care about transfer encoding

@jlacoline
Copy link
Contributor Author

If the script you suggest consists in sending huge chunk sizes, the problem seems to be exactly the same as users sending huge Content-Length sizes isn't it? I guess it should be handled the same way?
Actually I do not know what is done in the current code to prevent this but I suppose I could apply the same logic

@fafhrd91
Copy link
Member

fafhrd91 commented Jul 31, 2017

with content-length application has full control on how to handle huge payloads. it can close connection, it can pause connection, etc. with this patch, application can not make any decision. I think, this is biggest problem.

@asvetlov
Copy link
Member

I like suggested approach for emitting special tokens on end of chunk if opt-in parameter is passed.

@fafhrd91
Copy link
Member

I don't think we can make it in a backward compatible way, payload object is a byte stream.

We can make payload configurable and use something like DataQueue

@fafhrd91
Copy link
Member

Or we can just break readchunk

@asvetlov
Copy link
Member

I don't care about backward compatibility of readchunk -- now it is almost useless and not widely used I pretty sure.

@fafhrd91
Copy link
Member

ok, good.

@jlacoline would like to rewrite your PR?

@asvetlov
Copy link
Member

asvetlov commented Jul 31, 2017

@fafhrd91 to be clean: the proposal is rewriting readchunk to emit either bytes or END_CHUNK token on getting the chunk ending. All other methods should skip END_CHUNK and concatenate bytes from the stream silently.
Do I understand you properly?

@fafhrd91
Copy link
Member

fafhrd91 commented Jul 31, 2017

yes, that is right. maybe it should return tuple: (data, eof_chunk)
otherwise users would need to use: isinstance(data, END_OF_CHUNK)

@fafhrd91
Copy link
Member

btw do we need parameter at all if we do not care about backward compatibility?
lets just change method.

@asvetlov
Copy link
Member

I think no, the method behavior change is ok to me.

@fafhrd91
Copy link
Member

ok, good. then +1 for new proposal

@asvetlov
Copy link
Member

I not sure about returning tuple.
data is END_OF_CHUNK check is pretty fast, there is no need for isinstance check.

@fafhrd91
Copy link
Member

changing type of returning data doesn't seem very good, but that is matter of taste
here is case with json:

while 1:
   data, end_of_chunk = yield from payload.readchunk()
   buffer += data
   if end_of_chunk:
        decode_json_and_do_somthing(buffer)

as oposite to:

while 1:
   data = yield from payload.readchunk()
   if data is END_OF_CHUNK:
        decode_json_and_do_somthing(buffer)
   else:
        buffer += data

seems documentation should be simpler with tuple as we would need to describe each field in tuple only.
but that is up to you.

@jlacoline
Copy link
Contributor Author

I am OK with this new logic, seems to be a good compromise for usability and security.
About iter_chunks(), since it is using readchunk(), then the generator will yield tuples of (data, end_of_chunk) and it will be up to the user to concatenate the different parts to get full chunks (maybe add a helper fonction later that does it with a size limit?).
I will update the PR when I get some time (before the end of the week I think).
As for the choice between a tuple or a special value, it seems pretty equivalent so I let you tell me which one you prefer.

@asvetlov
Copy link
Member

asvetlov commented Aug 1, 2017

@after careful thinking I agree that (bytes, bool) tuple is better: it breaks existing code fast, users will upgrade without long pain period.
And yes, let's left up to user partial chunks concatenation.
It is low level API anyway.
Moreover helper with limit doesn't help: user still should add a code for concatenation if chunk size exceeds the limit.

@jlacoline
Copy link
Contributor Author

I updated the code to make readchunk() behave like specified, ready for your feedback.
However I do not understand what caused the CI job to fail, the test that failed does not seem to be linked to my changes.

@jlacoline
Copy link
Contributor Author

ping @asvetlov @fafhrd91: are you ok with the changes?

@asvetlov
Copy link
Member

Will review in a few hours.

Copy link
Member

@asvetlov asvetlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reflect in docs/client_reference.rst your changes.
Use .. versionchanged:: 2.3 sphinx tag.

Also please add missing tests.

I highly recommend installing https://docs.codecov.io/v4.3.6/docs/browser-extension -- it shows coverage result just in github Files changed tab.

class ChunkTupleAsyncStreamIterator(AsyncStreamIterator):
@asyncio.coroutine
def __anext__(self):
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for covering these lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Python-3.5 available for Python 3.5+ only
"""
return AsyncStreamIterator(self.readchunk)
return ChunkTupleAsyncStreamIterator(self.readchunk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for the call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

elif self._http_chunk_splits is not None:
if self._http_chunk_splits:
pos = self._http_chunk_splits[0]
if pos > self._cursor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for case if not pos > self._cursor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR this condition was here because I was afraid of what would happen if some other calls than readchunk were made before.
So I did a small modification to the condition here and added tests for those situations, covering cases where stream.unread_data and stream.read are used before stream.readchunk.

if self._exception is not None:
raise self._exception

if not self._buffer and not self._eof:
# end of http chunk without available data
if self._http_chunk_splits and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace backslash with brackets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -547,18 +548,17 @@ def feed_data(self, chunk, SEP=b'\r\n', CHUNK_EXT=b';'):
required = self._chunk_size
chunk_len = len(chunk)

if required >= chunk_len:
if required > chunk_len:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain why you replaced >= with >.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If required == chunk_len, then self._chunk_size == 0 and the "else" part does what is needed.
So lines 552 and 553 looked like duplicated code, and deduplicating this allowed me to put self.payload.end_http_chunk_receiving() at only one place in the code.

But the code will continue after the 'else' as opposed to the previous version where (False, None) was returned immediately. So the behaviour is not exactly the same even if the same result will be eventually returned.

So imo it looks better but if you want me to revert to the previous version tell me.

Also I'm going to remove the useless blank line(554)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fine with the change

@jlacoline
Copy link
Contributor Author

Thanks for the review. I updated my branch according to your feedback.
I will modify docs/client_reference.rst when I get some time, probably next week.

@asvetlov
Copy link
Member

Good. Please drop a message when you will be ready for next review.

@jlacoline
Copy link
Contributor Author

@asvetlov ready for next review

@jlacoline
Copy link
Contributor Author

@asvetlov pinging you again in case you didn't see the last message

Copy link
Member

@asvetlov asvetlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks

@@ -547,18 +548,17 @@ def feed_data(self, chunk, SEP=b'\r\n', CHUNK_EXT=b';'):
required = self._chunk_size
chunk_len = len(chunk)

if required >= chunk_len:
if required > chunk_len:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fine with the change

@lock
Copy link

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bot:chronographer:provided There is a change note present in this PR outdated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants