Skip to content
This repository has been archived by the owner on Dec 18, 2018. It is now read-only.

Low RPS with lots of response data + chunking #572

Closed
rynowak opened this issue Jan 11, 2016 · 15 comments
Closed

Low RPS with lots of response data + chunking #572

rynowak opened this issue Jan 11, 2016 · 15 comments
Labels
Milestone

Comments

@rynowak
Copy link
Member

rynowak commented Jan 11, 2016

I'm only able to get about 500-550 rps using the following middleware over the loopback interface. I can get 1200+ with a content length set.

private readonly byte[] bytes = Encoding.UTF8.GetBytes(new string('a', 1024));

...

app.Run(async (context) =>
{
    for (var i = 0; i < 300; i++)
    {
        await context.Response.Body.WriteAsync(bytes, 0, bytes.Length);
    }
});

Note that this is intentionally exceeding the write-behind buffer and going through the chunked path. This only gets the cpu to ~15%.

This is a synthetic version of a Razor benchmark I've been trying to improve.

@benaadams
Copy link
Contributor

@rynowak while you're there (in loopback)... does this make any difference? #416

Though you may need to change your test end to match the changes in the tests.

@benaadams
Copy link
Contributor

Also could you share repo/gist (client) - would be interested to try

@halter73
Copy link
Member

After talking to @rynowak there are two major suggestions. One is allocate less. I have a commit that avoids an array allocation in WriteBeginChunkBytes, but when I tested that change by itself it made no significant difference in perf even with hundreds of chunked writes in a response.

The second suggestion is to delay the actually chunking until right before we make the call to uv_write. The benefit would be that we could right less/larger chunks. This would require some architectural changes since that would require the code at that layer to know where the response body begins and ends which it currently doesn't.

For these reasons, I think this should be done post-RTM.

@rynowak
Copy link
Member Author

rynowak commented Jan 14, 2016

I want to add to the discussion that Razor is always going to do chunking. We need to probably get a sense for how big of a page results in degraded performance and just how degraded that is.

@benaadams
Copy link
Contributor

Using a variation of the synthetic with a wrk->windows setup; I'm seeing a lot of blockage on the sync WriteChunkedResponseSuffix that does a wait on an async write (further down the stack). To a degree that I'd say it needs to be pre-RC2.

Think its resolvable though; will try to have fix before end of weekend.

@benaadams
Copy link
Contributor

Think I've fix for this, will see if there are any other areas that I can tweek. but wow!

With the change the RPS is only up to 4,573; which I still didn't think was that impressive; but then I looked at the data rate and I've never seen Kestrel go so high!

11.6Gbps

Its outputting 11.6Gbps

$ wrk -c 1024 -t 32 http://.../plaintext
Running 10s test @ http://.../plaintext
  32 threads and 1024 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   234.79ms  213.33ms   1.95s    74.10%
    Req/Sec   150.19     51.13   350.00     72.38%
  46190 requests in 10.10s, 13.34GB read
  Socket errors: connect 35, read 0, write 0, timeout 40
Requests/sec:   4573.22
Transfer/sec:      1.32GB

Though is 32% CPU so just over 5 cores of a 16 core machine to do it.

@benaadams
Copy link
Contributor

Testing other connection amounts

16 connections is 112 rps (33.23MB/s or 265.84Mbit/s)
32 connections is 437 rps (129.04MB/s or 1.03Gbit/s)

So 500 rps is over 1GBit/s..!

@benaadams
Copy link
Contributor

Next is hotspot is CopyFrom, adding #585 gives higher peaks but also CPU drops are at play (GC)
After that hotspot is BeginChunkBytes and adding @halter73's change on top resolves the GC

The two changes giving an additional 123 rps (4,573 -> 4,694) though that is another 40Mbit/s. I have a feeling the network is saturated at this point.

$ wrk -c 2048 -t 10 http://.../plaintext
Running 10s test @ http://.../plaintext
  32 threads and 1024 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   234.66ms  223.09ms   1.99s    79.57%
    Req/Sec   149.36     54.00   400.00     72.38%
  47420 requests in 10.10s, 13.70GB read
  Socket errors: connect 35, read 0, write 0, timeout 38
Requests/sec:   4694.73
Transfer/sec:      1.36GB

Then it looks like we are back to the usual suspects; which are post RTM?

(Three combined are https://github.com/benaadams/KestrelHttpServer/tree/chunking )

@benaadams
Copy link
Contributor

Running longer 5m test to be sure; turns out better:

Peaking 12.6GBps, ave range 11.5 - 12.4 Gbps; 2 very brief GC drops over 5 mins.
CPU ave 36% (Azure G4 - 16 Cores)

$ wrk -c 1024 -t 32 -d 320 http://.../plaintext
Running 5m test @ http://.../plaintext
  32 threads and 1024 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   233.42ms  216.38ms   2.00s    78.30%
    Req/Sec   150.67     49.75   410.00     70.71%
  1531898 requests in 5.33m, 441.46GB read
  Socket errors: connect 35, read 0, write 0, timeout 2037
Requests/sec:   4785.69
Transfer/sec:      1.38GB

@benaadams
Copy link
Contributor

A single connection is interesting as it bandwidth is quite variable 17Mbps - 62Mbps; might be more optimal chunk sizes rather than 1024B. Though it does use 0% CPU apparently.

@halter73
Copy link
Member

@benaadams How did you modify the plaintext benchmark to do chunking? What were your wrk numbers before the change?

@benaadams
Copy link
Contributor

See #589 (comment)

Basically before, it died... as it got caught up in chunk writing going sync - I think that might have been the delay added by the network vs loopback?

Post the chunking also going async it was fine - that was the main change for this.

@benaadams
Copy link
Contributor

Might have also been the effects of running 1024 connections and them all flipping sync

@halter73
Copy link
Member

Yikes. I now see that this was the key change. I knew we weren't awaiting the suffix, but I didn't think we were blocking either. I guess you wouldn't notice until you surpassed the write-behind buffer. Good catch!

@halter73
Copy link
Member

@benaadams change has now been merged: d3d9c8d

I think this is the major cause of the low RPS with lots of response data + chunking.

@rynowak Reopen this if you think there is more that needs to be done.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants