Low RPS with lots of response data + chunking #572

rynowak · 2016-01-11T22:39:42Z

I'm only able to get about 500-550 rps using the following middleware over the loopback interface. I can get 1200+ with a content length set.

private readonly byte[] bytes = Encoding.UTF8.GetBytes(new string('a', 1024));

...

app.Run(async (context) =>
{
    for (var i = 0; i < 300; i++)
    {
        await context.Response.Body.WriteAsync(bytes, 0, bytes.Length);
    }
});

Note that this is intentionally exceeding the write-behind buffer and going through the chunked path. This only gets the cpu to ~15%.

This is a synthetic version of a Razor benchmark I've been trying to improve.

The text was updated successfully, but these errors were encountered:

benaadams · 2016-01-11T23:39:21Z

@rynowak while you're there (in loopback)... does this make any difference? #416

Though you may need to change your test end to match the changes in the tests.

benaadams · 2016-01-11T23:59:10Z

Also could you share repo/gist (client) - would be interested to try

halter73 · 2016-01-14T20:50:10Z

After talking to @rynowak there are two major suggestions. One is allocate less. I have a commit that avoids an array allocation in WriteBeginChunkBytes, but when I tested that change by itself it made no significant difference in perf even with hundreds of chunked writes in a response.

The second suggestion is to delay the actually chunking until right before we make the call to uv_write. The benefit would be that we could right less/larger chunks. This would require some architectural changes since that would require the code at that layer to know where the response body begins and ends which it currently doesn't.

For these reasons, I think this should be done post-RTM.

rynowak · 2016-01-14T21:01:29Z

I want to add to the discussion that Razor is always going to do chunking. We need to probably get a sense for how big of a page results in degraded performance and just how degraded that is.

benaadams · 2016-01-16T22:53:55Z

Using a variation of the synthetic with a wrk->windows setup; I'm seeing a lot of blockage on the sync WriteChunkedResponseSuffix that does a wait on an async write (further down the stack). To a degree that I'd say it needs to be pre-RC2.

Think its resolvable though; will try to have fix before end of weekend.

benaadams · 2016-01-16T23:24:26Z

Think I've fix for this, will see if there are any other areas that I can tweek. but wow!

With the change the RPS is only up to 4,573; which I still didn't think was that impressive; but then I looked at the data rate and I've never seen Kestrel go so high!

Its outputting 11.6Gbps

$ wrk -c 1024 -t 32 http://.../plaintext
Running 10s test @ http://.../plaintext
  32 threads and 1024 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   234.79ms  213.33ms   1.95s    74.10%
    Req/Sec   150.19     51.13   350.00     72.38%
  46190 requests in 10.10s, 13.34GB read
  Socket errors: connect 35, read 0, write 0, timeout 40
Requests/sec:   4573.22
Transfer/sec:      1.32GB

Though is 32% CPU so just over 5 cores of a 16 core machine to do it.

benaadams · 2016-01-16T23:32:52Z

Testing other connection amounts

16 connections is 112 rps (33.23MB/s or 265.84Mbit/s)
32 connections is 437 rps (129.04MB/s or 1.03Gbit/s)

So 500 rps is over 1GBit/s..!

benaadams · 2016-01-17T00:53:38Z

Next is hotspot is CopyFrom, adding #585 gives higher peaks but also CPU drops are at play (GC)
After that hotspot is BeginChunkBytes and adding @halter73's change on top resolves the GC

The two changes giving an additional 123 rps (4,573 -> 4,694) though that is another 40Mbit/s. I have a feeling the network is saturated at this point.

$ wrk -c 2048 -t 10 http://.../plaintext
Running 10s test @ http://.../plaintext
  32 threads and 1024 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   234.66ms  223.09ms   1.99s    79.57%
    Req/Sec   149.36     54.00   400.00     72.38%
  47420 requests in 10.10s, 13.70GB read
  Socket errors: connect 35, read 0, write 0, timeout 38
Requests/sec:   4694.73
Transfer/sec:      1.36GB

Then it looks like we are back to the usual suspects; which are post RTM?

(Three combined are https://github.com/benaadams/KestrelHttpServer/tree/chunking )

benaadams · 2016-01-17T01:07:40Z

Running longer 5m test to be sure; turns out better:

Peaking 12.6GBps, ave range 11.5 - 12.4 Gbps; 2 very brief GC drops over 5 mins.
CPU ave 36% (Azure G4 - 16 Cores)

$ wrk -c 1024 -t 32 -d 320 http://.../plaintext
Running 5m test @ http://.../plaintext
  32 threads and 1024 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   233.42ms  216.38ms   2.00s    78.30%
    Req/Sec   150.67     49.75   410.00     70.71%
  1531898 requests in 5.33m, 441.46GB read
  Socket errors: connect 35, read 0, write 0, timeout 2037
Requests/sec:   4785.69
Transfer/sec:      1.38GB

benaadams · 2016-01-17T03:40:45Z

A single connection is interesting as it bandwidth is quite variable 17Mbps - 62Mbps; might be more optimal chunk sizes rather than 1024B. Though it does use 0% CPU apparently.

halter73 · 2016-01-18T21:47:31Z

@benaadams How did you modify the plaintext benchmark to do chunking? What were your wrk numbers before the change?

benaadams · 2016-01-18T21:57:57Z

See #589 (comment)

Basically before, it died... as it got caught up in chunk writing going sync - I think that might have been the delay added by the network vs loopback?

Post the chunking also going async it was fine - that was the main change for this.

benaadams · 2016-01-18T22:20:27Z

Might have also been the effects of running 1024 connections and them all flipping sync

halter73 · 2016-01-18T22:30:06Z

Yikes. I now see that this was the key change. I knew we weren't awaiting the suffix, but I didn't think we were blocking either. I guess you wouldn't notice until you surpassed the write-behind buffer. Good catch!

halter73 · 2016-01-22T19:01:59Z

@benaadams change has now been merged: d3d9c8d

I think this is the major cause of the low RPS with lots of response data + chunking.

@rynowak Reopen this if you think there is more that needs to be done.

benaadams mentioned this issue Jan 16, 2016

Write chunks async; unblock sync-waits directly #586

Merged

benaadams mentioned this issue Jan 18, 2016

Sync+error complete combined #589

Closed

halter73 mentioned this issue Jan 18, 2016

Faster CopyFrom(+Ascii) #585

Closed

halter73 closed this as completed Jan 22, 2016

halter73 added the 3 - Done label Jan 22, 2016

halter73 added this to the 1.0.0-rc2 milestone Jan 22, 2016

halter73 mentioned this issue Feb 29, 2016

Chunked writing in the server #615

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low RPS with lots of response data + chunking #572

Low RPS with lots of response data + chunking #572

rynowak commented Jan 11, 2016

benaadams commented Jan 11, 2016

benaadams commented Jan 11, 2016

halter73 commented Jan 14, 2016

rynowak commented Jan 14, 2016

benaadams commented Jan 16, 2016

benaadams commented Jan 16, 2016

benaadams commented Jan 16, 2016

benaadams commented Jan 17, 2016

benaadams commented Jan 17, 2016

benaadams commented Jan 17, 2016

halter73 commented Jan 18, 2016

benaadams commented Jan 18, 2016

benaadams commented Jan 18, 2016

halter73 commented Jan 18, 2016

halter73 commented Jan 22, 2016

Low RPS with lots of response data + chunking #572

Low RPS with lots of response data + chunking #572

Comments

rynowak commented Jan 11, 2016

benaadams commented Jan 11, 2016

benaadams commented Jan 11, 2016

halter73 commented Jan 14, 2016

rynowak commented Jan 14, 2016

benaadams commented Jan 16, 2016

benaadams commented Jan 16, 2016

benaadams commented Jan 16, 2016

benaadams commented Jan 17, 2016

benaadams commented Jan 17, 2016

benaadams commented Jan 17, 2016

halter73 commented Jan 18, 2016

benaadams commented Jan 18, 2016

benaadams commented Jan 18, 2016

halter73 commented Jan 18, 2016

halter73 commented Jan 22, 2016