Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve L0_client_memory_growth test #303

Merged
merged 4 commits into from
Apr 29, 2023
Merged

Improve L0_client_memory_growth test #303

merged 4 commits into from
Apr 29, 2023

Conversation

rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Apr 27, 2023

  1. Refactor memory_leak_test.cc

    • Fix behavior of -R (reuse) flag to actually reuse client/connection. Previously subsequent calls to client::Create(reuse_client. ...) would just create new client objects under the hood.
    • Template helper functions since HTTP and GRPC APIs used are identical in this test
  2. Update C++ HTTP client to correctly propogate errors on client-side curl failures. Now the behavior matches other API call failures, rather than reporting an obscure Timer and Json parsing issue. See the before/after examples below:

Before:

root@rmccormick-dt:/workspace# memory_leak_test
Failed to update context stat: Timer not set correctly. Send time from 1682559474332558400 to 0.
error: unable to run model: failed to parse the request JSON buffer: The document is empty. at 0

root@rmccormick-dt:/workspace# simple_http_infer_client
Failed to update context stat: Timer not set correctly. Send time from 1682559478689285651 to 0.
error: unable to run model: failed to parse the request JSON buffer: The document is empty. at 0

root@rmccormick-dt:/workspace# simple_http_health_metadata
error: unable to get server liveness: HTTP client failed: Couldn't connect to server

root@rmccormick-dt:/workspace# simple_http_model_control
error: Failed to get repository index: HTTP client failed: Couldn't connect to server

After:

root@rmccormick-dt:/mnt/triton/jira/4524-client-growth/client/build# ./install/bin/memory_leak_test
error: HTTP client failed [400]: Couldn't connect to server

root@rmccormick-dt:/mnt/triton/jira/4524-client-growth/client/build# ./install/bin/simple_http_infer_client
error: unable to run model: HTTP client failed [400]: Couldn't connect to server

root@rmccormick-dt:/mnt/triton/jira/4524-client-growth/client/build# ./install/bin/simple_http_health_metadata
error: unable to get server liveness: HTTP client failed: Couldn't connect to server

root@rmccormick-dt:/mnt/triton/jira/4524-client-growth/client/build# ./install/bin/simple_http_model_control
error: Failed to get repository index: HTTP client failed: Couldn't connect to server

The root cause of the connection errors above in CI was running out of ephemeral ports on the system. Sockets were being opened too quickly / too many hanging sockets were being left in WAIT state, causing an inability to assign a port to new connections.

For future reference, this can be observed by running watch ss -s in a separate shell while running this script. If you set the -R flag to re-use the same client connection, you will see a consistent flat usage of sockets.

If you omit this flag and create new connections on each request, you will see a consistent rise / high value in sockets stuck in timewait:

$ ss -s

Total: 2031
TCP:   14135 (estab 47, closed 14068, orphaned 0, timewait 14065)

Possible Future Improvements:

  • Test AsyncInfer
  • Separate test to explicitly reuse client connection
  • Expose --max-retries rather than hard-coded value
  • Only retry on specific error messages like "couldn't connect to server"

Corresponding server PR that just adds notes about these findings to help future debuggers: triton-inference-server/server#5710

…unctions. (2) Update http_client to propogate error on client-side failure
…plify reuse logic, clarify HTTP millisecond timeout limitation
@rmccorm4 rmccorm4 requested a review from nnshah1 April 27, 2023 22:30
@rmccorm4 rmccorm4 requested a review from Tabrizian April 28, 2023 17:12
@rmccorm4 rmccorm4 merged commit b41f7ea into main Apr 29, 2023
@rmccorm4 rmccorm4 deleted the rmccormick-fix-http branch April 29, 2023 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants