From b37214b0b2e2b91193e3e5c06483d4ca3545091a Mon Sep 17 00:00:00 2001 From: kaspar-p Date: Sat, 18 Nov 2023 20:04:21 -0500 Subject: [PATCH 1/2] Add Retries.md document --- Topics/Software_Engineering/Retries.md | 86 ++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 Topics/Software_Engineering/Retries.md diff --git a/Topics/Software_Engineering/Retries.md b/Topics/Software_Engineering/Retries.md new file mode 100644 index 000000000..a6a8a168d --- /dev/null +++ b/Topics/Software_Engineering/Retries.md @@ -0,0 +1,86 @@ +# Retries in web services + +Especially relevant for webserver applications, but useful for others, retries are really tricky to get right. +Retries and throttling are both terms used to talk about the _flow_ of traffic into a service. Often the +operators/developers of that service want to make guarantees about the rate of that flow, or otherwise direct +traffic. + +## Why retry at all? + +Intermittent failure can happen at any level. This can be within a single host if its on-host disk, memory, or +CPU fails, or often in the communication between two hosts over a network. Networks, especially over the public +internet using TCP/IP, are known to have periodic failures due to high load/congestion or network infrastructure +hardware failure. + +Retries are a really simple, easy answer to these intermittent failures. If the error only happens rarely, then +trying a task again is a really effective way to ensure that the message goes through. This often manifests +itself as retrying API calls. + +Some services have built up language around these retries to control them. For example, calling AWS APIs returns +metadata about the request itself. For example, there is a `$retryable` field in most of the AWS SDKv2's APIs, +the most common client used to make AWS API calls. If this field is set `true`, the server is hinting that the +failure was intermittent and that the client should retry. If the field is set to `false`, the server is hinting +to the client that the failure is likely going to happen again. + +## What are the problems with retries? + +Since retries are so simple to implement and elegant, they are usually the first tool that developers reach for +when a dependency of theirs has intermittent failures, but how can this go wrong? + +Consider a case where 4 distinct software teams each build products that depend on one another, in a chain like: +``` +A -> B -> C -> D +``` + +That is, service A is calling service B's APIs, and so on. Since B's APIs are known to fail occasionally, A has +configured an automatic retry count of 3. Underneath the hood, B depends on C. Service A may or may not know this +about B. But since C has a flaky API too, B also has a retry count of 3. And the same for C. + +This works fine, and will usually work. If all services are sufficiently scaled up to handle the load they are +given, there are no problems. + +However, imagine a case where service D is down. Though it is at the end of the chain of dependencies, in theory, +the services should be able to stay up despite their dependencies being down. This type of engineering is called +fault-tolerance. + +The next time that Service A takes a request, it forwards it to B, which forwards it to C, which tries to call D's +API, which fails. C then, tries again 3 times before reporting a failure back to B, which also triggers a retry. +That means C tries _another 3 times_. + +Retries deep into services grow multiplicatively, and a single API call to A has caused +``` +A: 3 +B: 9 +C: 27 +``` +different API calls to fail. C is handling 27x more load than it is used to, and might start failing itself, further +exacerbating the problem. + +That is, D has become a single point of failure for all other services, and even if they don't outright fail, the +load on B, C, and D, are highly needlessly increased. + +## What can we do about retries? + +Clients calling services will nearly always have retries configured. However, internal services should rarely +implement retries while calling other internal services, for precisely this reason. + +Another technique to get around excessive retries is to utilize more caching. If service C had cached the responses +from service D, it's possible that service D going down would have affected the top-level services at all, and +everything would have worked as normal. The downside to this approach is that caches are often trick to get right, +and sometimes introduce modal behavior in services [1], usually a bad thing. + +## So should I retry? + +As always in software engineering, it depends. A good rule of thumb is the external/internal, where external +dependencies are wrapped in retries, but internal dependencies aren't. It's much easier to control the behavior of +internal dependencies, either by directly contributing to their product, or speaking to the owners of that product +itself. Retries are a rough bandaid, and more precise solutions are better. Fixing the root-cause of intermittent +failures avoids the problems with retries in the first place, and produces a more stable product. + +Retries are also more acceptable when they aren't in the _critical path_ of a service. For an `AddTwoNumbers` +service, having retries on dependencies within the main `AddTwoNumbers` API call might not be a good idea. However, +for backup jobs, batch processing, or other non-performance-critical work, retries are often a simple, +engineering-efficient way to ensure reliability. + +## References +1. https://brooker.co.za/blog/2021/05/24/metastable.html \ No newline at end of file From c2dd01d935965fae06e00e6b2e7c7188d10fdc69 Mon Sep 17 00:00:00 2001 From: kaspar-p Date: Fri, 24 Nov 2023 16:59:26 -0500 Subject: [PATCH 2/2] Update with language examples; clear wording --- Topics/Software_Engineering/Retries.md | 27 ++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/Topics/Software_Engineering/Retries.md b/Topics/Software_Engineering/Retries.md index a6a8a168d..69e51a006 100644 --- a/Topics/Software_Engineering/Retries.md +++ b/Topics/Software_Engineering/Retries.md @@ -1,6 +1,9 @@ # Retries in web services -Especially relevant for webserver applications, but useful for others, retries are really tricky to get right. +Especially relevant for webserver applications, but useful for others, retries are really tricky to get right. +Retries are the practice of _retrying_ a network request, usually over HTTP or HTTPS, when it fails. It relies +on the assumption that most failures are intermittent, meaning only happen rarely. + Retries and throttling are both terms used to talk about the _flow_ of traffic into a service. Often the operators/developers of that service want to make guarantees about the rate of that flow, or otherwise direct traffic. @@ -74,13 +77,29 @@ and sometimes introduce modal behavior in services [1], usually a bad thing. As always in software engineering, it depends. A good rule of thumb is the external/internal, where external dependencies are wrapped in retries, but internal dependencies aren't. It's much easier to control the behavior of internal dependencies, either by directly contributing to their product, or speaking to the owners of that product -itself. Retries are a rough bandaid, and more precise solutions are better. Fixing the root-cause of intermittent -failures avoids the problems with retries in the first place, and produces a more stable product. +itself. Retries are a rough band-aid, and more precise solutions are often better. For example, it might be more work, +but fixing the root-cause of intermittent failures avoids the problems with retries in the first place, and also +produces a more stable product. Retries are also more acceptable when they aren't in the _critical path_ of a service. For an `AddTwoNumbers` service, having retries on dependencies within the main `AddTwoNumbers` API call might not be a good idea. However, for backup jobs, batch processing, or other non-performance-critical work, retries are often a simple, engineering-efficient way to ensure reliability. +## How should I retry? + +For most popular programming languages, retries are built into common dependencies. For example, +1. Rust has `tower`, a generic HTTP service abstraction that offers automatic retries: https://github.com/tower-rs/tower [2], +2. JavaScript and Typescript have `retry`: https://www.npmjs.com/package/retry [3], and +3. Go has `retry-go`: https://github.com/avast/retry-go [4] + +Each library works slightly differently, but can be used in simple or complex ways. For example, it could be as simple +as immediately retrying the network request upon failure, or more complicated, including concepts like jitter (making sure +many concurrent clients don't all retry at the same time), exponential backoff (clients retrying less and less over time), +or other concepts [1]. + ## References -1. https://brooker.co.za/blog/2021/05/24/metastable.html \ No newline at end of file +1. https://brooker.co.za/blog/2021/05/24/metastable.html +2. https://github.com/tower-rs/tower +3. https://www.npmjs.com/package/retry +4. https://github.com/avast/retry-go \ No newline at end of file