Skip to content

2018 07 18 Aggregation and Analysis at Netflix

Adrian Cole edited this page Jun 5, 2020 · 2 revisions

These are notes that Adrian and Nara put together while Nara was "on holiday" in Singapore.

Aggregation and Analysis

Common questions for immediate data (one level deep)

Default to 2hrs ago

List all service names *since a specific time*

  • CHAP (automation platform) might need to know the services used in the last day

List all the cluster names

List the immediate downstream

What percentage of requests had a hystrix command for traffic type LOLOMO?

  • So questions like this could be used to tell how much a specific call pattern are used. Currently, this can be queried against metrics.

List services that use a service, grouped by hystrix command and vip

  • Key here is that there are multiple grouping conditions

  • For example, how many times a service is called with a given command on one IP

  • If we are able to do ancestor query, we could ask for requests to edge for the play service. This is impossible for data that can only be viewed at a single link

Link based search vs Path based search

Link based search seems helpful and it is for looking at the direct upstream. Secondarily, the service name isn’t useful by itself. For example, availability zone or cluster name is often a more meaningful query.

Questions that require the entire trace

Note: a lot of these questions might involve long periods of waiting, as they need to map reduce possibly the whole repo. However, it begs a question of a trace query language. Using a query like this could also work around practical limitations on the amount of tags you can place on metrics.

These questions below are from edge team. Tracing isn’t as useful for mid and data service folks due to sample rate and other factors limiting the analytics. Instead they use normal metrics. If we could get to a point of an “around the service” perspective, it would be more useful to folks deeper in the call chain.

What percentage of edge requests have at least one mid-tier retry?

  • This means you need to look at the siblings (to satisfy mid-tier and also to know it was a retry).

  • Also, this is complex because you need to map reduce per trace in order to satisfy the percentage goal.

Break out response status at edge by number of mid-tier retries

For example, are there any requests that have 9 mid-tier retries.. Do any ever return 200?

  • Tricky part here is that you need to look at the count of retries per-link, not retries per trace

I’m the owner of the subscriber service, and I want to show metrics, logs and cloud trail events. I want to understand why my service is slow, so I want to look a level down, possibly traversing downwards.

Another scenario is I want to know my entire downstream topology for topology purposes. Also chaos injection.. Will want to know who all are impacted (opposite.. by looking up)

Service aggregation in most simplest terms could be improved by partitioning by root span name. This will identify the “edge requests”. However, the actual request topologies are needed to ask the questions of which downstream were created by a specific edge service. Links can’t answer that, unless you are literally the service directly underneath the edge.

Answering happens-before questions on a path basis need aggregates either post fact or as they occur.

Answers path based search can answer, but link based cannot

Assuming the above graph is directed, you can see links such as A → S and S → Y, but you can't tell if any calls from S → Y were caused by A or even B. For example, it could have been a mid-tier originated call. Conversely is also true. You can't tell if a call from A → S will end up calling X, Y or Z as it could terminate at S! So a full trace can answer certain questions that links cannot.

Modeling path aggregates in json

Assuming we have the above service graph, link based aggregates are simpler. We have a list of each source/destination pair along with data we want to aggregate, such as count, latency histogram or error flags.

{"parent": "A", "child": "S", "callCount": 12 ...},
{"parent": "B", "child": "S", "callCount": 3 ...},
{"parent": "S", "child": "X", "callCount": 12 ...}
...

Path aggregates by nature they differentiate on the ancestry of the call. A naive way is to just consider the path and one contextual fact, such as the service name, and delimit to include every path segment. Each full path is a separate aggregate. When modeled like this LIKE or prefix queries can be used for cheap summaries. For example "%A|S%" can summarize all traffic that went through the link A → S, and "%A%S%" would get all traffic including proxies or loopback.

{"path": "A|S|X", "callCount": 6 ...},
{"path": "A|S", "callCount": 6 ...},
{"path": "B|S", "callCount": 3 ...},
{"path": "S|X", "callCount": 6 ...},
...

Path aggregates are more voluminous when more context is needed to identify a specific service. For example, the operation or cluster name. This makes a simple path unusable, rather a quite larger representation. A small change in one node's version for example creates a new aggregate. For this reason, you'd prefer to have the least facts possible.. only those you'd query on, at least if using this approach to aggregation.

{[
   {"cluster": "A-v32", "service": "A", "operation": "/users/{userId}"},
   {"cluster": "S-v31", "service": "S", "operation": "/profile/users/{userId}"},
   {"cluster": "X-v31", "service": "X", "operation": "/store/users/{userId}"}
 ], "callCount": 3 ...},
 [
   {"cluster": "A-v32", "service": "A", "operation": "/admin/users/{userId}"},
   {"cluster": "S-v31", "service": "S", "operation": "/profile/users/{userId}"},
   {"cluster": "X-v31", "service": "X", "operation": "/store/users/{userId}"}
 ], "callCount": 2 ...},
 [
   {"cluster": "A-v31", "service": "A", "operation": "/users/{userId}"},
   {"cluster": "S-v31", "service": "S", "operation": "/profile/users/{userId}"},
   {"cluster": "X-v31", "service": "X", "operation": "/store/users/{userId}"}
 ], "callCount": 1 ...},
...

Note that cheap wins are available as well. For example, even with link-based aggregate, you can partition based on edge-operation which can answer questions about the origin of traffic with less complexity than a full path aggregate:

{"root": "/users/{userId}", "parent": "A", "child": "S", "callCount": 4 ...},
{"root": NULL, "parent": "A", "child": "S", "callCount": 10 ...},
{"root": "/admin/users/{userId}", "parent": "A", "child": "S", "callCount": 2 ...},
...
Two tier Sampling (completely different topic) -----------------------------------------------------

At netflix the edge tier currently uses a zipkin-like service, but 100% sampling. Downstream is 1% How should one handle different tiers which might have different policy?

Approach 1: don’t propagate sampling flag

In this model, the edge tier does 100% sampling, but strips the sampling flag when propagating downstream. This will cause the next hop to reuse trace IDs, but restart a sampling decision. This works in an edge scenario, but less effective in mid-tier, as you could have inconsistent data.

Approach 1a: edge ignores its own sampling decision

In this model, the edge makes a sampling decision for downstream, but doesn’t honor it locally. For example, it could be using “firehose” mode to send all data to the system regardless of the 1% decision for downstream.

How to deal with multiple systems with the same trace ID

If edge and downstream are on different sampling rates, this could affect the storage decisions including retention policy or even data store choice depending on scale.

Approach 1: merge the data on analysis or UI concerns.

In this case, the data goes to two different keyspaces or storage systems. Both systems are joined to get a specific trace, whether that’s a join in spark, or a fan-out query for the api like zipkin-mux

Approach 2: just send them to the same place

If you aren’t indexing data, or only doing lookups (ex lookup trace by device ID and request ID), you could just dump the data into the same repo. The 100% traffic will end up with “short traces”, which you’d only know due to the depth of the trace.

Clone this wiki locally