Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics query API spec #2946

Merged
merged 8 commits into from
May 3, 2021

Conversation

albertteoh
Copy link
Contributor

@albertteoh albertteoh commented Apr 20, 2021

Signed-off-by: albertteoh albert.teoh@logz.io

Which problem is this PR solving?

Short description of the changes

  • Defines a vendor-neutral API specification for querying from a metrics backing store.
  • Add README to explain the motivation, approach taken and rationale for approach.
  • Update Makefile to generate client/server code from new proto files.
  • Tested on a working prototype with both gRPC and HTTP clients.

Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh albertteoh requested a review from a team as a code owner April 20, 2021 00:37
@codecov
Copy link

codecov bot commented Apr 20, 2021

Codecov Report

Merging #2946 (8a14cb7) into master (31a76d1) will decrease coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2946      +/-   ##
==========================================
- Coverage   95.97%   95.90%   -0.08%     
==========================================
  Files         223      223              
  Lines        9712     9712              
==========================================
- Hits         9321     9314       -7     
- Misses        323      328       +5     
- Partials       68       70       +2     
Impacted Files Coverage Δ
cmd/collector/app/server/zipkin.go 61.53% <0.00%> (-15.39%) ⬇️
pkg/config/tlscfg/cert_watcher.go 92.20% <0.00%> (-2.60%) ⬇️
cmd/query/app/static_handler.go 95.16% <0.00%> (-1.62%) ⬇️
plugin/storage/integration/integration.go 77.90% <0.00%> (+0.55%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31a76d1...8a14cb7. Read the comment docs.

jpkrohling
jpkrohling previously approved these changes Apr 20, 2021
Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this promise and how the service ended up looking like, but I can't judge the protobuf details.

rpc GetMinStepDuration(GetMinStepDurationRequest) returns (GetMinStepDurationResponse);

// GetPerServiceLatencies gets latency metrics grouped by service.
rpc GetPerServiceLatencies(GetPerServiceLatenciesRequest) returns (GetMetricsResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a conceptual problem with this API.

First, it does not scale well. It will work fine for small shops with a dozen services. For a 100 services it becomes iffy - is it useful for a user to see 100 charts? For orgs with 1000s of services this does not work at all, it will probably choke on the data volume, and even if it doesn't, it's completely useless to return 1000s of time series to the UI.

Second, I don't think this is the right user workflow. The notion of a "service" is pretty loose, a serverless function can be a service, or an instance of ML model being served, so even small shops can easily get into a state with 100s and 1000s of "services". There's no user persona that need to see all this data all at once, nor see it ranked in a flat list, because the requirements for latency are very different for different services, it doesn't make sense to just sort and return top-K.

Why not start with having this API serve a "default dashboard" for a single service? That's a clear use case and a well understood user workflow. To extend this to multiple services requires some form of grouping so that we don't pull the data for all services at once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not start with having this API serve a "default dashboard" for a single service?

The UI would then first get a list of services, decide which ones to place on the screen, and run queries passing the service they are interested in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I am thinking the user will pick a service, the UI won't decide by itself. It's essentially how the search works today as well, so will be consistent behavior in the UI. I don't think it fits exactly the vision that Albert had, but per my comment above I don't think that vision scales well with # of services.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of requiring the user to select a service to get useful data, we could preemptively show data about some services, even if only the 10 services with the most recent activity.

Copy link
Contributor Author

@albertteoh albertteoh Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback @jpkrohling & @yurishkuro.

First, it does not scale well.

I agree, this will not scale for 100s+ services, especially for latency computations which is at least an O(n^3) problem. A single service-level view would reduce this to an O(n^2) problem.
As such, I'll remove those higher-level cross-service endpoints from the API, and we'll rethink the UI design to accommodate for this new workflow, where the user will select the service to display its list of operations' metrics or, as @jpkrohling suggested, return the most recent k services.

There's no user persona that need to see all this data all at once, nor see it ranked in a flat list, because the requirements for latency are very different for different services, it doesn't make sense to just sort and return top-K.

I think there are valid use cases for a higher-level view of services' metrics.

Some example use cases (feel free to refute if these examples seem spurious):

  • Post-deployment sanity checks: As a developer or devops engineer, I want to be assured that my deployment does not negatively impact other services (esp. those that are not immediate dependencies) and therefore the business in general.
  • As an engineer, I want to quickly pin-point the cause of a vague problem identified by customers (the website is slow) and avoid the back and forth conversation to find the slow service.

I agree that different services have different latency and perhaps even error profiles. Ideally, a change in error rate, latency, etc. or some sort of anomaly detection would have the most value (especially for the use cases listed above), though would be more complex and so this proposal would be a more reachable stepping stone towards that direction.

Yes, there's little value in returning all services, although I thought there is some value in sorting and returning top-K, especially if it's based on a delta of the metric. Again, this is compute intensive and would not be a scalable solution in this iteration.

Given the above, we will continue thinking of better ways to compute and present this higher-level service view to users in a scalable manner (e.g. post-processing a spoon-fed "view" of metrics) as we still believe it is a valuable feature and also welcome suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although I thought there is some value in sorting and returning top-K

My point was that comparing latencies across services is pretty meaningless. There can always be some kind of batch processing service triggered by an RPC from a cron job, whose latency will always be high, but it doesn't mean that it's abnormal. Whereas a critical shared service (like cache) whose latency went from 1ms to 2ms won't make the top-K but it's clearly a drastic regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The intention of the first set of PRs was to use this as a stepping-stone towards something more meaningful like showing the change in the metric or something more sophisticated like anomaly detection; given it's a bit more complex. Additionally, we included an "Impact" column in the proposed UI mockup that multiplies the latency with the call rate, which helps make it a bit more meaningful; though of course, a delta of the metrics would amplify the positive signal better I think.

Would you prefer that we tackle this now rather than later? That is, at least something simpler like if the response were to include a delta between the last data-point of the current time window (now-lookback, now] and the last data-point of the previous time window (now-lookback*2, now-lookback], or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if there are any further questions/concerns that need addressing in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the positive reaction regarding the mockup, I would go with the minimal solution that works, release it, and iterate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpkrohling :)

I've updated the API to support metrics aggregated at a service level and operation level, requiring the client to explicitly pass service names at all times with the following goals in mind:

  • Address the scalability concern by not returning all service metrics
  • Support the proposed "user preference" list of metrics grouped by service.
  • Support per-service set of metrics grouped by operation, while giving clients the flexibility to fetch more than one service's set of metrics concurrently upfront if they so wish, though expect the more common use case to be just a single service (the single service view with multiple RED metrics per operation).
  • Minimize "surface area" of the API, with little scope for submitting custom queries.
  • Simplify the API by minimizing the number of RPC endpoints.

Signed-off-by: albertteoh <albert.teoh@logz.io>
@mergify mergify bot dismissed jpkrohling’s stale review April 22, 2021 12:14

Pull request has been modified.

Signed-off-by: albertteoh <albert.teoh@logz.io>
albertteoh and others added 3 commits April 26, 2021 18:55
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
@jpkrohling
Copy link
Contributor

@albertteoh, ping us when this is ready for re-review!

Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh
Copy link
Contributor Author

Thanks @jpkrohling, ready for re-review!

@albertteoh
Copy link
Contributor Author

I'd like to move forward with this, please let me know if there are any outstanding issues and I'd be happy to address them.

yurishkuro
yurishkuro previously approved these changes May 2, 2021
Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You may want to tag this experimental somewhere until the actual backend functionality is built. I would recommend a master ticket with an execution plan checklist, like:

  • service IDL & data model
  • service impl that queries Prom
  • UI module that displays it
  • anything else?

// GetLatenciesRequest contains parameters for the GetLatencies RPC call.
message GetLatenciesRequest {
MetricsQueryBaseRequest baseRequest = 1;
// quantile is the quantile to compute from latency histogram metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use an example. What are the units? If we want p99, should this be quantile=99? If so, why the double type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use an example. What are the units?

+1 I'll add an example.

If we want p99, should this be quantile=99? If so, why the double type?

The possible values range from 0-1; so p99 would be 0.99. Note that p99 is a percentile, which ranges from 0-100 and is a 100-quantile.

IIUC quantiles are a broader definition which equally slices the population of data up into any number. For example, p999 = 99.9th percentile and would have a quantile value of 0.999.

Moreover, most (if not all) metrics backends use quantiles instead of percentiles as parameters to their "histogram quantile" calculations, for instance:

Maintaining the 0-1 range of values for quantile means there is no need to perform any computation to map between percentiles to quantiles before passing it the metrics backend, as well as supporting quantiles more than 100, which would be a rare use case, but I feel there's no harm in doing so, as long as it's well documented.

}

service MetricsQueryService {
// GetMinStepDuration gets the min step duration supported by the backing metrics store.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GetMinStepDuration gets the min step duration supported by the backing metrics store.
// GetMinStepDuration gets the min time resolution supported by the backing metrics store,
// e.g. 10s means the backend can only return data points that are at least 10s apart, not closer.

Signed-off-by: albertteoh <albert.teoh@logz.io>
@mergify mergify bot dismissed yurishkuro’s stale review May 3, 2021 04:28

Pull request has been modified.

@albertteoh
Copy link
Contributor Author

Thanks Yuri!

You may want to tag this experimental somewhere until the actual backend functionality is built.

Agreed.

I would recommend a master ticket with an execution plan checklist

Yup, I added a checklist in the this master ticket: #2946 and referenced this PR.

If you're happy with the recent changes, I'd appreciate another stamp since the mergify bot has dismissed your last approval.
Thanks, again!

@albertteoh albertteoh merged commit 870fc90 into jaegertracing:master May 3, 2021
@albertteoh albertteoh deleted the metrics-api-spec branch May 3, 2021 05:11
Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this one, @albertteoh!

// See the License for the specific language governing permissions and
// limitations under the License.

// Based on: https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be to good to have a reference on the version used. Perhaps you can use a recent tag instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've addressed this in #2975, could you please review it when you've got time?

gogoproto.marshaler_all, gogoproto.unmarshaler_all, etc. enabled.

Moreover, if direct imports of other repositories were possible, it would mean importing and generating code for
transitive dependencies not required by Jaeger leading to longer build times, and potentially larger docker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/docker/container

@albertteoh albertteoh mentioned this pull request May 3, 2021
@jpkrohling jpkrohling added this to the Release 1.23.0 milestone Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants