Report gRPC status code in client-computed stats#10805
Conversation
9ca27a6 to
b39ed92
Compare
| CharSequence httpMethod, | ||
| CharSequence httpEndpoint) { | ||
| CharSequence httpEndpoint, | ||
| CharSequence grpcStatusCode) { |
There was a problem hiding this comment.
The implmenetation does not look correct. AFAIK our integrations are putting a named enum on grpc.status.code
span.setTag("status.code", status.getCode().name());
On the other side here it's required put a numeric value. The agent is doing a translation (see aggregate.go):
var grpcStatusMap = map[string]string{
"CANCELLED": "1",
"CANCELED": "1",
"INVALIDARGUMENT": "3",
"DEADLINEEXCEEDED": "4",
"NOTFOUND": "5",
"ALREADYEXISTS": "6",
"PERMISSIONDENIED": "7",
"RESOURCEEXHAUSTED": "8",
"FAILEDPRECONDITION": "9",
"OUTOFRANGE": "11",
"DATALOSS": "15",
}
We already have those numbers. I can suggest to enrich first the in place integrations by feeding the status code under the otel tag rpc.grpc.status_code and then source the fields for the stats from this tag instead.
Please also double check that the proto accept a string and not an integer
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 63 metrics, 8 unstable metrics. Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.61.0-SNAPSHOT~a3832a0fc0, baseline=1.61.0-SNAPSHOT~c1e9ac6389
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.059 s) : 0, 1058789
Total [baseline] (11.023 s) : 0, 11023165
Agent [candidate] (1.06 s) : 0, 1060448
Total [candidate] (11.084 s) : 0, 11084022
section appsec
Agent [baseline] (1.254 s) : 0, 1254371
Total [baseline] (11.205 s) : 0, 11204751
Agent [candidate] (1.25 s) : 0, 1249890
Total [candidate] (11.147 s) : 0, 11146810
section iast
Agent [baseline] (1.231 s) : 0, 1230731
Total [baseline] (11.388 s) : 0, 11387940
Agent [candidate] (1.225 s) : 0, 1224540
Total [candidate] (11.271 s) : 0, 11271139
section profiling
Agent [baseline] (1.187 s) : 0, 1187069
Total [baseline] (11.071 s) : 0, 11070563
Agent [candidate] (1.188 s) : 0, 1187793
Total [candidate] (10.994 s) : 0, 10994337
gantt
title petclinic - break down per module: candidate=1.61.0-SNAPSHOT~a3832a0fc0, baseline=1.61.0-SNAPSHOT~c1e9ac6389
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.211 ms) : 0, 1211
crashtracking [candidate] (1.198 ms) : 0, 1198
BytebuddyAgent [baseline] (627.546 ms) : 0, 627546
BytebuddyAgent [candidate] (628.762 ms) : 0, 628762
AgentMeter [baseline] (29.136 ms) : 0, 29136
AgentMeter [candidate] (29.296 ms) : 0, 29296
GlobalTracer [baseline] (256.69 ms) : 0, 256690
GlobalTracer [candidate] (257.352 ms) : 0, 257352
AppSec [baseline] (31.436 ms) : 0, 31436
AppSec [candidate] (31.607 ms) : 0, 31607
Debugger [baseline] (59.375 ms) : 0, 59375
Debugger [candidate] (59.587 ms) : 0, 59587
Remote Config [baseline] (593.37 µs) : 0, 593
Remote Config [candidate] (586.494 µs) : 0, 586
Telemetry [baseline] (8.693 ms) : 0, 8693
Telemetry [candidate] (8.647 ms) : 0, 8647
Flare Poller [baseline] (8.032 ms) : 0, 8032
Flare Poller [candidate] (7.343 ms) : 0, 7343
section appsec
crashtracking [baseline] (1.207 ms) : 0, 1207
crashtracking [candidate] (1.199 ms) : 0, 1199
BytebuddyAgent [baseline] (662.677 ms) : 0, 662677
BytebuddyAgent [candidate] (659.676 ms) : 0, 659676
AgentMeter [baseline] (12.128 ms) : 0, 12128
AgentMeter [candidate] (12.08 ms) : 0, 12080
GlobalTracer [baseline] (259.917 ms) : 0, 259917
GlobalTracer [candidate] (259.057 ms) : 0, 259057
AppSec [baseline] (178.32 ms) : 0, 178320
AppSec [candidate] (178.34 ms) : 0, 178340
Debugger [baseline] (66.115 ms) : 0, 66115
Debugger [candidate] (65.17 ms) : 0, 65170
Remote Config [baseline] (578.284 µs) : 0, 578
Remote Config [candidate] (562.886 µs) : 0, 563
Telemetry [baseline] (9.259 ms) : 0, 9259
Telemetry [candidate] (9.895 ms) : 0, 9895
Flare Poller [baseline] (3.637 ms) : 0, 3637
Flare Poller [candidate] (3.567 ms) : 0, 3567
IAST [baseline] (24.106 ms) : 0, 24106
IAST [candidate] (23.98 ms) : 0, 23980
section iast
crashtracking [baseline] (1.219 ms) : 0, 1219
crashtracking [candidate] (1.185 ms) : 0, 1185
BytebuddyAgent [baseline] (798.683 ms) : 0, 798683
BytebuddyAgent [candidate] (794.066 ms) : 0, 794066
AgentMeter [baseline] (11.357 ms) : 0, 11357
AgentMeter [candidate] (11.373 ms) : 0, 11373
GlobalTracer [baseline] (247.866 ms) : 0, 247866
GlobalTracer [candidate] (246.671 ms) : 0, 246671
AppSec [baseline] (26.455 ms) : 0, 26455
AppSec [candidate] (26.402 ms) : 0, 26402
Debugger [baseline] (64.802 ms) : 0, 64802
Debugger [candidate] (65.345 ms) : 0, 65345
Remote Config [baseline] (529.944 µs) : 0, 530
Remote Config [candidate] (520.709 µs) : 0, 521
Telemetry [baseline] (14.554 ms) : 0, 14554
Telemetry [candidate] (13.849 ms) : 0, 13849
Flare Poller [baseline] (4.023 ms) : 0, 4023
Flare Poller [candidate] (3.864 ms) : 0, 3864
IAST [baseline] (25.182 ms) : 0, 25182
IAST [candidate] (25.066 ms) : 0, 25066
section profiling
ProfilingAgent [baseline] (94.053 ms) : 0, 94053
ProfilingAgent [candidate] (94.202 ms) : 0, 94202
crashtracking [baseline] (1.172 ms) : 0, 1172
crashtracking [candidate] (1.181 ms) : 0, 1181
BytebuddyAgent [baseline] (685.71 ms) : 0, 685710
BytebuddyAgent [candidate] (685.694 ms) : 0, 685694
AgentMeter [baseline] (8.648 ms) : 0, 8648
AgentMeter [candidate] (8.685 ms) : 0, 8685
GlobalTracer [baseline] (216.515 ms) : 0, 216515
GlobalTracer [candidate] (216.515 ms) : 0, 216515
AppSec [baseline] (32.031 ms) : 0, 32031
AppSec [candidate] (32.233 ms) : 0, 32233
Debugger [baseline] (65.025 ms) : 0, 65025
Debugger [candidate] (63.4 ms) : 0, 63400
Remote Config [baseline] (594.169 µs) : 0, 594
Remote Config [candidate] (581.658 µs) : 0, 582
Telemetry [baseline] (8.992 ms) : 0, 8992
Telemetry [candidate] (10.632 ms) : 0, 10632
Flare Poller [baseline] (3.48 ms) : 0, 3480
Flare Poller [candidate] (3.539 ms) : 0, 3539
Profiling [baseline] (94.637 ms) : 0, 94637
Profiling [candidate] (94.774 ms) : 0, 94774
Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.61.0-SNAPSHOT~a3832a0fc0, baseline=1.61.0-SNAPSHOT~c1e9ac6389
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.058 s) : 0, 1058412
Total [baseline] (8.878 s) : 0, 8878412
Agent [candidate] (1.058 s) : 0, 1058102
Total [candidate] (8.842 s) : 0, 8841984
section iast
Agent [baseline] (1.237 s) : 0, 1237048
Total [baseline] (9.589 s) : 0, 9589228
Agent [candidate] (1.225 s) : 0, 1225140
Total [candidate] (9.548 s) : 0, 9548283
gantt
title insecure-bank - break down per module: candidate=1.61.0-SNAPSHOT~a3832a0fc0, baseline=1.61.0-SNAPSHOT~c1e9ac6389
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.221 ms) : 0, 1221
crashtracking [candidate] (1.191 ms) : 0, 1191
BytebuddyAgent [baseline] (628.006 ms) : 0, 628006
BytebuddyAgent [candidate] (628.283 ms) : 0, 628283
AgentMeter [baseline] (29.128 ms) : 0, 29128
AgentMeter [candidate] (29.173 ms) : 0, 29173
GlobalTracer [baseline] (257.141 ms) : 0, 257141
GlobalTracer [candidate] (256.656 ms) : 0, 256656
AppSec [baseline] (31.555 ms) : 0, 31555
AppSec [candidate] (31.592 ms) : 0, 31592
Debugger [baseline] (58.824 ms) : 0, 58824
Debugger [candidate] (58.69 ms) : 0, 58690
Remote Config [baseline] (596.841 µs) : 0, 597
Remote Config [candidate] (586.894 µs) : 0, 587
Telemetry [baseline] (8.656 ms) : 0, 8656
Telemetry [candidate] (8.677 ms) : 0, 8677
Flare Poller [baseline] (7.133 ms) : 0, 7133
Flare Poller [candidate] (7.116 ms) : 0, 7116
section iast
crashtracking [baseline] (1.234 ms) : 0, 1234
crashtracking [candidate] (1.197 ms) : 0, 1197
BytebuddyAgent [baseline] (803.492 ms) : 0, 803492
BytebuddyAgent [candidate] (795.67 ms) : 0, 795670
AgentMeter [baseline] (11.601 ms) : 0, 11601
AgentMeter [candidate] (11.295 ms) : 0, 11295
GlobalTracer [baseline] (249.24 ms) : 0, 249240
GlobalTracer [candidate] (247.062 ms) : 0, 247062
AppSec [baseline] (27.456 ms) : 0, 27456
AppSec [candidate] (26.245 ms) : 0, 26245
Debugger [baseline] (61.976 ms) : 0, 61976
Debugger [candidate] (62.343 ms) : 0, 62343
Remote Config [baseline] (521.429 µs) : 0, 521
Remote Config [candidate] (529.902 µs) : 0, 530
Telemetry [baseline] (14.984 ms) : 0, 14984
Telemetry [candidate] (14.961 ms) : 0, 14961
Flare Poller [baseline] (4.654 ms) : 0, 4654
Flare Poller [candidate] (4.652 ms) : 0, 4652
IAST [baseline] (25.326 ms) : 0, 25326
IAST [candidate] (25.094 ms) : 0, 25094
LoadParameters
See matching parameters
SummaryFound 2 performance improvements and 3 performance regressions! Performance is the same for 15 metrics, 16 unstable metrics.
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~a3832a0fc0, baseline=1.61.0-SNAPSHOT~c1e9ac6389
dateFormat X
axisFormat %s
section baseline
no_agent (1.179 ms) : 1168, 1191
. : milestone, 1179,
iast (3.112 ms) : 3072, 3152
. : milestone, 3112,
iast_FULL (5.884 ms) : 5825, 5943
. : milestone, 5884,
iast_GLOBAL (3.618 ms) : 3562, 3673
. : milestone, 3618,
profiling (2.144 ms) : 2123, 2165
. : milestone, 2144,
tracing (1.787 ms) : 1773, 1802
. : milestone, 1787,
section candidate
no_agent (1.19 ms) : 1178, 1201
. : milestone, 1190,
iast (2.997 ms) : 2958, 3035
. : milestone, 2997,
iast_FULL (5.737 ms) : 5680, 5795
. : milestone, 5737,
iast_GLOBAL (3.317 ms) : 3267, 3367
. : milestone, 3317,
profiling (2.02 ms) : 2000, 2040
. : milestone, 2020,
tracing (1.741 ms) : 1727, 1754
. : milestone, 1741,
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~a3832a0fc0, baseline=1.61.0-SNAPSHOT~c1e9ac6389
dateFormat X
axisFormat %s
section baseline
no_agent (17.959 ms) : 17774, 18144
. : milestone, 17959,
appsec (18.849 ms) : 18656, 19041
. : milestone, 18849,
code_origins (17.832 ms) : 17653, 18010
. : milestone, 17832,
iast (17.635 ms) : 17459, 17811
. : milestone, 17635,
profiling (19.913 ms) : 19707, 20119
. : milestone, 19913,
tracing (18.097 ms) : 17916, 18277
. : milestone, 18097,
section candidate
no_agent (18.181 ms) : 17995, 18366
. : milestone, 18181,
appsec (18.718 ms) : 18531, 18906
. : milestone, 18718,
code_origins (17.778 ms) : 17602, 17953
. : milestone, 17778,
iast (18.331 ms) : 18147, 18515
. : milestone, 18331,
profiling (19.475 ms) : 19279, 19671
. : milestone, 19475,
tracing (18.973 ms) : 18781, 19165
. : milestone, 18973,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 1 performance regressions! Performance is the same for 10 metrics, 1 unstable metrics.
Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~a3832a0fc0, baseline=1.61.0-SNAPSHOT~c1e9ac6389
dateFormat X
axisFormat %s
section baseline
no_agent (15.487 s) : 15487000, 15487000
. : milestone, 15487000,
appsec (15.087 s) : 15087000, 15087000
. : milestone, 15087000,
iast (18.106 s) : 18106000, 18106000
. : milestone, 18106000,
iast_GLOBAL (17.882 s) : 17882000, 17882000
. : milestone, 17882000,
profiling (15.681 s) : 15681000, 15681000
. : milestone, 15681000,
tracing (15.258 s) : 15258000, 15258000
. : milestone, 15258000,
section candidate
no_agent (15.438 s) : 15438000, 15438000
. : milestone, 15438000,
appsec (15.156 s) : 15156000, 15156000
. : milestone, 15156000,
iast (18.182 s) : 18182000, 18182000
. : milestone, 18182000,
iast_GLOBAL (17.755 s) : 17755000, 17755000
. : milestone, 17755000,
profiling (14.599 s) : 14599000, 14599000
. : milestone, 14599000,
tracing (15.147 s) : 15147000, 15147000
. : milestone, 15147000,
Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~a3832a0fc0, baseline=1.61.0-SNAPSHOT~c1e9ac6389
dateFormat X
axisFormat %s
section baseline
no_agent (1.448 ms) : 1436, 1459
. : milestone, 1448,
appsec (2.518 ms) : 2462, 2573
. : milestone, 2518,
iast (2.249 ms) : 2180, 2319
. : milestone, 2249,
iast_GLOBAL (2.293 ms) : 2223, 2363
. : milestone, 2293,
profiling (2.111 ms) : 2055, 2168
. : milestone, 2111,
tracing (1.971 ms) : 1923, 2019
. : milestone, 1971,
section candidate
no_agent (1.477 ms) : 1465, 1488
. : milestone, 1477,
appsec (3.757 ms) : 3540, 3975
. : milestone, 3757,
iast (2.262 ms) : 2193, 2332
. : milestone, 2262,
iast_GLOBAL (2.299 ms) : 2229, 2369
. : milestone, 2299,
profiling (2.112 ms) : 2056, 2169
. : milestone, 2112,
tracing (2.068 ms) : 2014, 2121
. : milestone, 2068,
|
21ec5e9 to
9f3f544
Compare
# Conflicts: # dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricKey.java
When client-computed stats (CCS) are enabled, the agent **merges** stats it computes itself from raw spans with stats pre-computed by the tracer. For gRPC spans, without Client Computed Stats (metrics) the agent resolves the status code from the span's tags via [`getGRPCStatusCode()`](https://github.com/DataDog/datadog-agent/blob/47938ea8c9b9894dcb03dc3f81cf2c6e408f1b6c/pkg/trace/stats/aggregation.go#L167-L221), which always returns a numeric string (e.g. `4`) or an empty string. With CCS enabled, the code uses [`GRPCStatusCode`](https://github.com/DataDog/datadog-agent/blob/47938ea8c9b9894dcb03dc3f81cf2c6e408f1b6c/pkg/trace/stats/aggregation.go#L160) without translation. This change mimics the aggregation of the agent, and what is expected from the agent, in [`NewAggregationFromGroup`](https://github.com/DataDog/datadog-agent/blob/47938ea8c9b9894dcb03dc3f81cf2c6e408f1b6c/pkg/trace/stats/aggregation.go#L146-L165). Protocol wise [ClientGroupedStats.GRPC_status_code](https://github.com/DataDog/datadog-agent/blob/47938ea8c9b9894dcb03dc3f81cf2c6e408f1b6c/pkg/proto/datadog/trace/stats.proto#L103) is a `string`.
9f3f544 to
a3832a0
Compare
What Does This Do
Reports the gRPC status code via Client Computed Stats.
This status is supported since 7.65.0 in the agent (DataDog/datadog-agent#34220), which is the minimal version needed to support CCS as well.
Current grpc instrumentations capture the status code, but not its numeric value, so it was chosen to add a new span tag that will be used in the client aggregation.
span.setTag("status.code", status.getCode().name()); span.setTag("grpc.status.code", status.getCode().name()); + span.setTag("rpc.grpc.status_code", status.getCode().value());This affects grpc and armeria instrumentations.
Motivation
Completeness of CCS.
Additional notes
When client-computed stats (CCS) are enabled, the agent merges stats it computes itself from raw spans with stats pre-computed by the tracer.
For gRPC spans, without Client Computed Stats (metrics) the agent resolves the status code from the span's tags via
getGRPCStatusCode(), which always returns a numeric string (e.g.4) or an empty string. With CCS enabled, the code usesGRPCStatusCodewithout translation.flowchart TB subgraph tracer["dd-trace-java"] span["gRPC span<br>grpc.status.code = 'DEADLINE_EXCEEDED'<br>rpc.grpc.status_code = 4"] span -->|raw spans| v04["POST /v0.4/traces<br>msgpack"] span --> agg["ConflatingMetricsAggregator<br>reads rpc.grpc.status_code<br>GRPCStatusCode = '4'"] agg -->|pre-computed stats| v06["POST /v0.6/stats<br>msgpack · GRPCStatusCode: '4'"] end subgraph agent["datadog-agent"] v04 --> agentPath["NewAggregationFromSpan<br>getGRPCStatusCode<br>meta[grpc.status.code]='DEADLINE_EXCEEDED' → '4'"] v06 --> ccsPath["NewAggregationFromGroup<br>GRPCStatusCode → '4'"] agentPath --> k1["key{GRPCStatusCode:'4',...}"] ccsPath --> k2["key{GRPCStatusCode:'4',...}"] endThis change mimics the aggregation of the agent, and what is expected from the agent, in
NewAggregationFromGroup.Protocol wise ClientGroupedStats.GRPC_status_code is a
string.