feat(data-pipeline)!: export client-computed span stats as OTLP trace metrics#2150
feat(data-pipeline)!: export client-computed span stats as OTLP trace metrics#2150mabdinur wants to merge 11 commits into
Conversation
Foundation pieces consumed by the OTLP trace-metrics exporter that follows. These are pure additions with no breaking changes. - libdd-ddsketch: `DDSketch::from_pb` rebuilds a sketch from its protobuf representation (or `None` when the mapping is missing/invalid); a thin `DDSketch::from_encoded` wraps protobuf decoding + `from_pb`. Lets callers read back the ok/error sketches that the span concentrator publishes. Includes a roundtrip test that goes `encode_to_vec` -> `from_encoded` and asserts bin count + total weight survive the trip. - libdd-trace-utils: extend `OtlpResourceInfo` with two new fields: `hostname` (emitted as the `host.name` resource attribute when set) and `process_tags` (comma-separated `key:value` pairs, each becoming a `dd.<key>` resource attribute). The struct is `#[non_exhaustive]`, so adding fields is forward-compatible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Make the span concentrator accumulate exact per-cell (ok/error) duration totals and min/max in nanoseconds alongside the existing combined `duration` that the /v0.6/stats agent payload uses, and publish them on a new sidecar that the OTLP trace-metrics path can read. - `GroupedStats` gains six pub(super) accumulators (`ok_duration`/`ok_min`/ `ok_max` + the error trio) updated inside `insert`. They are seeded on the first span in each cell (count == 1) so the natural `0` default cannot masquerade as a real minimum. - New public types `OtlpExactCell`, `OtlpExactGroup`, `OtlpStatsBucket` carry the exact scalars alongside an unmodified `pb::ClientStatsBucket`. The `grpc_method` field on `OtlpExactGroup` is intentionally introduced here but only ever populated with `String::new()`; a later commit fills it in. - `StatsBucket::flush` now delegates to a new `flush_with_otlp_exact` which produces both the protobuf bucket (identical bytes) and the parallel sidecar. `SpanConcentrator::flush` and `flush_with_otlp_exact` share a generic `drain_due_buckets` helper so the bucket-window/buffer-len logic stays in one place. - A new concentrator test drives the full path through `add_span` for 3 ok + 2 error spans and asserts each cell's count/duration/min/max plus `ok_duration + error_duration == group.duration` (the agent field). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig Prepare the data-pipeline OTLP layer to host a second exporter (trace metrics) without changing the existing trace path's behavior or public API. - `otlp/exporter.rs`: factor the actual POST + retry plumbing into a new crate-private `send_otlp_http(endpoint_url, headers, timeout, ...)` helper. `send_otlp_traces_http` becomes a thin wrapper that pulls fields out of `OtlpTraceConfig` and calls it; the existing public function signature is unchanged, so external callers see no diff. Two new pub(crate) constants (`OTLP_MAX_ATTEMPTS`, `OTLP_SHUTDOWN_MAX_ATTEMPTS`) replace the previous `OTLP_MAX_RETRIES` literal so the trace-metrics worker can use a single attempt on shutdown. - `otlp/config.rs`: add `OtlpMetricsConfig` mirroring `OtlpTraceConfig` plus an `otel_trace_semantics_enabled` flag for `DD_TRACE_OTEL_SEMANTICS_ENABLED`. Annotated `#[allow(dead_code)]` until a follow-up commit consumes it. - `trace_exporter/builder.rs`: factor the inline OTLP header-map builder out of `build_async` into a small `build_otlp_header_map` helper and refactor the existing OTLP traces config building to use it. No behavior change; this dedup makes the metrics-config branch trivial when it lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…metrics
Wire up the actual OTLP trace-metrics exporter on top of the foundation
pieces from earlier commits.
- New `libdd-data-pipeline/src/otlp/metrics.rs`:
- `map_stats_to_otlp_metrics` builds an `ExportMetricsServiceRequest`
JSON value from `&[OtlpStatsBucket]` (one histogram data point per
aggregation-key (ok|error) cell). `count`/`sum`/`min`/`max` come from
the sidecar's exact accumulators (ns -> s); `bucketCounts` is projected
from the per-cell DDSketch onto a fixed 17-bucket spanmetrics-style
layout. Empty cells are suppressed.
- `OtlpStatsExporter<C>` runs as a `libdd_shared_runtime::Worker`:
`trigger` waits one flush interval, `run` flushes + sends with
`OTLP_MAX_ATTEMPTS`, `shutdown` force-flushes with
`OTLP_SHUTDOWN_MAX_ATTEMPTS` (single attempt) so the final bucket is
delivered inside the bounded shutdown window.
- The mapper consumes `exact.grpc_method` (always empty here) so the
later breaking-change commit only has to fill it in.
- `otlp/mod.rs`: declare the new `metrics` module, re-export
`OtlpMetricsConfig` and `OtlpStatsExporter`, and extend the module-level
doc to describe the trace-metrics path.
- `trace_exporter/builder.rs`: add `otlp_metrics_endpoint`,
`otlp_metrics_headers` and `otel_trace_semantics_enabled` fields with
matching setters (`set_otlp_metrics_endpoint`, `set_otlp_metrics_headers`,
`enable_otel_trace_semantics`). When both an OTLP metrics endpoint and a
stats bucket size are configured, spawn an `OtlpStatsExporter` worker on
the shared runtime against an unconditionally-started
`SpanConcentrator`; set a new `otlp_stats_enabled` flag on `TraceExporter`
so the agent-info gate cannot later disable stats. The agent /v0.6/stats
payload bytes are unchanged when no OTLP metrics endpoint is set.
- `trace_exporter/mod.rs`: add the `otlp_stats_enabled` field on
`TraceExporter`.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add the gRPC method name to the aggregation key so spans sharing the same
service/resource/etc. but different `grpc.method.name` aggregate into
distinct groups, and surface the value via the OTLP trace-metrics sidecar
introduced earlier on this branch.
- `aggregation.rs`:
- New `GRPC_METHOD_FIELD` lookup list (`grpc.method.name`, fallback
`rpc.method`) consumed by a new `get_grpc_method` helper.
- New `FixedAggregationKey<T>.grpc_method` field, appended at the END of
the struct so the `PartialOrd` derive's field order (and therefore the
ordering of any existing comparisons) is unaffected for the pre-existing
fields.
- `BorrowedAggregationKey::from_obfuscated_span` now picks up
`grpc_method`; `OwnedAggregationKey::From<pb::ClientGroupedStats>` sets
it to `""` (the agent stats protobuf does not carry it).
- `StatsBucket::flush_with_otlp_exact` does `std::mem::take` on the key's
`grpc_method` and moves it into `OtlpExactGroup.grpc_method` before
encoding the agent payload, so the OTLP path reads it from the sidecar
while the /v0.6/stats wire format stays byte-for-byte unchanged.
- Aggregation test gains a case asserting that `grpc.method.name` (and
by fallthrough, `rpc.method`) are extracted into the key.
- `datadog-ipc/src/shm_stats.rs`: the SHM concentrator's
`FixedAggregationKey` test fixture grows a `grpc_method: ""` field.
BREAKING CHANGE: `FixedAggregationKey<T>` (re-exported from
`libdd_trace_stats::span_concentrator`) gains a public `grpc_method: T`
field. External callers that construct it via a struct literal must add
the field; callers using `Default::default()` are unaffected. The /v0.6/stats
agent protobuf wire format and behavior are unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n SDK computes stats When otlp_stats_enabled, add _dd.stats_computed="true" to the OTLP ResourceSpans resource attributes and Datadog-Client-Computed-Stats: yes to the HTTP request headers. The Agent's OTLP receiver already checks both signals (otlp.go:372, otlp.go:272) and skips its concentrator when either is set, preventing double-counted APM metrics. The resource attribute survives Collector hops (unlike HTTP headers); the header covers direct SDK→Agent connections. Both are backwards compatible: older Agents and non-Datadog OTLP receivers silently ignore unknown resource attributes and headers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… negotiation with OTLP stats grpc_method was part of FixedAggregationKey (Hash+PartialEq), splitting same-service gRPC spans into separate buckets that encode_grouped_stats then serialised with an empty method — producing duplicate indistinguishable ClientGroupedStats rows on the /v0.6/stats path. Move it to GroupedStats (value side), set on group creation, and surface it to OtlpExactGroup from there. This also removes the one breaking change introduced by the prior commit. check_agent_info returned before refresh_v1_active when otlp_stats_enabled, preventing V1 protocol negotiation for exporters that combine enable_v1_protocol with OTLP metrics. Move the early return to after the V1 refresh so only stats enable/disable is skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… negotiation with OTLP stats grpc_method was part of FixedAggregationKey (Hash+PartialEq), splitting same-service gRPC spans into separate buckets that encode_grouped_stats then serialised with an empty method — producing duplicate indistinguishable ClientGroupedStats rows on the /v0.6/stats path. Move it to GroupedStats (value side), set on group creation, and surface it to OtlpExactGroup from there. This also removes the one breaking change introduced by the prior commit. check_agent_info returned before refresh_v1_active when otlp_stats_enabled, preventing V1 protocol negotiation for exporters that combine enable_v1_protocol with OTLP metrics. Move the early return to after the V1 refresh so only stats enable/disable is skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Rename OTLP_MAX_ATTEMPTS/OTLP_SHUTDOWN_MAX_ATTEMPTS to OTLP_MAX_RETRIES/ OTLP_SHUTDOWN_MAX_RETRIES and rename the max_attempts parameter to max_retries throughout, converging on the retries convention used elsewhere - Add TraceExporterBuilder::set_runtime_id so callers can supply the language tracer's existing runtime_id; falls back to a generated UUID when not set, ensuring OTLP trace exports and OTLP trace-metrics share the same runtime_id for backend correlation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spans with different gRPC methods were previously merged into the same stats group (only the first span's method was kept). Adding grpc_method to FixedAggregationKey ensures each method gets a separate bucket. The OtlpExactGroup.grpc_method field is now sourced from the key rather than a GroupedStats sidecar. The agent /v0.6/stats protobuf wire format is unchanged (no grpc_method field in ClientGroupedStats). SHM_VERSION bumped to 2 because FixedAggregationKey<StringRef> is #[repr(C)] and the new field changes the layout; mismatched sidecar/worker pairs will safely fail with a version-mismatch error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6f244c9e7d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.otlp_stats_enabled { | ||
| return; |
There was a problem hiding this comment.
Apply agent trace filters before skipping OTLP stats updates
When OTLP metrics are enabled and the agent /info response carries filter_tags, regex filters, or ignore_resources, this early return exits before installing the new TraceFilterer. process_traces_for_stats still runs with the OTLP concentrator enabled and uses self.trace_filterer.load(), so it keeps the empty filter config and exports metrics for traces that the agent config says should be rejected/ignored. Move this return below the trace-filter update (and state bookkeeping) and only skip the agent-driven stats enable/disable block.
Useful? React with 👍 / 👎.
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
What does this PR do?
Adds a new OTLP trace-metrics export path to the data pipeline. When the
SDK computes stats client-side, the span concentrator now flushes them as
traces.span.sdk.metrics.durationOTLP histograms in addition to theexisting agent
/v0.6/statspayload.Includes a fix so that
grpc.method(/rpc.method) is part of theaggregation key rather than attached after the fact — spans with
different gRPC methods now get separate metric data points.
Motivation
The OTLP metrics path enables downstream consumers (e.g., OTel
spanmetrics-connector-compatible backends) to receive exact per-method
latency histograms without going through the Datadog agent.
Additional Notes
Breaking changes (this goes out in a major version):
FixedAggregationKey<T>gains agrpc_method: Tfield. Any code thatconstructs this struct by name (not
..Default::default()) must add thenew field. The SHM concentrator's
SHM_VERSIONis bumped from 1 → 2 toprevent layout mismatches between old workers and a new sidecar; they
will fail with a version-mismatch error rather than silently
misinterpreting memory.
StatsBucket::insertno longer takes agrpc_methodparameter.The agent
/v0.6/statsprotobuf wire format (ClientGroupedStats) isunchanged — there is no
grpc_methodfield in the proto, so the newkey dimension is surfaced only through the OTLP path.
How to test the change?
Unit tests in
libdd-trace-statscover aggregation key extraction(including
grpc.method.nameandrpc.method) and the OTLP exact-cellflush path. Integration tests in
libdd-data-pipelineexercise thefull exporter pipeline.