feat(observability): plugin/ML dashboard row + standalone speech-gateway dashboard#547
feat(observability): plugin/ML dashboard row + standalone speech-gateway dashboard#547staging-devin-ai-integration[bot] wants to merge 5 commits into
Conversation
Add Grafana rows for plugin/ML-inference, the speech-gateway metric contract, and a per-service oneshot split, plus documentation for the previously-undocumented plugin metrics and a guide for monitoring the hosted speech services. Signed-off-by: streamkit-devin <devin@streamkit.dev>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
📝 Info: Dashboard JSON remains syntactically valid after the large insertion
The dashboard file is a large generated-style JSON artifact, so I validated it mechanically with python3 -m json.tool samples/grafana-dashboard.json; it parsed successfully. I therefore did not flag formatting or syntax issues in the Grafana dashboard changes.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
| "expr": "histogram_quantile(0.50, sum(rate(plugin_call_duration_seconds_bucket[5m])) by (le, plugin_kind))", | ||
| "legendFormat": "p50 {{plugin_kind}}", | ||
| "range": true, | ||
| "refId": "A" | ||
| }, | ||
| { | ||
| "datasource": { | ||
| "type": "prometheus", | ||
| "uid": "${DS_PROMETHEUS}" | ||
| }, | ||
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.95, sum(rate(plugin_call_duration_seconds_bucket[5m])) by (le, plugin_kind))", | ||
| "legendFormat": "p95 {{plugin_kind}}", | ||
| "range": true, | ||
| "refId": "B" | ||
| }, | ||
| { | ||
| "datasource": { | ||
| "type": "prometheus", | ||
| "uid": "${DS_PROMETHEUS}" | ||
| }, | ||
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.99, sum(rate(plugin_call_duration_seconds_bucket[5m])) by (le, plugin_kind))", |
There was a problem hiding this comment.
📝 Info: Plugin metric queries match the existing OTel instrument contract
I checked the new Plugins / ML inference panels against the native plugin instrumentation. The Rust code records plugin.call.duration with unit s, counters named plugin.calls, plugin.errors, plugin.panics, and plugin.timeouts, and labels plugin.kind plus op (crates/plugin-native/src/metrics.rs:39-61, crates/plugin-native/src/metrics.rs:100-102). That matches the dashboard’s Prometheus names like plugin_call_duration_seconds_bucket, plugin_calls_total, and label plugin_kind after the OTLP→Prometheus rewrite described in the same docs, so I did not flag these plugin panels as a bug.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
Per review: keep only the core Plugins / ML inference row in the official samples dashboard, and move the demo-service panels (Speech Gateway + per-service oneshot split) into a self-contained dashboard under examples/speech-gateway/. Revert the official observability docs; document the gateway metrics + dashboard in the service's own README instead. Signed-off-by: streamkit-devin <devin@streamkit.dev>
Keep #546's authoritative gateway metrics section; add the Grafana dashboard pointer and align panel descriptions to the real endpoint/reason label values. Signed-off-by: streamkit-devin <devin@streamkit.dev>
ebbf40f to
bf5271f
Compare
| @@ -0,0 +1,1390 @@ | |||
| { | |||
There was a problem hiding this comment.
📝 Info: JSON dashboards are covered by REUSE annotations rather than inline SPDX headers
The new dashboard file does not include an inline SPDX header, but this is consistent with the repo’s existing pattern for JSON configuration files: REUSE.toml:40-56 annotates **/*.json with the StreamKit copyright and MPL-2.0 license. I therefore did not treat the missing inline header in this added JSON file as a CONTRIBUTING.md license-header violation.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
…e label (#545) Signed-off-by: streamkit-devin <devin@streamkit.dev> #545 pivoted the oneshot service label from the X-StreamKit-Service header to pipeline `attributes: {service}`. Declare the attribute in the gateway's embedded STT/TTS pipelines and drop the now-dead header. Fix the dashboard error-rate denominator to cover all statuses (so 'incomplete' runs count toward the total) and surface the incomplete rate, and correct the README to explain skit pushes via OTLP (only the gateway is scraped) and that the service label requires operator allowlist config.
f0b651e to
f36420f
Compare
| attributes: | ||
| service: stt |
There was a problem hiding this comment.
📝 Info: Service split now depends on pipeline attributes rather than a forwarded header
The gateway now declares attributes.service inside both embedded oneshot pipeline configs, and the backend resolves pipeline attributes after compiling the submitted config (apps/skit/src/server/oneshot.rs:518-520) before recording oneshot_pipeline.duration labels (apps/skit/src/server/oneshot.rs:461-467). That makes the removed X-StreamKit-Service header non-essential for the current backend path; operators still need the [server.metrics.attributes.service] allowlist for the dashboard's per-service rows to populate, which the README documents.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Part of the parallel speech-services observability effort. Scoped so the demo service stays decoupled from the official StreamKit dashboard/docs:
samples/grafana-dashboard.json(official) — adds only the Plugins / ML inference row, since it's built purely on core metrics (plugin_call_duration_seconds,plugin_calls_total,plugins_loaded, …) and is valuable on its own. Mirrors the existing styling /${DS_PROMETHEUS}templating / collapsible-row pattern; the only existing-panel change is a+17yshift of the 4 collapsed Advanced rows.examples/speech-gateway/grafana-dashboard.json(new, self-contained) — the demo-service dashboard: Speech Gateway row (frozengateway_*contract from the sibling Go PR), Oneshot Speech Services row (per-service split ofoneshot_pipeline_durationvia theservicelabel from the sibling Rust PR), plus a duplicated Plugins / ML inference row so it stands alone.examples/speech-gateway/README.md— documents the gateway/metricscontract and how to import the dashboard. Officialdocs/.../observability.mdis intentionally left unchanged.Metric-name note:
plugin.call.duration(OTel units) is queried asplugin_call_duration_seconds_bucket— existing panels confirm the exporter appends the unit (process_memory_usage→_bytes,process_cpu_utilization→_percent); labelplugin.kind→plugin_kind.Review & Validation
jqpasses on both dashboards; official diff is just the Plugins row + the collapsed-rowyshift (collapsed Advanced rows remain intact).reuse lintpasses (.jsoncovered byREUSE.toml).gateway_*/service-label names match the sibling PRs.Notes
The Speech Gateway and Oneshot Speech Services panels (in the standalone dashboard) describe metrics emitted by sibling PRs that may not be merged yet — they'll show no data until those land. Built against the frozen contract so the dashboard is ready immediately.
Link to Devin session: https://staging.itsdev.in/sessions/1c90abbd437a4c3383e43eff412f0c2e
Requested by: @streamer45
Devin Review
f36420f