Skip to content
Open
87 changes: 87 additions & 0 deletions .agents/skills/observability-stack/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
name: observability-stack
description: >-
Spin up StreamKit's local observability stack (skit + Prometheus + Grafana,
optional speech gateway) and validate the Grafana dashboards end-to-end. Use
when testing metrics/dashboards, debugging empty dashboard panels, or
reproducing the speech-gateway monitoring setup locally.
license: MPL-2.0
---
Comment on lines +1 to +9
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: License-header requirements are satisfied via REUSE annotations for frontmatter/config files

The new SKILL.md starts with YAML frontmatter instead of inline SPDX comments, but this is intentional and covered by REUSE.toml's .agents/skills/**/SKILL.md annotation. Likewise, the new .yml files are covered by the existing **/*.yml configuration-file annotation. I did not flag the missing inline SPDX headers on those files because adding them would either duplicate the configured REUSE coverage or break skill frontmatter parsing.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground


# Observability stack (local)

`samples/observability/` is a `docker compose` stack that runs skit + Prometheus
+ Grafana (and an optional speech gateway), auto-provisioning both bundled
dashboards. Use it to validate metrics and dashboards without any cloud setup.

## Run it

```bash
cd samples/observability
docker compose up -d
./generate-traffic.sh # direct-to-skit TTS+STT
# optional gateway row:
docker compose --profile gateway up -d --build
./generate-traffic.sh --gateway
```

Grafana: <http://localhost:3000> (anonymous admin). Prometheus:
<http://localhost:9090>. skit: <http://localhost:4545>.

## How metrics flow

- **skit → Prometheus via OTLP push.** Prometheus runs with
`--web.enable-otlp-receiver`; skit's `SK_TELEMETRY__OTLP_ENDPOINT` points at
`…/api/v1/otlp/v1/metrics`. There is **no scrape job** for skit.
- **gateway → Prometheus via scrape** of the gateway's `/metrics`.

## Validate dashboards (don't just eyeball)

OTLP renames dotted metrics and appends unit suffixes, so verify the metric
names/labels the panels query actually exist before trusting a panel:

```bash
# list all metric names Prometheus knows about
curl -s localhost:9090/api/v1/label/__name__/values | jq -r '.data[]' | sort
# run a panel's exact PromQL and count series (0 == panel will be "No data")
curl -s --data-urlencode 'query=<promql>' localhost:9090/api/v1/query \
| jq '.data.result | length'
# inspect a metric's labels
curl -s 'localhost:9090/api/v1/series?match[]=<metric>' | jq
```

Key name/label facts:

- Plugin metrics: `plugin_call_duration_seconds_*` (unit suffix present),
`plugin_calls_total`; labels `plugin_kind`, `op`.
- `oneshot_pipeline_duration_*` has **no** `_seconds` suffix (no unit set);
labels `status`, and `service` only when an `X-StreamKit-Service` header is
forwarded by a service-label-aware skit.
- Gateway: `gateway_requests_total{endpoint,code}`,
`gateway_request_duration_seconds`, `gateway_rejected_total{reason}` (only
appears after a 413/415/502 actually occurs).

## Expected "No data" (not bugs)

- Plugin failure panels (`plugin_errors_total` etc.) — counters don't exist
until a failure happens.
- Oneshot "by Service" panels — empty unless the skit build emits the `service`
label.
- Video / MoQ / codec panels — only populate when you run those pipelines.

## Gotchas (most-common causes of empty dashboards)

- **`latest-demo` is stale.** Pin a versioned `-demo` tag; `latest-demo` can
predate metrics like `plugin.call.duration`, leaving the Plugins row empty.
- **Demo-image plugin layout.** `-demo` images ship bare `.so` files but the
loader wants `plugins/native/<id>/` bundles; `skit/entrypoint.sh` reassembles
them. Symptom: "no plugins found" / "node kind not found in registry".
- **Model-name mismatch.** A pipeline's `model_path` must exist in the image's
`models/`. The stack's `pipelines/` use the names the `-demo` image ships.
- **Grafana datasource input.** Committed dashboards use `${DS_PROMETHEUS}`;
the `dashboard-prep` step rewrites it to the provisioned uid. In compose
command strings, escape it as `$${DS_PROMETHEUS}` so compose doesn't
interpolate it.
- **Local auth.** skit needs `SK_AUTH__MODE=disabled` +
`SK_PERMISSIONS__ALLOW_INSECURE_NO_AUTH=true` to start unauthenticated on a
non-loopback bind. Local only.
1 change: 1 addition & 0 deletions .claude/skills/observability-stack
38 changes: 38 additions & 0 deletions docs/src/content/docs/guides/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,44 @@ Import [`samples/grafana-dashboard.json`](https://github.com/streamer45/streamki

![Grafana Dashboard](/screenshots/grafana_dashboard.png)

### What's measured

Beyond HTTP and engine/node throughput, a few metric families are especially
useful for speech and ML workloads:

- **Plugin / ML inference** — native plugins emit per-call metrics labelled by
`plugin_kind` (e.g. `whisper`, `kokoro`) and `op`: `plugin_call_duration_seconds`
(histogram), `plugin_calls_total`, and `plugin_errors_total` /
`plugin_timeouts_total` / `plugin_panics_total`. This is where inference
latency and failures show up — usually the dominant cost of a speech pipeline.
- **Oneshot pipelines** — `oneshot_pipeline_duration` (histogram) is labelled by
`status` (`ok`/`error`). Because every oneshot request hits the same
`POST /api/v1/process` endpoint, splitting TTS vs STT requires a trusted
`service` label (sent via the `X-StreamKit-Service` header); without it all
oneshot traffic collapses into one series.
- **Speech gateway** — the [speech gateway example](https://github.com/streamer45/streamkit/tree/main/examples/speech-gateway)
exposes Prometheus metrics for the front door it puts in front of skit:
per-endpoint request rate/latency (`gateway_requests_total`,
`gateway_request_duration_seconds`), in-flight gauge, upstream latency, and
rejections by reason (`gateway_rejected_total`).

### Run the full stack locally

To see all of the above on the dashboards without any cloud setup, use the
[`samples/observability`](https://github.com/streamer45/streamkit/tree/main/samples/observability)
compose stack — it wires skit (OTLP push) + the gateway (scrape) into Prometheus
and auto-provisions both dashboards in Grafana:

```bash
cd samples/observability
docker compose up -d
./generate-traffic.sh
# Grafana: http://localhost:3000
```

See its README for the wiring details and known gotchas (demo-image tag/plugin
layout, model-name matching, the Prometheus OTLP receiver, and local auth).

## Traces (OTLP)

Tracing export is controlled by:
Expand Down
13 changes: 13 additions & 0 deletions examples/speech-gateway/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# SPDX-FileCopyrightText: © 2025 StreamKit Contributors
#
# SPDX-License-Identifier: MPL-2.0

FROM golang:1.24-bookworm AS build
WORKDIR /src
COPY . .
RUN CGO_ENABLED=0 go build -o /gateway ./cmd/gateway

FROM gcr.io/distroless/static-debian12
COPY --from=build /gateway /gateway
EXPOSE 8080
ENTRYPOINT ["/gateway"]
2 changes: 2 additions & 0 deletions examples/speech-gateway/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,5 @@ curl http://127.0.0.1:8080/metrics
### Grafana dashboard

A ready-made dashboard lives at [`grafana-dashboard.json`](./grafana-dashboard.json). It is self-contained: import it and pick the Prometheus datasource scraping both the gateway and the StreamKit backend. Alongside the gateway metrics above, it includes a per-service split of the backend's `oneshot_pipeline_duration` (via the `service` label: `tts`/`stt`/`other`) and the StreamKit native-plugin inference metrics (`plugin_call_duration_seconds`, `plugin_calls_total`, …) that back the STT/TTS models.

To run the gateway, Prometheus, and Grafana together locally, see [`samples/observability`](../../samples/observability).
100 changes: 100 additions & 0 deletions samples/observability/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
<!--
SPDX-FileCopyrightText: © 2025 StreamKit Contributors

SPDX-License-Identifier: MPL-2.0
-->

# Local observability stack

A `docker compose` stack that runs **skit + Prometheus + Grafana** (and an
optional **speech gateway**) so you can see StreamKit's metrics on the bundled
Grafana dashboards locally — no cloud, no manual import.

## Quick start

```bash
cd samples/observability
docker compose up -d # skit + Prometheus + Grafana
./generate-traffic.sh # drive ~20 TTS + STT requests through skit
```

Then open Grafana at <http://localhost:3000> (anonymous admin, no login). Two
dashboards are auto-provisioned:

- **StreamKit Performance Dashboard** — the repo's main dashboard
([`samples/grafana-dashboard.json`](../grafana-dashboard.json)), including the
**Plugins / ML inference** row.
- **StreamKit Speech Gateway Dashboard** — the gateway/oneshot dashboard
([`examples/speech-gateway/grafana-dashboard.json`](../../examples/speech-gateway/grafana-dashboard.json)).

| Service | URL |
| ---------- | ----------------------- |
| Grafana | <http://localhost:3000> |
| Prometheus | <http://localhost:9090> |
| skit API | <http://localhost:4545> |
| gateway | <http://localhost:8080> (gateway profile only) |

## How metrics get to Prometheus

Two different paths, both visible on the dashboards:

- **skit → Prometheus (OTLP push).** skit exports OTLP metrics to Prometheus'
native OTLP receiver, which is enabled with `--web.enable-otlp-receiver`.
Configured via `SK_TELEMETRY__OTLP_ENDPOINT` pointing at
`http://prometheus:9090/api/v1/otlp/v1/metrics`. This feeds the HTTP, engine,
oneshot, and **plugin** metrics.
- **gateway → Prometheus (scrape).** The speech gateway exposes a classic
`/metrics` endpoint that Prometheus scrapes (see `prometheus.yml`). This feeds
the **Speech Gateway** row.

## Speech Gateway row

The gateway is behind a compose profile because it requires the gateway
**metrics** instrumentation:

```bash
docker compose --profile gateway up -d --build
./generate-traffic.sh --gateway # route traffic through the gateway
```

Notes:

- The gateway's `/metrics` endpoint and the `gateway_*` metrics require the
metrics-instrumented gateway. The Speech Gateway dashboard row stays empty
until those metrics are present and the gateway has served traffic.
- The gateway's default STT pipeline targets a Whisper model that must exist on
the skit it talks to. The bundled `-demo` image ships `ggml-tiny-q5_1.bin`; if
the gateway points at a different model, STT through the gateway will fail
while TTS still works. The direct-to-skit traffic path (the default
`generate-traffic.sh`) avoids this by shipping its own pipelines under
`pipelines/`.

## Known gotchas

These are the sharp edges worth knowing when wiring this up yourself:

- **Pin a versioned `-demo` tag.** `latest-demo` can lag behind released
versions and predate metrics like `plugin.call.duration`, which leaves the
Plugins / ML inference row empty. This stack pins `v0.5.0-demo`.
- **Demo image plugin layout.** Current `-demo` images ship native plugins as
bare `.so` files under `plugins/native/`, but the loader expects directory
bundles (`plugins/native/<id>/` with a `plugin.yml` + the `.so`). `skit serve`
otherwise logs "no plugins found" and pipelines fail with "node kind not
found". `skit/entrypoint.sh` reassembles the expected layout at startup from
the in-repo manifests (mounted at `/repo-manifests`).
- **Model names must match.** Pipelines reference model files by path; the file
must actually be present in the image/`models/` dir. The pipelines under
`pipelines/` use the model names the `-demo` image actually ships.
- **Local auth override.** skit refuses to start unauthenticated on a
non-loopback bind unless you opt in. This stack sets
`SK_AUTH__MODE=disabled` + `SK_PERMISSIONS__ALLOW_INSECURE_NO_AUTH=true`.
**Local testing only** — never do this on an exposed instance.
- **Grafana dashboard datasource.** The committed dashboards use a
`${DS_PROMETHEUS}` datasource input. The `dashboard-prep` step rewrites it to
the provisioned datasource uid so the dashboards load without a manual import.

## Cleanup

```bash
docker compose --profile gateway down -v
```
96 changes: 96 additions & 0 deletions samples/observability/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Local observability stack for StreamKit: skit + Prometheus + Grafana, with an
# optional speech gateway. See README.md for the walkthrough and known gotchas.
#
# Usage:
# docker compose up -d # skit + Prometheus + Grafana
# docker compose --profile gateway up -d # also build & run the speech gateway
#
# Grafana: http://localhost:3000 (anonymous admin, no login)
# Prometheus: http://localhost:9090
# skit API: http://localhost:4545
# gateway: http://localhost:8080 (gateway profile only)

services:
skit:
image: ghcr.io/streamer45/streamkit:v0.5.0-demo
# Pinned to a versioned -demo tag on purpose: `latest-demo` can lag behind
# and predate metrics like plugin.call.duration, leaving dashboard rows empty.
entrypoint: ["/entrypoint.sh"]
environment:
SK_AUTH__MODE: disabled
SK_PERMISSIONS__ALLOW_INSECURE_NO_AUTH: "true"
SK_PLUGINS__DIRECTORY: /opt/streamkit/np
SK_TELEMETRY__ENABLE: "true"
SK_TELEMETRY__OTLP_ENDPOINT: http://prometheus:9090/api/v1/otlp/v1/metrics
volumes:
- ./skit/entrypoint.sh:/entrypoint.sh:ro
- ../../plugins/native:/repo-manifests:ro
ports:
- "4545:4545"
healthcheck:
test: ["CMD", "curl", "-fsS", "http://localhost:4545/healthz"]
interval: 5s
timeout: 3s
retries: 20

prometheus:
image: prom/prometheus:v3.1.0
command:
- --config.file=/etc/prometheus/prometheus.yml
- --web.enable-otlp-receiver
- --storage.tsdb.path=/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
ports:
- "9090:9090"

dashboard-prep:
image: alpine:3.21
# Copies the in-repo dashboards into Grafana's provisioning dir, resolving
# the ${DS_PROMETHEUS} template input to the provisioned datasource uid so
# the dashboards load without manual import.
command:
- sh
- -c
- |
set -e
for f in /in/*.json; do
sed 's/$${DS_PROMETHEUS}/prometheus/g' "$$f" > "/out/$$(basename "$$f")"
done
Comment on lines +55 to +59
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: Compose interpolation escaping in dashboard-prep is intentional

The dashboard-prep command uses $${DS_PROMETHEUS} and $$f/$$(...) so Docker Compose leaves literal shell variables and command substitution for the container's /bin/sh. After Compose interpolation, the script rewrites dashboard datasource placeholders from ${DS_PROMETHEUS} to the provisioned Grafana datasource uid prometheus; this escaping matches the documented gotcha in the added README/skill and is not an accidental double-dollar typo.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

echo "prepared dashboards:"; ls -1 /out
volumes:
- ../../samples/grafana-dashboard.json:/in/streamkit.json:ro
- ../../examples/speech-gateway/grafana-dashboard.json:/in/speech-gateway.json:ro
- grafana-dashboards:/out

grafana:
image: grafana/grafana:11.4.0
environment:
GF_AUTH_ANONYMOUS_ENABLED: "true"
GF_AUTH_ANONYMOUS_ORG_ROLE: Admin
GF_AUTH_DISABLE_LOGIN_FORM: "true"
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- grafana-dashboards:/var/lib/grafana/dashboards:ro
ports:
- "3000:3000"
depends_on:
- prometheus
- dashboard-prep

gateway:
profiles: ["gateway"]
build:
context: ../../examples/speech-gateway
environment:
GATEWAY_LISTEN: ":8080"
SKIT_URL: http://skit:4545
ports:
- "8080:8080"
depends_on:
skit:
condition: service_healthy

volumes:
grafana-dashboards:
Loading