feat(observability): add gateway OTLP traces and initial Kube monitoring surface#1270
Open
TaylorMutch wants to merge 4 commits into
Open
feat(observability): add gateway OTLP traces and initial Kube monitoring surface#1270TaylorMutch wants to merge 4 commits into
TaylorMutch wants to merge 4 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
a551804 to
c6463bf
Compare
|
🌿 Preview your docs: https://nvidia-preview-pr-1270.docs.buildwithfern.com/openshell |
Adds opt-in OpenTelemetry trace export and a Prometheus ServiceMonitor to
the gateway Helm chart. The exporter and chart toggles are independent
from the existing /metrics surface and the OCSF sandbox log fan-out.
- Gateway: append a tracing-opentelemetry layer to TracingLogBus when an
OTLP/gRPC endpoint is configured; flush spans on shutdown. CLI gains
--otlp-endpoint; standard OTEL_* env vars drive sampling and resource
attributes.
- Helm: monitoring.serviceMonitor.* renders a Prometheus-Operator
ServiceMonitor; monitoring.tracing.* projects OTEL_* env vars onto the
gateway container. Both default off.
- Tooling: observability:k8s:{setup,teardown,port-forward} mise tasks
install kube-prometheus-stack + Jaeger all-in-one for local dev.
- Docs: new docs/kubernetes/monitoring.mdx; cross-links from observability
overview and architecture/gateway.md; helm-dev-environment and
debug-openshell-cluster skills updated.
…files The kube-prometheus-stack and Jaeger releases were configured via long chains of `--set` flags, which obscure the configuration and make the script hard to extend. Extract them into two checked-in values files the setup script consumes via `--values`. - tasks/scripts/observability-prometheus-values.yaml — slim chart config plus Grafana auto-provisioning of a Jaeger datasource (stable uid so dashboards can reference it). - tasks/scripts/observability-jaeger-values.yaml — all-in-one Jaeger. - PROMSTACK_VALUES and JAEGER_VALUES env vars allow pointing at custom files for local experimentation.
c6463bf to
7d4c3d5
Compare
Operator-facing /docs pages shouldn't surface mise tasks. Trim the `Local development` section out of docs/kubernetes/monitoring.mdx and move it into deploy/helm/openshell/README.md alongside the Monitoring opt-in block. Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
`tasks/scripts/` is for shell scripts, not third-party Helm values. The
kube-prometheus-stack and Jaeger values files belong with other K8s
deployment artifacts.
- Move observability-{prometheus,jaeger}-values.yaml to deploy/kube/observability/
and drop the `observability-` prefix (parent dir already scopes them).
- Update observability-k8s-setup.sh to resolve them via a REPO_ROOT-anchored
VALUES_DIR instead of SCRIPT_DIR. PROMSTACK_VALUES / JAEGER_VALUES
env-var overrides continue to work.
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds opt-in OpenTelemetry trace export to the gateway and a Prometheus
ServiceMonitorto the Helm chart. Both surfaces are independent from the existing/metricsendpoint and the OCSF sandbox log fan-out, default off, and configured via standardOTEL_*env vars or chart values.Changes
Gateway (
crates/openshell-server)0.29/tracing-opentelemetry 0.30(the latest set compatible with the workspace'stonic 0.12+prost 0.13).TracingLogBus::install_subscribernow optionally appends atracing-opentelemetrylayer when an OTLP endpoint is configured. The existingtower_http::trace::TraceLayerper-request span automatically becomes the OTLP root — no#[instrument]rewrites required.OtlpTracingConfig::resolvehonorsOTEL_EXPORTER_OTLP_TRACES_ENDPOINT→OTEL_EXPORTER_OTLP_ENDPOINT→--otlp-endpointprecedence.OTEL_TRACES_SAMPLER/OTEL_TRACES_SAMPLER_ARG; defaultparent_based_traceidratio(1.0).shutdown()flushes theBatchSpanProcessorfrom the gateway shutdown path onSIGTERM.Helm chart
monitoring.serviceMonitor.*andmonitoring.tracing.*blocks invalues.yaml(off by default).templates/servicemonitor.yaml(gated, scrapes the existing namedmetricsport).OTEL_*env vars when tracing is enabled, including mergedOTEL_RESOURCE_ATTRIBUTES.ci/values-monitoring.yamloverlay and commented-inkube-prometheus-stack+jaegerHelm releases inskaffold.yaml.deploy/helm/openshell/README.md.Tooling
tasks/observability.tomlexposingobservability:k8s:setup,observability:k8s:teardown, andobservability:port-forward.tasks/scripts/mirroring the existingkeycloak-k8s-setup.shshape: install slimkube-prometheus-stack+ Jaeger all-in-one, idempotent re-runs.Docs / agent skills
docs/kubernetes/monitoring.mdx(operator + local-dev guide).docs/observability/overview.mdxand a new "Observability surface" subsection inarchitecture/gateway.md.helm-dev-environmentanddebug-openshell-clusterskills updated.Testing
mise run pre-commitpasses (lint, format, license headers, clippy, helm-lint matrix, full workspace tests).OtlpTracingConfig::resolveandsampler_from_env.observability:k8s:setup, deployed gateway withci/values-monitoring.yaml, drove 5ListSandboxes+ 3HealthgRPC calls. Verified:up{job=\"openshell\"} == 1;openshell_server_grpc_requests_totaltotals match driven traffic (8).openshell-gatewayservice; 8requestspans with correctmethod,path,request_idattributes; resource attributes includeservice.namespace=openshell,service.version=0.0.0,deployment.environment=dev,telemetry.sdk.version=0.29.0.Out of scope (follow-ups)
protocol: grpc.#[tracing::instrument]annotations on gRPC handlers.Checklist