Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,17 @@ openshell status
openshell logs <sandbox-name>
```

## Telemetry Signals

Before drilling into logs, check whether the gateway is exporting telemetry — the pull-based metrics surface and the push-based trace export are the fastest signals that the control plane is alive and that requests are reaching it.

| Signal | Where it shows up | When to use it |
|---|---|---|
| Prometheus metrics on `/metrics` | A scrape target via the chart's `ServiceMonitor` (`monitoring.serviceMonitor.enabled`). Local: `kubectl -n openshell port-forward statefulset/openshell <metrics-port>:<metrics-port>`. | Confirm the gateway listener is up and gRPC requests are landing. `up{job="openshell"} == 1` in Prometheus is a quick liveness ping. |
| OTLP traces | Jaeger / Tempo / OTel backend (`monitoring.tracing.enabled`). Look for service `openshell-gateway`. | Confirm an inbound request reached the multiplex layer; spans carry `method`, `path`, `request_id`. Missing traces under load means OTLP export is misconfigured or the endpoint is unreachable. |

If the chart's `monitoring.serviceMonitor.enabled` or `monitoring.tracing.enabled` were not set, those signals are unavailable — fall back to gateway logs. See [Monitoring the Gateway](../../../docs/kubernetes/monitoring.mdx) for setup.

## Common Failure Patterns

| Symptom | Likely cause | Check |
Expand Down
33 changes: 33 additions & 0 deletions .agents/skills/helm-dev-environment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,39 @@ To remove Keycloak:
mise run keycloak:k8s:teardown
```

### Monitoring (Prometheus + Grafana + Jaeger)

One-time setup — installs `kube-prometheus-stack` (slimmed: no Alertmanager,
node-exporter, or kube-state-metrics) and a Jaeger all-in-one Pod:

```bash
mise run observability:k8s:setup
```

Then activate monitoring on the gateway:

1. Uncomment `#- ci/values-monitoring.yaml` in `skaffold.yaml`
2. Redeploy: `mise run helm:skaffold:run`

Forward UIs to localhost:

```bash
mise run observability:port-forward
# Grafana http://localhost:3000 (admin / admin)
# Prometheus http://localhost:9090
# Jaeger UI http://localhost:16686
```

Teardown:

```bash
mise run observability:k8s:teardown
```

The chart's `monitoring.serviceMonitor.enabled` creates a `ServiceMonitor`
that Prometheus scrapes, and `monitoring.tracing.enabled` projects `OTEL_*`
env vars onto the gateway so it exports OTLP/gRPC traces to Jaeger.

---

## Cluster Lifecycle (suspend/resume)
Expand Down
85 changes: 85 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,12 @@ tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
tracing-appender = "0.2"

# OpenTelemetry — pinned to a tonic-0.12 / prost-0.13 compatible release set.
opentelemetry = "0.29"
opentelemetry_sdk = { version = "0.29", features = ["rt-tokio"] }
opentelemetry-otlp = { version = "0.29", default-features = false, features = ["grpc-tonic", "trace"] }
tracing-opentelemetry = "0.30"

# Metrics
metrics = "0.24"
metrics-exporter-prometheus = { version = "0.18", default-features = false, features = ["http-listener"] }
Expand Down
17 changes: 17 additions & 0 deletions architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,23 @@ Domain objects use shared metadata: stable server-generated IDs, human-readable
names, creation timestamps, and labels. Crate-level details live in
`crates/openshell-core/README.md`.

### Observability surface

The gateway exposes three independent telemetry surfaces, each with its own
configuration knob and consumer:

| Surface | Direction | Configured by | Consumers |
|---|---|---|---|
| Prometheus metrics on `/metrics` | Pull | `--metrics-port` (CLI), `monitoring.serviceMonitor.*` (Helm) | Prometheus / kube-prometheus-stack via `ServiceMonitor`. |
| OpenTelemetry traces over OTLP/gRPC | Push | `--otlp-endpoint` / `OTEL_EXPORTER_OTLP_*` env, `monitoring.tracing.*` (Helm) | Any OTLP backend (Jaeger, Tempo, OTel Collector). The per-request span set up by `TraceLayer` becomes the OTLP root. |
| Sandbox log fan-out | Push (gRPC stream) | Always on per sandbox subscription | CLI / TUI / SDK consumers via `WatchSandbox` and `GetSandboxLogs`; OCSF JSONL when enabled inside the sandbox. |

Trace export is opt-in: the gateway only installs the OpenTelemetry layer
when an OTLP endpoint is supplied. Spans flush on `SIGTERM` via an explicit
`shutdown()` in the gateway shutdown path. See
[Monitoring the Gateway](../docs/kubernetes/monitoring.mdx) for the operator
guide.

## Persistence

The gateway persistence layer is a protobuf object store. Domain services store
Expand Down
6 changes: 6 additions & 0 deletions crates/openshell-server/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,12 @@ anyhow = { workspace = true }
tracing = { workspace = true }
tracing-subscriber = { workspace = true }

# OpenTelemetry tracing export (opt-in, configured via env)
opentelemetry = { workspace = true }
opentelemetry_sdk = { workspace = true }
opentelemetry-otlp = { workspace = true }
tracing-opentelemetry = { workspace = true }

# Metrics
metrics = { workspace = true }
metrics-exporter-prometheus = { workspace = true }
Expand Down
14 changes: 13 additions & 1 deletion crates/openshell-server/src/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@ use tracing_subscriber::EnvFilter;

use crate::certgen;
use crate::compute::{DockerComputeConfig, VmComputeConfig};
use crate::{run_server, tracing_bus::TracingLogBus};
use crate::{
run_server,
tracing_bus::{OtlpTracingConfig, TracingLogBus},
};

/// `OpenShell` gateway process - gRPC and HTTP server with protocol multiplexing.
///
Expand Down Expand Up @@ -297,6 +300,13 @@ struct RunArgs {
/// Keycloak: "scope". Okta: "scp". Leave empty to disable scope enforcement.
#[arg(long, env = "OPENSHELL_OIDC_SCOPES_CLAIM", default_value = "")]
oidc_scopes_claim: String,

/// OTLP/gRPC endpoint for OpenTelemetry trace export (e.g.
/// `http://jaeger-collector.observability.svc:4317`). When unset, no
/// traces are exported. The signal-specific
/// `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` takes precedence over this flag.
#[arg(long, env = "OPENSHELL_OTLP_ENDPOINT")]
otlp_endpoint: Option<String>,
}

pub fn command() -> Command {
Expand All @@ -320,8 +330,10 @@ pub async fn run_cli() -> Result<()> {

async fn run_from_args(args: RunArgs) -> Result<()> {
let tracing_log_bus = TracingLogBus::new();
let otlp = OtlpTracingConfig::resolve(args.otlp_endpoint.clone());
tracing_log_bus.install_subscriber(
EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new(&args.log_level)),
otlp,
);

let bind = SocketAddr::new(args.bind_address, args.port);
Expand Down
3 changes: 3 additions & 0 deletions crates/openshell-server/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,9 @@ pub async fn run_server(
.await
.map_err(|err| Error::execution(format!("gateway shutdown cleanup failed: {err}")))?;

// Flush any pending OTLP spans. No-op when OTLP export is not configured.
state.tracing_log_bus.shutdown();

Ok(())
}

Expand Down
Loading
Loading