Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,16 +109,19 @@ jobs:
build-docs:
name: Build docs site
runs-on: ubuntu-latest
# GATED: docs-site/ directory does not exist yet. Uncomment + create docs-site/ before re-enabling.
if: false
permissions:
contents: write
steps:
- uses: actions/checkout@v4
- name: Build static site
working-directory: docs-site
run: |
npm ci
npm run build
# Build runs on every push/PR (validates the site). Deploy to gh-pages
# only on push to the default branch — PRs build but don't publish.
- uses: peaceiris/actions-gh-pages@v4
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: docs-site/out
19 changes: 11 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Stackup

**Kubernetes on your laptop. ArgoCD + Argo Rollouts + Prometheus + Loki + Tempo + Grafana. `make up` in 10 minutes. Free.**
**Kubernetes on your laptop. ArgoCD + Argo Rollouts + Prometheus + Grafana. `make up` in 10 minutes. Free.**

[![CI](https://github.com/ykstorm/stackup/actions/workflows/ci.yml/badge.svg)](https://github.com/ykstorm/stackup/actions/workflows/ci.yml)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
Expand Down Expand Up @@ -29,11 +29,16 @@ The buyerchat workload deliberately runs degraded (no DB). That's intentional. T
| **TLS** | cert-manager | Self-signed ClusterIssuer (swap to ACME in one line for prod) |
| **Secrets** | Sealed Secrets | Encrypted secrets in git, decrypted in-cluster |
| **Metrics** | kube-prometheus-stack | Prometheus + Alertmanager + Grafana, RED dashboards pre-imported |
| **Logs** | Loki + Promtail | Pod stdout → Loki → Grafana Explore |
| **Traces** | Tempo (monolithic) | OTLP traces from workloads |
| **Workload demo** | buyerchat Helm chart | Next.js app — demonstrates the cluster, not a production app |
| **Hardening** | PSS `restricted` + NetworkPolicy `default-deny` | Zero-trust on workload namespaces |

### Roadmap (not installed yet)

| Layer | Component | What it would do |
|---|---|---|
| **Logs** | Loki + Promtail | Pod stdout → Loki → Grafana Explore |
| **Traces** | Tempo | OTLP traces from workloads |

---

## 10-minute quickstart
Expand All @@ -55,7 +60,7 @@ Add to `/etc/hosts` (Windows: `C:\Windows\System32\drivers\etc\hosts`):
Then open:

- **[https://buyerchat.local.stackup.dev](https://buyerchat.local.stackup.dev)** — workload, returns 503 degraded (no DB — expected)
- **[https://grafana.local.stackup.dev](https://grafana.local.stackup.dev)** — RED metrics + Loki logs + Tempo traces
- **[https://grafana.local.stackup.dev](https://grafana.local.stackup.dev)** — RED metrics from Prometheus (logs/traces are roadmap)
- **[https://argocd.local.stackup.dev](https://argocd.local.stackup.dev)** — GitOps tree of 6 child apps

---
Expand Down Expand Up @@ -88,15 +93,13 @@ graph TD
Apps --> Rollout[Argo Rollouts CRD]
Rollout --> Pods[Canary pods]
Pods --> Prom[Prometheus]
Pods --> LokiL[Loki]
Pods --> TempoT[Tempo]
Prom --> Graf[Grafana]
LokiL --> Graf
TempoT --> Graf
```

For full topology + sequence diagrams, see [docs/architecture.md](docs/architecture.md).

A static documentation site (overview, getting started, architecture, GitOps + canary) is built from `docs-site/` and published to GitHub Pages on merge to `main`.

---

## Makefile targets
Expand Down
5 changes: 5 additions & 0 deletions docs-site/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
/node_modules
/.next
/out
*.tsbuildinfo
.DS_Store
122 changes: 122 additions & 0 deletions docs-site/app/architecture/page.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
export const metadata = {
title: 'Architecture — Stackup',
};

export default function Architecture() {
return (
<>
<h1>Architecture</h1>
<p className="lede">
Cluster topology, the GitOps tree, the observability flow, and the
security posture — all of it reproducible from <code>make up</code>.
</p>

<h2>Cluster topology</h2>
<p>
kind launches the cluster as Docker containers, each node running
containerd and kubelet. Pods run inside the worker nodes as
containers-within-containers. The cluster declares{' '}
<code>disableDefaultCNI: true</code> and installs Calico, which enforces
ingress and egress NetworkPolicy rules in full. The whole thing fits in
roughly 3 GB of RAM.
</p>

<h2>GitOps tree (app-of-apps)</h2>
<p>
A single root ArgoCD Application is the only thing <code>make up</code>{' '}
applies. It manages six child applications, and ArgoCD syncs each of
them from the git repo:
</p>
<ul>
<li>cert-manager — TLS issuance</li>
<li>ingress-nginx — ingress and TLS termination</li>
<li>sealed-secrets — in-cluster secret decryption</li>
<li>kube-prometheus-stack — Prometheus, Alertmanager, Grafana</li>
<li>argo-rollouts — the canary controller</li>
<li>buyerchat — the demo workload</li>
</ul>
<p>
The discipline is that state lives in git, not in ad-hoc{' '}
<code>kubectl apply</code> commands. ArgoCD runs automated sync, prune,
and self-heal against what the repo declares.
</p>

<h2>Observability flow</h2>
<p>
kube-prometheus-stack installs Prometheus, Alertmanager, and Grafana.
Metrics flow into Prometheus and render as RED dashboards in Grafana:
</p>
<table>
<thead>
<tr>
<th>Signal</th>
<th>Path</th>
<th>Store</th>
</tr>
</thead>
<tbody>
<tr>
<td>Metrics</td>
<td>/api/metrics scraped every 30s</td>
<td>Prometheus</td>
</tr>
</tbody>
</table>

<h3>Roadmap</h3>
<p>
Logs and traces (Loki + Promtail, Tempo) are on the roadmap — not
installed yet. Once they are wired in, a Grafana panel would let you
drill from a metric into the matching logs, and from a log line jump to
the trace by its trace_id.
</p>

<h2>Security posture</h2>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Control</th>
</tr>
</thead>
<tbody>
<tr>
<td>Pod admission</td>
<td>
Pod Security Standards <code>restricted</code> on workload
namespaces
</td>
</tr>
<tr>
<td>Network</td>
<td>
NetworkPolicy <code>default-deny</code>, explicit allow rules per
service
</td>
</tr>
<tr>
<td>Secrets</td>
<td>Sealed Secrets — encrypted in git, decrypted in-cluster</td>
</tr>
<tr>
<td>TLS</td>
<td>cert-manager self-signed CA (swap to ACME for production)</td>
</tr>
<tr>
<td>RBAC</td>
<td>No cluster-admin bindings on workload namespaces</td>
</tr>
</tbody>
</table>

<h2>What changes for production</h2>
<p>
Taking the stack to EKS, GKE, or AKS means swapping kind for a managed
control plane, the self-signed issuer for ACME via DNS-01, hostPort
ingress for a real LoadBalancer, local volumes for a CSI driver,
single-replica components for HA, and loosening nothing on the RBAC
side.
</p>
</>
);
}
88 changes: 88 additions & 0 deletions docs-site/app/getting-started/page.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
export const metadata = {
title: 'Getting Started — Stackup',
};

export default function GettingStarted() {
return (
<>
<h1>Getting Started</h1>
<p className="lede">
Clone the repo, run one command, and reach a working cluster in about
ten minutes. You need Docker and kubectl on the machine.
</p>

<h2>Bring it up</h2>
<pre>
<code>{`git clone https://github.com/ykstorm/stackup && cd stackup
make up`}</code>
</pre>
<p>
<code>make up</code> creates the kind cluster, installs the platform,
and deploys the buyerchat workload. The root ArgoCD Application is the
only thing applied directly; ArgoCD syncs everything else from the git
repo.
</p>

<h2>Map the hostnames</h2>
<p>
Add these entries to your hosts file (<code>/etc/hosts</code>, or{' '}
<code>C:\Windows\System32\drivers\etc\hosts</code> on Windows):
</p>
<pre>
<code>{`127.0.0.1 buyerchat.local.stackup.dev
127.0.0.1 grafana.local.stackup.dev
127.0.0.1 argocd.local.stackup.dev
127.0.0.1 prometheus.local.stackup.dev`}</code>
</pre>

<h2>Open the surfaces</h2>
<ul>
<li>
<strong>buyerchat.local.stackup.dev</strong> — the workload. It
returns 503 degraded because there is no database wired in. That
response is expected.
</li>
<li>
<strong>grafana.local.stackup.dev</strong> — RED metrics from
Prometheus. Logs and traces (Loki, Tempo) are on the roadmap, not
installed yet.
</li>
<li>
<strong>argocd.local.stackup.dev</strong> — the GitOps tree of six
child apps.
</li>
</ul>

<h2>Makefile targets</h2>
<pre>
<code>{`make help # Show all targets
make up # Create cluster + install platform + buyerchat
make down # Tear down the kind cluster
make smoke # Run smoke tests (requires cluster up)
make lint # Lint all YAML + Helm charts
make rollout-status # Watch the buyerchat canary progress`}</code>
</pre>

<h2>Known limits</h2>
<ul>
<li>
No real LoadBalancer service type — kind does not ship one, so the
stack uses hostPort. Deploy to a cloud cluster for a real load
balancer.
</li>
<li>
Storage is local-path PVs by default. Re-creating the cluster wipes
them. Add Longhorn or OpenEBS for persistence across teardowns.
</li>
<li>
Single-tenant workload namespace. Multi-tenant needs more
NetworkPolicy and RBAC work.
</li>
<li>
The buyerchat workload runs degraded with no database, on purpose. The
cluster is the demo, not the app.
</li>
</ul>
</>
);
}
62 changes: 62 additions & 0 deletions docs-site/app/gitops-canary/page.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
export const metadata = {
title: 'GitOps & Canary — Stackup',
};

export default function GitopsCanary() {
return (
<>
<h1>GitOps &amp; Canary</h1>
<p className="lede">
How a single commit turns into a canary rollout gated on Prometheus,
with automatic rollback when the analysis fails.
</p>

<h2>The trigger</h2>
<p>
Push a commit that bumps <code>helm/buyerchat/values.yaml</code>{' '}
<code>image.tag</code>. ArgoCD notices the change and syncs. Argo
Rollouts applies the new Rollout revision. Watch it advance:
</p>
<pre>
<code>{`make rollout-status
# same as: kubectl argo rollouts get rollout buyerchat -n app --watch`}</code>
</pre>

<h2>The canary steps</h2>
<p>
Argo Rollouts shifts 25% of traffic to the new version and pauses, then
runs an analysis step. An <code>AnalysisTemplate</code> queries
Prometheus three times over 90 seconds. If the success condition holds,
the rollout advances to 50%, then 75%, then 100%. If the analysis fails,
Argo Rollouts aborts and rolls back to the previous revision. This is the
canary pattern teams run in production, reproduced on a laptop.
</p>

<ol>
<li>Scale the canary to 25% of traffic and pause.</li>
<li>Run the Prometheus analysis query three times over 90 seconds.</li>
<li>If the gate passes, advance to 50%, then 75%, then 100%.</li>
<li>If the gate fails, abort and revert to the previous revision.</li>
</ol>

<h2>The analysis query</h2>
<p>
The current analysis query is a conservative liveness check: is the
canary up and being scraped. Once the buyerchat image exports request
counters on <code>/api/metrics</code>, swap it for a real success-rate
ratio. The template carries a <code>TODO</code> marking the one line to
change.
</p>

<h2>Why GitOps for this</h2>
<p>
Because the image tag lives in git and ArgoCD reconciles against it, the
rollout has a single source of truth. There is no out-of-band{' '}
<code>kubectl set image</code>. A reviewer can read the diff that
triggered a deploy, and a revert is a git revert. The canary gate then
decides whether that change reaches all traffic, with Prometheus as the
judge.
</p>
</>
);
}
Loading
Loading