Skip to content

feat(cluster): enable GKE managed ML Diagnostics for GKE version >= 1.35.0-gke.3065000#1186

Merged
scaliby merged 1 commit into
AI-Hypercomputer:mainfrom
rapatchi:mldia_fix
Jun 11, 2026
Merged

feat(cluster): enable GKE managed ML Diagnostics for GKE version >= 1.35.0-gke.3065000#1186
scaliby merged 1 commit into
AI-Hypercomputer:mainfrom
rapatchi:mldia_fix

Conversation

@rapatchi

@rapatchi rapatchi commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Description

This change introduces automated management and lifecycle optimization for GKE Managed Machine Learning Diagnostics on cluster creation.

Specifically:

  1. Standardizes PEP 440 version comparison via is_gke_version_at_least() with a central minimum requirement constant (MANAGED_MLDIAGNOSTICS_MIN_GKE_VERSION = "1.35.0-gke.3065000").
  2. Refactors run_gke_cluster_create_command() to dynamically append --enable-managed-mldiagnostics when --managed-mldiagnostics is requested and the planned GKE control plane version meets or exceeds the minimum threshold.
  3. Bypasses legacy user-space prerequisite installations (install_mldiagnostics_prerequisites()) when creating or updating clusters on modern GKE versions.
  4. Refines --dry-run simulation in get_gke_server_config() and get_gke_node_pool_version() to correctly evaluate user-supplied GKE versions (--gke-version=X) or default smoothly to "0", ensuring accurate simulation without breaking Golden Recipe verification.

Issue

Correcting the behaviour of: --managed-mldiagnosticsflag

Testing

Have you performed any manual testing on your change? Yes

xpk cluster create with >= 1.35.0-gke.3065000:

(xpk_local_venv) rapatchi@rapatchi2:~/xpk_fork/xpk$ xpk cluster create --cluster=maxtest-cluster --tpu-type=v5litepod-8 --project=rapatchiconsumer --zone=us-central1-a --num-nodes=2 --managed-mldiagnostics --spot
[XPK] Starting xpk v0.1.dev902+g7c8ca0ae1
...
[XPK] Task: `GKE Cluster Create` is implemented by `gcloud beta container clusters create maxtest-cluster --project=rapatchiconsumer --region=us-central1 --node-locations=us-central1-a --cluster-version=1.36.0-gke.2459000 --machine-type=e2-standard-16 --enable-autoscaling --total-min-nodes 1 --total-max-nodes 1000 --num-nodes 6 --autoscaling-profile=optimize-utilization --labels=gke_product_type=xpk --release-channel=rapid --enable-ip-alias --enable-dataplane-v2 --enable-multi-networking --enable-dns-access --location-policy=BALANCED --scopes=storage-full,gke-default --enable-managed-mldiagnostics`, streaming output live.
...
[XPK] Task: `Updating Controller Manager resources` terminated with code `0`
[XPK] GKE commands done! Resources are created.
[XPK] See your GKE Cluster here: https://console.cloud.google.com/kubernetes/clusters/details/us-central1/maxtest-cluster/details?project=rapatchiconsumer
[XPK] Exiting XPK cleanly

xpk cluster create with < 1.35.0-gke.3065000

(xpk_local_venv) rapatchi@rapatchi2:~/xpk_fork/xpk$ xpk cluster create --cluster=maxtest-cluster --tpu-type=v5litepod-8 --project=rapatchiconsumer --zone=us-central1-a --num-nodes=2 --managed-mldiagnostics --spot --gke-version=1.34.8-gke.1000000
[XPK] Starting xpk v0.1.dev902+g7c8ca0ae1
...
[XPK] Task: `GKE Cluster Create` is implemented by the following command not running since it is a dry run.
gcloud beta container clusters create maxtest-cluster --project=rapatchiconsumer --region=us-central1 --node-locations=us-central1-a --cluster-version=1.34.8-gke.1000000 --machine-type=e2-standard-16 --enable-autoscaling --total-min-nodes 1 --total-max-nodes 1000 --num-nodes 6 --autoscaling-profile=optimize-utilization --labels=gke_product_type=xpk --release-channel=regular --enable-ip-alias --enable-dataplane-v2 --enable-multi-networking --no-enable-autoupgrade --enable-dns-access --location-policy=BALANCED --scopes=storage-full,gke-default
...
[XPK] Task: `Applying cert-manager 1.13.0 manifest...` is implemented by `kubectl apply -f [https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml](https://www.google.com/url?sa=D&q=https%3A%2F%2Fgithub.com%2Fcert-manager%2Fcert-manager%2Freleases%2Fdownload%2Fv1.13.0%2Fcert-manager.yaml)`, streaming output live.
[XPK] Task: `Create gke-mldiagnostics namespace...` is implemented by `kubectl create namespace gke-mldiagnostics`
[XPK] Task: `Install /tmp/mldiagnostics-injection-webhook-v0.5.0.yaml...` is implemented by `kubectl apply -f /tmp/mldiagnostics-injection-webhook-v0.5.0.yaml -n gke-mldiagnostics`, streaming output live.

Have you verified use cases affected by goldens? Yes

@rapatchi rapatchi marked this pull request as ready for review June 9, 2026 11:57
Comment thread src/xpk/utils/versions.py Outdated
Comment thread src/xpk/utils/versions_test.py Outdated
Comment thread src/xpk/utils/versions.py Outdated
Comment thread src/xpk/core/telemetry_test.py Outdated

@scaliby scaliby left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your changes! Two minor fixes to go.

Comment thread src/xpk/utils/versions_test.py Outdated
Comment thread src/xpk/core/gcloud_context.py Outdated
….35.0-gke.3065000

This change introduces automated management and lifecycle optimization for GKE Managed Machine Learning Diagnostics on cluster creation.

Specifically:
1. Standardizes PEP 440 version comparison via `is_gke_version_at_least()` with a central minimum requirement constant (`MANAGED_MLDIAGNOSTICS_MIN_GKE_VERSION = "1.35.0-gke.3065000"`).
2. Refactors `run_gke_cluster_create_command()` to dynamically append `--enable-managed-mldiagnostics` when `--managed-mldiagnostics` is requested and the planned GKE control plane version meets or exceeds the minimum threshold.
3. Bypasses legacy user-space prerequisite installations (`install_mldiagnostics_prerequisites()`) when creating or updating clusters on modern GKE versions.
4. Refines `--dry-run` simulation in `get_gke_server_config()` and `get_gke_node_pool_version()` to correctly evaluate user-supplied GKE versions (`--gke-version=X`) or default smoothly to "0", ensuring accurate simulation without breaking Golden Recipe verification.

@scaliby scaliby left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for addressing my feedback! LGTM!

@scaliby scaliby added the release-features features label Jun 11, 2026
@scaliby scaliby added this pull request to the merge queue Jun 11, 2026
Merged via the queue into AI-Hypercomputer:main with commit 4baa812 Jun 11, 2026
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants