Skip to content

Fixes from 0.3.0 rc0#46

Open
tkatila wants to merge 6 commits into
intel:mainfrom
tkatila:fixes-from-0.3.0-rc0
Open

Fixes from 0.3.0 rc0#46
tkatila wants to merge 6 commits into
intel:mainfrom
tkatila:fixes-from-0.3.0-rc0

Conversation

@tkatila

@tkatila tkatila commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Important ones:

  • clusterpolicy: drop invalid "deviceTaint" option for DRA
  • xpumd: filter-out ecc states and use "critical" limits (not warnings)
  • helm/e2e: fixes to uninstall
  • docs updates

tkatila added 6 commits June 24, 2026 16:30
The deviceTaint option was added due to confusion. It does not
enable/disable device taints. The underlying cmdline argument
was meant for testing only.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Previous namespace creation caused a helm uninstall failure
which was not detected with e2e.
Also replace "--wait" with a namespace removal wait.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
As the chart is now handling the namespace creation, NFD's
post delete cleanup caused the helm uninstall to return non zero.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
memory "ecc_disabled" state was converted into a warning
use "critical" limit instead of "warning" for temps

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR applies a set of release-candidate fixes for v0.3.0-rc0, focusing on removing an invalid DRA option, adjusting XPUMD health handling (critical thresholds + ECC state filtering), and improving Helm/e2e uninstall reliability, along with corresponding chart/CRD/doc updates.

Changes:

  • Remove deviceTaints from the ClusterPolicy API/CRDs, samples, and Helm charts/templates.
  • Update XPUM OTel status rules to use critical temperature limits and filter out ECC states.
  • Harden Helm/e2e uninstall flows and extend workflow validation to release-* branches.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/e2e/e2e_utils.go Adds namespace deletion waiter used by Helm e2e cleanup.
test/e2e/e2e_test.go Adjusts Helm uninstall cleanup and adds namespace deletion wait.
README.md Updates DRA/XPUM docs (remove deviceTaints, adjust defaults/fields).
internal/controller/xpumanager_controller.go Applies health thresholds to the critical state instead of warning.
internal/controller/dra_controller.go Drops args derived from removed deviceTaints option.
internal/controller/dra_controller_test.go Updates DRA args test expectations after removing deviceTaints.
config/samples/dra/clusterpolicy.yaml Removes deviceTaints from sample.
config/deployments/xpum/otel-config.yaml Switches temperature states to critical + filters ECC states.
config/crd/bases/intel.com_clusterpolicies.yaml Removes deviceTaints from generated CRD schema.
charts/gpu-base-operator/values.yaml Adds nfd.postDeleteCleanup value.
charts/gpu-base-operator/crds/clusterpolicies.yaml Removes deviceTaints from Helm CRD copy.
charts/gpu-base-operator/Chart.yaml Adds alias nfd for the NFD dependency to match values/condition usage.
charts/gpu-base-operator-policy/values.yaml Removes deviceTaints from policy chart values.
charts/gpu-base-operator-policy/templates/clusterpolicy.yaml Removes deviceTaints rendering into ClusterPolicy.
charts/gpu-base-operator-policy/README.md Updates policy chart docs for removed/changed values.
api/v1alpha1/clusterpolicy_types.go Removes DeviceTaints field from API type.
.github/workflows/validate-public.yaml Runs validation workflow on release-* branches too.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/e2e/e2e_utils.go
Comment on lines +227 to +236
func waitUntilNamespaceGone(g Gomega) {
cmd := exec.Command("kubectl", "get", "namespace", "intel-gpu-base-operator", "--no-headers")
output, err := utils.Run(cmd)
if err != nil {
return
}

lines := utils.GetNonEmptyLines(output)
g.Expect(lines).To(ContainElement("No resources found"), "Expected namespace intel-gpu-base-operator to be gone")
}
Comment thread test/e2e/e2e_test.go
Comment on lines +522 to +526
cmd = exec.Command("helm", "uninstall", "-n", namespace, helmOperatorName)
_, err = utils.Run(cmd)
Expect(err).NotTo(HaveOccurred(), "Failed to uninstall operator")

Eventually(waitUntilNamespaceGone).Should(Succeed())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants