Skip to content

[llm-d] Make more reliable#91

Open
kpouget wants to merge 25 commits into
openshift-psap:mainfrom
kpouget:llmd
Open

[llm-d] Make more reliable#91
kpouget wants to merge 25 commits into
openshift-psap:mainfrom
kpouget:llmd

Conversation

@kpouget

@kpouget kpouget commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features

    • Added preflight checks before deployments and tests to verify required cluster components.
    • Introduced new operator bootstrap and cleanup flows with richer status capture and reporting.
    • Added improved diagnostics during model and service deployment, including more detailed failure summaries.
  • Bug Fixes

    • Improved install and readiness handling to better recover from partial deployments and avoid duplicate installs.
    • Updated pipeline sequencing and task naming so steps run in the intended order.
  • Configuration

    • Reduced default model deployment size for lighter resource usage.

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosokin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR renumbers Tekton workflows and adds preflight steps, adds EarlyReturn support to the DSL runtime and retry logging, expands operator bootstrap/deployment/cleanup tooling, updates GPU and KServe bootstrap flows, and adjusts llm_d orchestration configs, phases, tests, and CI commands.

Changes

Tekton pipeline sequencing

Layer / File(s) Summary
Full pipeline order
fournos/gitops/base/workflows/pipeline-full.yaml
The full pipeline renumbers pre-cleanup, prepare, preflight, test, post-cleanup, and export-artifacts steps, and updates the export wait reference.
Prepare and test variants
fournos/gitops/base/workflows/pipeline-prepare-only.yaml, fournos/gitops/base/workflows/pipeline-prepare-test.yaml, fournos/gitops/base/workflows/pipeline-test-only.yaml
The prepare-only, prepare-test, and test-only pipelines add preflight tasks and renumber prepare, test, and export-artifacts task names.

DSL early-return and retry logging

Layer / File(s) Summary
EarlyReturn API
projects/core/dsl/control_flow.py, projects/core/dsl/__init__.py
EarlyReturn is defined, imported from the package root, and exported in the DSL package API.
Runtime early-return handling
projects/core/dsl/runtime.py
The runtime detects EarlyReturn, stops pending non-@always tasks, and continues pending @always tasks after an early return.
Retry header suffixes
projects/core/dsl/task.py
Retry logging accepts an optional artifact_dirname_suffix and includes it in retry headers for both retry paths.

Cluster operator lifecycle tooling

Layer / File(s) Summary
Bootstrap LWS operator
projects/cluster/toolbox/bootstrap_lws_operator/main.py, projects/cluster/toolbox/bootstrap_lws_operator/templates/leaderworkersetoperator.yaml.j2
The bootstrap entrypoint renders and applies the LeaderWorkerSetOperator manifest, waits for resource readiness, and captures state artifacts.
CSV gate and InstallPlan approval
projects/cluster/toolbox/cluster_deploy_operator/main.py, projects/cluster/toolbox/cluster_deploy_operator/on_failure_helpers.py
The deployment flow adds a succeeded-CSV gate, manual InstallPlan approval handling, safer CSV lookup, and an AGENT.md failure helper.
Cleanup operators
projects/cluster/toolbox/cleanup_operators/main.py
The cleanup script validates subscriptions, captures CSV/InstallPlan/OperatorGroup artifacts, deletes operator resources, and writes a cleanup summary.

GPU operator bootstrap and diagnostics

Layer / File(s) Summary
ClusterPolicy readiness and namespace
projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py
The bootstrap flow resolves the GPU operator namespace from ClusterPolicy status, records readiness errors on the context, and captures runtime diagnostics.
Failure analysis
projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/on_failure_helpers.py
The failure helper writes AGENT.md when a ClusterPolicy error message is present and returns early when it is absent.

LLMInferenceService deployment diagnostics

Layer / File(s) Summary
Service context and readiness
projects/kserve/toolbox/deploy_llmisvc/main.py
The deployment flow derives service name and selector from context or args, updates readiness queries and return messages, and includes service description in the final result.
Diagnostics and failure report
projects/kserve/toolbox/deploy_llmisvc/main.py, projects/kserve/toolbox/deploy_llmisvc/on_failure_helpers.py
Always-run tasks capture oc describe output for the service and ReplicaSets, and a failure helper writes AGENT.md for pod-wait failures.

HF cache wait loop

Layer / File(s) Summary
Download wait behavior
projects/kserve/toolbox/prepare_hf_model_cache/main.py
wait_for_download now tracks job and namespace locals, raises on missing jobs, and runs oc get pods before retrying while the existing success and failure checks remain unchanged.

llm_d orchestration wiring

Layer / File(s) Summary
Platform config and profile inputs
projects/llm_d/orchestration/config.d/*, projects/llm_d/orchestration/config.yaml, projects/llm_d/orchestration/manifests/leaderworkersetoperator.yaml, projects/llm_d/tests/test_profiles.py
Deployment defaults shrink replicas, tensor parallelism, and max model length; platform config adds LWS bootstrap metadata and a required CRD; orchestration config adds agentic and cleanup settings; the manifest declares LeaderWorkerSetOperator cluster; the profile test expects single-GPU settings.
Preflight CI and CRD checks
projects/llm_d/orchestration/ci.py, projects/llm_d/orchestration/preflight_phase.py
CI adds a preflight_check command and async config review during test execution, and the preflight phase checks required CRDs from platform config.
LeaderWorkerSet prepare flow
projects/llm_d/orchestration/prepare_phase.py
The prepare phase imports and runs the LWS bootstrap and waits for the workload CRD after logging operator details.
Test finalization and cleanup diagnostics
projects/llm_d/orchestration/test_phase.py, projects/llm_d/toolbox/cleanup_test_resources/main.py
The test phase runs one deployment/test path with centralized finalizers, and the cleanup-test-resources loop logs pod details before deletion completes.
Cleanup orchestration
projects/llm_d/orchestration/ci.py, projects/llm_d/orchestration/cleanup_phase.py
Cleanup commands pass cleanup_subscriptions=true, and cleanup_phase can delete operators using preserve lists.

CI text and stubs

Layer / File(s) Summary
Status text updates
projects/core/agentic/on_failure/__init__.py, projects/core/ci_entrypoint/prepare_ci.py
FAILURE_REVIEW writes the artifacts path as bold text, and CI notification and status messages switch from “Test of” wording to “Executing of” and “Execution of” wording.
Preflight command stubs
projects/rhaiis/orchestration/ci.py, projects/skeleton/orchestration/ci.py
The rhaiis and skeleton CI command groups add warning-only preflight_check subcommands, and the rhaiis module initializes logging.

Sequence Diagram(s)

sequenceDiagram
  participant check_existing_csv
  participant wait_for_installplan
  participant oc
  participant handle_installplan_failure

  check_existing_csv->>oc: query CSVs by operator package and namespace
  oc-->>check_existing_csv: CSV names and phases
  check_existing_csv-->>wait_for_installplan: continue when no Succeeded CSV exists

  wait_for_installplan->>oc: fetch Subscription YAML
  oc-->>wait_for_installplan: installPlanRef and approval mode
  wait_for_installplan->>oc: fetch InstallPlan YAML
  wait_for_installplan->>oc: patch spec.approved=true
  wait_for_installplan-->>handle_installplan_failure: write AGENT.md on failure
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • openshift-psap/forge#17: Also changes projects/core/dsl/runtime.py task execution control flow, which overlaps with the new EarlyReturn handling.
  • openshift-psap/forge#26: Also updates execute_tasks() behavior in projects/core/dsl/runtime.py, a direct neighbor to the runtime changes here.
  • openshift-psap/forge#83: Also renumbers Tekton pipeline tasks and adjusts finally export ordering in the same workflow area.

Suggested reviewers

  • MML-coder

Poem

A bunny hopped through YAML light,
And nudged the steps from left to right.
With preflight checks and logs so neat,
The carrots of progress taste extra sweet. 🐇

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is related to the changes, but it is too vague to convey the main update clearly. Use a more specific title that names the primary reliability changes, such as preflight checks, orchestration hardening, or operator bootstrap fixes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@kpouget

kpouget commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'llm_d prepare' failed after 00 hours 07 minutes 16 seconds 🔴

• Link to the test results.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260626-094820
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

FAILURE REVIEW 001 cluster deploy operator openshift-cert-manager-operator:

001__cluster_deploy_operator__openshift-cert-manager-operator

The wait_for_installplan task failed after exhausting its retry budget, incorrectly halting the deployment of the openshift-cert-manager-operator due to a strict polling logic that required a non-null .status.installPlanRef. This was a false positive caused by a mismatch with OpenShift OLM's idempotent reconciliation state: because the operator was already installed and the subscription spec had not drifted, OLM intentionally left the installPlanRef as null while maintaining a satisfied status, which the task lacked the logic to recognize as a successful outcome.

Detailed failure analysis report

Raw failure details
Execution logs

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 10 minutes 26 seconds 🔴

• Link to the test results.

FOURNOS Job Status:

  • forge-llm-d-20260626-094820: Failed - Tasks Completed: 4 (Failed: 1, Cancelled 0), Skipped: 1

MLflow Tracking:

FOURNOS Job Details:

Task Logs:

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-26 09:58:36
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:186
   RuntimeError: Job forge-llm-d-20260626-094820 failed: Tasks Completed: 4 (Failed: 1, Cancelled 0), Skipped: 1


[...]

Execution logs

@kpouget

kpouget commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Caution

Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted.

Error details
{}

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'llm_d prepare' failed after 00 hours 07 minutes 42 seconds 🔴

• Link to the test results.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260626-125151
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

FAILURE REVIEW 003 bootstrap lws operator:

003__bootstrap_lws_operator

The wait_for_lws_operator_ready task failed after exhausting 30 retries because the LeaderWorkerSetOperator/cluster resource never transitioned to a ready state. Detailed analysis of the captured artifact confirms the root cause is an environmental dependency failure: the operator controller encountered a SyncError and stalled during initialization because the required cert-manager-webhook service was missing, preventing the controller from populating the Available status condition.

Detailed failure analysis report

Raw failure details
Execution logs

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 09 minutes 05 seconds 🔴

• Link to the test results.

FOURNOS Job Status:

  • forge-llm-d-20260626-125151: Failed - Tasks Completed: 4 (Failed: 1, Cancelled 0), Skipped: 1

MLflow Tracking:

FOURNOS Job Details:

Task Logs:

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-26 13:00:47
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:186
   RuntimeError: Job forge-llm-d-20260626-125151 failed: Tasks Completed: 4 (Failed: 1, Cancelled 0), Skipped: 1


[...]

Execution logs

@kpouget

kpouget commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full
/var cleanup.preserve_operators.openshift-cert-manager-operator: false

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'llm_d prepare' failed after 00 hours 21 minutes 20 seconds 🔴

• Link to the test results.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260626-131630
ci_job.name: llm_d
ci_job.owner: kpouget
cleanup.preserve_operators.openshift-cert-manager-operator: 'false'
project.args: []
project.name: llm_d

FAILURE REVIEW 010 bootstrap gpu clusterpolicy:

010__bootstrap_gpu_clusterpolicy

The GPU operator bootstrap failed when the wait_for_clusterpolicy_ready task exhausted 60 retries because the gpu-cluster-policy resource stalled permanently at the state-operator-validation state and never reported ready. This reconciliation stall prevented the operator's validation controller from completing its readiness checks, halting the bootstrap process after all other infrastructure components had successfully resolved.

Detailed failure analysis report

Raw failure details
Execution logs

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 24 minutes 47 seconds 🔴

• Link to the test results.

FOURNOS Job Status:

  • forge-llm-d-20260626-131630: Failed - Tasks Completed: 4 (Failed: 1, Cancelled 0), Skipped: 1

MLflow Tracking:

FOURNOS Job Details:

Task Logs:

Test configuration:

/test fournos llm_d
/var cleanup.preserve_operators.openshift-cert-manager-operator: false
/cluster forge-smoke-testing
/pipeline forge-full

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-26 13:41:06
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:186
   RuntimeError: Job forge-llm-d-20260626-131630 failed: Tasks Completed: 4 (Failed: 1, Cancelled 0), Skipped: 1


[...]

Execution logs

@kpouget kpouget force-pushed the llmd branch 2 times, most recently from 7b83bd6 to 4919605 Compare June 26, 2026 14:02
@kpouget

kpouget commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'llm_d test' failed after 00 hours 16 minutes 26 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260626-140314
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

TEST DESCRIPTION:

This test evaluates the llm_d project by applying the approximate-prefix-cache deployment profile and the Qwen/Qwen3-0.6B model configuration to verify the setup and functionality of the approximate prefix caching scheduler.

FAILURE REVIEW 002 run smoke request:

001__llmd_test/002__run_smoke_request

The execute_smoke_request task failed because the curl command targeting the forge-llm-d LLM endpoint timed out after 60 seconds (exit code 28), indicating a failure to establish a connection or receive an HTTP response. This timeout prevented validation of the service endpoint, blocking the smoke test and preventing further test execution.

Detailed failure analysis report

Config review report

Raw failure details
Execution logs

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 22 minutes 23 seconds 🔴

• Link to the test results.

FOURNOS Job Status:

  • forge-llm-d-20260626-140314: Failed - Tasks Completed: 5 (Failed: 1, Cancelled 0), Skipped: 0

MLflow Tracking:

FOURNOS Job Details:

Task Logs:

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-26 14:25:27
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:186
   RuntimeError: Job forge-llm-d-20260626-140314 failed: Tasks Completed: 5 (Failed: 1, Cancelled 0), Skipped: 0


[...]

Execution logs

@kpouget

kpouget commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full

@psap-forge-bot

Copy link
Copy Markdown

🔴 Executing of 'llm_d test' failed after 00 hours 05 minutes 46 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260626-144540
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

TEST DESCRIPTION:

This test evaluates the llm_d project's deployment of Qwen/Qwen3-0.6B using the approximate-prefix-cache profile, specifically benchmarking vLLM's prefix caching feature. It executes a short concurrent workload via the Guidellm engine to measure inference throughput and latency under controlled token lengths and request rates.

FAILURE REVIEW 002 run smoke request:

001__llmd_test/002__run_smoke_request

The execute_smoke_request task failed with a curl timeout (exit code 28) after 60 seconds when attempting to reach the LLM service endpoint, indicating the backend was unreachable or unresponsive. This failure suggests the service may not be ready to serve traffic, is experiencing extreme latency due to model initialization, or is being silently dropped by a routing or network policy misconfiguration.

Detailed failure analysis report

Config review report

Raw failure details
Execution logs

@psap-forge-bot

Copy link
Copy Markdown

🔴 Executing of 'fournos_launcher submit' failed after 00 hours 10 minutes 22 seconds 🔴

• Link to the test results.

FOURNOS Job Status:

  • forge-llm-d-20260626-144540: Failed - Tasks Completed: 5 (Failed: 1, Cancelled 0), Skipped: 0

MLflow Tracking:

FOURNOS Job Details:

Task Logs:

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-26 14:55:51
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:186
   RuntimeError: Job forge-llm-d-20260626-144540 failed: Tasks Completed: 5 (Failed: 1, Cancelled 0), Skipped: 0


[...]

Execution logs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
fournos/gitops/base/workflows/pipeline-full.yaml (1)

119-136: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

job-step-name does not match the 05-export-artifacts task name.

The task is named 05-export-artifacts but its job-step-name is 04__export-artifacts, so the artifact directory (derived from FOURNOS_STEP_NAME per task-forge-step.yaml) will be misnamed and inconsistent with the task. Should be 05__export-artifacts.

🐛 Proposed fix
         - name: job-step-name
-          value: 04__export-artifacts
+          value: 05__export-artifacts
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/gitops/base/workflows/pipeline-full.yaml` around lines 119 - 136, The
`05-export-artifacts` task in `pipeline-full.yaml` has a mismatched
`job-step-name` value, which will cause the artifact directory naming used by
`forge-step` and `task-forge-step.yaml` to be inconsistent. Update the
`job-step-name` parameter in the `05-export-artifacts` task to match the task’s
step number so it follows the same `FOURNOS_STEP_NAME`-derived pattern as the
other tasks, using the `05__export-artifacts` naming convention.
🧹 Nitpick comments (6)
fournos/gitops/base/workflows/pipeline-prepare-test.yaml (1)

20-37: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Stale job-step-name numbering on 01-prepare.

The task 01-prepare still uses job-step-name: 001__prepare while the rest of this file moved to the two-digit 0N__ scheme (e.g. 02__preflight, 04__export-artifacts). Align it to 01__prepare for a consistent artifact-dir naming.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/gitops/base/workflows/pipeline-prepare-test.yaml` around lines 20 -
37, The `01-prepare` task in the pipeline manifest still uses the old
`job-step-name` value `001__prepare`; update it to match the file’s current
two-digit naming convention used by other steps like `02__preflight` and
`04__export-artifacts`. Make the change in the `job-step-name` param for
`01-prepare` so artifact directory naming stays consistent across the workflow.
fournos/gitops/base/workflows/pipeline-full.yaml (1)

59-79: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicate 02 prefix for two distinct tasks.

Both 02-preflight and 02-test share the 02 numbering, breaking the otherwise sequential ordering convention (compare pipeline-prepare-test.yaml, where preflight is 02 and test is 03). Consider renaming the test task to 03-test and its job-step-name to 03__test for consistency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/gitops/base/workflows/pipeline-full.yaml` around lines 59 - 79, The
pipeline task names are using the same 02 prefix for two different steps, which
breaks the sequential naming convention. Update the task identifier for the test
stage from 02-test to 03-test, and adjust the corresponding job-step-name in the
same task from 02__test to 03__test so it matches the ordering used by the
pipeline task definitions.
projects/kserve/toolbox/prepare_hf_model_cache/main.py (1)

340-343: 🔒 Security & Privacy | 🔵 Trivial

Prefer the list form for shell.run to avoid shell interpolation

shell.run accepts str | list[str] and defaults to shell=True. Passing an argument list with shell=False removes the shell interpolation risk and is more robust to unusual characters in namespace or job_name.

Both values are derived from cache_spec, which is built deterministically in this module and always includes namespace and download_job_name.

♻️ Proposed change
-    shell.run(
-        f"oc get pods -n {namespace} -l job-name={job_name}",
-        check=False,
-    )
+    shell.run(
+        ["oc", "get", "pods", "-n", namespace, "-l", f"job-name={job_name}"],
+        check=False,
+        shell=False,
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/kserve/toolbox/prepare_hf_model_cache/main.py` around lines 340 -
343, The `shell.run` call in `prepare_hf_model_cache/main.py` is using a
shell-form string, which leaves room for shell interpolation; update the `oc get
pods` invocation to use the argument-list form with `shell=False`. Keep the same
command semantics while passing `namespace` and `job_name` as separate arguments
so `shell.run` handles them safely and robustly. Use the existing `shell.run`
call site around the pod lookup logic to locate the change.
projects/llm_d/toolbox/cleanup_test_resources/main.py (1)

208-212: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Drop or persist the unused JSON pod fetch.

This command discards stdout and does not feed the deletion check, so it only adds an extra API-server call on every retry. Keep the plain-text diagnostic, or write the JSON output to an artifact if it is needed.

Proposed cleanup
-    shell.run(
-        f"oc get pods -n {args.namespace} -l app.kubernetes.io/name={args.inference_service_name} -o json",
-        check=False,
-        log_stdout=False,
-    )
-
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm_d/toolbox/cleanup_test_resources/main.py` around lines 208 -
212, The extra JSON pod fetch in the cleanup flow is unused and just adds
another API-server call on each retry. Update the retry logic around shell.run
in cleanup_test_resources/main.py so the diagnostic remains plain-text only, or
persist the JSON output to an artifact if it is actually needed by the deletion
check; remove the discarded json call from the retry path.
projects/llm_d/orchestration/test_phase.py (1)

57-61: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Align run_finalizers’ return type with its actual contract.

Line 104 returns a pair, so the annotation should describe (primary_exc, finalizer_exc) rather than one exception tuple.

Proposed cleanup
 def run_finalizers(
     endpoint_url: str | None,
     primary_exc: tuple[type[BaseException], BaseException, Any] | None,
     finalizer_exc: tuple[type[BaseException], BaseException, Any] | None,
-) -> tuple[type[BaseException], BaseException, Any] | None:
+) -> tuple[
+    tuple[type[BaseException], BaseException, Any] | None,
+    tuple[type[BaseException], BaseException, Any] | None,
+]:

Also applies to: 104-104

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm_d/orchestration/test_phase.py` around lines 57 - 61,
`run_finalizers`’ return annotation does not match what the function actually
returns: it returns both `primary_exc` and `finalizer_exc`, not a single
exception tuple. Update the `run_finalizers` signature and any related type
hints in `test_phase.py` to reflect a pair/tuple of the two exception tuples,
and make sure the return statement at the end of `run_finalizers` matches that
contract.
projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py (1)

167-208: 🔒 Security & Privacy | 🔵 Trivial

Replace f-string shell commands with argv lists and shell=False

The shell.run() wrapper explicitly supports list-style arguments and the shell flag. Interpolating namespace into shell strings needlessly exposes command arguments to shell parsing. Convert all six oc invocations in this block to use the list form.

     shell.run(
-        f"oc get daemonsets -n {namespace} -oyaml",
+        ["oc", "get", "daemonsets", "-n", namespace, "-oyaml"],
+        shell=False,
         stdout_dest=args.artifact_dir / "artifacts" / "gpu-operator-daemonsets.yaml",
         check=False,
     )

Apply this pattern to all remaining calls in the block.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py` around
lines 167 - 208, The `shell.run()` calls in the GPU operator artifact capture
block are still using interpolated shell strings, which should be replaced with
argv lists. Update each `oc` invocation in `main.py` to pass command arguments
as a list using the `shell.run` wrapper’s list-style form, keeping `namespace`
as a separate argument and not embedded in an f-string. Apply this change
consistently to the six capture calls in the block, including the `get`,
`describe`, and `-owide` commands, so they all run with `shell=False`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fournos/gitops/base/workflows/pipeline-full.yaml`:
- Around line 79-97: The 02-test task in the pipeline is invoking the wrong job
step, so testing never runs and preflight is duplicated. Update the `job-step`
parameter in `pipeline-full.yaml` for the `02-test` task from preflight to test,
keeping the surrounding `forge-step` task configuration and `job-step-name`
as-is so the step name matches the intended test execution.

In `@fournos/gitops/base/workflows/pipeline-prepare-test.yaml`:
- Around line 59-60: The 03-test task in the pipeline workflow has an invalid
runAfter dependency because it points to a task name that does not exist. Update
the runAfter reference on 03-test to match the actual preflight task identifier,
02-preflight, so the Tekton DAG resolves correctly and the pipeline definition
validates.

In `@fournos/gitops/base/workflows/pipeline-test-only.yaml`:
- Line 39: The `01-test` task in the pipeline is missing a `runAfter` dependency
on `00-preflight`, so preflight can run in parallel instead of gating the test.
Update the Tekton workflow definition for `01-test` to depend on `00-preflight`,
following the same pattern used in `pipeline-prepare-only.yaml`, so the
preflight step completes before the test starts.

In `@projects/cluster/toolbox/bootstrap_lws_operator/main.py`:
- Around line 16-22: The `run` function’s `timeout_seconds` parameter is
currently unused because readiness waits are controlled by the fixed retry
decorators in `run` rather than by caller input. Either remove `timeout_seconds`
and its docstring from `run`, or thread it into the waiting logic by replacing
the fixed retries with timeout-based polling in the readiness checks so the
caller can actually influence the wait duration.

In `@projects/cluster/toolbox/cleanup_operators/main.py`:
- Around line 114-124: The per-operator artifact filenames in the cleanup flow
are missing the namespace, so repeated subscription names can overwrite each
other’s diagnostics. Update the filename construction in the `shell.run` calls
that write the CSV and InstallPlan artifacts to include both `subscription_name`
and `namespace`, and apply the same change in the other matching block
referenced by the comment. Keep the existing artifact destinations and command
behavior unchanged.
- Around line 161-175: The cleanup logic in the subscription check is failing
open because it treats an unsuccessful oc get subscription call as zero
subscriptions, which can trigger unsafe deletion of shared OperatorGroups.
Update the subscription-counting flow in cleanup_operators/main.py so the code
using shell.run and remaining_subscriptions first verifies the command succeeded
before inspecting result.stdout, and if it fails, skip deletion and log or raise
an error instead of proceeding to delete operator groups. Keep the deletion
guard around the existing logger.info and oc delete operatorgroup path so
OperatorGroups are only removed when the subscription query is known to be
valid.

In `@projects/cluster/toolbox/cluster_deploy_operator/main.py`:
- Around line 66-112: The early-return logic in check_existing_csv can bypass
validate_parameters, letting reused cluster state hide invalid inputs. Reorder
the task flow so validate_parameters runs before check_existing_csv, ensuring
source_name, channel, approval mode, and install mode are validated first while
keeping the existing EarlyReturn behavior in check_existing_csv unchanged.
- Around line 456-466: The Subscription failure check in the operator logic only
inspects the second entry in conditions, so it can miss a ResolutionFailed
condition when ordering changes. Update the condition handling in main.py to
iterate through all items in subscription_data["status"]["conditions"] and raise
the same ValueError as soon as any condition_type equals ResolutionFailed, using
the matching message and reason from that condition.

In `@projects/core/ci_entrypoint/prepare_ci.py`:
- Line 588: The user-visible notification text built in prepare_ci.py is
inconsistent and ungrammatical, using both “Executing of” and “Execution of”
around the notification_status formatting. Update the status message
construction at the affected notification points so they all use one consistent,
grammatical phrasing (for example, the same “Execution of '{project}
{operation}' ...” form) in the prepare_ci flow, including the logic around
notification_status.

In `@projects/core/dsl/runtime.py`:
- Around line 369-376: The early-return path in runtime.py drops any context
changes because EarlyReturnException is raised in the task handling block before
the shared_context write-back runs. Update the task execution flow around the
EarlyReturn check so the context persistence loop executes before raising
EarlyReturnException, or otherwise ensure context is saved as part of the
early-return path. Use the existing task_status, EarlyReturn, and shared_context
logic in the runtime task runner to keep values set by the task available to
downstream `@always` tasks like capture_csv.

In `@projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py`:
- Around line 84-91: The namespace fallback in the ClusterPolicy readiness flow
is getting stuck on the default value instead of refreshing from the latest
status. Update the logic in the ClusterPolicy retry/readiness path, especially
the code that reads `result.stdout` and the loop that rechecks status on each
retry, so `ctx.gpu_operator_namespace` is reassigned whenever
`.status.namespace` becomes available. Keep the existing fallback behavior only
until a real namespace is observed, and make sure the final diagnostics use the
refreshed value from `ctx.gpu_operator_namespace`.

In `@projects/kserve/toolbox/deploy_llmisvc/main.py`:
- Around line 136-149: The pod-appearance wait in wait_pods_appear was shortened
too much, which can cause slow KServe/LLMInferenceService reconciliations to
fail before readiness checks run. Update wait_pods_appear in
deploy_llmisvc/main.py to restore a longer wait window or make the retry
duration configurable, keeping the existing service_name, selector, and
oc_get_json flow intact.
- Around line 166-178: The diagnostic artifact writing for the `oc` calls
currently saves only `stdout`, so failures from `describe` and `get replicaset`
can lose the real error details. Update the artifact generation around the
`oc(...)` result handling in `main.py` so the saved files include `stderr`
whenever `check=False` is used, and make sure this is applied to both the
`llminferenceservice` describe path and the replicaset diagnostic path. Use the
existing `result` object and artifact writes like `llmisv_desc_path` to locate
and append the error output alongside the normal output.

In `@projects/llm_d/orchestration/ci.py`:
- Around line 89-96: The pre_cleanup() step is deleting non-preserved operator
subscriptions before prepare(), which forces rhods-operator and related
operators to be torn down and reinstalled on every run. Update the cleanup flow
in pre_cleanup() so full cleanup with cleanup_subscriptions=True is moved to
post_cleanup(), or adjust runtime_config/config.yaml preserve settings to
include rhods-operator, rhcl-operator, and leader-worker-set if they must
survive pre-run cleanup. Make sure the fix is applied in the pre_cleanup and
cleanup_toolbox_run path so prepare() no longer triggers unnecessary operator
churn.

In `@projects/llm_d/orchestration/cleanup_phase.py`:
- Around line 58-74: The cleanup flow in cleanup_phase.py is currently
fail-open: when cleanup_config or preserve_operators is missing,
_identify_operators_to_delete can mark all platform operators for deletion
before cleanup_operators_command.run is called. Update the cleanup logic to
default to preserving operators unless an explicit allowlist/opt-in is present,
and make _identify_operators_to_delete (and any callers in this phase) return no
deletions when the preserve list is absent or incomplete. Ensure the guard
happens before invoking cleanup_operators_command.run so platform operators are
not removed accidentally.

In `@projects/llm_d/orchestration/config.yaml`:
- Line 71: The hardcoded model_key value is invalid and will break model
resolution; update the config entry in the orchestration config to a valid
registry key such as qwen3-0_6b/qwen3-0-6b, and make sure the default MODEL_KEY
in config_review/__init__.py matches the same valid identifier used by
runtime.yaml and the test suite.

---

Outside diff comments:
In `@fournos/gitops/base/workflows/pipeline-full.yaml`:
- Around line 119-136: The `05-export-artifacts` task in `pipeline-full.yaml`
has a mismatched `job-step-name` value, which will cause the artifact directory
naming used by `forge-step` and `task-forge-step.yaml` to be inconsistent.
Update the `job-step-name` parameter in the `05-export-artifacts` task to match
the task’s step number so it follows the same `FOURNOS_STEP_NAME`-derived
pattern as the other tasks, using the `05__export-artifacts` naming convention.

---

Nitpick comments:
In `@fournos/gitops/base/workflows/pipeline-full.yaml`:
- Around line 59-79: The pipeline task names are using the same 02 prefix for
two different steps, which breaks the sequential naming convention. Update the
task identifier for the test stage from 02-test to 03-test, and adjust the
corresponding job-step-name in the same task from 02__test to 03__test so it
matches the ordering used by the pipeline task definitions.

In `@fournos/gitops/base/workflows/pipeline-prepare-test.yaml`:
- Around line 20-37: The `01-prepare` task in the pipeline manifest still uses
the old `job-step-name` value `001__prepare`; update it to match the file’s
current two-digit naming convention used by other steps like `02__preflight` and
`04__export-artifacts`. Make the change in the `job-step-name` param for
`01-prepare` so artifact directory naming stays consistent across the workflow.

In `@projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py`:
- Around line 167-208: The `shell.run()` calls in the GPU operator artifact
capture block are still using interpolated shell strings, which should be
replaced with argv lists. Update each `oc` invocation in `main.py` to pass
command arguments as a list using the `shell.run` wrapper’s list-style form,
keeping `namespace` as a separate argument and not embedded in an f-string.
Apply this change consistently to the six capture calls in the block, including
the `get`, `describe`, and `-owide` commands, so they all run with
`shell=False`.

In `@projects/kserve/toolbox/prepare_hf_model_cache/main.py`:
- Around line 340-343: The `shell.run` call in `prepare_hf_model_cache/main.py`
is using a shell-form string, which leaves room for shell interpolation; update
the `oc get pods` invocation to use the argument-list form with `shell=False`.
Keep the same command semantics while passing `namespace` and `job_name` as
separate arguments so `shell.run` handles them safely and robustly. Use the
existing `shell.run` call site around the pod lookup logic to locate the change.

In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 57-61: `run_finalizers`’ return annotation does not match what the
function actually returns: it returns both `primary_exc` and `finalizer_exc`,
not a single exception tuple. Update the `run_finalizers` signature and any
related type hints in `test_phase.py` to reflect a pair/tuple of the two
exception tuples, and make sure the return statement at the end of
`run_finalizers` matches that contract.

In `@projects/llm_d/toolbox/cleanup_test_resources/main.py`:
- Around line 208-212: The extra JSON pod fetch in the cleanup flow is unused
and just adds another API-server call on each retry. Update the retry logic
around shell.run in cleanup_test_resources/main.py so the diagnostic remains
plain-text only, or persist the JSON output to an artifact if it is actually
needed by the deletion check; remove the discarded json call from the retry
path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8e4e2f9a-4e5a-4b96-ad58-9a10f65452d1

📥 Commits

Reviewing files that changed from the base of the PR and between 1659dde and 8d27be1.

📒 Files selected for processing (33)
  • fournos/gitops/base/workflows/pipeline-full.yaml
  • fournos/gitops/base/workflows/pipeline-prepare-only.yaml
  • fournos/gitops/base/workflows/pipeline-prepare-test.yaml
  • fournos/gitops/base/workflows/pipeline-test-only.yaml
  • projects/cluster/toolbox/bootstrap_lws_operator/main.py
  • projects/cluster/toolbox/bootstrap_lws_operator/templates/leaderworkersetoperator.yaml.j2
  • projects/cluster/toolbox/cleanup_operators/main.py
  • projects/cluster/toolbox/cluster_deploy_operator/main.py
  • projects/cluster/toolbox/cluster_deploy_operator/on_failure_helpers.py
  • projects/core/agentic/on_failure/__init__.py
  • projects/core/ci_entrypoint/prepare_ci.py
  • projects/core/dsl/__init__.py
  • projects/core/dsl/control_flow.py
  • projects/core/dsl/runtime.py
  • projects/core/dsl/task.py
  • projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py
  • projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/on_failure_helpers.py
  • projects/kserve/toolbox/deploy_llmisvc/main.py
  • projects/kserve/toolbox/deploy_llmisvc/on_failure_helpers.py
  • projects/kserve/toolbox/prepare_hf_model_cache/main.py
  • projects/llm_d/orchestration/ci.py
  • projects/llm_d/orchestration/cleanup_phase.py
  • projects/llm_d/orchestration/config.d/deployments.yaml
  • projects/llm_d/orchestration/config.d/platform.yaml
  • projects/llm_d/orchestration/config.yaml
  • projects/llm_d/orchestration/manifests/leaderworkersetoperator.yaml
  • projects/llm_d/orchestration/preflight_phase.py
  • projects/llm_d/orchestration/prepare_phase.py
  • projects/llm_d/orchestration/test_phase.py
  • projects/llm_d/tests/test_profiles.py
  • projects/llm_d/toolbox/cleanup_test_resources/main.py
  • projects/rhaiis/orchestration/ci.py
  • projects/skeleton/orchestration/ci.py

Comment on lines +79 to +97
- name: 02-test
runAfter: [02-preflight]
taskRef:
name: forge-step
kind: Task
workspaces:
- name: artifacts
workspace: artifacts
params:
- name: fjob-name
value: "$(params.fjob-name)"
- name: fournos-workload-namespace
value: "$(params.fournos-workload-namespace)"
- name: kubeconfig-secret
value: "$(params.kubeconfig-secret)"
- name: job-step
value: preflight
- name: job-step-name
value: 02__test

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

02-test runs the preflight step instead of test.

The test task's job-step is set to preflight, so the actual test step never executes — preflight runs twice (once here, once in 02-preflight) while artifacts land under 02__test. This silently disables testing in the full pipeline. It should be test.

🐛 Proposed fix
         - name: job-step
-          value: preflight
+          value: test
         - name: job-step-name
           value: 02__test
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: 02-test
runAfter: [02-preflight]
taskRef:
name: forge-step
kind: Task
workspaces:
- name: artifacts
workspace: artifacts
params:
- name: fjob-name
value: "$(params.fjob-name)"
- name: fournos-workload-namespace
value: "$(params.fournos-workload-namespace)"
- name: kubeconfig-secret
value: "$(params.kubeconfig-secret)"
- name: job-step
value: preflight
- name: job-step-name
value: 02__test
- name: 02-test
runAfter: [02-preflight]
taskRef:
name: forge-step
kind: Task
workspaces:
- name: artifacts
workspace: artifacts
params:
- name: fjob-name
value: "$(params.fjob-name)"
- name: fournos-workload-namespace
value: "$(params.fournos-workload-namespace)"
- name: kubeconfig-secret
value: "$(params.kubeconfig-secret)"
- name: job-step
value: test
- name: job-step-name
value: 02__test
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/gitops/base/workflows/pipeline-full.yaml` around lines 79 - 97, The
02-test task in the pipeline is invoking the wrong job step, so testing never
runs and preflight is duplicated. Update the `job-step` parameter in
`pipeline-full.yaml` for the `02-test` task from preflight to test, keeping the
surrounding `forge-step` task configuration and `job-step-name` as-is so the
step name matches the intended test execution.

Comment on lines +59 to +60
- name: 03-test
runAfter: [01-preflight]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

03-test references a nonexistent task in runAfter.

runAfter: [01-preflight] points to a task that does not exist — the preflight task is named 02-preflight (Line 39). Tekton will reject this pipeline with a DAG validation error ("task ... runAfter refers to nonexistent task"). Update the reference to 02-preflight.

🐛 Proposed fix
     - name: 03-test
-      runAfter: [01-preflight]
+      runAfter: [02-preflight]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: 03-test
runAfter: [01-preflight]
- name: 03-test
runAfter: [02-preflight]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/gitops/base/workflows/pipeline-prepare-test.yaml` around lines 59 -
60, The 03-test task in the pipeline workflow has an invalid runAfter dependency
because it points to a task name that does not exist. Update the runAfter
reference on 03-test to match the actual preflight task identifier,
02-preflight, so the Tekton DAG resolves correctly and the pipeline definition
validates.

- name: job-step-name
value: 00__preflight

- name: 01-test

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

01-test has no runAfter on 00-preflight, so preflight does not gate the test.

Without a runAfter dependency, Tekton schedules 00-preflight and 01-test in parallel. If the intent of adding preflight is to validate prerequisites before running the test, the gate is ineffective here (contrast pipeline-prepare-only.yaml, where preflight runAfter: [00-prepare]). Add the dependency.

🐛 Proposed fix
     - name: 01-test
+      runAfter: [00-preflight]
       taskRef:
         name: forge-step
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: 01-test
- name: 01-test
runAfter: [00-preflight]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/gitops/base/workflows/pipeline-test-only.yaml` at line 39, The
`01-test` task in the pipeline is missing a `runAfter` dependency on
`00-preflight`, so preflight can run in parallel instead of gating the test.
Update the Tekton workflow definition for `01-test` to depend on `00-preflight`,
following the same pattern used in `pipeline-prepare-only.yaml`, so the
preflight step completes before the test starts.

Comment on lines +16 to +22
def run(*, timeout_seconds: int = 600) -> int:
"""
Bootstrap the LeaderWorkerSetOperator instance and wait for it to be ready.

Args:
timeout_seconds: Maximum time to wait for the LWS operator configuration
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Wire or remove the unused timeout parameter.

Line 16 exposes timeout_seconds, but the waits are fixed by the retry decorators on Lines 57 and 67, so caller-provided values have no effect. Either remove the argument/docs or replace the fixed retries with timeout-driven polling.

Also applies to: 57-69

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/cluster/toolbox/bootstrap_lws_operator/main.py` around lines 16 -
22, The `run` function’s `timeout_seconds` parameter is currently unused because
readiness waits are controlled by the fixed retry decorators in `run` rather
than by caller input. Either remove `timeout_seconds` and its docstring from
`run`, or thread it into the waiting logic by replacing the fixed retries with
timeout-based polling in the readiness checks so the caller can actually
influence the wait duration.

Comment on lines +114 to +124
stdout_dest=artifact_dir / "artifacts" / f"{subscription_name}-csvs.yaml",
check=False,
log_stdout=False,
)

# Capture InstallPlans
shell.run(
f"oc get installplan -n {namespace} "
f"-l operators.coreos.com/{subscription_name}.{namespace} "
"-o yaml",
stdout_dest=artifact_dir / "artifacts" / f"{subscription_name}-installplans.yaml",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Include namespace in per-operator artifact filenames.

{subscription_name}-csvs.yaml and {subscription_name}-installplans.yaml collide when the same subscription name is cleaned from multiple namespaces, overwriting diagnostics.

🧾 Proposed fix
-        stdout_dest=artifact_dir / "artifacts" / f"{subscription_name}-csvs.yaml",
+        stdout_dest=artifact_dir / "artifacts" / f"{namespace}-{subscription_name}-csvs.yaml",
...
-        stdout_dest=artifact_dir / "artifacts" / f"{subscription_name}-installplans.yaml",
+        stdout_dest=artifact_dir / "artifacts" / f"{namespace}-{subscription_name}-installplans.yaml",
...
-            f.write(f"  - {operator_name}-csvs.yaml\n")
-            f.write(f"  - {operator_name}-installplans.yaml\n")
+            f.write(f"  - {namespace}-{operator_name}-csvs.yaml\n")
+            f.write(f"  - {namespace}-{operator_name}-installplans.yaml\n")

Also applies to: 212-216

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/cluster/toolbox/cleanup_operators/main.py` around lines 114 - 124,
The per-operator artifact filenames in the cleanup flow are missing the
namespace, so repeated subscription names can overwrite each other’s
diagnostics. Update the filename construction in the `shell.run` calls that
write the CSV and InstallPlan artifacts to include both `subscription_name` and
`namespace`, and apply the same change in the other matching block referenced by
the comment. Keep the existing artifact destinations and command behavior
unchanged.

Comment on lines +136 to 149
@on_failure(on_wait_pods_appear_failure)
@retry(attempts=12, delay=5, backoff=1.0)
@task
def wait_pods_appear(args, ctx):
"""Wait for llm-d pods to appear"""

pods = oc_get_json(
"pods", namespace=args.namespace, selector=ctx.selector, ignore_not_found=True
)
# Use service name and selector from context (if set) or create from args
service_name = getattr(ctx, "service_name", args.inference_service_name)
selector = getattr(ctx, "selector", f"app.kubernetes.io/name={service_name}")

pods = oc_get_json("pods", namespace=args.namespace, selector=selector, ignore_not_found=True)
if pods and pods.get("items"):
return f"Pods appeared for {ctx.service_name}"
return f"Pods appeared for {service_name}"
return False # Retry

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Restore a longer pod-appearance window.

Line 137 now caps wait_pods_appear at about 60 seconds (12 * 5s). Slow KServe/LLMInferenceService reconciliation can fail here before the later readiness wait runs, turning a valid slow deployment into a hard failure. Consider restoring the previous longer window or making this timeout configurable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/kserve/toolbox/deploy_llmisvc/main.py` around lines 136 - 149, The
pod-appearance wait in wait_pods_appear was shortened too much, which can cause
slow KServe/LLMInferenceService reconciliations to fail before readiness checks
run. Update wait_pods_appear in deploy_llmisvc/main.py to restore a longer wait
window or make the retry duration configurable, keeping the existing
service_name, selector, and oc_get_json flow intact.

Comment on lines +166 to +178
result = oc(
"describe",
"llminferenceservice",
service_name,
"-n",
args.namespace,
log_stdout=False,
check=False,
)

llmisv_desc_path = artifacts_dir / "llmisv_description.txt"
with open(llmisv_desc_path, "w", encoding="utf-8") as f:
f.write(result.stdout)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Include stderr when diagnostic oc calls fail.

These calls use check=False, but the artifacts only write stdout. If oc describe / oc get replicaset fails, the saved file can be empty or say “No replicasets found” while the real cause is in stderr.

Proposed diagnostic artifact fix
-        with open(llmisv_desc_path, "w", encoding="utf-8") as f:
-            f.write(result.stdout)
+        with open(llmisv_desc_path, "w", encoding="utf-8") as f:
+            f.write(result.stdout)
+            if result.stderr:
+                f.write("\n\n--- stderr ---\n")
+                f.write(result.stderr)
-            else:
+            elif rs_result.returncode == 0:
                 f.write("No replicasets found for the service")
+            else:
+                f.write("Failed to list replicasets for the service")
+                if rs_result.stderr:
+                    f.write("\n\n--- stderr ---\n")
+                    f.write(rs_result.stderr)

Also applies to: 201-236

🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 176-176: File path is request-/variable-derived; validate and normalize to prevent path traversal.
Context: open(llmisv_desc_path, "w", encoding="utf-8")
Note: [CWE-22] Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal').

(open-filename-from-request)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/kserve/toolbox/deploy_llmisvc/main.py` around lines 166 - 178, The
diagnostic artifact writing for the `oc` calls currently saves only `stdout`, so
failures from `describe` and `get replicaset` can lose the real error details.
Update the artifact generation around the `oc(...)` result handling in `main.py`
so the saved files include `stderr` whenever `check=False` is used, and make
sure this is applied to both the `llminferenceservice` describe path and the
replicaset diagnostic path. Use the existing `result` object and artifact writes
like `llmisv_desc_path` to locate and append the error output alongside the
normal output.

Comment on lines 89 to +96
def pre_cleanup(ctx) -> int:
"""Cleanup phase - Clean up resources and finalize."""
# Cleanup doesn't typically need vaults, but initialize resolve-only for consistency
# no vault needed
from projects.llm_d.orchestration import runtime_config

for run_spec in runtime_config.get_run_specs():
with runtime_config.activate_run_spec(run_spec):
cleanup_toolbox_run(namespace=run_spec.namespace)
cleanup_toolbox_run(namespace=run_spec.namespace, cleanup_subscriptions=True)
return 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect pipeline ordering and cleanup_operators to confirm pre-cleanup impact
fd -t f 'pipeline-full.yaml' --exec cat {}
rg -nP -C5 'def cleanup_operators' projects/llm_d/orchestration/cleanup_phase.py

Repository: openshift-psap/forge

Length of output: 4598


pre_cleanup forces a full uninstall/reinstall of non-preserved operators (e.g., rhods-operator) on every pipeline run.

Verification confirms that cleanup.preserve_operators in config.yaml lists only gpu-operator-certified, nfd, and openshift-cert-manager-operator. Consequently, rhods-operator, rhcl-operator, and leader-worker-set are not preserved.

Because the pipeline-full.yaml executes pre-cleanup immediately before prepare:

  1. pre_cleanup invokes cleanup_operators() with cleanup_subscriptions=True.
  2. This deletes the subscriptions/CSVs for rhods-operator and other non-preserved operators.
  3. The subsequent prepare step reinstalls them.

This introduces unnecessary churn, increases pipeline runtime, and risks reliability by forcing the teardown and re-provision of heavy operators like rhods-operator for every run.

Recommended Actions:

  • Relocate Cleanup: Move full subscription cleanup logic to post_cleanup to avoid destroying operators required during prepare and test.
  • Expand Preservation: If pre-run cleanup is strictly necessary, update cleanup.preserve_operators in config.yaml to include rhods-operator, rhcl-operator, and leader-worker-set.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm_d/orchestration/ci.py` around lines 89 - 96, The pre_cleanup()
step is deleting non-preserved operator subscriptions before prepare(), which
forces rhods-operator and related operators to be torn down and reinstalled on
every run. Update the cleanup flow in pre_cleanup() so full cleanup with
cleanup_subscriptions=True is moved to post_cleanup(), or adjust
runtime_config/config.yaml preserve settings to include rhods-operator,
rhcl-operator, and leader-worker-set if they must survive pre-run cleanup. Make
sure the fix is applied in the pre_cleanup and cleanup_toolbox_run path so
prepare() no longer triggers unnecessary operator churn.

Comment on lines +58 to +74
cleanup_config = config.project.get_config("cleanup", {})

logger.info("Starting operator cleanup (subscriptions, CSVs, InstallPlans, OperatorGroups)")

# Identify operators to delete based on configuration
operators_to_delete = _identify_operators_to_delete(platform, cleanup_config)

if not operators_to_delete:
logger.info("No operators to clean up")
return

logger.info(f"Found {len(operators_to_delete)} operators to clean up")

cleanup_operators_command.run(
operators=operators_to_delete,
dry_run=dry_run,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Fail closed before deleting platform operators.

With cleanup_config or preserve_operators missing, this treats every platform operator as deletable. Since CI now calls cleanup with cleanup_subscriptions=True, a missing/new preserve entry can uninstall shared cluster operators unexpectedly. Default to preserving unless deletion is explicit, or require an explicit cleanup allowlist before calling cleanup_operators_command.run.

Safer default option
-    preserve_operators = cleanup_config.get("preserve_operators", {})
+    preserve_operators = cleanup_config.get("preserve_operators")
+    if preserve_operators is None:
+        logger.warning("cleanup.preserve_operators is missing; preserving all operators")
+        return []
@@
-        if preserve_operators.get(package_name, False):
+        if preserve_operators.get(package_name, True):
             logger.info(f"Preserving operator: {package_name}")
             continue

Also applies to: 80-111

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm_d/orchestration/cleanup_phase.py` around lines 58 - 74, The
cleanup flow in cleanup_phase.py is currently fail-open: when cleanup_config or
preserve_operators is missing, _identify_operators_to_delete can mark all
platform operators for deletion before cleanup_operators_command.run is called.
Update the cleanup logic to default to preserving operators unless an explicit
allowlist/opt-in is present, and make _identify_operators_to_delete (and any
callers in this phase) return no deletions when the preserve list is absent or
incomplete. Ensure the guard happens before invoking
cleanup_operators_command.run so platform operators are not removed
accidentally.


agentic:
enabled: true
model_key: qwen-3-6-35b

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find where model keys are defined/consumed to confirm qwen-3-6-35b is valid
rg -nP --type=yaml 'qwen[\w-]*' 
rg -nP -C3 'model_key' projects/core/agentic projects/llm_d

Repository: openshift-psap/forge

Length of output: 158


🏁 Script executed:

#!/bin/bash
# Search for model definitions and registry files
rg -t yaml -t json -i 'model_key|model_id|model_registry' --type-add 'yaml:*.yaml' --type-add 'json:*.json' .
# Explicit search for qwen string
rg -i 'qwen' .
# Search for files named models or registry
find . -name "*model*" -o -name "*registry*" | head -20

Repository: openshift-psap/forge

Length of output: 15824


The model_key value qwen-3-6-35b is invalid and will fail model resolution.

This identifier does not match any model key defined in projects/rhaiis/orchestration/config.d/models.yaml (which lists qwen3-0_6b, qwen3-30b-a3b, qwen3-235b-instruct, etc.). The value appears to be a typo for qwen3-0-6b (corresponding to Qwen/Qwen3-0.6B), which is the model currently used in projects/llm_d/orchestration/config.d/runtime.yaml and the test suite.

Additionally, this value is hardcoded in projects/core/agentic/config_review/__init__.py as MODEL_KEY. Update this default to a valid key like qwen3-0-6b to prevent resolution failures in the agentic flow.

Valid Qwen model keys in registry
qwen3-0_6b:
  hf_model_id: Qwen/Qwen3-0.6B
qwen25-7b-instruct:
  hf_model_id: Qwen/Qwen2.5-7B-Instruct
...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm_d/orchestration/config.yaml` at line 71, The hardcoded model_key
value is invalid and will break model resolution; update the config entry in the
orchestration config to a valid registry key such as qwen3-0_6b/qwen3-0-6b, and
make sure the default MODEL_KEY in config_review/__init__.py matches the same
valid identifier used by runtime.yaml and the test suite.

@kpouget

kpouget commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-full
/var platform.status_address_name: gateway-internal

@psap-forge-bot

Copy link
Copy Markdown

🔴 Executing of 'llm_d test' failed after 00 hours 06 minutes 16 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260627-080240
ci_job.name: llm_d
ci_job.owner: kpouget
platform.status_address_name: gateway-internal
project.args: []
project.name: llm_d

TEST DESCRIPTION:

This test evaluates the llm_d project by applying the benchmark-short preset with the deployment-approximate-prefix-cache profile to validate baseline inference performance. It runs a lightweight workload (256 input/128 output tokens, 1 concurrent request) to test deployment stability and vLLM prefix caching functionality.

FAILURE REVIEW 002 run smoke request:

001__llmd_test/002__run_smoke_request

The execute_smoke_request task failed with a curl connection timeout (exit code 28) because the LLMInferenceService controller reported a reconciliation error due to a missing HTTPRoute resource (llm-d-kserve-route). This routing failure left the service in a Progressing state with RouterReady: False, causing the external LoadBalancer endpoint to be unable to forward traffic to the backend pods.

Detailed failure analysis report

Config review report

Raw failure details
Execution logs

@psap-forge-bot

Copy link
Copy Markdown

🔴 Executing of 'fournos_launcher submit' failed after 00 hours 10 minutes 38 seconds 🔴

• Link to the test results.

FOURNOS Job Status:

  • forge-llm-d-20260627-080240: Failed - Tasks Completed: 5 (Failed: 1, Cancelled 0), Skipped: 0

MLflow Tracking:

FOURNOS Job Details:

Task Logs:

Test configuration:

/test fournos llm_d
/var platform.status_address_name: gateway-internal
/cluster forge-smoke-testing
/pipeline forge-full

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-27 08:13:08
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:186
   RuntimeError: Job forge-llm-d-20260627-080240 failed: Tasks Completed: 5 (Failed: 1, Cancelled 0), Skipped: 0


[...]

Execution logs

@openshift-ci

openshift-ci Bot commented Jun 27, 2026

Copy link
Copy Markdown

@kpouget: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/fournos 8d27be1 link true /test fournos

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant