[llm-d] Make more reliable#91
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThe PR renumbers Tekton workflows and adds preflight steps, adds ChangesTekton pipeline sequencing
DSL early-return and retry logging
Cluster operator lifecycle tooling
GPU operator bootstrap and diagnostics
LLMInferenceService deployment diagnostics
HF cache wait loop
llm_d orchestration wiring
CI text and stubs
Sequence Diagram(s)sequenceDiagram
participant check_existing_csv
participant wait_for_installplan
participant oc
participant handle_installplan_failure
check_existing_csv->>oc: query CSVs by operator package and namespace
oc-->>check_existing_csv: CSV names and phases
check_existing_csv-->>wait_for_installplan: continue when no Succeeded CSV exists
wait_for_installplan->>oc: fetch Subscription YAML
oc-->>wait_for_installplan: installPlanRef and approval mode
wait_for_installplan->>oc: fetch InstallPlan YAML
wait_for_installplan->>oc: patch spec.approved=true
wait_for_installplan-->>handle_installplan_failure: write AGENT.md on failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/test fournos llm_d |
|
🔴 Test of 'llm_d prepare' failed after 00 hours 07 minutes 16 seconds 🔴 • Link to the test results. Test configuration: FAILURE REVIEW 001 cluster deploy operator openshift-cert-manager-operator:
|
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 10 minutes 26 seconds 🔴 • Link to the test results. • FOURNOS Job Status:
• MLflow Tracking: • FOURNOS Job Details: • Task Logs:
Test configuration: |
|
/test fournos llm_d |
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
|
🔴 Test of 'llm_d prepare' failed after 00 hours 07 minutes 42 seconds 🔴 • Link to the test results. Test configuration: FAILURE REVIEW 003 bootstrap lws operator:
|
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 09 minutes 05 seconds 🔴 • Link to the test results. • FOURNOS Job Status:
• MLflow Tracking: • FOURNOS Job Details: • Task Logs:
Test configuration: |
|
/test fournos llm_d |
|
🔴 Test of 'llm_d prepare' failed after 00 hours 21 minutes 20 seconds 🔴 • Link to the test results. Test configuration: FAILURE REVIEW 010 bootstrap gpu clusterpolicy:
|
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 24 minutes 47 seconds 🔴 • Link to the test results. • FOURNOS Job Status:
• MLflow Tracking: • FOURNOS Job Details: • Task Logs:
Test configuration: |
7b83bd6 to
4919605
Compare
|
/test fournos llm_d |
|
🔴 Test of 'llm_d test' failed after 00 hours 16 minutes 26 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: TEST DESCRIPTION:
FAILURE REVIEW 002 run smoke request:
|
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 22 minutes 23 seconds 🔴 • Link to the test results. • FOURNOS Job Status:
• MLflow Tracking: • FOURNOS Job Details: • Task Logs:
Test configuration: |
…rs in the test directory
|
/test fournos llm_d |
|
🔴 Executing of 'llm_d test' failed after 00 hours 05 minutes 46 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: TEST DESCRIPTION:
FAILURE REVIEW 002 run smoke request:
|
|
🔴 Executing of 'fournos_launcher submit' failed after 00 hours 10 minutes 22 seconds 🔴 • Link to the test results. • FOURNOS Job Status:
• MLflow Tracking: • FOURNOS Job Details: • Task Logs:
Test configuration: |
…the manual InstallPlan
…ait_for_installplan failure
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
fournos/gitops/base/workflows/pipeline-full.yaml (1)
119-136: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
job-step-namedoes not match the05-export-artifactstask name.The task is named
05-export-artifactsbut itsjob-step-nameis04__export-artifacts, so the artifact directory (derived fromFOURNOS_STEP_NAMEpertask-forge-step.yaml) will be misnamed and inconsistent with the task. Should be05__export-artifacts.🐛 Proposed fix
- name: job-step-name - value: 04__export-artifacts + value: 05__export-artifacts🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos/gitops/base/workflows/pipeline-full.yaml` around lines 119 - 136, The `05-export-artifacts` task in `pipeline-full.yaml` has a mismatched `job-step-name` value, which will cause the artifact directory naming used by `forge-step` and `task-forge-step.yaml` to be inconsistent. Update the `job-step-name` parameter in the `05-export-artifacts` task to match the task’s step number so it follows the same `FOURNOS_STEP_NAME`-derived pattern as the other tasks, using the `05__export-artifacts` naming convention.
🧹 Nitpick comments (6)
fournos/gitops/base/workflows/pipeline-prepare-test.yaml (1)
20-37: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueStale
job-step-namenumbering on01-prepare.The task
01-preparestill usesjob-step-name: 001__preparewhile the rest of this file moved to the two-digit0N__scheme (e.g.02__preflight,04__export-artifacts). Align it to01__preparefor a consistent artifact-dir naming.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos/gitops/base/workflows/pipeline-prepare-test.yaml` around lines 20 - 37, The `01-prepare` task in the pipeline manifest still uses the old `job-step-name` value `001__prepare`; update it to match the file’s current two-digit naming convention used by other steps like `02__preflight` and `04__export-artifacts`. Make the change in the `job-step-name` param for `01-prepare` so artifact directory naming stays consistent across the workflow.fournos/gitops/base/workflows/pipeline-full.yaml (1)
59-79: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueDuplicate
02prefix for two distinct tasks.Both
02-preflightand02-testshare the02numbering, breaking the otherwise sequential ordering convention (comparepipeline-prepare-test.yaml, where preflight is02and test is03). Consider renaming the test task to03-testand itsjob-step-nameto03__testfor consistency.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos/gitops/base/workflows/pipeline-full.yaml` around lines 59 - 79, The pipeline task names are using the same 02 prefix for two different steps, which breaks the sequential naming convention. Update the task identifier for the test stage from 02-test to 03-test, and adjust the corresponding job-step-name in the same task from 02__test to 03__test so it matches the ordering used by the pipeline task definitions.projects/kserve/toolbox/prepare_hf_model_cache/main.py (1)
340-343: 🔒 Security & Privacy | 🔵 TrivialPrefer the list form for
shell.runto avoid shell interpolation
shell.runacceptsstr | list[str]and defaults toshell=True. Passing an argument list withshell=Falseremoves the shell interpolation risk and is more robust to unusual characters innamespaceorjob_name.Both values are derived from
cache_spec, which is built deterministically in this module and always includesnamespaceanddownload_job_name.♻️ Proposed change
- shell.run( - f"oc get pods -n {namespace} -l job-name={job_name}", - check=False, - ) + shell.run( + ["oc", "get", "pods", "-n", namespace, "-l", f"job-name={job_name}"], + check=False, + shell=False, + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/kserve/toolbox/prepare_hf_model_cache/main.py` around lines 340 - 343, The `shell.run` call in `prepare_hf_model_cache/main.py` is using a shell-form string, which leaves room for shell interpolation; update the `oc get pods` invocation to use the argument-list form with `shell=False`. Keep the same command semantics while passing `namespace` and `job_name` as separate arguments so `shell.run` handles them safely and robustly. Use the existing `shell.run` call site around the pod lookup logic to locate the change.projects/llm_d/toolbox/cleanup_test_resources/main.py (1)
208-212: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winDrop or persist the unused JSON pod fetch.
This command discards stdout and does not feed the deletion check, so it only adds an extra API-server call on every retry. Keep the plain-text diagnostic, or write the JSON output to an artifact if it is needed.
Proposed cleanup
- shell.run( - f"oc get pods -n {args.namespace} -l app.kubernetes.io/name={args.inference_service_name} -o json", - check=False, - log_stdout=False, - ) -🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm_d/toolbox/cleanup_test_resources/main.py` around lines 208 - 212, The extra JSON pod fetch in the cleanup flow is unused and just adds another API-server call on each retry. Update the retry logic around shell.run in cleanup_test_resources/main.py so the diagnostic remains plain-text only, or persist the JSON output to an artifact if it is actually needed by the deletion check; remove the discarded json call from the retry path.projects/llm_d/orchestration/test_phase.py (1)
57-61: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAlign
run_finalizers’ return type with its actual contract.Line 104 returns a pair, so the annotation should describe
(primary_exc, finalizer_exc)rather than one exception tuple.Proposed cleanup
def run_finalizers( endpoint_url: str | None, primary_exc: tuple[type[BaseException], BaseException, Any] | None, finalizer_exc: tuple[type[BaseException], BaseException, Any] | None, -) -> tuple[type[BaseException], BaseException, Any] | None: +) -> tuple[ + tuple[type[BaseException], BaseException, Any] | None, + tuple[type[BaseException], BaseException, Any] | None, +]:Also applies to: 104-104
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm_d/orchestration/test_phase.py` around lines 57 - 61, `run_finalizers`’ return annotation does not match what the function actually returns: it returns both `primary_exc` and `finalizer_exc`, not a single exception tuple. Update the `run_finalizers` signature and any related type hints in `test_phase.py` to reflect a pair/tuple of the two exception tuples, and make sure the return statement at the end of `run_finalizers` matches that contract.projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py (1)
167-208: 🔒 Security & Privacy | 🔵 TrivialReplace f-string shell commands with argv lists and
shell=FalseThe
shell.run()wrapper explicitly supports list-style arguments and theshellflag. Interpolatingnamespaceinto shell strings needlessly exposes command arguments to shell parsing. Convert all sixocinvocations in this block to use the list form.shell.run( - f"oc get daemonsets -n {namespace} -oyaml", + ["oc", "get", "daemonsets", "-n", namespace, "-oyaml"], + shell=False, stdout_dest=args.artifact_dir / "artifacts" / "gpu-operator-daemonsets.yaml", check=False, )Apply this pattern to all remaining calls in the block.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py` around lines 167 - 208, The `shell.run()` calls in the GPU operator artifact capture block are still using interpolated shell strings, which should be replaced with argv lists. Update each `oc` invocation in `main.py` to pass command arguments as a list using the `shell.run` wrapper’s list-style form, keeping `namespace` as a separate argument and not embedded in an f-string. Apply this change consistently to the six capture calls in the block, including the `get`, `describe`, and `-owide` commands, so they all run with `shell=False`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@fournos/gitops/base/workflows/pipeline-full.yaml`:
- Around line 79-97: The 02-test task in the pipeline is invoking the wrong job
step, so testing never runs and preflight is duplicated. Update the `job-step`
parameter in `pipeline-full.yaml` for the `02-test` task from preflight to test,
keeping the surrounding `forge-step` task configuration and `job-step-name`
as-is so the step name matches the intended test execution.
In `@fournos/gitops/base/workflows/pipeline-prepare-test.yaml`:
- Around line 59-60: The 03-test task in the pipeline workflow has an invalid
runAfter dependency because it points to a task name that does not exist. Update
the runAfter reference on 03-test to match the actual preflight task identifier,
02-preflight, so the Tekton DAG resolves correctly and the pipeline definition
validates.
In `@fournos/gitops/base/workflows/pipeline-test-only.yaml`:
- Line 39: The `01-test` task in the pipeline is missing a `runAfter` dependency
on `00-preflight`, so preflight can run in parallel instead of gating the test.
Update the Tekton workflow definition for `01-test` to depend on `00-preflight`,
following the same pattern used in `pipeline-prepare-only.yaml`, so the
preflight step completes before the test starts.
In `@projects/cluster/toolbox/bootstrap_lws_operator/main.py`:
- Around line 16-22: The `run` function’s `timeout_seconds` parameter is
currently unused because readiness waits are controlled by the fixed retry
decorators in `run` rather than by caller input. Either remove `timeout_seconds`
and its docstring from `run`, or thread it into the waiting logic by replacing
the fixed retries with timeout-based polling in the readiness checks so the
caller can actually influence the wait duration.
In `@projects/cluster/toolbox/cleanup_operators/main.py`:
- Around line 114-124: The per-operator artifact filenames in the cleanup flow
are missing the namespace, so repeated subscription names can overwrite each
other’s diagnostics. Update the filename construction in the `shell.run` calls
that write the CSV and InstallPlan artifacts to include both `subscription_name`
and `namespace`, and apply the same change in the other matching block
referenced by the comment. Keep the existing artifact destinations and command
behavior unchanged.
- Around line 161-175: The cleanup logic in the subscription check is failing
open because it treats an unsuccessful oc get subscription call as zero
subscriptions, which can trigger unsafe deletion of shared OperatorGroups.
Update the subscription-counting flow in cleanup_operators/main.py so the code
using shell.run and remaining_subscriptions first verifies the command succeeded
before inspecting result.stdout, and if it fails, skip deletion and log or raise
an error instead of proceeding to delete operator groups. Keep the deletion
guard around the existing logger.info and oc delete operatorgroup path so
OperatorGroups are only removed when the subscription query is known to be
valid.
In `@projects/cluster/toolbox/cluster_deploy_operator/main.py`:
- Around line 66-112: The early-return logic in check_existing_csv can bypass
validate_parameters, letting reused cluster state hide invalid inputs. Reorder
the task flow so validate_parameters runs before check_existing_csv, ensuring
source_name, channel, approval mode, and install mode are validated first while
keeping the existing EarlyReturn behavior in check_existing_csv unchanged.
- Around line 456-466: The Subscription failure check in the operator logic only
inspects the second entry in conditions, so it can miss a ResolutionFailed
condition when ordering changes. Update the condition handling in main.py to
iterate through all items in subscription_data["status"]["conditions"] and raise
the same ValueError as soon as any condition_type equals ResolutionFailed, using
the matching message and reason from that condition.
In `@projects/core/ci_entrypoint/prepare_ci.py`:
- Line 588: The user-visible notification text built in prepare_ci.py is
inconsistent and ungrammatical, using both “Executing of” and “Execution of”
around the notification_status formatting. Update the status message
construction at the affected notification points so they all use one consistent,
grammatical phrasing (for example, the same “Execution of '{project}
{operation}' ...” form) in the prepare_ci flow, including the logic around
notification_status.
In `@projects/core/dsl/runtime.py`:
- Around line 369-376: The early-return path in runtime.py drops any context
changes because EarlyReturnException is raised in the task handling block before
the shared_context write-back runs. Update the task execution flow around the
EarlyReturn check so the context persistence loop executes before raising
EarlyReturnException, or otherwise ensure context is saved as part of the
early-return path. Use the existing task_status, EarlyReturn, and shared_context
logic in the runtime task runner to keep values set by the task available to
downstream `@always` tasks like capture_csv.
In `@projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py`:
- Around line 84-91: The namespace fallback in the ClusterPolicy readiness flow
is getting stuck on the default value instead of refreshing from the latest
status. Update the logic in the ClusterPolicy retry/readiness path, especially
the code that reads `result.stdout` and the loop that rechecks status on each
retry, so `ctx.gpu_operator_namespace` is reassigned whenever
`.status.namespace` becomes available. Keep the existing fallback behavior only
until a real namespace is observed, and make sure the final diagnostics use the
refreshed value from `ctx.gpu_operator_namespace`.
In `@projects/kserve/toolbox/deploy_llmisvc/main.py`:
- Around line 136-149: The pod-appearance wait in wait_pods_appear was shortened
too much, which can cause slow KServe/LLMInferenceService reconciliations to
fail before readiness checks run. Update wait_pods_appear in
deploy_llmisvc/main.py to restore a longer wait window or make the retry
duration configurable, keeping the existing service_name, selector, and
oc_get_json flow intact.
- Around line 166-178: The diagnostic artifact writing for the `oc` calls
currently saves only `stdout`, so failures from `describe` and `get replicaset`
can lose the real error details. Update the artifact generation around the
`oc(...)` result handling in `main.py` so the saved files include `stderr`
whenever `check=False` is used, and make sure this is applied to both the
`llminferenceservice` describe path and the replicaset diagnostic path. Use the
existing `result` object and artifact writes like `llmisv_desc_path` to locate
and append the error output alongside the normal output.
In `@projects/llm_d/orchestration/ci.py`:
- Around line 89-96: The pre_cleanup() step is deleting non-preserved operator
subscriptions before prepare(), which forces rhods-operator and related
operators to be torn down and reinstalled on every run. Update the cleanup flow
in pre_cleanup() so full cleanup with cleanup_subscriptions=True is moved to
post_cleanup(), or adjust runtime_config/config.yaml preserve settings to
include rhods-operator, rhcl-operator, and leader-worker-set if they must
survive pre-run cleanup. Make sure the fix is applied in the pre_cleanup and
cleanup_toolbox_run path so prepare() no longer triggers unnecessary operator
churn.
In `@projects/llm_d/orchestration/cleanup_phase.py`:
- Around line 58-74: The cleanup flow in cleanup_phase.py is currently
fail-open: when cleanup_config or preserve_operators is missing,
_identify_operators_to_delete can mark all platform operators for deletion
before cleanup_operators_command.run is called. Update the cleanup logic to
default to preserving operators unless an explicit allowlist/opt-in is present,
and make _identify_operators_to_delete (and any callers in this phase) return no
deletions when the preserve list is absent or incomplete. Ensure the guard
happens before invoking cleanup_operators_command.run so platform operators are
not removed accidentally.
In `@projects/llm_d/orchestration/config.yaml`:
- Line 71: The hardcoded model_key value is invalid and will break model
resolution; update the config entry in the orchestration config to a valid
registry key such as qwen3-0_6b/qwen3-0-6b, and make sure the default MODEL_KEY
in config_review/__init__.py matches the same valid identifier used by
runtime.yaml and the test suite.
---
Outside diff comments:
In `@fournos/gitops/base/workflows/pipeline-full.yaml`:
- Around line 119-136: The `05-export-artifacts` task in `pipeline-full.yaml`
has a mismatched `job-step-name` value, which will cause the artifact directory
naming used by `forge-step` and `task-forge-step.yaml` to be inconsistent.
Update the `job-step-name` parameter in the `05-export-artifacts` task to match
the task’s step number so it follows the same `FOURNOS_STEP_NAME`-derived
pattern as the other tasks, using the `05__export-artifacts` naming convention.
---
Nitpick comments:
In `@fournos/gitops/base/workflows/pipeline-full.yaml`:
- Around line 59-79: The pipeline task names are using the same 02 prefix for
two different steps, which breaks the sequential naming convention. Update the
task identifier for the test stage from 02-test to 03-test, and adjust the
corresponding job-step-name in the same task from 02__test to 03__test so it
matches the ordering used by the pipeline task definitions.
In `@fournos/gitops/base/workflows/pipeline-prepare-test.yaml`:
- Around line 20-37: The `01-prepare` task in the pipeline manifest still uses
the old `job-step-name` value `001__prepare`; update it to match the file’s
current two-digit naming convention used by other steps like `02__preflight` and
`04__export-artifacts`. Make the change in the `job-step-name` param for
`01-prepare` so artifact directory naming stays consistent across the workflow.
In `@projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py`:
- Around line 167-208: The `shell.run()` calls in the GPU operator artifact
capture block are still using interpolated shell strings, which should be
replaced with argv lists. Update each `oc` invocation in `main.py` to pass
command arguments as a list using the `shell.run` wrapper’s list-style form,
keeping `namespace` as a separate argument and not embedded in an f-string.
Apply this change consistently to the six capture calls in the block, including
the `get`, `describe`, and `-owide` commands, so they all run with
`shell=False`.
In `@projects/kserve/toolbox/prepare_hf_model_cache/main.py`:
- Around line 340-343: The `shell.run` call in `prepare_hf_model_cache/main.py`
is using a shell-form string, which leaves room for shell interpolation; update
the `oc get pods` invocation to use the argument-list form with `shell=False`.
Keep the same command semantics while passing `namespace` and `job_name` as
separate arguments so `shell.run` handles them safely and robustly. Use the
existing `shell.run` call site around the pod lookup logic to locate the change.
In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 57-61: `run_finalizers`’ return annotation does not match what the
function actually returns: it returns both `primary_exc` and `finalizer_exc`,
not a single exception tuple. Update the `run_finalizers` signature and any
related type hints in `test_phase.py` to reflect a pair/tuple of the two
exception tuples, and make sure the return statement at the end of
`run_finalizers` matches that contract.
In `@projects/llm_d/toolbox/cleanup_test_resources/main.py`:
- Around line 208-212: The extra JSON pod fetch in the cleanup flow is unused
and just adds another API-server call on each retry. Update the retry logic
around shell.run in cleanup_test_resources/main.py so the diagnostic remains
plain-text only, or persist the JSON output to an artifact if it is actually
needed by the deletion check; remove the discarded json call from the retry
path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8e4e2f9a-4e5a-4b96-ad58-9a10f65452d1
📒 Files selected for processing (33)
fournos/gitops/base/workflows/pipeline-full.yamlfournos/gitops/base/workflows/pipeline-prepare-only.yamlfournos/gitops/base/workflows/pipeline-prepare-test.yamlfournos/gitops/base/workflows/pipeline-test-only.yamlprojects/cluster/toolbox/bootstrap_lws_operator/main.pyprojects/cluster/toolbox/bootstrap_lws_operator/templates/leaderworkersetoperator.yaml.j2projects/cluster/toolbox/cleanup_operators/main.pyprojects/cluster/toolbox/cluster_deploy_operator/main.pyprojects/cluster/toolbox/cluster_deploy_operator/on_failure_helpers.pyprojects/core/agentic/on_failure/__init__.pyprojects/core/ci_entrypoint/prepare_ci.pyprojects/core/dsl/__init__.pyprojects/core/dsl/control_flow.pyprojects/core/dsl/runtime.pyprojects/core/dsl/task.pyprojects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.pyprojects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/on_failure_helpers.pyprojects/kserve/toolbox/deploy_llmisvc/main.pyprojects/kserve/toolbox/deploy_llmisvc/on_failure_helpers.pyprojects/kserve/toolbox/prepare_hf_model_cache/main.pyprojects/llm_d/orchestration/ci.pyprojects/llm_d/orchestration/cleanup_phase.pyprojects/llm_d/orchestration/config.d/deployments.yamlprojects/llm_d/orchestration/config.d/platform.yamlprojects/llm_d/orchestration/config.yamlprojects/llm_d/orchestration/manifests/leaderworkersetoperator.yamlprojects/llm_d/orchestration/preflight_phase.pyprojects/llm_d/orchestration/prepare_phase.pyprojects/llm_d/orchestration/test_phase.pyprojects/llm_d/tests/test_profiles.pyprojects/llm_d/toolbox/cleanup_test_resources/main.pyprojects/rhaiis/orchestration/ci.pyprojects/skeleton/orchestration/ci.py
| - name: 02-test | ||
| runAfter: [02-preflight] | ||
| taskRef: | ||
| name: forge-step | ||
| kind: Task | ||
| workspaces: | ||
| - name: artifacts | ||
| workspace: artifacts | ||
| params: | ||
| - name: fjob-name | ||
| value: "$(params.fjob-name)" | ||
| - name: fournos-workload-namespace | ||
| value: "$(params.fournos-workload-namespace)" | ||
| - name: kubeconfig-secret | ||
| value: "$(params.kubeconfig-secret)" | ||
| - name: job-step | ||
| value: preflight | ||
| - name: job-step-name | ||
| value: 02__test |
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win
02-test runs the preflight step instead of test.
The test task's job-step is set to preflight, so the actual test step never executes — preflight runs twice (once here, once in 02-preflight) while artifacts land under 02__test. This silently disables testing in the full pipeline. It should be test.
🐛 Proposed fix
- name: job-step
- value: preflight
+ value: test
- name: job-step-name
value: 02__test📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: 02-test | |
| runAfter: [02-preflight] | |
| taskRef: | |
| name: forge-step | |
| kind: Task | |
| workspaces: | |
| - name: artifacts | |
| workspace: artifacts | |
| params: | |
| - name: fjob-name | |
| value: "$(params.fjob-name)" | |
| - name: fournos-workload-namespace | |
| value: "$(params.fournos-workload-namespace)" | |
| - name: kubeconfig-secret | |
| value: "$(params.kubeconfig-secret)" | |
| - name: job-step | |
| value: preflight | |
| - name: job-step-name | |
| value: 02__test | |
| - name: 02-test | |
| runAfter: [02-preflight] | |
| taskRef: | |
| name: forge-step | |
| kind: Task | |
| workspaces: | |
| - name: artifacts | |
| workspace: artifacts | |
| params: | |
| - name: fjob-name | |
| value: "$(params.fjob-name)" | |
| - name: fournos-workload-namespace | |
| value: "$(params.fournos-workload-namespace)" | |
| - name: kubeconfig-secret | |
| value: "$(params.kubeconfig-secret)" | |
| - name: job-step | |
| value: test | |
| - name: job-step-name | |
| value: 02__test |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fournos/gitops/base/workflows/pipeline-full.yaml` around lines 79 - 97, The
02-test task in the pipeline is invoking the wrong job step, so testing never
runs and preflight is duplicated. Update the `job-step` parameter in
`pipeline-full.yaml` for the `02-test` task from preflight to test, keeping the
surrounding `forge-step` task configuration and `job-step-name` as-is so the
step name matches the intended test execution.
| - name: 03-test | ||
| runAfter: [01-preflight] |
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win
03-test references a nonexistent task in runAfter.
runAfter: [01-preflight] points to a task that does not exist — the preflight task is named 02-preflight (Line 39). Tekton will reject this pipeline with a DAG validation error ("task ... runAfter refers to nonexistent task"). Update the reference to 02-preflight.
🐛 Proposed fix
- name: 03-test
- runAfter: [01-preflight]
+ runAfter: [02-preflight]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: 03-test | |
| runAfter: [01-preflight] | |
| - name: 03-test | |
| runAfter: [02-preflight] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fournos/gitops/base/workflows/pipeline-prepare-test.yaml` around lines 59 -
60, The 03-test task in the pipeline workflow has an invalid runAfter dependency
because it points to a task name that does not exist. Update the runAfter
reference on 03-test to match the actual preflight task identifier,
02-preflight, so the Tekton DAG resolves correctly and the pipeline definition
validates.
| - name: job-step-name | ||
| value: 00__preflight | ||
|
|
||
| - name: 01-test |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
01-test has no runAfter on 00-preflight, so preflight does not gate the test.
Without a runAfter dependency, Tekton schedules 00-preflight and 01-test in parallel. If the intent of adding preflight is to validate prerequisites before running the test, the gate is ineffective here (contrast pipeline-prepare-only.yaml, where preflight runAfter: [00-prepare]). Add the dependency.
🐛 Proposed fix
- name: 01-test
+ runAfter: [00-preflight]
taskRef:
name: forge-step📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: 01-test | |
| - name: 01-test | |
| runAfter: [00-preflight] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fournos/gitops/base/workflows/pipeline-test-only.yaml` at line 39, The
`01-test` task in the pipeline is missing a `runAfter` dependency on
`00-preflight`, so preflight can run in parallel instead of gating the test.
Update the Tekton workflow definition for `01-test` to depend on `00-preflight`,
following the same pattern used in `pipeline-prepare-only.yaml`, so the
preflight step completes before the test starts.
| def run(*, timeout_seconds: int = 600) -> int: | ||
| """ | ||
| Bootstrap the LeaderWorkerSetOperator instance and wait for it to be ready. | ||
|
|
||
| Args: | ||
| timeout_seconds: Maximum time to wait for the LWS operator configuration | ||
| """ |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Wire or remove the unused timeout parameter.
Line 16 exposes timeout_seconds, but the waits are fixed by the retry decorators on Lines 57 and 67, so caller-provided values have no effect. Either remove the argument/docs or replace the fixed retries with timeout-driven polling.
Also applies to: 57-69
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/cluster/toolbox/bootstrap_lws_operator/main.py` around lines 16 -
22, The `run` function’s `timeout_seconds` parameter is currently unused because
readiness waits are controlled by the fixed retry decorators in `run` rather
than by caller input. Either remove `timeout_seconds` and its docstring from
`run`, or thread it into the waiting logic by replacing the fixed retries with
timeout-based polling in the readiness checks so the caller can actually
influence the wait duration.
| stdout_dest=artifact_dir / "artifacts" / f"{subscription_name}-csvs.yaml", | ||
| check=False, | ||
| log_stdout=False, | ||
| ) | ||
|
|
||
| # Capture InstallPlans | ||
| shell.run( | ||
| f"oc get installplan -n {namespace} " | ||
| f"-l operators.coreos.com/{subscription_name}.{namespace} " | ||
| "-o yaml", | ||
| stdout_dest=artifact_dir / "artifacts" / f"{subscription_name}-installplans.yaml", |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win
Include namespace in per-operator artifact filenames.
{subscription_name}-csvs.yaml and {subscription_name}-installplans.yaml collide when the same subscription name is cleaned from multiple namespaces, overwriting diagnostics.
🧾 Proposed fix
- stdout_dest=artifact_dir / "artifacts" / f"{subscription_name}-csvs.yaml",
+ stdout_dest=artifact_dir / "artifacts" / f"{namespace}-{subscription_name}-csvs.yaml",
...
- stdout_dest=artifact_dir / "artifacts" / f"{subscription_name}-installplans.yaml",
+ stdout_dest=artifact_dir / "artifacts" / f"{namespace}-{subscription_name}-installplans.yaml",
...
- f.write(f" - {operator_name}-csvs.yaml\n")
- f.write(f" - {operator_name}-installplans.yaml\n")
+ f.write(f" - {namespace}-{operator_name}-csvs.yaml\n")
+ f.write(f" - {namespace}-{operator_name}-installplans.yaml\n")Also applies to: 212-216
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/cluster/toolbox/cleanup_operators/main.py` around lines 114 - 124,
The per-operator artifact filenames in the cleanup flow are missing the
namespace, so repeated subscription names can overwrite each other’s
diagnostics. Update the filename construction in the `shell.run` calls that
write the CSV and InstallPlan artifacts to include both `subscription_name` and
`namespace`, and apply the same change in the other matching block referenced by
the comment. Keep the existing artifact destinations and command behavior
unchanged.
| @on_failure(on_wait_pods_appear_failure) | ||
| @retry(attempts=12, delay=5, backoff=1.0) | ||
| @task | ||
| def wait_pods_appear(args, ctx): | ||
| """Wait for llm-d pods to appear""" | ||
|
|
||
| pods = oc_get_json( | ||
| "pods", namespace=args.namespace, selector=ctx.selector, ignore_not_found=True | ||
| ) | ||
| # Use service name and selector from context (if set) or create from args | ||
| service_name = getattr(ctx, "service_name", args.inference_service_name) | ||
| selector = getattr(ctx, "selector", f"app.kubernetes.io/name={service_name}") | ||
|
|
||
| pods = oc_get_json("pods", namespace=args.namespace, selector=selector, ignore_not_found=True) | ||
| if pods and pods.get("items"): | ||
| return f"Pods appeared for {ctx.service_name}" | ||
| return f"Pods appeared for {service_name}" | ||
| return False # Retry |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Restore a longer pod-appearance window.
Line 137 now caps wait_pods_appear at about 60 seconds (12 * 5s). Slow KServe/LLMInferenceService reconciliation can fail here before the later readiness wait runs, turning a valid slow deployment into a hard failure. Consider restoring the previous longer window or making this timeout configurable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/kserve/toolbox/deploy_llmisvc/main.py` around lines 136 - 149, The
pod-appearance wait in wait_pods_appear was shortened too much, which can cause
slow KServe/LLMInferenceService reconciliations to fail before readiness checks
run. Update wait_pods_appear in deploy_llmisvc/main.py to restore a longer wait
window or make the retry duration configurable, keeping the existing
service_name, selector, and oc_get_json flow intact.
| result = oc( | ||
| "describe", | ||
| "llminferenceservice", | ||
| service_name, | ||
| "-n", | ||
| args.namespace, | ||
| log_stdout=False, | ||
| check=False, | ||
| ) | ||
|
|
||
| llmisv_desc_path = artifacts_dir / "llmisv_description.txt" | ||
| with open(llmisv_desc_path, "w", encoding="utf-8") as f: | ||
| f.write(result.stdout) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Include stderr when diagnostic oc calls fail.
These calls use check=False, but the artifacts only write stdout. If oc describe / oc get replicaset fails, the saved file can be empty or say “No replicasets found” while the real cause is in stderr.
Proposed diagnostic artifact fix
- with open(llmisv_desc_path, "w", encoding="utf-8") as f:
- f.write(result.stdout)
+ with open(llmisv_desc_path, "w", encoding="utf-8") as f:
+ f.write(result.stdout)
+ if result.stderr:
+ f.write("\n\n--- stderr ---\n")
+ f.write(result.stderr)- else:
+ elif rs_result.returncode == 0:
f.write("No replicasets found for the service")
+ else:
+ f.write("Failed to list replicasets for the service")
+ if rs_result.stderr:
+ f.write("\n\n--- stderr ---\n")
+ f.write(rs_result.stderr)Also applies to: 201-236
🧰 Tools
🪛 ast-grep (0.44.0)
[warning] 176-176: File path is request-/variable-derived; validate and normalize to prevent path traversal.
Context: open(llmisv_desc_path, "w", encoding="utf-8")
Note: [CWE-22] Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal').
(open-filename-from-request)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/kserve/toolbox/deploy_llmisvc/main.py` around lines 166 - 178, The
diagnostic artifact writing for the `oc` calls currently saves only `stdout`, so
failures from `describe` and `get replicaset` can lose the real error details.
Update the artifact generation around the `oc(...)` result handling in `main.py`
so the saved files include `stderr` whenever `check=False` is used, and make
sure this is applied to both the `llminferenceservice` describe path and the
replicaset diagnostic path. Use the existing `result` object and artifact writes
like `llmisv_desc_path` to locate and append the error output alongside the
normal output.
| def pre_cleanup(ctx) -> int: | ||
| """Cleanup phase - Clean up resources and finalize.""" | ||
| # Cleanup doesn't typically need vaults, but initialize resolve-only for consistency | ||
| # no vault needed | ||
| from projects.llm_d.orchestration import runtime_config | ||
|
|
||
| for run_spec in runtime_config.get_run_specs(): | ||
| with runtime_config.activate_run_spec(run_spec): | ||
| cleanup_toolbox_run(namespace=run_spec.namespace) | ||
| cleanup_toolbox_run(namespace=run_spec.namespace, cleanup_subscriptions=True) | ||
| return 0 |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect pipeline ordering and cleanup_operators to confirm pre-cleanup impact
fd -t f 'pipeline-full.yaml' --exec cat {}
rg -nP -C5 'def cleanup_operators' projects/llm_d/orchestration/cleanup_phase.pyRepository: openshift-psap/forge
Length of output: 4598
pre_cleanup forces a full uninstall/reinstall of non-preserved operators (e.g., rhods-operator) on every pipeline run.
Verification confirms that cleanup.preserve_operators in config.yaml lists only gpu-operator-certified, nfd, and openshift-cert-manager-operator. Consequently, rhods-operator, rhcl-operator, and leader-worker-set are not preserved.
Because the pipeline-full.yaml executes pre-cleanup immediately before prepare:
pre_cleanupinvokescleanup_operators()withcleanup_subscriptions=True.- This deletes the subscriptions/CSVs for
rhods-operatorand other non-preserved operators. - The subsequent
preparestep reinstalls them.
This introduces unnecessary churn, increases pipeline runtime, and risks reliability by forcing the teardown and re-provision of heavy operators like rhods-operator for every run.
Recommended Actions:
- Relocate Cleanup: Move full subscription cleanup logic to
post_cleanupto avoid destroying operators required duringprepareandtest. - Expand Preservation: If pre-run cleanup is strictly necessary, update
cleanup.preserve_operatorsinconfig.yamlto includerhods-operator,rhcl-operator, andleader-worker-set.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/llm_d/orchestration/ci.py` around lines 89 - 96, The pre_cleanup()
step is deleting non-preserved operator subscriptions before prepare(), which
forces rhods-operator and related operators to be torn down and reinstalled on
every run. Update the cleanup flow in pre_cleanup() so full cleanup with
cleanup_subscriptions=True is moved to post_cleanup(), or adjust
runtime_config/config.yaml preserve settings to include rhods-operator,
rhcl-operator, and leader-worker-set if they must survive pre-run cleanup. Make
sure the fix is applied in the pre_cleanup and cleanup_toolbox_run path so
prepare() no longer triggers unnecessary operator churn.
| cleanup_config = config.project.get_config("cleanup", {}) | ||
|
|
||
| logger.info("Starting operator cleanup (subscriptions, CSVs, InstallPlans, OperatorGroups)") | ||
|
|
||
| # Identify operators to delete based on configuration | ||
| operators_to_delete = _identify_operators_to_delete(platform, cleanup_config) | ||
|
|
||
| if not operators_to_delete: | ||
| logger.info("No operators to clean up") | ||
| return | ||
|
|
||
| logger.info(f"Found {len(operators_to_delete)} operators to clean up") | ||
|
|
||
| cleanup_operators_command.run( | ||
| operators=operators_to_delete, | ||
| dry_run=dry_run, | ||
| ) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Fail closed before deleting platform operators.
With cleanup_config or preserve_operators missing, this treats every platform operator as deletable. Since CI now calls cleanup with cleanup_subscriptions=True, a missing/new preserve entry can uninstall shared cluster operators unexpectedly. Default to preserving unless deletion is explicit, or require an explicit cleanup allowlist before calling cleanup_operators_command.run.
Safer default option
- preserve_operators = cleanup_config.get("preserve_operators", {})
+ preserve_operators = cleanup_config.get("preserve_operators")
+ if preserve_operators is None:
+ logger.warning("cleanup.preserve_operators is missing; preserving all operators")
+ return []
@@
- if preserve_operators.get(package_name, False):
+ if preserve_operators.get(package_name, True):
logger.info(f"Preserving operator: {package_name}")
continueAlso applies to: 80-111
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/llm_d/orchestration/cleanup_phase.py` around lines 58 - 74, The
cleanup flow in cleanup_phase.py is currently fail-open: when cleanup_config or
preserve_operators is missing, _identify_operators_to_delete can mark all
platform operators for deletion before cleanup_operators_command.run is called.
Update the cleanup logic to default to preserving operators unless an explicit
allowlist/opt-in is present, and make _identify_operators_to_delete (and any
callers in this phase) return no deletions when the preserve list is absent or
incomplete. Ensure the guard happens before invoking
cleanup_operators_command.run so platform operators are not removed
accidentally.
|
|
||
| agentic: | ||
| enabled: true | ||
| model_key: qwen-3-6-35b |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find where model keys are defined/consumed to confirm qwen-3-6-35b is valid
rg -nP --type=yaml 'qwen[\w-]*'
rg -nP -C3 'model_key' projects/core/agentic projects/llm_dRepository: openshift-psap/forge
Length of output: 158
🏁 Script executed:
#!/bin/bash
# Search for model definitions and registry files
rg -t yaml -t json -i 'model_key|model_id|model_registry' --type-add 'yaml:*.yaml' --type-add 'json:*.json' .
# Explicit search for qwen string
rg -i 'qwen' .
# Search for files named models or registry
find . -name "*model*" -o -name "*registry*" | head -20Repository: openshift-psap/forge
Length of output: 15824
The model_key value qwen-3-6-35b is invalid and will fail model resolution.
This identifier does not match any model key defined in projects/rhaiis/orchestration/config.d/models.yaml (which lists qwen3-0_6b, qwen3-30b-a3b, qwen3-235b-instruct, etc.). The value appears to be a typo for qwen3-0-6b (corresponding to Qwen/Qwen3-0.6B), which is the model currently used in projects/llm_d/orchestration/config.d/runtime.yaml and the test suite.
Additionally, this value is hardcoded in projects/core/agentic/config_review/__init__.py as MODEL_KEY. Update this default to a valid key like qwen3-0-6b to prevent resolution failures in the agentic flow.
Valid Qwen model keys in registry
qwen3-0_6b:
hf_model_id: Qwen/Qwen3-0.6B
qwen25-7b-instruct:
hf_model_id: Qwen/Qwen2.5-7B-Instruct
...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/llm_d/orchestration/config.yaml` at line 71, The hardcoded model_key
value is invalid and will break model resolution; update the config entry in the
orchestration config to a valid registry key such as qwen3-0_6b/qwen3-0-6b, and
make sure the default MODEL_KEY in config_review/__init__.py matches the same
valid identifier used by runtime.yaml and the test suite.
|
/test fournos llm_d |
|
🔴 Executing of 'llm_d test' failed after 00 hours 06 minutes 16 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: TEST DESCRIPTION:
FAILURE REVIEW 002 run smoke request:
|
|
🔴 Executing of 'fournos_launcher submit' failed after 00 hours 10 minutes 38 seconds 🔴 • Link to the test results. • FOURNOS Job Status:
• MLflow Tracking: • FOURNOS Job Details: • Task Logs:
Test configuration: |
|
@kpouget: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit
New Features
Bug Fixes
Configuration