Add TNF two-node fencing diagnostics toolset#367
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds the TNF toolset, fencing diagnostics, STONITH status reporting, a troubleshooting prompt, and a fencing domain-knowledge MCP resource, plus docs and tests for the new flows. ChangesTNF toolset and MCP wiring
Estimated code review effort: 4 (Complex) | ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dhensel-rh The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/toolsets/tnf/fencing/fencing_test.go`:
- Around line 22-32: The test is calling the private checkFencingHealth handler
directly, so it does not verify the exported tool wiring. Update
TestInvalidNamespaceType in FencingHandlerSuite to exercise the public
entrypoint for fencing health instead of invoking checkFencingHealth, using the
exported tool/API surface that routes ToolHandlerParams and the invalid
namespace args. Keep the same invalid namespace assertion, but ensure the test
covers the public registration/path rather than internal implementation details.
In `@pkg/toolsets/tnf/fencing/fencing.go`:
- Around line 153-155: The `unstructured.Nested*` calls in `fencing.go` are
discarding parse/type errors, so malformed CRs are being treated as missing data
instead of producing a real failure. Update the affected code paths in
`section`, `issues`, and any related helpers that use
`unstructured.NestedString`, `NestedMap`, `NestedSlice`, or similar to capture
and check the returned error, then propagate it into the section output or
append a diagnostic to `issues` rather than ignoring it. Use the existing
symbols like `platform`, `infraTopology`, `cpTopology`, and the surrounding
fencing/validation functions to keep the handling consistent across all listed
call sites.
- Around line 391-397: The node health/reporting path is using the BareMetalHost
name instead of the correlated Node name, which can produce false missing-node
messages for healthy hosts. Update the fencing flow in the code around
checkNodeHealth and the report generation to pass and use the Node name derived
from Machine.status.nodeRef.name (the real Node identifier) rather than hostName
whenever looking up the Node or formatting node-related status, including the
later report section that applies the same logic.
- Around line 146-150: The current Get/List error handling in the fencing checks
treats every failure as “not available” or “CRD not installed,” which hides real
client/apiserver failures. Update the affected branches around the
Infrastructure CR and the other listed Resource/List calls to explicitly detect
NotFound and NoMatch cases as the graceful optional-API path, but for any other
error (for example Forbidden or timeout) record it as a real failure in the
markdown summary instead of returning a generic unavailable message. Keep the
existing slog debug logging, and ensure the report.WriteString output
distinguishes optional absence from actual errors.
- Around line 596-600: The credential validation in fencing should reject empty
values as well as missing keys, because the current checks in the secret
validation logic only verify presence in the secret data map. Update the
validation around the secret.Data["username"] and secret.Data["password"] checks
so it also flags zero-length byte slices as invalid, and keep the issue messages
in the same validation path used by the fencing secret validation routine.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c0376530-b72a-4482-9345-24190bf67feb
📒 Files selected for processing (4)
pkg/mcp/modules.gopkg/toolsets/tnf/fencing/fencing.gopkg/toolsets/tnf/fencing/fencing_test.gopkg/toolsets/tnf/toolset.go
78fb74a to
ba9a3a6
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (6)
pkg/toolsets/tnf/fencing/api_check.go (1)
27-44: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueRedundant
*net.OpErrorcheck.
net.OpErrorimplements thenet.Errorinterface (Timeout()/Temporary()), so any error matchingerrors.As(err, &netErr)at line 29 already covers*net.OpErrorinstances. The block at lines 32-35 is dead code.♻️ Proposed simplification
func isConnectionError(err error) bool { var netErr net.Error if errors.As(err, &netErr) { return true } - var opErr *net.OpError - if errors.As(err, &opErr) { - return true - } msg := err.Error()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/tnf/fencing/api_check.go` around lines 27 - 44, The isConnectionError helper has a redundant *net.OpError branch because errors.As(err, &netErr) in isConnectionError already matches net.Error implementations, including net.OpError. Remove the separate opErr check and keep the existing net.Error handling plus the string fallback so the logic stays equivalent but simpler.pkg/toolsets/tnf/fencing/api_check_test.go (2)
92-106: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueMultiple assertions in a single test instead of one-assertion-per-subtest.
TestAPIUnreachableGuidebundles 11assert.Containschecks into one test rather than nesteds.Run()subtests with descriptive scenario names and one assertion each.As per path instructions, "Use nested subtests with
s.Run()to organize related test cases with descriptive scenario names" and "Eachs.Run()block should ideally test one specific behavior with one assertion per test case."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/tnf/fencing/api_check_test.go` around lines 92 - 106, TestAPIUnreachableGuide currently groups many assert.Contains checks into one test; split these into nested s.Run subtests with descriptive names and one assertion per subtest. Update the test around APIUnreachableGuide to organize the existing checks (API Unreachable, Scenario 1/2/3, BMC, ssh core@, journalctl -u agent, rpm-ostree status, pcs status, etcdctl, tnf_check_fencing_health) into separate subtests so each behavior is isolated and easier to read.Source: Path instructions
12-86: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoffTest accesses a private function and doesn't use
testify/suite.
isConnectionErroris unexported; testing it directly makes this a white-box test of an internal function. Guidelines require black-box testing of the public API and organizing tests withtestify/suite/suite.Run(). Consider testing this behavior indirectly throughIsAPIUnreachable(the public entry point) and wrapping the suite in asuite.Suite.As per path instructions, "Use
testify/suitefor organizing tests into suites with test suites that embedsuite.Suite, and execute them usingsuite.Run()" and "Test the public API only - tests should be black-box and not access internal/private functions."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/tnf/fencing/api_check_test.go` around lines 12 - 86, The test currently targets the internal helper isConnectionError directly, which should be covered only through the public API. Move these cases into a testify suite by creating a suite that embeds suite.Suite and running it with suite.Run(), then exercise the same error scenarios via IsAPIUnreachable instead of calling the private function. Keep the existing table-driven cases, but assert behavior through the exported entry point so the tests remain black-box.Source: Path instructions
pkg/toolsets/tnf/troubleshoot.go (2)
256-261: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winControl-plane count only matches the
masterrole label.Same concern as
DetectControlPlaneNodeinstonith.go: counting onlynode-role.kubernetes.io/masterwill under-report (or zero out) control-plane node counts on clusters that only carrynode-role.kubernetes.io/control-plane, skewing the "TNF Profile" detection.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/tnf/troubleshoot.go` around lines 256 - 261, The control-plane counting logic in DetectTNFProfile only checks the node-role.kubernetes.io/master label, so it misses clusters that use node-role.kubernetes.io/control-plane. Update the node label check in this loop to count either role label, consistent with DetectControlPlaneNode in stonith.go, so cpCount reflects all control-plane nodes regardless of label variant.
231-273: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winNode/control-plane counts are lost when the Infrastructure CR is unavailable.
fetchClusterTopologyreturns immediately at line 239 if the Infrastructure CRGetfails, skipping the (independent) node-listing block entirely. On a non-OpenShift cluster (no Infrastructure CR) the report would show no node-count data at all, even though it's obtainable viacoreClient.Nodes().List.♻️ Proposed fix
infra, err := dynamicClient.Resource(fencing.InfrastructureGVR).Get(ctx, "cluster", metav1.GetOptions{}) - if err != nil { - slog.Debug("could not get Infrastructure CR", "error", err) - result.WriteString("*Infrastructure CR not available*\n") - return result.String() - } - - platform, _, _ := unstructured.NestedString(infra.Object, "status", "platform") - infraTopology, _, _ := unstructured.NestedString(infra.Object, "status", "infrastructureTopology") - cpTopology, _, _ := unstructured.NestedString(infra.Object, "status", "controlPlaneTopology") - - fmt.Fprintf(&result, "- **Platform:** %s\n", fencing.ValueOrNA(platform)) - fmt.Fprintf(&result, "- **Infrastructure Topology:** %s\n", fencing.ValueOrNA(infraTopology)) - fmt.Fprintf(&result, "- **Control Plane Topology:** %s\n", fencing.ValueOrNA(cpTopology)) + var platform string + if err != nil { + slog.Debug("could not get Infrastructure CR", "error", err) + result.WriteString("*Infrastructure CR not available*\n") + } else { + var infraTopology, cpTopology string + platform, _, _ = unstructured.NestedString(infra.Object, "status", "platform") + infraTopology, _, _ = unstructured.NestedString(infra.Object, "status", "infrastructureTopology") + cpTopology, _, _ = unstructured.NestedString(infra.Object, "status", "controlPlaneTopology") + + fmt.Fprintf(&result, "- **Platform:** %s\n", fencing.ValueOrNA(platform)) + fmt.Fprintf(&result, "- **Infrastructure Topology:** %s\n", fencing.ValueOrNA(infraTopology)) + fmt.Fprintf(&result, "- **Control Plane Topology:** %s\n", fencing.ValueOrNA(cpTopology)) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/tnf/troubleshoot.go` around lines 231 - 273, fetchClusterTopology currently returns early when dynamicClient.Resource(fencing.InfrastructureGVR).Get fails, which prevents the independent node-count logic from running. Change the flow so the infrastructure lookup only gates the platform/topology fields, then always continue to coreClient.Nodes().List and the control-plane counting logic in fetchClusterTopology. Keep the existing logging for the Infrastructure CR error, but avoid returning before the node summary is appended to the result.pkg/toolsets/tnf/troubleshoot_test.go (1)
30-83: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winHand-rolled CoreV1 fakes instead of the real client-go fake clientset.
fakeCoreV1/fakeNodes/fakeSecretsreimplement a subset ofCoreV1Interfaceby hand (e.g.fakeNodes.ListignoresLabelSelector/FieldSelectorentirely).k8s.io/client-go/kubernetes/fakealready provides a fully-functional fakeCoreV1Interfacebacked by an object tracker (as is already done for the dynamic client viadynamic/fake), which avoids re-implementing matching logic that can silently diverge from real client behavior.As per coding guidelines: "Use real implementations and integration testing where possible - avoid mocks."
♻️ Suggested approach
import ( clientgofake "k8s.io/client-go/kubernetes/fake" ) func newFakeCoreV1(objects ...runtime.Object) corev1client.CoreV1Interface { return clientgofake.NewSimpleClientset(objects...).CoreV1() }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/tnf/troubleshoot_test.go` around lines 30 - 83, The CoreV1 test doubles in fakeCoreV1, fakeNodes, fakeSecrets, and newFakeCoreV1 are hand-rolling client-go behavior and can diverge from real selector and lookup semantics. Replace this custom CoreV1Interface implementation with the real kubernetes/fake clientset in newFakeCoreV1, returning its CoreV1() so the tests use the object tracker-backed fake instead of bespoke matching logic.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/toolsets/tnf/fencing/stonith_test.go`:
- Around line 1-180: The STONITH suite is testing package-private helpers
directly instead of the public API, so update the tests to exercise the exported
entrypoint from the fencing package (for example, InitSTONITH and its Handler)
rather than calling checkSTONITHStatus, parseSections, buildSTONITHReport, or
extractBracketedList. Refactor the table-like scenarios in STONITHHandlerSuite
to use nested s.Run subtests with descriptive names and keep one assertion focus
per scenario, while preserving the existing coverage through the public surface.
In `@pkg/toolsets/tnf/fencing/stonith.go`:
- Around line 100-103: The error path in checkSTONITHStatus is dropping the
partial diagnostic text returned by RunSTONITHDiagnostics. Update the error
handling in checkSTONITHStatus to pass the report value into
api.NewToolCallResult instead of an empty string when err is non-nil, so the
"Partial output" built by RunSTONITHDiagnostics is preserved for the caller.
Also ensure the execErr branch in RunSTONITHDiagnostics continues returning that
diagnostic string as the first result.
- Around line 66-77: The STONITH tool annotations are mislabeling a
side-effecting operation as read-only. Update the tool definition in the STONITH
status registration so the api.ToolAnnotations for checkSTONITHStatus sets
ReadOnlyHint to false while keeping the other hints as appropriate, since the
handler creates a privileged debug pod and executes chroot /host on a
control-plane node.
- Around line 158-169: DetectControlPlaneNode and the related control-plane
detection/counting logic in pkg/toolsets/tnf/troubleshoot.go and
pkg/toolsets/tnf/fencing/fencing.go only look for
node-role.kubernetes.io/master, so they miss clusters that use
node-role.kubernetes.io/control-plane. Update the node selection/counting paths
to recognize both labels, ideally by sharing the same label-matching logic used
by DetectControlPlaneNode so auto-detection and TNF/STONITH control-plane checks
work on either label.
In `@pkg/toolsets/tnf/mcp_resources_test.go`:
- Around line 1-37: The tests in TestInitMCPResources and
TestFencingDomainKnowledgeResource are using bare testing functions and calling
the unexported initMCPResources helper directly, which breaks the suite-based
black-box test pattern. Refactor this file to use a testify/suite test suite
like TNFTroubleshootSuite, run it with suite.Run(), and exercise the MCP
resource behavior only through the public API instead of reaching into internal
package functions. Keep the existing assertions, but move them into suite
methods and validate the same resource contract via exported entry points.
In `@pkg/toolsets/tnf/troubleshoot_test.go`:
- Around line 109-362: The troubleshooting tests are white-boxing internal
helpers instead of validating the public prompt surface. Update the suite to
exercise the registered tnf-troubleshoot prompt handler and assert its output,
rather than calling fetchClusterTopology, fetchNodeHealth, fetchOperatorHealth,
fetchBMHStatus, fetchRemediationStatus, and checkCredentialSecret directly. Keep
only public API usage in these tests, similar to how TestValueOrNA uses
fencing.ValueOrNA.
In `@pkg/toolsets/tnf/troubleshoot.go`:
- Around line 242-244: The unstructured field reads in fetchClusterTopology,
fetchOperatorHealth, fetchBMHStatus, and fetchRemediationStatus are discarding
errors from NestedString, NestedBool, and NestedSlice, so type mismatches are
hidden as “N/A” with no diagnostics. Update each call site to capture and handle
the err (and found where relevant) instead of using `_`, and emit a slog.Debug
message consistent with the existing logging patterns in this file when a field
cannot be read. Use the existing helper functions and symbols in these methods
to centralize the handling so every Nested* access reports malformed CRD data
instead of failing silently.
---
Nitpick comments:
In `@pkg/toolsets/tnf/fencing/api_check_test.go`:
- Around line 92-106: TestAPIUnreachableGuide currently groups many
assert.Contains checks into one test; split these into nested s.Run subtests
with descriptive names and one assertion per subtest. Update the test around
APIUnreachableGuide to organize the existing checks (API Unreachable, Scenario
1/2/3, BMC, ssh core@, journalctl -u agent, rpm-ostree status, pcs status,
etcdctl, tnf_check_fencing_health) into separate subtests so each behavior is
isolated and easier to read.
- Around line 12-86: The test currently targets the internal helper
isConnectionError directly, which should be covered only through the public API.
Move these cases into a testify suite by creating a suite that embeds
suite.Suite and running it with suite.Run(), then exercise the same error
scenarios via IsAPIUnreachable instead of calling the private function. Keep the
existing table-driven cases, but assert behavior through the exported entry
point so the tests remain black-box.
In `@pkg/toolsets/tnf/fencing/api_check.go`:
- Around line 27-44: The isConnectionError helper has a redundant *net.OpError
branch because errors.As(err, &netErr) in isConnectionError already matches
net.Error implementations, including net.OpError. Remove the separate opErr
check and keep the existing net.Error handling plus the string fallback so the
logic stays equivalent but simpler.
In `@pkg/toolsets/tnf/troubleshoot_test.go`:
- Around line 30-83: The CoreV1 test doubles in fakeCoreV1, fakeNodes,
fakeSecrets, and newFakeCoreV1 are hand-rolling client-go behavior and can
diverge from real selector and lookup semantics. Replace this custom
CoreV1Interface implementation with the real kubernetes/fake clientset in
newFakeCoreV1, returning its CoreV1() so the tests use the object tracker-backed
fake instead of bespoke matching logic.
In `@pkg/toolsets/tnf/troubleshoot.go`:
- Around line 256-261: The control-plane counting logic in DetectTNFProfile only
checks the node-role.kubernetes.io/master label, so it misses clusters that use
node-role.kubernetes.io/control-plane. Update the node label check in this loop
to count either role label, consistent with DetectControlPlaneNode in
stonith.go, so cpCount reflects all control-plane nodes regardless of label
variant.
- Around line 231-273: fetchClusterTopology currently returns early when
dynamicClient.Resource(fencing.InfrastructureGVR).Get fails, which prevents the
independent node-count logic from running. Change the flow so the infrastructure
lookup only gates the platform/topology fields, then always continue to
coreClient.Nodes().List and the control-plane counting logic in
fetchClusterTopology. Keep the existing logging for the Infrastructure CR error,
but avoid returning before the node summary is appended to the result.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 08c2306b-c417-45e4-ad39-798cd02a0d2d
📒 Files selected for processing (12)
pkg/mcp/modules.gopkg/toolsets/tnf/fencing/api_check.gopkg/toolsets/tnf/fencing/api_check_test.gopkg/toolsets/tnf/fencing/fencing.gopkg/toolsets/tnf/fencing/fencing_test.gopkg/toolsets/tnf/fencing/stonith.gopkg/toolsets/tnf/fencing/stonith_test.gopkg/toolsets/tnf/mcp_resources.gopkg/toolsets/tnf/mcp_resources_test.gopkg/toolsets/tnf/toolset.gopkg/toolsets/tnf/troubleshoot.gopkg/toolsets/tnf/troubleshoot_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/mcp/modules.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/toolsets/tnf/fencing/fencing_test.go
- pkg/toolsets/tnf/fencing/fencing.go
ba9a3a6 to
e0c110e
Compare
c5e222d to
456f800
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/TNF.md`:
- Around line 110-114: Update the “API must be reachable” limitation in TNF.md
to reflect the actual fallback behavior: when the Kubernetes API is unavailable
during a fencing event, the tools should be described as returning the
out-of-band recovery guide from the tnf-troubleshoot prompt rather than implying
a hard timeout. Keep the wording aligned with the existing limitations list and
reference the Kubernetes API and tnf-troubleshoot behavior explicitly.
- Line 1: The documentation file name uses uppercase letters, but new docs under
docs should use lowercase filenames. Rename the TNF markdown document to tnf.md
and update any references that point to the old TNF.md path so the docs match
the naming convention.
- Around line 7-13: The docs summary for the fencing tool is using the wrong
exported name and an incomplete description. Update the
`tnf_check_fencing_health` heading to the actual tool name
`tnf_check_fencing_config`, and revise the summary/arguments in `docs/TNF.md` to
reflect the full scope handled by that tool, including operator health,
machine/node/BMH correlation, FenceAgentsRemediation, and NodeHealthCheck
checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fcc12327-84ae-4185-becf-a821e0970e2e
📒 Files selected for processing (14)
docs/README.mddocs/TNF.mdpkg/mcp/modules.gopkg/toolsets/tnf/fencing/api_check.gopkg/toolsets/tnf/fencing/api_check_test.gopkg/toolsets/tnf/fencing/fencing.gopkg/toolsets/tnf/fencing/fencing_test.gopkg/toolsets/tnf/fencing/stonith.gopkg/toolsets/tnf/fencing/stonith_test.gopkg/toolsets/tnf/mcp_resources.gopkg/toolsets/tnf/mcp_resources_test.gopkg/toolsets/tnf/toolset.gopkg/toolsets/tnf/troubleshoot.gopkg/toolsets/tnf/troubleshoot_test.go
✅ Files skipped from review due to trivial changes (2)
- pkg/mcp/modules.go
- docs/README.md
🚧 Files skipped from review as they are similar to previous changes (11)
- pkg/toolsets/tnf/fencing/api_check_test.go
- pkg/toolsets/tnf/mcp_resources_test.go
- pkg/toolsets/tnf/fencing/fencing_test.go
- pkg/toolsets/tnf/mcp_resources.go
- pkg/toolsets/tnf/fencing/api_check.go
- pkg/toolsets/tnf/fencing/stonith_test.go
- pkg/toolsets/tnf/toolset.go
- pkg/toolsets/tnf/troubleshoot_test.go
- pkg/toolsets/tnf/troubleshoot.go
- pkg/toolsets/tnf/fencing/stonith.go
- pkg/toolsets/tnf/fencing/fencing.go
| @@ -0,0 +1,114 @@ | |||
| # TNF (Two-Node Fencing) Support | |||
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Rename this doc to lowercase.
docs/**/*.md requires lowercase filenames for new documentation, so docs/TNF.md should be docs/tnf.md. As per path instructions, docs/**/*.md: Use lowercase filenames for new documentation files (e.g., configuration.md, prompts.md).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/TNF.md` at line 1, The documentation file name uses uppercase letters,
but new docs under docs should use lowercase filenames. Rename the TNF markdown
document to tnf.md and update any references that point to the old TNF.md path
so the docs match the naming convention.
Source: Path instructions
Extends the OpenShift MCP server with a new `tnf` toolset for Two-Node Fencing (TNF) cluster diagnostics. Tools: - tnf_check_fencing_config: validates cluster topology, operator health, Machine/Node/BMH correlation, BMC credentials, FenceAgentsRemediation, and NodeHealthCheck configuration - tnf_check_stonith_status: runs pcs diagnostics via debug pod to check pacemaker cluster state, STONITH devices, quorum, and fencing history Prompt: - tnf-troubleshoot: structured troubleshooting workflow collecting all diagnostic data with API-down fallback to out-of-band recovery guide Resource: - tnf://domain-knowledge/fencing: domain knowledge covering quorum rules, split-brain risk matrix, and recovery procedures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7d07b0a to
3046a4b
Compare
|
@dhensel-rh: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Extends the OpenShift MCP server with a new
tnftoolset that brings two-node fencing (TNF) diagnostics into the MCP tool/prompt/resource model. TNF clusters rely on out-of-band STONITH fencing to prevent split-brain, but diagnosing fencing failures today requires manually cross-referencing Kubernetes node health, BMC credentials, hardware state, and RHEL HA fencing agent configuration — this toolset correlates all of those into conversational diagnostics that surface split-brain risks before they become outages.Ref: OCPEDGE-2742
Scope
This toolset is diagnostic-only. It collects and reports cluster state — it does
not modify fencing configuration, trigger remediation, or alter any cluster resources.
The one caveat:
tnf_check_stonith_statuscreates a temporary privileged debug podon a control-plane node to run
pcs status. This follows the same pattern ascluster-diagnostics/nodesdebug— it's a read mechanism, not a write operationagainst fencing config. Cleanup is handled by the
nodesdebugpackage's deferredteardown; verify behavior on abnormal exit (process crash, client disconnect) if
cleanup guarantees are important for review.
Future work may add remediation actions (e.g., restarting a fencing agent, unfencing
a node), but those would be a separate toolset or an explicit opt-in — not added to
these diagnostic tools.
What's included
Tools
tnf_check_fencing_config— validates fencing configuration including BareMetalHost resources, BMC credentials, cluster operator health, machine/node/BMH correlation, FenceAgentsRemediation templates, and NodeHealthCheck configtnf_check_stonith_status— creates a temporary privileged debug pod on a control-plane node (vianodesdebug) to runpcsdiagnostic commands. Returns pacemaker cluster state, STONITH device configuration, quorum status, and fencing history. The debug pod is automatically cleaned up after execution.Prompts
tnf-troubleshoot— comprehensive fencing troubleshooting guide that collects cluster topology, node health, operator status, BMH state, remediation config, and STONITH diagnostics. Includes API-down detection: when the cluster API is unreachable, returns an out-of-band recovery guide with BMC/IPMI procedures instead of failing.Resources
tnf://domain-knowledge/fencing— static reference for two-node fencing domain knowledge: corosync/pacemaker quorum rules, split-brain risk assessment matrix, and common recovery procedures. Extracted from the prompt to keep responses focused while making domain knowledge available on demand.Design notes
tnf_check_stonith_statuscreates a temporary debug pod (same pattern ascluster-diagnostics/nodesdebug), but does not modify user-visible cluster state.CoreV1Interfaceimplementations (interface embedding) instead ofk8s.io/client-go/kubernetes/fake, avoiding ~280 vendored files that would otherwise bloat the PR.Test plan
go build ./...passesgo vet ./pkg/toolsets/tnf/...passesgo test ./pkg/toolsets/tnf/...passes (tnf + fencing packages)go test ./pkg/toolsets/...passes (includes global resource URI uniqueness checks)find vendor -path '*/kubernetes/fake*' | wc -lreturns 0🤖 Generated with Claude Code
Summary by CodeRabbit
Summary by CodeRabbit
New Features
tnf-troubleshootwalkthrough prompt.Bug Fixes
Documentation