Skip to content

Add TNF two-node fencing diagnostics toolset#367

Open
dhensel-rh wants to merge 1 commit into
openshift:mainfrom
dhensel-rh:OCPEDGE-2742shiftweek
Open

Add TNF two-node fencing diagnostics toolset#367
dhensel-rh wants to merge 1 commit into
openshift:mainfrom
dhensel-rh:OCPEDGE-2742shiftweek

Conversation

@dhensel-rh

@dhensel-rh dhensel-rh commented Jun 29, 2026

Copy link
Copy Markdown

Summary

Extends the OpenShift MCP server with a new tnf toolset that brings two-node fencing (TNF) diagnostics into the MCP tool/prompt/resource model. TNF clusters rely on out-of-band STONITH fencing to prevent split-brain, but diagnosing fencing failures today requires manually cross-referencing Kubernetes node health, BMC credentials, hardware state, and RHEL HA fencing agent configuration — this toolset correlates all of those into conversational diagnostics that surface split-brain risks before they become outages.

Ref: OCPEDGE-2742

Scope

This toolset is diagnostic-only. It collects and reports cluster state — it does
not modify fencing configuration, trigger remediation, or alter any cluster resources.

The one caveat: tnf_check_stonith_status creates a temporary privileged debug pod
on a control-plane node to run pcs status. This follows the same pattern as
cluster-diagnostics/nodesdebug — it's a read mechanism, not a write operation
against fencing config. Cleanup is handled by the nodesdebug package's deferred
teardown; verify behavior on abnormal exit (process crash, client disconnect) if
cleanup guarantees are important for review.

Future work may add remediation actions (e.g., restarting a fencing agent, unfencing
a node), but those would be a separate toolset or an explicit opt-in — not added to
these diagnostic tools.

What's included

Tools

  • tnf_check_fencing_config — validates fencing configuration including BareMetalHost resources, BMC credentials, cluster operator health, machine/node/BMH correlation, FenceAgentsRemediation templates, and NodeHealthCheck config
  • tnf_check_stonith_status — creates a temporary privileged debug pod on a control-plane node (via nodesdebug) to run pcs diagnostic commands. Returns pacemaker cluster state, STONITH device configuration, quorum status, and fencing history. The debug pod is automatically cleaned up after execution.

Prompts

  • tnf-troubleshoot — comprehensive fencing troubleshooting guide that collects cluster topology, node health, operator status, BMH state, remediation config, and STONITH diagnostics. Includes API-down detection: when the cluster API is unreachable, returns an out-of-band recovery guide with BMC/IPMI procedures instead of failing.

Resources

  • tnf://domain-knowledge/fencing — static reference for two-node fencing domain knowledge: corosync/pacemaker quorum rules, split-brain risk assessment matrix, and common recovery procedures. Extracted from the prompt to keep responses focused while making domain knowledge available on demand.

Design notes

  • See Scope above for the diagnostic-only contract.
  • All operations are read-only. tnf_check_stonith_status creates a temporary debug pod (same pattern as cluster-diagnostics/nodesdebug), but does not modify user-visible cluster state.
  • BMC credentials are validated for presence but never included in output — passwords are not leaked to the LLM.
  • Gracefully degrades when optional CRDs (FenceAgentsRemediation, NodeHealthCheck) are not installed.
  • Test fakes use hand-built CoreV1Interface implementations (interface embedding) instead of k8s.io/client-go/kubernetes/fake, avoiding ~280 vendored files that would otherwise bloat the PR.

Test plan

  • go build ./... passes
  • go vet ./pkg/toolsets/tnf/... passes
  • go test ./pkg/toolsets/tnf/... passes (tnf + fencing packages)
  • go test ./pkg/toolsets/... passes (includes global resource URI uniqueness checks)
  • Tested against a live 2-node bare metal TNF cluster
  • Verified API-down detection against an unreachable cluster (SSH tunnel to disconnected environment)
  • Verified graceful degradation when FenceAgentsRemediation/NodeHealthCheck CRDs are absent
  • No vendor files added — find vendor -path '*/kubernetes/fake*' | wc -l returns 0

🤖 Generated with Claude Code

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added TNF (Two-Node Fencing) diagnostics for fencing health and STONITH status, plus a full tnf-troubleshoot walkthrough prompt.
    • Included TNF domain-knowledge content with step-by-step recovery guidance and structured Markdown output.
  • Bug Fixes

    • Enhanced behavior and messaging when the Kubernetes API is unreachable.
    • Added stronger input validation and clearer reporting when expected cluster data isn’t available.
  • Documentation

    • Added a TNF toolset guide page covering tools, arguments, outputs, and prerequisites.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds the TNF toolset, fencing diagnostics, STONITH status reporting, a troubleshooting prompt, and a fencing domain-knowledge MCP resource, plus docs and tests for the new flows.

Changes

TNF toolset and MCP wiring

Layer / File(s) Summary
Toolset wiring and docs
pkg/mcp/modules.go, pkg/toolsets/tnf/toolset.go, pkg/toolsets/tnf/mcp_resources.go, pkg/toolsets/tnf/mcp_resources_test.go, docs/README.md, docs/TNF.md
Registers the TNF toolset, exposes its prompt and domain-knowledge resource, and adds the TNF documentation and resource checks.
API reachability guidance
pkg/toolsets/tnf/fencing/api_check.go, pkg/toolsets/tnf/fencing/api_check_test.go
Adds API-unreachable detection, connection-error classification, and the unreachable-cluster guidance text with tests.
Fencing health report
pkg/toolsets/tnf/fencing/fencing.go, pkg/toolsets/tnf/fencing/fencing_test.go
Implements the fencing health tool with topology, operator, machine, BareMetalHost, remediation, and node-health reporting plus validation coverage.
STONITH diagnostics
pkg/toolsets/tnf/fencing/stonith.go, pkg/toolsets/tnf/fencing/stonith_test.go
Adds the STONITH status tool, diagnostics execution, PCS report parsing, section rendering, and helper tests.
Troubleshooting prompt
pkg/toolsets/tnf/troubleshoot.go, pkg/toolsets/tnf/troubleshoot_test.go
Builds the TNF troubleshooting prompt flow and validates the collected cluster data helpers and prompt registration.

Estimated code review effort: 4 (Complex) | ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding the TNF two-node fencing diagnostics toolset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from Kaustubh-pande and manusa June 29, 2026 18:08
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dhensel-rh
Once this PR has been reviewed and has the lgtm label, please assign cali0707 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/toolsets/tnf/fencing/fencing_test.go`:
- Around line 22-32: The test is calling the private checkFencingHealth handler
directly, so it does not verify the exported tool wiring. Update
TestInvalidNamespaceType in FencingHandlerSuite to exercise the public
entrypoint for fencing health instead of invoking checkFencingHealth, using the
exported tool/API surface that routes ToolHandlerParams and the invalid
namespace args. Keep the same invalid namespace assertion, but ensure the test
covers the public registration/path rather than internal implementation details.

In `@pkg/toolsets/tnf/fencing/fencing.go`:
- Around line 153-155: The `unstructured.Nested*` calls in `fencing.go` are
discarding parse/type errors, so malformed CRs are being treated as missing data
instead of producing a real failure. Update the affected code paths in
`section`, `issues`, and any related helpers that use
`unstructured.NestedString`, `NestedMap`, `NestedSlice`, or similar to capture
and check the returned error, then propagate it into the section output or
append a diagnostic to `issues` rather than ignoring it. Use the existing
symbols like `platform`, `infraTopology`, `cpTopology`, and the surrounding
fencing/validation functions to keep the handling consistent across all listed
call sites.
- Around line 391-397: The node health/reporting path is using the BareMetalHost
name instead of the correlated Node name, which can produce false missing-node
messages for healthy hosts. Update the fencing flow in the code around
checkNodeHealth and the report generation to pass and use the Node name derived
from Machine.status.nodeRef.name (the real Node identifier) rather than hostName
whenever looking up the Node or formatting node-related status, including the
later report section that applies the same logic.
- Around line 146-150: The current Get/List error handling in the fencing checks
treats every failure as “not available” or “CRD not installed,” which hides real
client/apiserver failures. Update the affected branches around the
Infrastructure CR and the other listed Resource/List calls to explicitly detect
NotFound and NoMatch cases as the graceful optional-API path, but for any other
error (for example Forbidden or timeout) record it as a real failure in the
markdown summary instead of returning a generic unavailable message. Keep the
existing slog debug logging, and ensure the report.WriteString output
distinguishes optional absence from actual errors.
- Around line 596-600: The credential validation in fencing should reject empty
values as well as missing keys, because the current checks in the secret
validation logic only verify presence in the secret data map. Update the
validation around the secret.Data["username"] and secret.Data["password"] checks
so it also flags zero-length byte slices as invalid, and keep the issue messages
in the same validation path used by the fencing secret validation routine.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c0376530-b72a-4482-9345-24190bf67feb

📥 Commits

Reviewing files that changed from the base of the PR and between dc9c67b and 1c9c04e.

📒 Files selected for processing (4)
  • pkg/mcp/modules.go
  • pkg/toolsets/tnf/fencing/fencing.go
  • pkg/toolsets/tnf/fencing/fencing_test.go
  • pkg/toolsets/tnf/toolset.go

Comment thread pkg/toolsets/tnf/fencing/fencing_test.go
Comment thread pkg/toolsets/tnf/fencing/fencing.go
Comment thread pkg/toolsets/tnf/fencing/fencing.go Outdated
Comment thread pkg/toolsets/tnf/fencing/fencing.go Outdated
Comment thread pkg/toolsets/tnf/fencing/fencing.go Outdated
@dhensel-rh dhensel-rh force-pushed the OCPEDGE-2742shiftweek branch from 78fb74a to ba9a3a6 Compare June 30, 2026 20:31

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (6)
pkg/toolsets/tnf/fencing/api_check.go (1)

27-44: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Redundant *net.OpError check.

net.OpError implements the net.Error interface (Timeout()/Temporary()), so any error matching errors.As(err, &netErr) at line 29 already covers *net.OpError instances. The block at lines 32-35 is dead code.

♻️ Proposed simplification
 func isConnectionError(err error) bool {
 	var netErr net.Error
 	if errors.As(err, &netErr) {
 		return true
 	}
-	var opErr *net.OpError
-	if errors.As(err, &opErr) {
-		return true
-	}
 	msg := err.Error()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/toolsets/tnf/fencing/api_check.go` around lines 27 - 44, The
isConnectionError helper has a redundant *net.OpError branch because
errors.As(err, &netErr) in isConnectionError already matches net.Error
implementations, including net.OpError. Remove the separate opErr check and keep
the existing net.Error handling plus the string fallback so the logic stays
equivalent but simpler.
pkg/toolsets/tnf/fencing/api_check_test.go (2)

92-106: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Multiple assertions in a single test instead of one-assertion-per-subtest.

TestAPIUnreachableGuide bundles 11 assert.Contains checks into one test rather than nested s.Run() subtests with descriptive scenario names and one assertion each.

As per path instructions, "Use nested subtests with s.Run() to organize related test cases with descriptive scenario names" and "Each s.Run() block should ideally test one specific behavior with one assertion per test case."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/toolsets/tnf/fencing/api_check_test.go` around lines 92 - 106,
TestAPIUnreachableGuide currently groups many assert.Contains checks into one
test; split these into nested s.Run subtests with descriptive names and one
assertion per subtest. Update the test around APIUnreachableGuide to organize
the existing checks (API Unreachable, Scenario 1/2/3, BMC, ssh core@, journalctl
-u agent, rpm-ostree status, pcs status, etcdctl, tnf_check_fencing_health) into
separate subtests so each behavior is isolated and easier to read.

Source: Path instructions


12-86: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoff

Test accesses a private function and doesn't use testify/suite.

isConnectionError is unexported; testing it directly makes this a white-box test of an internal function. Guidelines require black-box testing of the public API and organizing tests with testify/suite/suite.Run(). Consider testing this behavior indirectly through IsAPIUnreachable (the public entry point) and wrapping the suite in a suite.Suite.

As per path instructions, "Use testify/suite for organizing tests into suites with test suites that embed suite.Suite, and execute them using suite.Run()" and "Test the public API only - tests should be black-box and not access internal/private functions."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/toolsets/tnf/fencing/api_check_test.go` around lines 12 - 86, The test
currently targets the internal helper isConnectionError directly, which should
be covered only through the public API. Move these cases into a testify suite by
creating a suite that embeds suite.Suite and running it with suite.Run(), then
exercise the same error scenarios via IsAPIUnreachable instead of calling the
private function. Keep the existing table-driven cases, but assert behavior
through the exported entry point so the tests remain black-box.

Source: Path instructions

pkg/toolsets/tnf/troubleshoot.go (2)

256-261: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Control-plane count only matches the master role label.

Same concern as DetectControlPlaneNode in stonith.go: counting only node-role.kubernetes.io/master will under-report (or zero out) control-plane node counts on clusters that only carry node-role.kubernetes.io/control-plane, skewing the "TNF Profile" detection.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/toolsets/tnf/troubleshoot.go` around lines 256 - 261, The control-plane
counting logic in DetectTNFProfile only checks the
node-role.kubernetes.io/master label, so it misses clusters that use
node-role.kubernetes.io/control-plane. Update the node label check in this loop
to count either role label, consistent with DetectControlPlaneNode in
stonith.go, so cpCount reflects all control-plane nodes regardless of label
variant.

231-273: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Node/control-plane counts are lost when the Infrastructure CR is unavailable.

fetchClusterTopology returns immediately at line 239 if the Infrastructure CR Get fails, skipping the (independent) node-listing block entirely. On a non-OpenShift cluster (no Infrastructure CR) the report would show no node-count data at all, even though it's obtainable via coreClient.Nodes().List.

♻️ Proposed fix
 	infra, err := dynamicClient.Resource(fencing.InfrastructureGVR).Get(ctx, "cluster", metav1.GetOptions{})
-	if err != nil {
-		slog.Debug("could not get Infrastructure CR", "error", err)
-		result.WriteString("*Infrastructure CR not available*\n")
-		return result.String()
-	}
-
-	platform, _, _ := unstructured.NestedString(infra.Object, "status", "platform")
-	infraTopology, _, _ := unstructured.NestedString(infra.Object, "status", "infrastructureTopology")
-	cpTopology, _, _ := unstructured.NestedString(infra.Object, "status", "controlPlaneTopology")
-
-	fmt.Fprintf(&result, "- **Platform:** %s\n", fencing.ValueOrNA(platform))
-	fmt.Fprintf(&result, "- **Infrastructure Topology:** %s\n", fencing.ValueOrNA(infraTopology))
-	fmt.Fprintf(&result, "- **Control Plane Topology:** %s\n", fencing.ValueOrNA(cpTopology))
+	var platform string
+	if err != nil {
+		slog.Debug("could not get Infrastructure CR", "error", err)
+		result.WriteString("*Infrastructure CR not available*\n")
+	} else {
+		var infraTopology, cpTopology string
+		platform, _, _ = unstructured.NestedString(infra.Object, "status", "platform")
+		infraTopology, _, _ = unstructured.NestedString(infra.Object, "status", "infrastructureTopology")
+		cpTopology, _, _ = unstructured.NestedString(infra.Object, "status", "controlPlaneTopology")
+
+		fmt.Fprintf(&result, "- **Platform:** %s\n", fencing.ValueOrNA(platform))
+		fmt.Fprintf(&result, "- **Infrastructure Topology:** %s\n", fencing.ValueOrNA(infraTopology))
+		fmt.Fprintf(&result, "- **Control Plane Topology:** %s\n", fencing.ValueOrNA(cpTopology))
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/toolsets/tnf/troubleshoot.go` around lines 231 - 273,
fetchClusterTopology currently returns early when
dynamicClient.Resource(fencing.InfrastructureGVR).Get fails, which prevents the
independent node-count logic from running. Change the flow so the infrastructure
lookup only gates the platform/topology fields, then always continue to
coreClient.Nodes().List and the control-plane counting logic in
fetchClusterTopology. Keep the existing logging for the Infrastructure CR error,
but avoid returning before the node summary is appended to the result.
pkg/toolsets/tnf/troubleshoot_test.go (1)

30-83: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Hand-rolled CoreV1 fakes instead of the real client-go fake clientset.

fakeCoreV1/fakeNodes/fakeSecrets reimplement a subset of CoreV1Interface by hand (e.g. fakeNodes.List ignores LabelSelector/FieldSelector entirely). k8s.io/client-go/kubernetes/fake already provides a fully-functional fake CoreV1Interface backed by an object tracker (as is already done for the dynamic client via dynamic/fake), which avoids re-implementing matching logic that can silently diverge from real client behavior.

As per coding guidelines: "Use real implementations and integration testing where possible - avoid mocks."

♻️ Suggested approach
import (
	clientgofake "k8s.io/client-go/kubernetes/fake"
)

func newFakeCoreV1(objects ...runtime.Object) corev1client.CoreV1Interface {
	return clientgofake.NewSimpleClientset(objects...).CoreV1()
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/toolsets/tnf/troubleshoot_test.go` around lines 30 - 83, The CoreV1 test
doubles in fakeCoreV1, fakeNodes, fakeSecrets, and newFakeCoreV1 are
hand-rolling client-go behavior and can diverge from real selector and lookup
semantics. Replace this custom CoreV1Interface implementation with the real
kubernetes/fake clientset in newFakeCoreV1, returning its CoreV1() so the tests
use the object tracker-backed fake instead of bespoke matching logic.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/toolsets/tnf/fencing/stonith_test.go`:
- Around line 1-180: The STONITH suite is testing package-private helpers
directly instead of the public API, so update the tests to exercise the exported
entrypoint from the fencing package (for example, InitSTONITH and its Handler)
rather than calling checkSTONITHStatus, parseSections, buildSTONITHReport, or
extractBracketedList. Refactor the table-like scenarios in STONITHHandlerSuite
to use nested s.Run subtests with descriptive names and keep one assertion focus
per scenario, while preserving the existing coverage through the public surface.

In `@pkg/toolsets/tnf/fencing/stonith.go`:
- Around line 100-103: The error path in checkSTONITHStatus is dropping the
partial diagnostic text returned by RunSTONITHDiagnostics. Update the error
handling in checkSTONITHStatus to pass the report value into
api.NewToolCallResult instead of an empty string when err is non-nil, so the
"Partial output" built by RunSTONITHDiagnostics is preserved for the caller.
Also ensure the execErr branch in RunSTONITHDiagnostics continues returning that
diagnostic string as the first result.
- Around line 66-77: The STONITH tool annotations are mislabeling a
side-effecting operation as read-only. Update the tool definition in the STONITH
status registration so the api.ToolAnnotations for checkSTONITHStatus sets
ReadOnlyHint to false while keeping the other hints as appropriate, since the
handler creates a privileged debug pod and executes chroot /host on a
control-plane node.
- Around line 158-169: DetectControlPlaneNode and the related control-plane
detection/counting logic in pkg/toolsets/tnf/troubleshoot.go and
pkg/toolsets/tnf/fencing/fencing.go only look for
node-role.kubernetes.io/master, so they miss clusters that use
node-role.kubernetes.io/control-plane. Update the node selection/counting paths
to recognize both labels, ideally by sharing the same label-matching logic used
by DetectControlPlaneNode so auto-detection and TNF/STONITH control-plane checks
work on either label.

In `@pkg/toolsets/tnf/mcp_resources_test.go`:
- Around line 1-37: The tests in TestInitMCPResources and
TestFencingDomainKnowledgeResource are using bare testing functions and calling
the unexported initMCPResources helper directly, which breaks the suite-based
black-box test pattern. Refactor this file to use a testify/suite test suite
like TNFTroubleshootSuite, run it with suite.Run(), and exercise the MCP
resource behavior only through the public API instead of reaching into internal
package functions. Keep the existing assertions, but move them into suite
methods and validate the same resource contract via exported entry points.

In `@pkg/toolsets/tnf/troubleshoot_test.go`:
- Around line 109-362: The troubleshooting tests are white-boxing internal
helpers instead of validating the public prompt surface. Update the suite to
exercise the registered tnf-troubleshoot prompt handler and assert its output,
rather than calling fetchClusterTopology, fetchNodeHealth, fetchOperatorHealth,
fetchBMHStatus, fetchRemediationStatus, and checkCredentialSecret directly. Keep
only public API usage in these tests, similar to how TestValueOrNA uses
fencing.ValueOrNA.

In `@pkg/toolsets/tnf/troubleshoot.go`:
- Around line 242-244: The unstructured field reads in fetchClusterTopology,
fetchOperatorHealth, fetchBMHStatus, and fetchRemediationStatus are discarding
errors from NestedString, NestedBool, and NestedSlice, so type mismatches are
hidden as “N/A” with no diagnostics. Update each call site to capture and handle
the err (and found where relevant) instead of using `_`, and emit a slog.Debug
message consistent with the existing logging patterns in this file when a field
cannot be read. Use the existing helper functions and symbols in these methods
to centralize the handling so every Nested* access reports malformed CRD data
instead of failing silently.

---

Nitpick comments:
In `@pkg/toolsets/tnf/fencing/api_check_test.go`:
- Around line 92-106: TestAPIUnreachableGuide currently groups many
assert.Contains checks into one test; split these into nested s.Run subtests
with descriptive names and one assertion per subtest. Update the test around
APIUnreachableGuide to organize the existing checks (API Unreachable, Scenario
1/2/3, BMC, ssh core@, journalctl -u agent, rpm-ostree status, pcs status,
etcdctl, tnf_check_fencing_health) into separate subtests so each behavior is
isolated and easier to read.
- Around line 12-86: The test currently targets the internal helper
isConnectionError directly, which should be covered only through the public API.
Move these cases into a testify suite by creating a suite that embeds
suite.Suite and running it with suite.Run(), then exercise the same error
scenarios via IsAPIUnreachable instead of calling the private function. Keep the
existing table-driven cases, but assert behavior through the exported entry
point so the tests remain black-box.

In `@pkg/toolsets/tnf/fencing/api_check.go`:
- Around line 27-44: The isConnectionError helper has a redundant *net.OpError
branch because errors.As(err, &netErr) in isConnectionError already matches
net.Error implementations, including net.OpError. Remove the separate opErr
check and keep the existing net.Error handling plus the string fallback so the
logic stays equivalent but simpler.

In `@pkg/toolsets/tnf/troubleshoot_test.go`:
- Around line 30-83: The CoreV1 test doubles in fakeCoreV1, fakeNodes,
fakeSecrets, and newFakeCoreV1 are hand-rolling client-go behavior and can
diverge from real selector and lookup semantics. Replace this custom
CoreV1Interface implementation with the real kubernetes/fake clientset in
newFakeCoreV1, returning its CoreV1() so the tests use the object tracker-backed
fake instead of bespoke matching logic.

In `@pkg/toolsets/tnf/troubleshoot.go`:
- Around line 256-261: The control-plane counting logic in DetectTNFProfile only
checks the node-role.kubernetes.io/master label, so it misses clusters that use
node-role.kubernetes.io/control-plane. Update the node label check in this loop
to count either role label, consistent with DetectControlPlaneNode in
stonith.go, so cpCount reflects all control-plane nodes regardless of label
variant.
- Around line 231-273: fetchClusterTopology currently returns early when
dynamicClient.Resource(fencing.InfrastructureGVR).Get fails, which prevents the
independent node-count logic from running. Change the flow so the infrastructure
lookup only gates the platform/topology fields, then always continue to
coreClient.Nodes().List and the control-plane counting logic in
fetchClusterTopology. Keep the existing logging for the Infrastructure CR error,
but avoid returning before the node summary is appended to the result.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 08c2306b-c417-45e4-ad39-798cd02a0d2d

📥 Commits

Reviewing files that changed from the base of the PR and between 1c9c04e and ba9a3a6.

📒 Files selected for processing (12)
  • pkg/mcp/modules.go
  • pkg/toolsets/tnf/fencing/api_check.go
  • pkg/toolsets/tnf/fencing/api_check_test.go
  • pkg/toolsets/tnf/fencing/fencing.go
  • pkg/toolsets/tnf/fencing/fencing_test.go
  • pkg/toolsets/tnf/fencing/stonith.go
  • pkg/toolsets/tnf/fencing/stonith_test.go
  • pkg/toolsets/tnf/mcp_resources.go
  • pkg/toolsets/tnf/mcp_resources_test.go
  • pkg/toolsets/tnf/toolset.go
  • pkg/toolsets/tnf/troubleshoot.go
  • pkg/toolsets/tnf/troubleshoot_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/mcp/modules.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/toolsets/tnf/fencing/fencing_test.go
  • pkg/toolsets/tnf/fencing/fencing.go

Comment thread pkg/toolsets/tnf/fencing/stonith_test.go
Comment thread pkg/toolsets/tnf/fencing/stonith.go
Comment thread pkg/toolsets/tnf/fencing/stonith.go
Comment thread pkg/toolsets/tnf/fencing/stonith.go
Comment thread pkg/toolsets/tnf/mcp_resources_test.go
Comment thread pkg/toolsets/tnf/troubleshoot_test.go
Comment thread pkg/toolsets/tnf/troubleshoot.go
@dhensel-rh dhensel-rh force-pushed the OCPEDGE-2742shiftweek branch from ba9a3a6 to e0c110e Compare July 1, 2026 01:26
@dhensel-rh dhensel-rh changed the title Add TNF fencing diagnostics toolset Add TNF two-node fencing diagnostics toolset Jul 1, 2026
@dhensel-rh dhensel-rh force-pushed the OCPEDGE-2742shiftweek branch 3 times, most recently from c5e222d to 456f800 Compare July 1, 2026 17:49

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/TNF.md`:
- Around line 110-114: Update the “API must be reachable” limitation in TNF.md
to reflect the actual fallback behavior: when the Kubernetes API is unavailable
during a fencing event, the tools should be described as returning the
out-of-band recovery guide from the tnf-troubleshoot prompt rather than implying
a hard timeout. Keep the wording aligned with the existing limitations list and
reference the Kubernetes API and tnf-troubleshoot behavior explicitly.
- Line 1: The documentation file name uses uppercase letters, but new docs under
docs should use lowercase filenames. Rename the TNF markdown document to tnf.md
and update any references that point to the old TNF.md path so the docs match
the naming convention.
- Around line 7-13: The docs summary for the fencing tool is using the wrong
exported name and an incomplete description. Update the
`tnf_check_fencing_health` heading to the actual tool name
`tnf_check_fencing_config`, and revise the summary/arguments in `docs/TNF.md` to
reflect the full scope handled by that tool, including operator health,
machine/node/BMH correlation, FenceAgentsRemediation, and NodeHealthCheck
checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fcc12327-84ae-4185-becf-a821e0970e2e

📥 Commits

Reviewing files that changed from the base of the PR and between c5e222d and 7d07b0a.

📒 Files selected for processing (14)
  • docs/README.md
  • docs/TNF.md
  • pkg/mcp/modules.go
  • pkg/toolsets/tnf/fencing/api_check.go
  • pkg/toolsets/tnf/fencing/api_check_test.go
  • pkg/toolsets/tnf/fencing/fencing.go
  • pkg/toolsets/tnf/fencing/fencing_test.go
  • pkg/toolsets/tnf/fencing/stonith.go
  • pkg/toolsets/tnf/fencing/stonith_test.go
  • pkg/toolsets/tnf/mcp_resources.go
  • pkg/toolsets/tnf/mcp_resources_test.go
  • pkg/toolsets/tnf/toolset.go
  • pkg/toolsets/tnf/troubleshoot.go
  • pkg/toolsets/tnf/troubleshoot_test.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/mcp/modules.go
  • docs/README.md
🚧 Files skipped from review as they are similar to previous changes (11)
  • pkg/toolsets/tnf/fencing/api_check_test.go
  • pkg/toolsets/tnf/mcp_resources_test.go
  • pkg/toolsets/tnf/fencing/fencing_test.go
  • pkg/toolsets/tnf/mcp_resources.go
  • pkg/toolsets/tnf/fencing/api_check.go
  • pkg/toolsets/tnf/fencing/stonith_test.go
  • pkg/toolsets/tnf/toolset.go
  • pkg/toolsets/tnf/troubleshoot_test.go
  • pkg/toolsets/tnf/troubleshoot.go
  • pkg/toolsets/tnf/fencing/stonith.go
  • pkg/toolsets/tnf/fencing/fencing.go

Comment thread docs/tnf.md
@@ -0,0 +1,114 @@
# TNF (Two-Node Fencing) Support

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Rename this doc to lowercase.

docs/**/*.md requires lowercase filenames for new documentation, so docs/TNF.md should be docs/tnf.md. As per path instructions, docs/**/*.md: Use lowercase filenames for new documentation files (e.g., configuration.md, prompts.md).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/TNF.md` at line 1, The documentation file name uses uppercase letters,
but new docs under docs should use lowercase filenames. Rename the TNF markdown
document to tnf.md and update any references that point to the old TNF.md path
so the docs match the naming convention.

Source: Path instructions

Comment thread docs/TNF.md Outdated
Comment thread docs/tnf.md
Extends the OpenShift MCP server with a new `tnf` toolset for
Two-Node Fencing (TNF) cluster diagnostics.

Tools:
- tnf_check_fencing_config: validates cluster topology, operator health,
  Machine/Node/BMH correlation, BMC credentials, FenceAgentsRemediation,
  and NodeHealthCheck configuration
- tnf_check_stonith_status: runs pcs diagnostics via debug pod to check
  pacemaker cluster state, STONITH devices, quorum, and fencing history

Prompt:
- tnf-troubleshoot: structured troubleshooting workflow collecting all
  diagnostic data with API-down fallback to out-of-band recovery guide

Resource:
- tnf://domain-knowledge/fencing: domain knowledge covering quorum rules,
  split-brain risk matrix, and recovery procedures

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dhensel-rh dhensel-rh force-pushed the OCPEDGE-2742shiftweek branch from 7d07b0a to 3046a4b Compare July 2, 2026 01:42
@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown

@dhensel-rh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant