Skip to content

Conversation

@ropatil010
Copy link

Hi Team,

Can you PTAL on this.

PR about:
Implements comprehensive cluster health check tool that examines:

  • Cluster operators (OpenShift)
  • Nodes (readiness, schedulability, resource pressure)
  • Pods (failures, crash loops, image pull errors, high restarts)
  • Workload controllers (Deployments, StatefulSets, DaemonSets)
  • Storage (PVC status)
  • Recent warning events

Features:

  • Text and JSON output formats
  • Verbose mode for detailed diagnostics
  • Configurable event checking
  • Clear severity indicators (critical/warning/healthy)

@ropatil010
Copy link
Author

Output format:

===============================================
Cluster Health Check Report

Cluster Type: OpenShift
Cluster Version: 4.20.0
Check Time: 2025-11-03T17:33:34Z

Checking Cluster Operators...
✅ All cluster operators healthy (34/34)
Checking Node Health...
✅ All nodes healthy (6)
Checking Pod Health...
❌ CRITICAL: 1 pod(s) in failed/pending state

  • openshift-xxxx/nodename [Failed]
    Checking Workload Controllers...
    ✅ All deployments healthy
    ✅ All statefulsets healthy
    ✅ All daemonsets healthy
    Checking Storage...
    ✅ All PVCs bound
    Checking Recent Events...
    ✅ No recent warning events

===============================================
Summary

Critical Issues: 1
Warnings: 0

❌ Cluster has CRITICAL issues requiring immediate attention

@ropatil010 ropatil010 force-pushed the cluster-health-check branch 2 times, most recently from 2e02dc9 to 767c1db Compare November 4, 2025 05:23
@Cali0707 Cali0707 requested a review from manusa November 4, 2025 15:39
@manusa
Copy link
Member

manusa commented Nov 5, 2025

I'm not really sure about this one, it seems to be quite opinionated.
Given the raw tools, the LLMs should be able to perform the operations themselves.

In https://github.com/Flux159/mcp-server-kubernetes they're implementing similar functionality via a prompt. Which doesn't seem like a bad idea.

Similarly, in https://github.com/GoogleCloudPlatform/kubectl-ai (AFAIU) they're leveraging the system prompt for this purpose: https://github.com/GoogleCloudPlatform/kubectl-ai/blob/main/pkg/agent/systemprompt_template_default.txt

IMO if we want to proceed with this feature we should either:

  1. Reimplement it as a prompt for the core toolset.
  2. If we really want this as a tool, add it to a specific opt-in toolset diagnostics or something similar.

This is also a good case to test evals and see if they can be used to make a better decision on how to implement this feature.

@ropatil010 ropatil010 force-pushed the cluster-health-check branch 2 times, most recently from 5fd5987 to afb7d63 Compare November 5, 2025 16:03
@ropatil010 ropatil010 changed the title feat(core): add cluster health check tool prompt(core): add cluster health check Nov 7, 2025
@ropatil010
Copy link
Author

Hi @manusa/@matzew Updated as per suggestions.
Can you PTAL when ever get a chance. Thanks in adv.!

Copy link
Member

@manusa manusa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to break down prompt support into:

This PR is blocked by #556 that should be taken care of quickly (since there's already a PR that partially takes care of it #510)

Once that's merged we can rebase this one and properly review.

@nader-ziada
Copy link
Collaborator

nader-ziada commented Dec 12, 2025

@ropatil010 would be you be able to rebase this PR? thanks

@ropatil010 ropatil010 force-pushed the cluster-health-check branch 2 times, most recently from 6a8f524 to 17bda66 Compare December 15, 2025 11:26
@ropatil010 ropatil010 changed the title prompt(core): add cluster health check [WIP] prompt(core): add cluster health check Dec 15, 2025
@ropatil010 ropatil010 changed the title [WIP] prompt(core): add cluster health check prompt(core): add cluster health check Dec 15, 2025
@ropatil010
Copy link
Author

Hi @nader-ziada , i have rebase the PR and updated the files accordingly by using the existing function definitions.
Can you PTAL?

@nader-ziada
Copy link
Collaborator

Hi @nader-ziada , i have rebase the PR and updated the files accordingly by using the existing function definitions. Can you PTAL?

sorry, you will have to rebase again, but should be a small conflict this time, please see comment, thanks

Signed-off-by: Rohit Patil <ropatil@redhat.com>
Signed-off-by: Rohit Patil <ropatil@redhat.com>
@nader-ziada
Copy link
Collaborator

thanks @ropatil010

@nader-ziada
Copy link
Collaborator

@ropatil010 The TestNotifiesToolsChangeMultipleTimes is failing, check for a fix for a similar issue in TestNotifiesToolsChange

@nader-ziada
Copy link
Collaborator

@manusa any comments about this PR?

Signed-off-by: Rohit Patil <ropatil@redhat.com>
@ropatil010
Copy link
Author

Hi @manusa/@nader-ziada Fixed as per suggestions. Can you PTAL when ever get a chance?
Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants