# Devops Mcp

> Unified MCP server for Kubernetes, ArgoCD, Prometheus, and PagerDuty

- **Type:** MCP server
- **Install:** `agentstack add mcp-notharshhaa-devops-mcp`
- **Verified:** Pending review
- **Seller:** [NotHarshhaa](https://agentstack.voostack.com/s/notharshhaa)
- **Installs:** 0
- **Latest version:** 1.0.1
- **License:** MIT
- **Upstream author:** [NotHarshhaa](https://github.com/NotHarshhaa)
- **Source:** https://github.com/NotHarshhaa/devops-mcp

## Install

```sh
agentstack add mcp-notharshhaa-devops-mcp
```

Requires the [AgentStack CLI](https://agentstack.voostack.com/docs/cli). Works with Claude Code, Cursor, and any MCP-compatible agent.

## About

# devops-mcp

> Unified MCP server for DevOps engineers — query and manage Kubernetes, ArgoCD, Prometheus, and PagerDuty from any MCP-compatible AI agent.

[](https://www.npmjs.com/package/devops-mcp)
[](LICENSE)
[](https://modelcontextprotocol.io)

---

## What is this?

`devops-mcp` is an open source [Model Context Protocol](https://modelcontextprotocol.io) server that gives AI agents (Claude, etc.) real-time read and write access to your infrastructure stack — all from a single install.

Instead of copy-pasting `kubectl` output into a chat window, you can ask:

> *"Why is the payments deployment in CrashLoopBackOff?"*
> *"What changed in the last ArgoCD sync for the auth app?"*
> *"Show me the p99 latency for the API gateway over the last hour."*
> *"Who's on call right now and what incidents are open?"*
> *"Debug the payments service - what's wrong with it?"*

...and get live answers, sourced directly from your cluster and tooling.

**Providers included:**

| Prefix | Provider | Transport |
|---|---|---|
| `k8s__*` | Kubernetes (via kubeconfig or in-cluster SA) | client-go |
| `argo__*` | ArgoCD | REST API |
| `prom__*` | Prometheus | HTTP API (PromQL) |
| `pd__*` | PagerDuty | REST API v2 |
| `helm__*` | Helm | CLI (helm binary) |
| `devops__*` | Cross-provider incident debugging | Aggregates all providers |
| `logs__*` | Loki | HTTP API (LogQL) |

---

## Quick start

### Claude Desktop (stdio — recommended)

Add this to `~/.config/claude/claude_desktop_config.json` (macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`):

```json
{
  "mcpServers": {
    "devops": {
      "command": "npx",
      "args": ["-y", "@notharshhaa/devops-mcp@latest"],
      "env": {
        "KUBECONFIG": "/home/you/.kube/config",
        "ARGOCD_SERVER": "https://argocd.company.com",
        "ARGOCD_TOKEN": "your-argocd-token",
        "PROMETHEUS_URL": "http://prometheus.monitoring:9090",
        "PAGERDUTY_TOKEN": "your-pd-api-token",
        "LOKI_URL": "http://loki.monitoring:3100",
        "LOKI_TOKEN": "your-loki-token"
      }
    }
  }
}
```

Restart Claude Desktop. The `devops` server will appear in the tools list.

### Claude Code (CLI)

```bash
claude mcp add devops-mcp -e KUBECONFIG=$HOME/.kube/config \
  -e ARGOCD_SERVER=https://argocd.company.com \
  -e ARGOCD_TOKEN=... \
  -e PROMETHEUS_URL=http://prometheus:9090 \
  -e PAGERDUTY_TOKEN=... \
  -e LOKI_URL=http://loki.monitoring:3100 \
  -e LOKI_TOKEN=... \
  -- npx -y @notharshhaa/devops-mcp@latest
```

### Local dev / test

```bash
npx @notharshhaa/devops-mcp
# or clone and run:
git clone https://github.com/NotHarshhaa/devops-mcp
cd devops-mcp
npm install
cp .env.example .env   # fill in your values
npm run dev
```

---

## Configuration

All config is via environment variables. Only set the ones for providers you actually use — providers with missing config are silently skipped.

```env
# ── Kubernetes ────────────────────────────────────────────────
KUBECONFIG=/home/user/.kube/config       # omit to use in-cluster service account
K8S_CONTEXT=my-prod-context              # optional: pin a specific context
K8S_ALLOWED_NAMESPACES=default,backend   # optional: restrict namespace access

# ── ArgoCD ───────────────────────────────────────────────────
ARGOCD_SERVER=https://argocd.company.com
ARGOCD_TOKEN=eyJhbGci...                 # argocd account generate-token

# ── Prometheus ───────────────────────────────────────────────
PROMETHEUS_URL=http://prometheus:9090
PROMETHEUS_BEARER_TOKEN=                 # optional: for authenticated Prometheus

# ── PagerDuty ────────────────────────────────────────────────
PAGERDUTY_TOKEN=your-api-v2-token

# ── Loki ───────────────────────────────────────────────────
LOKI_URL=http://loki.monitoring:3100
LOKI_TOKEN=your-loki-token

# ── Transport ────────────────────────────────────────────────
# For stdio mode (default): no transport config needed
# For SSE mode: set these env vars
PORT=3000                                # SSE mode only
MCP_AUTH_TOKEN=shared-secret            # Bearer token for SSE authentication

# ── Safety ───────────────────────────────────────────────────
DEVOPS_MCP_DRY_RUN=false                # true = block all mutations globally
DEVOPS_MCP_AUDIT_LOG=/var/log/devops-mcp-audit.jsonl
```

---

## Tool reference

All tools follow a three-tier safety model:

- **Read** — safe, no side effects, no confirmation needed
- **Mutate** — defaults to `dry_run: true`; set `dry_run: false` to execute
- **Destructive** — requires `confirm: true` as an explicit parameter

### Kubernetes (`k8s__*`)

| Tool | Tier | Description |
|---|---|---|
| `k8s__list_pods` | read | List pods with status, restarts, node, age |
| `k8s__get_pod_logs` | read | Tail or stream logs from a pod container |
| `k8s__describe_resource` | read | Full describe for any resource type |
| `k8s__get_events` | read | Cluster or namespace events, filterable by reason |
| `k8s__list_deployments` | read | Deployments with replica counts and rollout health |
| `k8s__get_resource_usage` | read | CPU/mem usage per pod via metrics-server |
| `k8s__get_node_status` | read | Node health, conditions, capacity, allocatable resources, taints |
| `k8s__get_network_policies` | read | Network policies with pod selectors and ingress/egress rules |
| `k8s__get_ingresses` | read | Ingress resources with hosts, paths, backends, TLS config |
| `k8s__list_cronjobs` | read | CronJobs with schedule, last run, active jobs, suspend status |
| `k8s__get_cronjob_status` | read | Detailed CronJob status with recent job history |
| `k8s__diff_resource` | read | Compare current resource state vs last-applied-configuration |
| `k8s__get_hpa` | read | HorizontalPodAutoscaler with current/target metrics and scaling status |
| `k8s__list_pvcs` | read | PersistentVolumeClaims with status, capacity, storage class |
| `k8s__list_services` | read | Services with type, ports, selectors, clusterIP, endpoints |
| `k8s__list_contexts` | read | All kubeconfig contexts and the active one |
| `k8s__switch_context` | mutate | Switch active context (session-scoped) |
| `k8s__scale_deployment` | mutate | Scale replicas with dry-run diff preview |
| `k8s__apply_manifest` | mutate | Apply a manifest string with server-side dry-run |
| `k8s__rollout_restart` | mutate | Trigger rolling restart of a deployment or statefulset |
| `k8s__delete_resource` | destructive | Delete a named resource — requires `confirm: true` |

### ArgoCD (`argo__*`)

| Tool | Tier | Description |
|---|---|---|
| `argo__list_apps` | read | All apps with health, sync status, source repo |
| `argo__get_app` | read | Full spec and status for one application |
| `argo__get_app_diff` | read | Live diff between git and cluster state |
| `argo__get_app_history` | read | Deployment history with git SHAs and timestamps |
| `argo__get_resource_tree` | read | Full owned resource tree for an app |
| `argo__sync_app` | mutate | Trigger sync — supports dry-run, prune, force |
| `argo__rollback_app` | mutate | Roll back to a specific history revision |
| `argo__terminate_op` | mutate | Cancel an in-progress sync operation |

### Prometheus (`prom__*`)

| Tool | Tier | Description |
|---|---|---|
| `prom__query` | read | Instant PromQL query with label + value output |
| `prom__query_range` | read | Range query with step, returns time-series data |
| `prom__list_alerts` | read | All alert rules with state (firing / pending / inactive) |
| `prom__get_firing_alerts` | read | Only currently firing alerts with duration |
| `prom__list_targets` | read | All scrape targets with health and last scrape |
| `prom__label_values` | read | Enumerate values for a given label name |
| `prom__metric_metadata` | read | Type, help text, and unit for a metric |
| `prom__compare_periods` | read | 📈 **Compare metrics** between two time windows — detect before/after deployment changes |
| `prom__slo_status` | read | 🎯 **SLO compliance** — error budget remaining, burn rate, time to exhaustion |
| `prom__summarize_service_health` | read | 📊 **Smart summary** - human-readable service health metrics including latency changes, error rate vs SLO, and traffic patterns |

**Example usage:**
```bash
# Get a human-readable health summary
prom__summarize_service_health(service="payments", timeframeMinutes=30, sloThreshold=0.05)
```

**What it outputs:**
- **Latency**: "Latency increased: 120ms → 480ms (+300%)" or "Latency stable: 125ms"
- **Error rate**: "Error rate crossed SLO (5%): 7.2%" or "Error rate within SLO: 2.1%"
- **Traffic**: "Traffic dropped: 500 → 350 req/s (-30%)" or "Traffic spike detected (+150%)"
- **Overall assessment**: Summary of issues and positive indicators

**Why this matters:**
Instead of raw PromQL numbers that require interpretation, this tool provides actionable insights that AI agents can use directly in responses, making monitoring data actually useful for incident investigation and communication.

### Loki (`logs__*`)

| Tool | Tier | Description |
|---|---|---|
| `logs__get_recent_errors` | read | Get recent error logs from Loki for debugging incidents |
| `logs__search` | read | Search logs in Loki with custom query for root cause analysis |

**Example usage:**
```bash
# Get recent error logs
logs__get_recent_errors(service="payments", namespace="default", minutes=30, limit=50)

# Search logs with custom query
logs__search(query='{service="payments"} |= level="error"', limit=100)
```

**Why this matters:**
- **Metrics tell what**: Prometheus shows you that latency increased or error rate crossed SLO
- **Logs tell why**: Loki shows you the actual error messages, stack traces, and context around failures
- **Complete debugging**: Without logs, you can see that something is broken but not understand the root cause

**Output format:**
- Structured log entries with timestamp, message, service, namespace, and extracted log levels
- Error count summaries and filtering
- Raw LogQL results for detailed analysis

This makes incident investigation complete by combining the "what" (metrics) with the "why" (logs).

### PagerDuty (`pd__*`)

| Tool | Tier | Description |
|---|---|---|
| `pd__list_incidents` | read | Open incidents with severity, status, assignee |
| `pd__get_incident` | read | Full detail with alerts, notes, timeline |
| `pd__who_is_oncall` | read | Current on-call per schedule or escalation policy |
| `pd__list_services` | read | All services with integration keys and status |
| `pd__get_log_entries` | read | Audit log for an incident (all state changes) |
| `pd__acknowledge_incident` | mutate | Acknowledge — suppresses further notifications |
| `pd__add_note` | mutate | Append a note to an incident timeline |
| `pd__escalate_incident` | destructive | Escalate to a different policy — requires `confirm: true` |
| `pd__summarize_incident` | read | 🚨 **Incident auto-summary** - what happened, affected services, probable root cause, current status |

#### `pd__summarize_incident`

**Example usage:**
```bash
# Get an auto-summary of an incident
pd__summarize_incident(id="ABC123")
```

**What it outputs:**
- **What happened**: Incident title, description, severity, urgency, status, creation time, and duration
- **Affected services**: Service name, ID, and current status
- **Probable root cause**: Analysis of trigger alerts and log entries to identify likely causes
- **Current status**: Current incident state, assignees, acknowledgements, and notes count

**Output format:**
```json
{
  "what_happened": {
    "title": "API Gateway High Error Rate",
    "description": "5xx error rate exceeded 5% threshold",
    "severity": "high",
    "urgency": "high",
    "status": "acknowledged",
    "createdAt": "2025-01-15T10:30:00Z",
    "updatedAt": "2025-01-15T11:45:00Z",
    "duration": "1h 15m"
  },
  "affected_services": [
    {
      "id": "P123456",
      "name": "API Gateway",
      "status": "critical"
    }
  ],
  "probable_root_cause": "Triggered by: High 5xx error rate from API Gateway pods",
  "current_status": {
    "status": "acknowledged",
    "lastUpdated": "2025-01-15T11:45:00Z",
    "assignees": ["john.doe@company.com"],
    "acknowledgements": 2,
    "notes": 3
  }
}
```

**Why this matters:**
Instead of manually piecing together incident details from multiple API calls, this tool provides a comprehensive, human-readable summary perfect for:
- **Demos**: Shows AI's ability to understand and summarize complex incident data
- **Real-world use**: Quickly understand incident impact without digging through raw data
- **Communication**: Share concise incident summaries with stakeholders

### Helm (`helm__*`)

| Tool | Tier | Description |
|---|---|---|
| `helm__list_releases` | read | List Helm releases with status, chart, app version |
| `helm__get_status` | read | Full status of a Helm release |
| `helm__get_values` | read | User-supplied or computed values for a release |
| `helm__get_history` | read | Revision history of a release |
| `helm__rollback` | mutate | Rollback to a previous revision (dry-run by default) |

**Requirements:** Helm CLI binary must be available in PATH.

**Example usage:**
```bash
# List all releases in a namespace
helm__list_releases(namespace="production")

# Check what values a release is using
helm__get_values(name="api-gateway", all_values=true)

# Rollback after a bad deploy
helm__rollback(name="api-gateway", revision=5, dry_run=false)
```

### Cross-Provider Debugging (`devops__*`)

| Tool | Tier | Description |
|---|---|---|
| `devops__debug_service` | read | 🔥 **Cross-provider incident debugging** - aggregates Kubernetes, ArgoCD, Prometheus, and PagerDuty data to diagnose service issues in one command |
| `devops__explain_change` | read | 🧠 **Explain what changed** - combines ArgoCD history, Kubernetes rollout history, and Prometheus anomaly window to identify cause of issues |
| `devops__runbook` | read | 📋 **Automated runbook** - symptom-based diagnostic that runs targeted checks (crashloop, high-latency, oom, 5xx, pod-pending) |
| `devops__health_report` | read | 🏥 **Cluster health report** - one-shot assessment across all providers with overall status (healthy/degraded/critical) |
| `devops__incident_timeline` | read | 🕐 **Incident timeline** - unified event timeline across K8s, ArgoCD, Prometheus, and PagerDuty sorted chronologically |

#### `devops__debug_service`

**Example usage:**
```bash
# Debug a service across all providers
devops__debug_service(service="payments", namespace="default")
```

**What it checks:**
- **Kubernetes**: Pod status, restart counts, readiness, deployment health, recent events
- **ArgoCD**: Sync status, health status, Git diff detection, deployment history  
- **Prometheus**: Error rate (5xx responses), latency (p95), firing alerts
- **PagerDuty**: Active incidents matching the service name

**Output format:**
- Human-readable diagnosis with emoji indicators (⚠️ warnings, ❌ errors)
- Per-provider status sections
- Summary highlighting critical issues
- Raw JSON data for detailed analysis

This is the most powerful tool for incident investigation - it gives you a complete picture of what's wrong with a service in seconds.

#### `devops__explain_change`

**Example usage:**
```bash
# Explain what changed in the last hour
devops__explain_change(service="payments", namespace="default", timeframeMinutes=60)
```

**What it analyzes:**
- **ArgoCD**: Deployment history within the timeframe, including revision, author, repo, and chart
- **Kubernetes**: Current rollout status, replica counts, image tags, and deployment readiness
- **Prometheus**: Error rate trends, latency patterns, and traffic spikes over the time window

**Output format:**
- Timeline of recent deployments with full metadata
- Kubernetes rollout status and health
- Metric anomaly detection (error rate spikes, latency issues, traffic changes)
- **Correlation analysis** that links deployments to metric changes
- Summary with root cause hypothesis

**Problem it solves:**
*"Ev

…

## Source & license

This open-source MCP server is cataloged on AgentStack and links to its original source — we do not rehost the code.

- **Author:** [NotHarshhaa](https://github.com/NotHarshhaa)
- **Source:** [NotHarshhaa/devops-mcp](https://github.com/NotHarshhaa/devops-mcp)
- **License:** MIT

Install and usage instructions live in the source repository linked above.

## Pricing

- **Free** — Free

## Versions

- **1.0.1** — security scan: pending review — Imported from the upstream source.

## Links

- Listing page: https://agentstack.voostack.com/l/mcp-notharshhaa-devops-mcp
- Seller: https://agentstack.voostack.com/s/notharshhaa
- Browse the marketplace: https://agentstack.voostack.com/browse

---
Listed on AgentStack — the marketplace for AI agent skills and MCP servers. Every listing is security-reviewed. Creators keep 70%.
