
NexSev: AI-Powered Incident Response
HackWeek-led build: LLMs, LangChain agents, and MCP tools to automate RCA, CANs, and knowledge retrieval—cutting post-Sev1 documentation time by ~40% for HashiCorp/IBM APJ.
How we used LLMs and agentic workflows to cut Sev1 response time by 40%
The problem: incident response is a time sink
If you've been on-call for enterprise infrastructure, you know the pattern: an alert fires, you troubleshoot, you resolve—then the administrative work begins: detailed Root Cause Analysis (RCA), Customer Action Notices (CANs), searching past tickets, and updating the knowledge base.
At HashiCorp, our support team was spending 2+ hours per Sev1 incident on post-resolution work. Across dozens of critical incidents per month, that added up to weeks of manual documentation.
During IBM's AI HackWeek, I led a cross-functional team to build NexSev—an AI-powered incident response assistant that automates those workflows.
The vision: an AI teammate for support engineers
The goal was not to replace engineers—it was to handle the tedious parts so people can focus on solving customer problems.
What we wanted NexSev to do:
- Auto-generate RCA documentation from incident data and resolution notes
- Draft Customer Action Notices with the right technical depth and tone
- Surface relevant historical solutions from the knowledge base when similar issues arise
- Provide real-time troubleshooting guidance during active incidents
In short: turn collective institutional knowledge into an always-available assistant.
Architecture: agentic workflows + knowledge retrieval
Tech stack
- LLMs: Llama 3.1 and IBM Granite (via Ollama for local hosting)
- Orchestration: LangChain for agentic workflows
- Knowledge base: Vector embeddings of historical Zendesk tickets
- Integrations: Custom MCP (Model Context Protocol) tools for Slack and Zendesk
- Frontend: Next.js for document generation and review
- Backend: Python for LLM orchestration and API integrations
Why this stack
Local LLMs (Ollama): Customer data had to stay inside our infrastructure. Running Llama 3.1 and Granite locally meant sensitive context did not leave the environment.
MCP: Custom tools gave the model access to Zendesk (tickets, customer context, notes), internal knowledge (past RCAs, runbooks), and Slack (updates and commands).
Agentic design: Instead of one brittle mega-prompt, the assistant could decide what to fetch, call tools, synthesize drafts, and pause for human review before finalizing.
Implementation: from concept to production
Phase 1: Knowledge base retrieval
Historical incidents were unstructured and noisy. Our approach:
- Extract resolved Sev1 and Sev2 tickets from Zendesk
- Clean and chunk descriptions, resolution notes, and RCAs where available
- Embed with a lightweight model and store in a vector database
- At incident time, retrieve by error patterns, components (TFE, PostgreSQL, Redis, VCS), and environment (cloud, topology)
Phase 2: RCA generation
A strong RCA needs accuracy, timeline, real root cause (not symptoms), remediation, and prevention. We combined ticket data with retrieved similar incidents, used structured outputs (function calling), and presented a draft for engineer refinement.
The key insight: don't chase perfect AI copy—ship a strong first draft an engineer can polish in minutes instead of writing from scratch for an hour.
Phase 3: Slack integration
Support engineers live in Slack during incidents. We shipped slash-style workflows such as:
/nexsev analyze— suggest troubleshooting steps for the active incident/nexsev rca [ticket-id]— RCA draft for a resolved incident/nexsev can [ticket-id]— CAN draft/nexsev search [query]— knowledge base search
The bot used MCP-backed tools to pull Zendesk context, search knowledge, generate documents, and post results back to the channel.
Phase 4: Next.js review interface
Long-form review needed more than a thread. We built a Next.js app to review and edit RCAs and CANs, track approval state, and export finalized documents to Zendesk—closing the loop so engineer edits inform prompt improvements over time.
Impact: ~40% reduction in incident documentation time
After rollout to our APJ support team, we measured:
- RCA: ~90 minutes average → ~30 minutes to review and finalize AI-generated drafts (~67% reduction)
- CAN: ~30 minutes → ~10 minutes to finalize (~67% reduction)
- Knowledge retrieval: automated and surfaced during incidents instead of ad hoc search
Overall: roughly 40% less time on post-incident documentation, with better consistency, fewer missed sections (timelines, prevention), and less cognitive load on engineers during high-pressure events.
Lessons learned
- Local LLMs are viable in production when privacy matters—fast enough, no per-token bill, and room to customize.
- Agentic flows beat one-shot prompts for reliability when context must be gathered from multiple systems.
- Human-in-the-loop is non-negotiable for customer-facing quality—draft, review, then publish.
- Adoption follows the workflow—Slack integration mattered as much as model quality.
What's next
NexSev is in production for APJ; the roadmap includes expansion to other regions, proactive signals, tighter feedback loops for fine-tuning, and deeper observability integrations (e.g. Datadog / Prometheus).
Technical details
Reference stack
Backend:
- Python 3.11
- LangChain (agent orchestration)
- Ollama (local LLMs)
- FastAPI (API)
- Vector DB (Chroma)
Frontend:
- Next.js (App Router)
- TypeScript
- Tailwind CSS
- shadcn/ui
Integrations:
- Slack SDK (Bolt for Python)
- Zendesk API
- Custom MCP tools
Models:
- Llama 3.1 (8B / 70B)
- IBM Granite 13BImplementation tips: start with retrieval before generation; use structured outputs; bake review into v1; prioritize low-friction entry (Slack first).
Conclusion
NexSev began as a HackWeek experiment and became a production system that saves the team significant time each month while improving consistency. The direction that resonates most: use AI for the repetitive work so engineers can stay focused on hard technical problems and customer outcomes.
Syed Ibtihaj
Design & Code by Syed Ibtihaj
Actively maintaining this site and pushing new work to GitHub as it ships.
© 2026 — Built with Next.js 16