Skip to content

Architecture Overview

Fault Intelligence as Code is a live demo system for turning a detected network fault into a grounded, auditable troubleshooting session. Splunk detects the event, the relay opens an OpenCode session, the network-troubleshooter agent loads the relevant fault intelligence, and RADKit MCP provides device access.

The architecture has two connected sides:

Side Purpose Primary agents
Author-time Compile source knowledge into reviewed intelligence artifacts and operational KB content before any fault fires. ia-curator, kb-curator
Runtime Consume approved artifacts and KB context to diagnose, remediate, verify, or escalate an active fault. network-troubleshooter, ia-reader, kb-reader

For the runtime context assembled during an incident, see Data Flow. For the shared vocabulary behind the agentic terms, see Agentic Concepts.

Author-Time Architecture

Author-time work turns scattered expertise into durable knowledge sources. The IA curator workflow researches source material, drafts Remediation Guides, derives Fault Signatures and Repair Action Workflows, generates tests, and publishes through Git. The KB curator workflow ingests operational knowledge into the Markdown wiki.

graph LR
    SRC[Source material<br/>support data / docs / cases / lab findings] --> IAC[ia-curator]
    IAC --> IASKILLS[ia-* skills<br/>research / create / optimize / test / publish]
    IASKILLS --> DRAFTS[ia-drafts]
    DRAFTS --> PR[Git review]
    PR --> IA[intelligence-artifacts<br/>FS / RAW / RG]
    IA --> SPLUNK[Generated Splunk alerting]

    OPS[Operational knowledge<br/>policies / incidents / known issues] --> KBC[kb-curator]
    KBC --> WSKILLS[wiki skills<br/>ingest / query / lint / save]
    WSKILLS --> KB[kb/wiki]

The expensive synthesis happens here, before runtime. That keeps the live agent from rediscovering or reinterpreting raw documents during an incident.

Runtime Architecture

graph LR
    subgraph Detection
        SP[Splunk saved search]
    end

    subgraph Relay[Webhook relay container]
        API[FastAPI relay<br/>app/alert_pipeline.py]
        WB[Webex websocket bot<br/>fault_approval callbacks]
    end

    subgraph OpenCode[OpenCode runtime]
        NT[network-troubleshooter]
        IAR[ia-reader]
        KBR[kb-reader]
        FR[fault-remediation skill]
        WN[webex-notify skill]
    end

    subgraph Repo[Repository knowledge]
        IA[intelligence-artifacts<br/>FS / RAW / RG]
        KB[kb/wiki<br/>business rules / runbooks / incidents]
    end

    subgraph Lab[Lab network]
        RK[RADKit MCP server]
        DEV[IOS XR routers]
    end

    WX[Webex room]

    SP -- POST /fault-alert --> API
    API -- create session + prompt_async --> NT
    NT -- task --> IAR
    IAR -- read --> IA
    NT -- task --> KBR
    KBR -- wiki-query --> KB
    NT -- invoke --> FR
    FR -- radkit_* tools --> RK
    RK -- CLI --> DEV
    NT -- invoke --> WN
    WN -- messages + cards --> WX
    WX -- outbound websocket event --> WB
    WB -- prompt_async approval --> NT

At runtime, the troubleshooting agent consumes the compiled knowledge sources, verifies live network state through RADKit MCP, and pauses for approval before crossing a service-impacting action boundary.

Repository as Control Plane

The repository is not only a storage location. It is the control plane for the agent harness.

Path Control-plane role
.opencode/agents/ Defines agent roles, allowed skills, sub-agent permissions, and behavioral constraints.
.opencode/skills/ Defines reusable procedures for artifact authoring, RAW execution, Webex notification, and KB operations.
opencode.json Selects model/provider settings, MCP servers, and agent-level tool allow-lists.
intelligence-artifacts/ Stores approved FS/RAW/RG bundles that runtime agents treat as technical ground truth.
kb/ Stores compiled operational memory in Markdown.
app/ Bridges Splunk-style alerts and Webex approval events into OpenCode sessions.
scripts/ Provides simulation, validation, testing, and deployment helpers.

This is the practical payoff of the "as code" model: the same source of truth controls authoring, review, testing, deployment, and runtime behavior.

Components

Component Technology Role
Splunk Splunk saved search / alert action Detects a fault signature match and sends a Splunk-shaped webhook payload to the relay.
Webhook relay FastAPI in app/alert_pipeline.py Normalizes Splunk payloads, creates OpenCode sessions, sends prompts to the target agent, exposes health checks, and proxies Splunk REST requests when needed.
Webex websocket bot webex-bot library inside the relay process Maintains an outbound websocket to Webex and forwards approval card clicks back to the matching OpenCode session.
OpenCode OpenCode server or TUI Hosts the configured model, agent definitions, skills, sub-agent tasks, and MCP tool calls.
network-troubleshooter OpenCode primary agent Orchestrates live fault diagnosis, session logging, artifact loading, KB lookup, RAW execution, Webex notifications, and approval handling.
ia-reader OpenCode read-only sub-agent Finds and returns Fault Signature, Repair Action Workflow, and Remediation Guide artifacts from intelligence-artifacts/.
kb-reader OpenCode read-only sub-agent Queries the KB wiki vault at kb/wiki/ through the wiki-query skill.
fault-remediation OpenCode skill Interprets and executes the RAW against the device through RADKit MCP.
webex-notify OpenCode skill Renders notification templates and sends Webex messages or adaptive cards.
RADKit MCP Remote MCP server Provides network-device CLI access through tools prefixed with radkit_.
Cisco Support MCP Local MCP server Available for author-time intelligence-artifact work through the configured OpenCode environment. It is not part of the live remediation path.

Design Decisions

Decision Current choice Why it matters
Agent runtime OpenCode with repository-defined agents The demo is model-provider independent and the agent behavior is versioned with the repository.
Live troubleshooting authority network-troubleshooter only Runtime responsibilities are narrow: diagnose, execute the RAW, notify, and escalate.
Artifact access ia-reader sub-agent Live sessions can read FS/RAW/RG artifacts but cannot create or publish them.
KB access kb-reader sub-agent plus wiki-query Live sessions can retrieve business rules and incident context without gaining write access to the vault.
Device access RADKit MCP Network operations go through a controlled MCP tool surface rather than ad hoc SSH scripts.
Human approval Webex adaptive card plus outbound websocket callback Approval does not require exposing the relay on a public inbound URL.
Notification rendering webex-notify templates Message and card format lives in one skill and can be reviewed independently from the RAW interpreter.
Relay deployment Docker Compose The relay is a small FastAPI service; the OpenCode agent runtime stays on the host or another service.

Runtime Boundaries

The live fault path is intentionally read-only with respect to the repository. network-troubleshooter can read files, call RADKit tools, invoke fault-remediation and webex-notify, and delegate read-only work to ia-reader and kb-reader. It cannot edit files and it cannot call curator agents.

Author-time maintenance uses separate curator agents:

Agent Purpose
ia-curator Creates, researches, optimizes, and publishes intelligence artifacts.
kb-curator Ingests, lints, and maintains the KB wiki vault.

This separation prevents a live troubleshooting session from silently changing the knowledge base or the artifact library while a fault is active.