Architecture Overview¶

Fault Intelligence as Code is a live demo system for turning a detected network fault into a grounded, auditable troubleshooting session. Splunk detects the event, the relay opens an OpenCode session, the network-troubleshooter agent loads the relevant fault intelligence, and RADKit MCP provides device access.

The architecture has two connected sides:

Side	Purpose	Primary agents
Author-time	Compile source knowledge into reviewed intelligence artifacts and operational KB content before any fault fires.	`ia-curator`, `kb-curator`
Runtime	Consume approved artifacts and KB context to diagnose, remediate, verify, or escalate an active fault.	`network-troubleshooter`, `ia-reader`, `kb-reader`

For the runtime context assembled during an incident, see Data Flow. For the shared vocabulary behind the agentic terms, see Agentic Concepts.

Author-Time Architecture¶

Author-time work turns scattered expertise into durable knowledge sources. The IA curator workflow researches source material, drafts Remediation Guides, derives Fault Signatures and Repair Action Workflows, generates tests, and publishes through Git. The KB curator workflow ingests operational knowledge into the Markdown wiki.

graph LR
    SRC[Source material<br/>support data / docs / cases / lab findings] --> IAC[ia-curator]
    IAC --> IASKILLS[ia-* skills<br/>research / create / optimize / test / publish]
    IASKILLS --> DRAFTS[ia-drafts]
    DRAFTS --> PR[Git review]
    PR --> IA[intelligence-artifacts<br/>FS / RAW / RG]
    IA --> SPLUNK[Generated Splunk alerting]

    OPS[Operational knowledge<br/>policies / incidents / known issues] --> KBC[kb-curator]
    KBC --> WSKILLS[wiki skills<br/>ingest / query / lint / save]
    WSKILLS --> KB[kb/wiki]

The expensive synthesis happens here, before runtime. That keeps the live agent from rediscovering or reinterpreting raw documents during an incident.

Runtime Architecture¶

graph LR
    subgraph Detection
        SP[Splunk saved search]
    end

    subgraph Relay[Webhook relay container]
        API[FastAPI relay<br/>app/alert_pipeline.py]
        WB[Webex websocket bot<br/>fault_approval callbacks]
    end

    subgraph OpenCode[OpenCode runtime]
        NT[network-troubleshooter]
        IAR[ia-reader]
        KBR[kb-reader]
        FR[fault-remediation skill]
        WN[webex-notify skill]
    end

    subgraph Repo[Repository knowledge]
        IA[intelligence-artifacts<br/>FS / RAW / RG]
        KB[kb/wiki<br/>business rules / runbooks / incidents]
    end

    subgraph Lab[Lab network]
        RK[RADKit MCP server]
        DEV[IOS XR routers]
    end

    WX[Webex room]

    SP -- POST /fault-alert --> API
    API -- create session + prompt_async --> NT
    NT -- task --> IAR
    IAR -- read --> IA
    NT -- task --> KBR
    KBR -- wiki-query --> KB
    NT -- invoke --> FR
    FR -- radkit_* tools --> RK
    RK -- CLI --> DEV
    NT -- invoke --> WN
    WN -- messages + cards --> WX
    WX -- outbound websocket event --> WB
    WB -- prompt_async approval --> NT

At runtime, the troubleshooting agent consumes the compiled knowledge sources, verifies live network state through RADKit MCP, and pauses for approval before crossing a service-impacting action boundary.

Repository as Control Plane¶

The repository is not only a storage location. It is the control plane for the agent harness.

Path	Control-plane role
`.opencode/agents/`	Defines agent roles, allowed skills, sub-agent permissions, and behavioral constraints.
`.opencode/skills/`	Defines reusable procedures for artifact authoring, RAW execution, Webex notification, and KB operations.
`opencode.json`	Selects model/provider settings, MCP servers, and agent-level tool allow-lists.
`intelligence-artifacts/`	Stores approved FS/RAW/RG bundles that runtime agents treat as technical ground truth.
`kb/`	Stores compiled operational memory in Markdown.
`app/`	Bridges Splunk-style alerts and Webex approval events into OpenCode sessions.
`scripts/`	Provides simulation, validation, testing, and deployment helpers.

This is the practical payoff of the "as code" model: the same source of truth controls authoring, review, testing, deployment, and runtime behavior.

Components¶

Component	Technology	Role
Splunk	Splunk saved search / alert action	Detects a fault signature match and sends a Splunk-shaped webhook payload to the relay.
Webhook relay	FastAPI in `app/alert_pipeline.py`	Normalizes Splunk payloads, creates OpenCode sessions, sends prompts to the target agent, exposes health checks, and proxies Splunk REST requests when needed.
Webex websocket bot	`webex-bot` library inside the relay process	Maintains an outbound websocket to Webex and forwards approval card clicks back to the matching OpenCode session.
OpenCode	OpenCode server or TUI	Hosts the configured model, agent definitions, skills, sub-agent tasks, and MCP tool calls.
`network-troubleshooter`	OpenCode primary agent	Orchestrates live fault diagnosis, session logging, artifact loading, KB lookup, RAW execution, Webex notifications, and approval handling.
`ia-reader`	OpenCode read-only sub-agent	Finds and returns Fault Signature, Repair Action Workflow, and Remediation Guide artifacts from `intelligence-artifacts/`.
`kb-reader`	OpenCode read-only sub-agent	Queries the KB wiki vault at `kb/wiki/` through the `wiki-query` skill.
`fault-remediation`	OpenCode skill	Interprets and executes the RAW against the device through RADKit MCP.
`webex-notify`	OpenCode skill	Renders notification templates and sends Webex messages or adaptive cards.
RADKit MCP	Remote MCP server	Provides network-device CLI access through tools prefixed with `radkit_`.
Cisco Support MCP	Local MCP server	Available for author-time intelligence-artifact work through the configured OpenCode environment. It is not part of the live remediation path.

Design Decisions¶

Decision	Current choice	Why it matters
Agent runtime	OpenCode with repository-defined agents	The demo is model-provider independent and the agent behavior is versioned with the repository.
Live troubleshooting authority	`network-troubleshooter` only	Runtime responsibilities are narrow: diagnose, execute the RAW, notify, and escalate.
Artifact access	`ia-reader` sub-agent	Live sessions can read FS/RAW/RG artifacts but cannot create or publish them.
KB access	`kb-reader` sub-agent plus `wiki-query`	Live sessions can retrieve business rules and incident context without gaining write access to the vault.
Device access	RADKit MCP	Network operations go through a controlled MCP tool surface rather than ad hoc SSH scripts.
Human approval	Webex adaptive card plus outbound websocket callback	Approval does not require exposing the relay on a public inbound URL.
Notification rendering	`webex-notify` templates	Message and card format lives in one skill and can be reviewed independently from the RAW interpreter.
Relay deployment	Docker Compose	The relay is a small FastAPI service; the OpenCode agent runtime stays on the host or another service.

Runtime Boundaries¶

The live fault path is intentionally read-only with respect to the repository. network-troubleshooter can read files, call RADKit tools, invoke fault-remediation and webex-notify, and delegate read-only work to ia-reader and kb-reader. It cannot edit files and it cannot call curator agents.

Author-time maintenance uses separate curator agents:

Agent	Purpose
`ia-curator`	Creates, researches, optimizes, and publishes intelligence artifacts.
`kb-curator`	Ingests, lints, and maintains the KB wiki vault.

This separation prevents a live troubleshooting session from silently changing the knowledge base or the artifact library while a fault is active.