Fault Intelligence Overview¶
Fault intelligence is the version-controlled expression of how to detect, diagnose, and repair a known network fault. In this repository, a published alert definition is represented by three linked artifacts: FS, RAW, and RG.
For the full data model reference, see the Fault Intelligence Standards Reference in the repository's root AGENTS.md.
Section Guide¶
| Page | Purpose |
|---|---|
| Concepts | Linked sets, authoring vs runtime, and current scenarios. |
| Authoring Workflow | How ia-start, ia-create, ia-publish, and ia-explorer fit together. |
| Remediation Guides | Markdown RG format and RG-to-FS/RAW derivation. |
| Fault Signatures | FS YAML structure and detection logic. |
| Repair Action Workflows | RAW YAML structure and runtime action model. |
| Cross-References | Linked-set IDs, references, and validation checks. |
| Best Practices | Common authoring and publishing guidance. |
| Glossary | IA terms and naming conventions. |
| Health Intelligence | Stub/future health-artifact model. |
Artifact Group Layout¶
Artifacts are grouped under intelligence-artifacts/<alert-definition-id>-<slug>/:
intelligence-artifacts/
AD000002-bgp-neighbor-admin-shutdown-xr/
FS000002-BGP_NEIGHBOR_ADMIN_SHUTDOWN.yml
RAW000002-BGP_NEIGHBOR_ADMIN_SHUTDOWN_REPAIR.yml
RG000002-BGP_NEIGHBOR_ADMIN_SHUTDOWN_GUIDE.md
tests/
The generated index files at intelligence-artifacts/index.md and intelligence-artifacts/index.json summarize the published catalog.
Fault Signature (FS)¶
A Fault Signature defines when a fault has occurred. It contains metadata, applicability, detection events, regex evaluation rules, extracted variables, and optional clear-event logic.
For AD000002, FS000002-BGP_NEIGHBOR_ADMIN_SHUTDOWN.yml detects this IOS XR syslog pattern:
%ROUTING-BGP-5-ADJCHANGE : neighbor <ip> Down - Admin. shutdown (VRF: <vrf>) (AS: <asn>)
It extracts:
| Variable | Meaning |
|---|---|
neighbor_ip |
BGP neighbor that was administratively shut down. |
vrf_name |
VRF containing the neighbor session. |
neighbor_as |
Remote AS shown in the event. |
The relay receives these values from Splunk in the alert payload, then passes them to the agent as alert_vars.
Repair Action Workflow (RAW)¶
A Repair Action Workflow is the executable remediation plan. It defines inputs, ordered steps, validations, action selection, and terminal outcomes such as resolve, escalate, fail, or partial success.
For AD000002, RAW000002-BGP_NEIGHBOR_ADMIN_SHUTDOWN_REPAIR.yml performs this sequence:
| Step | Name | Purpose |
|---|---|---|
| 1 | confirm_admin_shutdown |
Confirm BGP state = Idle (Admin) and extract the local BGP AS from running config. |
| 2 | confirm_shutdown_in_config |
Confirm shutdown is present under the neighbor stanza. |
| 3 | remove_admin_shutdown |
Request approval, then run no shutdown under the BGP neighbor. |
| 4 | verify_session_reestablishment |
Confirm the session returns to Established or escalate with diagnostics. |
The fault-remediation skill interprets the RAW. It runs validation commands through RADKit MCP, evaluates regex patterns, updates variables, selects actions, and surfaces events back to network-troubleshooter.
Remediation Guide (RG)¶
A Remediation Guide is the human-readable troubleshooting and remediation source document for a fault. It is written in Markdown by a network engineer or SME, contains no regex or YAML syntax, and reads like a knowledge base article that another engineer can review and edit.
In the artifact-creation workflow, the RG is generated and reviewed first. AI tooling then derives both the Fault Signature and the Repair Action Workflow from the edited RG: triggering events map to FS detection logic, and diagnosis/repair steps map to RAW validation and action flow.
For AD000002, RG000002-BGP_NEIGHBOR_ADMIN_SHUTDOWN_GUIDE.md documents the IOS XR operator workflow for restoring a neighbor that was placed into administrative shutdown.
At runtime, the live agent executes the machine-readable RAW. The RG remains the human-readable source and review artifact for the linked FS and RAW.
Runtime Relationship¶
graph LR
RG[Remediation Guide<br/>human-authored Markdown] --> FS[Fault Signature<br/>derived detection logic]
RG --> RAW[Repair Action Workflow<br/>derived repair flow]
SP[Splunk alert] --> FS
FS --> VARS[Extracted variables]
VARS --> RAW
RAW --> FR[fault-remediation skill]
KB[KB wiki context] --> NT[network-troubleshooter]
NT --> FR
FR --> OUT{resolve / escalate / fail}
At authoring time, ia-create treats the RG as the primary Markdown source for deriving FS and RAW artifacts. At runtime, ia-reader loads the matching artifact group and returns the FS/RAW/RG context to network-troubleshooter; kb-reader separately retrieves operational context from kb/wiki/, such as approval requirements and escalation policy.
Published Scenarios¶
| Alert definition | Fault | Artifacts |
|---|---|---|
AD000002 |
BGP neighbor administrative shutdown on IOS XR | FS000002, RAW000002, RG000002 |
AD000003 |
BGP neighbor maximum-prefix limit exceeded on IOS XR | FS000003, RAW000003, RG000003 |
AD000002 is the current default for scripts/simulate_alert.py and the primary demo walkthrough in these docs.