[ series · failure-library · [!] ]

The Failure Library

Real production incidents, told as stories. Each one ends with the Claude prompt that triages it.

7 posts · bi-weekly · failure

#01 · 2026-05-20 · 11 min · failure

Failure Library #01: The Spanning-Tree Loop That Hit at 3:14 AM

A 28-minute network outage caused by a $19 unmanaged switch, a single line of "fix-it" config from 2019, and a backup job. The full post-mortem — symptoms, false leads, the tell that cracked it, the fix, and the Claude prompt that would have saved 15 minutes.

03:14  ALERT  ping loss 47% · all VLANs
03:14  ALERT  core cpu 98% · mac-flap 2400/s
03:31  "why is bpdufilter set on Gi1/0/14?"
03:33  RESOLVED  port [BLK] · cpu 5%
[!] root cause: rogue switch + bpdufilter

#switching#troubleshooting read ▸

#02 · 2026-05-31 · 10 min · failure

Failure Library #02: The DNS Outage That Wasn't DNS

An app server lost its database the instant the DB was migrated to a new IP — and DNS pointed at the right place the entire time. A two-hour outage caused by one line in a hosts file, added as a "temporary" override in 2018. The post-mortem: the false lead, the tell, the fix, and the Claude prompt that would have ended it in five minutes.

14:02  db migrated → 10.0.6.40
14:03  app01: connection refused
nslookup db.corp → 10.0.6.40  (right!)
app still connects to 10.0.6.12
[!] hosts file pinned it since 2018

#dns#troubleshooting read ▸

#03 · 2026-05-31 · 11 min · failure

Failure Library #03: The Authentication Outage That Was a Clock

Logins failing building-wide, services throwing errors, the identity team in a war room — and the domain controllers were up, replication was healthy. The root cause was a 47-minute clock skew on the PDC emulator and Kerberos doing exactly what it is designed to do. The post-mortem: the false lead, the tell, the fix, and the prompt that names it.

08:31  auth failing: OWA/VPN/shares
DCs up · replication healthy
every error: KRB_AP_ERR_SKEW
w32tm offset +2847s on the PDC
[!] clock drifted 47m; Kerberos >5m = no

#security#troubleshooting read ▸

2026-04-24 · 6 min · failure

How to Diagnose DNS Failures Fast

When DNS breaks, everything looks broken — but the real cause is rarely obvious. This step-by-step guide takes you from "the internet is down" to root cause using nslookup, dig, and a handful of resolver checks.

$ dig @192.168.1.1 corp.local +stats
;; ANSWER SECTION:
corp.local. 60 IN A 10.0.4.20
;; Query time: 412 ms   ← way too slow
↳ check resolver upstream chain

#dns#troubleshooting read ▸

2026-04-24 · 8 min · failure

How to Find What's Saturating Your WAN

A saturated WAN feels like the entire network is broken — but the cause is usually one app, one host, or one runaway backup. This step-by-step guide takes you from "everything is slow" to the exact source using interface stats, NetFlow, DPI, and Wireshark.

TOP TALKERS  · last 5m
  10.0.4.42   480 Mb/s   ████████████
  10.0.4.20    96 Mb/s   ███
  10.0.7.11    44 Mb/s   ██
→ host .42 = nightly backup window

#wan#automation#troubleshooting read ▸

2026-04-18 · 9 min · failure

How to Diagnose Packet Loss Fast

Packet loss kills voice calls, video, and file transfers — but the cause is rarely obvious. This step-by-step guide walks you from complaint to root cause using ping, traceroute, switch commands, and Wireshark.

$ ping 192.168.1.1 -n 100
Sent=100  Received=83  Lost=17 (17%)
SW01# show interfaces Gi0/4
  CRC: 3,201   ← bad cable / duplex
[!] root cause: Gi0/4 cabling

#troubleshooting read ▸

2026-04-18 · 10 min · failure

Troubleshoot Slow Network Performance Step-by-Step

A systematic methodology for diagnosing slow networks — from end-user complaint to root cause. Covers ping, traceroute, packet capture, interface stats, and common fixes.

LAYER 1  cable/port      ✓ pass
LAYER 2  switching       ✓ pass
LAYER 3  routing/MTU     ✗ MTU 1492
LAYER 4  TCP retrans     ✗ 4.1%
LAYER 7  app             — N/A

#troubleshooting read ▸

▸ back to all posts