SENAR Guide: Failure Modes
SENAR can fail. Every methodology can. This document catalogs the ways SENAR implementations break down, how to detect problems early, and what to do about them.
The failure modes are organized into three categories: process failures (the methodology is adopted but practiced incorrectly), organizational failures (the people and structures around SENAR break down), and technical failures (the AI tooling and infrastructure undermines the process).
Process Failures
PF-1: Shadow Coding
Description: Supervisors secretly write code instead of directing AI Agents. The Task appears AI-generated, metrics show normal throughput, but the human is doing the implementation work. This defeats the fundamental premise of SENAR: that the Supervisor’s value is judgment, not keystrokes.
Early warning signs:
- Manual Intervention Rate climbs above 30% without corresponding Dead End entries explaining why AI couldn’t do the work
- Commit timestamps cluster outside of session windows
- Supervisors report feeling “faster without the AI” but can’t articulate what the AI struggled with
- Knowledge base has no Dead End or Gotcha entries — everything “just works” (because the human did it)
- Session tool call counts are suspiciously low relative to task complexity
Recovery action:
- Conduct a non-punitive audit: compare git diffs against AI session logs for a sample of Tasks
- Identify which task types trigger shadow coding — these likely have poor context templates
- Improve context quality for those task types (better AC, patterns, architectural constraints)
- If AI genuinely cannot handle certain task types, create an explicit “manual” work type with its own tracking
Prevention:
- Normalize manual intervention as a documented, tracked activity — not a shameful secret
- Track Manual Intervention Rate as a diagnostic metric, not a performance metric
- Ensure Supervisors understand that high MIR on certain task types is a signal to improve context, not a personal failure
- Session logs should be reviewed in Quality Sweeps — not to police, but to identify context gaps
PF-2: Rubber Stamping
Description: Supervisors accept AI output without meaningful verification. Code passes QG-2 (automated tests, lint, types) but nobody checks whether it actually solves the problem correctly, handles edge cases, or fits the architecture. The human becomes a button-presser.
Early warning signs:
- Defect Escape Rate rises while First-Pass Success Rate stays suspiciously high (everything “passes” but bugs appear later)
- Code review comments are absent or perfunctory (“LGTM”, “looks good”)
- Verification time per task drops below plausible minimums (a 200-line change reviewed in 30 seconds)
- Architectural drift: patterns diverge across the codebase because nobody enforces consistency
- Security vulnerabilities appear in production that a human reviewer should have caught
Recovery action:
- Mandatory Quality Sweep focused on the last 2 Increments — compare merged code against requirements
- Introduce QG-3 (Verification Gate) if not already active: independent review by a different Supervisor
- Establish minimum review time thresholds proportional to change size
- Create an “AI Review Checklist” (see Guide 02) and require it to be completed per task
Prevention:
- QG-3 with independent verification is the strongest safeguard
- Rotate Verification Engineer assignments so reviewers stay sharp
- Track Defect Escape Rate prominently — it is the primary indicator of rubber stamping
- Quality Sweeps must sample actual code, not just metrics
PF-3: Gate Bypass Normalization
Description: Quality Gates exist but are routinely bypassed with documented exceptions. “We’ll fix it later” becomes the default. Gate bypass was designed for genuine emergencies; when it becomes normal, the gates are decorative.
Early warning signs:
- More than 10% of Tasks have gate bypass entries
- The same gate is bypassed repeatedly (e.g., QG-0 skipped because “we already know what to do”)
- Bypass justifications become formulaic (“time pressure”, “will address in next increment”)
- Tech debt tickets created from bypasses are never completed
- New team members learn to bypass gates as standard practice
Recovery action:
- Audit all bypass entries from the last 2 Increments — categorize by gate and justification
- For each repeatedly-bypassed gate, determine: is the gate too strict (adjust criteria) or is the team undertrained (invest in process)?
- Require bypass approval from someone other than the Supervisor doing the work
- Create a “bypass budget” per Increment — a maximum number of allowed bypasses. When exhausted, the gate is hard-blocked
Prevention:
- Track bypass rate as a metric reviewed at Increment Retrospectives
- Retrospective must include a “bypass review” — every bypass is discussed, not just counted
- Gate criteria should be reviewed quarterly: if a gate is routinely bypassed, it may need adjustment rather than enforcement
- Distinguish between “gate too strict” (criteria problem) and “discipline too low” (culture problem) — the responses are different
PF-4: Knowledge Base Staleness
Description: The Knowledge Base was populated during initial adoption but is no longer maintained. Entries refer to old patterns, deprecated APIs, or architectural decisions that have been superseded. AI Agents receive stale context and produce output based on outdated information.
Early warning signs:
- Knowledge Capture Rate drops below 0.3 entries per task for 3+ consecutive sessions
- New Supervisors report that KB entries contradict what they see in the codebase
- AI Agents produce output using patterns documented in KB that the team has since abandoned
- “Last reviewed” dates on KB entries are older than 3 Increments
- Dead End entries reference tools or libraries that are no longer in the stack
Recovery action:
- Schedule a dedicated KB audit session — treat it as a Task, not a side activity
- Mark every entry with a freshness status: current, needs-review, deprecated
- Delete or archive deprecated entries immediately — stale context is worse than no context
- Assign KB maintenance as an explicit responsibility (Knowledge Engineer role)
Prevention:
- Quality Sweep ceremony (Standard, Section 7.4) should include KB freshness audit
- Knowledge entries should have a “last verified” timestamp, updated when someone confirms the entry is still accurate
- Set a target Knowledge Capture Rate (0.3-0.5 entries/task) and review at Retrospectives
- When a Task produces output that contradicts a KB entry, the entry must be updated before the Task is marked done
PF-5: Dead End Documentation Stops
Description: Teams stop documenting failed approaches. Dead Ends are one of the highest-value knowledge types — they prevent AI Agents from repeating mistakes. When teams are under pressure, Dead End documentation is the first thing dropped because it documents what didn’t work, which feels low-priority.
Early warning signs:
- Dead End entries per Increment drop to zero
- AI Agents attempt approaches that were already tried and abandoned in previous sessions
- Session durations increase because the AI is exploring paths that a human remembers failing but didn’t document
- Handoff documents mention “we tried X but it didn’t work” without creating a KB entry
Recovery action:
- Review the last 5 session handoffs for mentions of failed approaches — create Dead End entries for each
- Add “Dead End capture” as an explicit step in the Session End ceremony
- Make Dead End documentation part of QG-2 acceptance criteria for complex tasks
Prevention:
- Session End template should include a “Failed approaches” section that becomes a Dead End entry
- Knowledge Capture Rate metric should be decomposed by entry type — a healthy KB has a mix of decisions, patterns, gotchas, and dead ends
- Celebrate Dead End documentation: it is not a record of failure, it is a map of the territory
PF-6: Session Discipline Erosion
Description: Sessions stretch beyond reasonable limits. Supervisors skip Session Start (no context loading, no task selection) or Session End (no handoff, no metrics capture). The session boundary — SENAR’s primary rhythm mechanism — dissolves into continuous, unstructured work.
Early warning signs:
- Session durations exceed 8 hours regularly
- Session Start entries have no task list or context verification
- Handoff documents are empty or missing
- Checkpoints are never taken
- Tool call counts per session exceed 500 (fatigue territory)
- Quality of AI output degrades in the second half of sessions (context window exhaustion)
Recovery action:
- Enforce hard session time limits — 4-6 hours is the recommended range
- Make Session End a blocking step: the next session cannot start without a completed handoff
- Implement automatic checkpoint reminders at 40-60 tool call intervals
- Review the last 10 sessions for missing handoffs and create them retroactively
Prevention:
- Session Start and Session End are ceremonies, not optional steps
- Tooling should enforce: session cannot start if previous session has no handoff
- Set a maximum tool call budget per session (300-500 depending on task complexity)
- Supervisor fatigue is real: verification quality degrades after 4-6 hours. Process must acknowledge human limits
PF-7: Vanity Metrics
Description: Teams optimize metrics without improving outcomes. Throughput is inflated by splitting tasks into trivially small units. First-Pass Success Rate is gamed by lowering acceptance criteria. Lead Time is reduced by starting tasks before they’re ready. The numbers look good; the software does not.
Early warning signs:
- Throughput increases dramatically but stakeholder satisfaction doesn’t
- Average task complexity drops to “trivial” for 80%+ of tasks
- FPSR is above 95% but Defect Escape Rate is also above 10% (contradictory signals)
- Lead Time decreases but deployed features are incomplete or buggy
- Teams resist adding new metrics or changing existing ones
- Cost per Task drops but total Increment cost doesn’t
Recovery action:
- Cross-reference metrics: Throughput x DER, FPSR x post-deployment defects, Lead Time x rework rate
- Introduce “outcome metrics” alongside process metrics: deployment frequency, change failure rate, time to restore
- Redefine task granularity guidelines — a task should represent a meaningful, verifiable change
- Retrospective should compare metrics against actual stakeholder outcomes
Prevention:
- Never reward or punish based on a single metric
- Metrics exist to diagnose, not to judge — make this explicit in team norms
- DER is the hardest metric to game because it’s measured by external reality (production bugs)
- Quality Sweeps should validate that metrics tell a consistent story
Organizational Failures
OF-1: Supervisor Burnout
Description: Constant verification of AI output creates cognitive fatigue. The Supervisor must maintain deep technical understanding while processing large volumes of generated code. Unlike traditional development where the developer builds context incrementally, the Supervisor must rapidly context-switch between tasks and verify unfamiliar code patterns.
Early warning signs:
- Verification quality drops in afternoon sessions (rubber stamping after lunch)
- Supervisors report feeling “drained” despite not writing code
- Increased use of gate bypasses in later sessions of the week
- Sick days and turnover increase
- Supervisors start preferring simple tasks and avoiding complex ones
Recovery action:
- Reduce session length to 3-4 hours with mandatory breaks
- Rotate between deep verification work and lighter tasks (documentation, knowledge curation)
- Implement pair supervision: two Supervisors verify each other’s complex tasks
- Ensure Supervisors have autonomy in task selection and session planning
Prevention:
- Design sessions with verification load in mind — not all tasks require the same level of scrutiny
- Budget time for non-verification work: KB maintenance, context improvement, learning
- Track tool calls and session duration — these are leading indicators of burnout
- Respect the difference between “AI can work 24 hours” and “humans cannot verify for 24 hours”
OF-2: Skills Atrophy
Description: Supervisors stop writing code entirely and gradually lose the ability to evaluate AI output at a deep technical level. They can spot surface-level issues (typos, obvious bugs) but miss architectural problems, performance anti-patterns, and subtle security issues. The “judgment over keystrokes” value requires that judgment remains sharp.
Early warning signs:
- Supervisors struggle to explain why a particular implementation approach is wrong
- Architectural decisions are increasingly delegated to the AI with no human challenge
- Code reviews become “does it work?” rather than “is this the right approach?”
- Supervisors avoid tasks in unfamiliar parts of the codebase
- New technologies are adopted without Supervisor understanding of tradeoffs
Recovery action:
- Institute “manual implementation days” — periodic sessions where Supervisors write code without AI assistance
- Require Supervisors to write the architectural approach before the AI implements it
- Rotate Supervisors across different parts of the codebase to force learning
- Include “explain the implementation” as part of verification: if you can’t explain it, you can’t verify it
Prevention:
- Supervisors should occasionally implement complex tasks manually to stay sharp
- Technical learning time should be budgeted — not as overhead, but as capability maintenance
- The Context Architect role keeps Supervisors engaged in system design, not just task verification
- Quality Sweeps that require deep code understanding serve double duty: quality assurance and skill maintenance
OF-3: AI Tool Vendor Lock-in
Description: The SENAR implementation becomes deeply coupled to a specific AI tool, model, or platform. Session templates, knowledge base format, context injection mechanisms, and verification workflows all assume a particular vendor’s API and capabilities. Switching becomes prohibitively expensive.
Early warning signs:
- All context templates use vendor-specific syntax or features
- Knowledge base format is optimized for one model’s context window size
- Session management is embedded in a proprietary platform with no export
- Cost tracking depends on vendor-specific billing APIs
- Team expertise is in “using Tool X” rather than “supervising AI agents”
Recovery action:
- Audit all touchpoints between SENAR process and AI tooling — create an abstraction inventory
- Define a vendor-neutral interface for context injection, output capture, and session management
- Extract knowledge base into a portable format (structured markdown, JSON)
- Run parallel sessions with alternative tools to validate portability
Prevention:
- SENAR deliberately does not prescribe specific AI tools — maintain this neutrality
- Keep knowledge base in a format readable by any LLM (structured text, not proprietary embeddings)
- Session management should be a thin layer, not a platform
- Periodically evaluate alternative tools to maintain optionality
OF-4: Over-Process (Bureaucratic SENAR)
Description: SENAR becomes an end in itself. Every task requires exhaustive documentation. Every session has a 30-minute startup ceremony. Every knowledge entry needs three approvals. The methodology that was designed to increase throughput becomes the primary obstacle to throughput.
Early warning signs:
- Session Start takes longer than 30 minutes
- More time is spent on SENAR ceremonies than on actual work
- Supervisors create “process compliance” tasks that produce no software value
- New Supervisors are overwhelmed by the number of required steps
- Teams maintain two systems: the “official” SENAR process and their actual workflow
Recovery action:
- Measure “process overhead ratio”: time on ceremonies / time on productive work. If it exceeds 20%, the process is too heavy
- Revisit which SENAR level (Core/Team/Enterprise) actually matches the team’s needs — many teams over-adopt
- Automate everything that can be automated — manual ceremony steps should be minimized
- Apply SENAR’s own principle: “Enforcement over Ceremony” — if a step can be automated, it shouldn’t be a manual ceremony
Prevention:
- Start with SENAR Core and upgrade only when specific pain points justify additional process
- Every ceremony should have a defined timebox
- Every manual step should have a justification: “what failure does this prevent?” If the answer is unclear, remove the step
- Periodically run a “process audit” — are all steps still earning their cost?
OF-5: Under-Process (Perpetual L2)
Description: The team adopts SENAR Core (Maturity Level 2) and never progresses. Tasks exist, Start Gate and Done Gate are enforced, FPSR and DER are tracked — but there’s no independent verification, no knowledge capture discipline, no federation coordination. The team gets the minimum benefit and plateaus.
Early warning signs:
- Same process for 6+ months with no evolution
- Knowledge base growth has flatlined
- DER is stable but never improves
- Cross-project dependencies are managed ad-hoc (email, chat) rather than through federation
- No Increment Retrospectives — metrics are collected but never reviewed
- Team says “SENAR is working fine” but can’t point to specific improvements
Recovery action:
- Run a maturity assessment across all 6 dimensions (Section 12.2) — identify the weakest
- Pick ONE dimension to improve in the next Increment — don’t try to jump from L2 to L3 across all dimensions simultaneously
- Assign improvement ownership to a specific person (Flow Manager responsibility)
- Set a measurable target: “Reduce DER from 12% to 8% by end of Increment”
Prevention:
- Include “process improvement” as a standing item in Increment Planning
- The Flow Manager role exists precisely to drive process evolution — ensure it’s staffed
- Set explicit maturity targets with timelines
- Review the SENAR Maturity Model (Section 12) at each Retrospective
OF-6: Supervisor Resistance
Description: Developers refuse to adopt the Supervisor role. They see SENAR as de-skilling: taking away the creative, satisfying work of writing code and replacing it with bureaucratic verification of machine output. The resistance may be quiet (passive non-compliance) or loud (active pushback).
Early warning signs:
- High shadow coding rates (PF-1) — developers code and retroactively create tasks to match
- Supervisors describe themselves as “reviewers” or “testers” rather than “engineers”
- Hiring for Supervisor roles is difficult — candidates want “real engineering” positions
- Team morale surveys show dissatisfaction with role definition
- Experienced developers leave for traditional development roles
Recovery action:
- Address the identity question directly: Supervisor is a more senior role than Developer, not a lesser one
- Highlight the aspects of the role that require deeper skill: architectural judgment, context design, verification of complex systems
- Allow Supervisors to choose when to code manually — autonomy reduces resistance
- Share concrete examples of how the Supervisor role amplifies individual impact (throughput multiplied by AI)
Prevention:
- Frame the transition as career progression, not role diminishment
- Ensure compensation reflects the increased scope and responsibility of the Supervisor role
- Provide the Transition Guide (Guide 04) and allow adequate adjustment time
- Create communities of practice where Supervisors share techniques and celebrate wins
OF-7: Knowledge Silos
Description: Each Supervisor develops their own undocumented patterns for context preparation, task structuring, and AI interaction. These patterns are effective but exist only in the Supervisor’s head. When they’re unavailable, other Supervisors can’t replicate their results.
Early warning signs:
- Throughput varies dramatically between Supervisors on similar task types
- Knowledge base entries are sparse or generic — the “real” context is in personal notes
- Supervisors can’t cover for each other during absences
- Handoff documents don’t capture the “tricks” that make a session productive
- New Supervisors take months to reach baseline productivity
Recovery action:
- Run “context pairing” sessions: have high-performing Supervisors work alongside others, capturing their undocumented patterns
- Convert personal notes and templates into shared knowledge entries
- Standardize context templates for common task types
- Include “knowledge sharing” as an explicit goal in Quality Sweeps
Prevention:
- Knowledge Capture Rate metric should be reviewed per Supervisor, not just in aggregate
- Context templates should be shared artifacts, not personal files
- Session handoffs should include “what worked well” for context preparation, not just task status
- The Knowledge Engineer role exists to prevent silos — ensure it’s actively practiced
Technical Failures
TF-1: Model Change Breaks Baselines
Description: The AI model is updated (new version, different provider, or fine-tuning change) and all established baselines become invalid. Throughput, FPSR, Lead Time, and Cost per Task all shift unpredictably. The team loses its ability to detect process problems because the signal is buried in model-change noise.
Early warning signs:
- Sudden shift in all metrics after a model update
- Tasks that previously had high FPSR start failing
- Cost per Task changes significantly (often in both directions for different task types)
- Context templates that worked well now produce worse output
- AI behavior changes subtly: different coding style, different library preferences, different error handling patterns
Recovery action:
- Treat a model change as a “baseline reset event” — document the change and mark metrics before/after
- Run a sample of representative tasks on the new model to establish new baselines
- Update context templates and knowledge entries that assumed specific model behavior
- Allow 1-2 sessions of reduced throughput expectation while the team calibrates
Prevention:
- Pin model versions in production — don’t auto-update
- When evaluating a new model, run a parallel evaluation on a representative task set before switching
- Keep context templates model-neutral: describe what you want, not how the model should produce it
- Maintain a “model change log” that records which model version was used for each Increment’s baselines
TF-2: Context Window Limitations
Description: The AI’s context window is finite. As projects grow, the context required for a task (codebase knowledge, architectural constraints, related patterns, previous decisions) exceeds what fits in a single session. Tasks are artificially split not because they’re naturally separable, but because the AI can’t hold enough context.
Early warning signs:
- Tasks are split into sub-tasks that don’t make sense as independent units
- AI “forgets” instructions given earlier in the session
- Late-session output contradicts early-session decisions
- Supervisors spend increasing time re-explaining context after checkpoints
- Complex refactoring tasks fail because the AI can’t see enough of the codebase simultaneously
Recovery action:
- Restructure context injection: prioritize high-value context (architectural constraints, active patterns) over low-value context (full file contents)
- Use knowledge base entries as compressed context: a 5-line decision entry replaces reading 500 lines of code
- Implement progressive context loading: start with architecture overview, load details as needed
- Accept that some tasks require multiple sessions — design handoff documents to carry context across the boundary
Prevention:
- Knowledge base entries are context compression tools — invest in high-quality, concise entries
- Architectural documentation should be structured for AI consumption: clear, concise, with explicit constraints
- Design task granularity around natural boundaries, not context window size
- Monitor context window utilization and establish team conventions for context budget allocation
TF-3: Architecturally Wrong but Gate-Passing Code
Description: AI generates code that passes all automated quality gates (tests pass, types check, linter clean, security scan clear) but is architecturally wrong. It might use the wrong pattern, create inappropriate coupling, bypass established abstractions, or introduce a dependency that will cause problems in 6 months. Automated gates verify correctness; they cannot verify judgment.
Early warning signs:
- Code reviews increasingly find “it works but it’s wrong” issues
- Similar functionality is implemented differently in different parts of the codebase
- Established abstractions are bypassed — AI creates new patterns instead of using existing ones
- Dependency graph grows in unexpected directions
- Refactoring costs increase over time as architectural drift accumulates
Recovery action:
- Strengthen the QG-3 Verification Gate with explicit architectural review criteria
- Create “architectural constraint” knowledge entries that are injected into every session context
- Run an architectural health audit — identify and document all established patterns
- Consider architecture-specific linting rules that can be automated (dependency constraints, layer violations)
Prevention:
- Architectural decisions must be documented in the knowledge base, not just in the code
- Context templates for each area of the codebase should include “use this pattern, not that pattern” instructions
- Quality Sweeps should include architectural consistency checks
- The Context Architect role is critical here: they ensure AI receives the right architectural constraints
- Automated architectural fitness functions (dependency rules, layer checks) convert judgment into enforcement
TF-4: Phantom Dependencies
Description: Across sessions, AI agents introduce dependencies (libraries, services, API calls, shared state) that are not explicitly tracked. Each dependency is individually reasonable, but the aggregate creates a fragile system. No single person or document captures the full dependency graph.
Early warning signs:
- Build times increase without obvious cause
- Package manifest (package.json, requirements.txt, go.mod) grows steadily
- “Works on my machine” issues increase — environment setup is unreliable
- Upgrading one dependency triggers cascading failures
- Nobody can explain why a particular dependency was added
- Service coupling increases: changes in one service require changes in others
Recovery action:
- Run a dependency audit: for every dependency, find the Task that introduced it and verify it’s still needed
- Create a “dependency decision” knowledge entry type — every new dependency requires a documented justification
- Implement dependency change detection in CI — any new dependency triggers a review flag
- Prune unused dependencies
Prevention:
- Add dependency justification to QG-2 acceptance criteria: “no new dependencies without documented reason”
- Track dependency count as a secondary metric
- Quality Sweeps should include dependency review
- Context templates should list approved dependencies and their purposes — AI will prefer what it knows about
TF-5: Meaningless Test Coverage
Description: AI generates tests that achieve high coverage metrics but don’t actually verify meaningful behavior. Tests mirror the implementation: they test that the code does what the code does, not that the code does what the requirements say. This is particularly insidious because the coverage number looks excellent.
Early warning signs:
- Test coverage is above 90% but Defect Escape Rate is also high
- Tests break when implementation changes but not when behavior changes (brittle, tightly coupled tests)
- Test descriptions are generic: “should work correctly”, “handles input”
- Tests don’t cover edge cases, error paths, or boundary conditions
- Mutation testing kills few mutants despite high line coverage
- Integration tests are actually unit tests with mocked-out dependencies
Recovery action:
- Run mutation testing to evaluate actual test effectiveness — coverage means nothing without kill rate
- Review test quality in Quality Sweeps: do tests verify requirements or mirror implementation?
- Require acceptance criteria to include specific test scenarios — AI writes tests from scenarios, not from code
- Introduce property-based testing for critical paths
Prevention:
- Acceptance criteria should define test cases: “given X, when Y, then Z” — not “write tests for this function”
- Track mutation testing score alongside coverage
- QG-2 should include test quality checks, not just coverage thresholds
- Knowledge base should contain testing patterns: “how we test X-type functionality”
- Separate the test-writing agent prompt from the implementation agent prompt — tests should verify requirements, not mirror code
Summary
No failure mode in this document is theoretical. Every one has been observed in practice, either in AI-native development or in analogous patterns from traditional Agile/SAFe adoption.
The common thread across all failure modes is the same: when the uncomfortable parts of the process are skipped, the process stops working. Verification is uncomfortable. Documentation is tedious. Dead Ends feel like wasted time. Gate compliance feels like bureaucracy. But these are the load-bearing elements of SENAR — remove them and the structure collapses.
The best defense is not more process but better feedback loops:
- Metrics that cross-reference — no single metric tells the truth alone
- Quality Sweeps that sample reality — not just metrics dashboards but actual code and actual knowledge entries
- Retrospectives that ask hard questions — not “are our numbers good?” but “is our software good?”
- A culture where documenting failure is valued — Dead Ends and failure mode awareness prevent repeated mistakes