Initial Lessons Learned with Agentic Programming

Posted on AI, Development

My journey through four phases of learning to work effectively with AI agents—from failed attempts at prompt engineering to discovering a process that actually delivers production-ready code.

It’s honestly not a secret—we are finally getting technology to where I imagined back in the early 90s when I was in high school. I remember having discussions with my friends about the potentials and problems we would face when we got here. It’s been an interesting journey to watch, but now we’re here and what do we do with this technology I had dreamed about for thirty years? We do cool things!

When ChatGPT announced their LLM to the world, I immediately started using it to write code. I’m huge into automation and have a list of projects and ideas I would like to get executed. Problem is I’m one measly weak human that requires sleep, food and such… I would open the console, ask a question, get a response or a code snippet, test, iterate, test, iterate, test, rinse and repeat. I would get useful code sometimes, but other times I would ask for something resembling a simple sandwich and was served with a 12-course meal and all the fixings but not a sandwich in sight. Exceptionally exciting if you want to see what it can do, but not very useful when you have a detailed task list and a deadline.

So began my journey to figure out what actually works.

Phase 1 - Prompt Engineering

Initially we believed that the key to making this work was to improve the prompt. It’s a computer after all and it’s supposed to follow instructions, right? If it’s making a mistake it’s probably because I wasn’t clear enough or left a detail out, something…

So we ended up writing long detailed prompts with examples, gotchas, lists of do/don’t actions, etc. everything we could think of to make sure the system knew what we wanted.

This only got us so far. It worked a bit but definitely didn’t give the results we needed and certainly wouldn’t ship production-ready code. We also noticed that the agents would have a tendency to overcomplicate simple solutions—the wheelbarrow becomes a tricycle big rig. It works but definitely not what you need, way too complicated and a city-sized mountain of technical debt right out the gate.

Not the solution.

Phase 2 - Gates/Guardrails

As a systems admin I want to trust the process and remove the human element as much as possible. I also believe that people, if given enough opportunity, will find ways around constraints. So we took a different approach: we don’t let the agent do the thing—we build a process that ensures it does the thing correctly, so the process has permission but the agent doesn’t.

That didn’t work out. The agent would work and realize there is a guardrail and spend the time and tokens trying to circumvent the git-hooks rather than fixing the linting problems. It would spend more time trying to get around the guardrail than actually doing the work.

Not the solution.

Phase 3 - Structure

A little digression first:

Jocko Willink has a great podcast and one of the things he talks about is “Discipline Equals Freedom.” The principles are outlined in his book Discipline Equals Freedom: Field Manual Mk1-MOD1. The more structure you have, the more freedom you have to operate. This is counter-intuitive to most people as they think structure is confining and limiting. It’s not, it’s liberating.

We started to figure out that if we would give the agent smaller focused tasks within a clear process and definitions of success, it would stay on track and deliver a bit better results.

Not perfect but better! Definitely making progress.

Phase 4 - Process

The key to the “Structure” component was the process. We needed to define for the agent a clear step-by-step process to follow. This doesn’t just include actions but also intent. If the agent is aware of the intent and final result, it’s better able to make decisions about the required steps to keep moving towards the goal and not take a hard left into hallucination land.

This is the part that made all the difference. We decided to develop a simple framework built around three fundamental questions:

What’s important to you and your team?
What does success look like?
What is the definition of done?

We found that these three simple questions help the agent understand the context of the task and what is required to complete it. Everything else is supporting information, but these questions are essential for making the right decisions.

This was about the time Claude Code was released and we were actually able to leverage the agents directly and not have to use ChatGPT as a middleman with a lot of copy/paste. (I was behind a bit.) This also allowed for the use of slash-commands. Leveraging this functionality, we were able to develop a practical workflow.

The Evolution: From Failure to Success

Here’s a visual representation of the journey through these four phases:

%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'14px'}}}%%
graph TB
    Start([Start: Agentic Programming Journey]) --> Phase1

    %% Phase 1: Prompt Engineering
    Phase1[🔴 Phase 1: Prompt Engineering<br/>Long detailed prompts with examples,<br/>gotchas, do/don't lists]
    Phase1 --> Result1{Result}
    Result1 -->|❌ FAILED| Problem1[Problem: Overcomplicated solutions<br/>Wheelbarrow → Tricycle Big Rig<br/>Too much technical debt]
    Problem1 --> Phase2

    %% Phase 2: Gates/Guardrails
    Phase2[🔴 Phase 2: Gates/Guardrails<br/>Build process with guardrails<br/>to ensure correct execution]
    Phase2 --> Result2{Result}
    Result2 -->|❌ FAILED| Problem2[Problem: Agent circumvented guardrails<br/>Spent time fighting git-hooks<br/>instead of doing work]
    Problem2 --> Phase3

    %% Phase 3: Structure
    Phase3[🟡 Phase 3: Structure<br/>Smaller focused tasks within<br/>clear process & success definitions]
    Phase3 --> Result3{Result}
    Result3 -->|⚠️ BETTER| Problem3[Better but not perfect<br/>Making progress!]
    Problem3 --> Phase4

    %% Phase 4: Process - The Solution
    Phase4[🟢 Phase 4: Process<br/>Three Key Questions]
    Phase4 --> Questions{What's important?<br/>What does success look like?<br/>What is definition of done?}

    Questions --> Workflow[Three-Step Workflow]

    %% Step 1: Draft PRP
    Workflow --> Step1[Step 1: /draft-prp command]
    Step1 --> Draft[Create detailed draft document<br/>20-150k tokens]
    Draft --> Sections[Sections:<br/>• Constraints<br/>• Requirements<br/>• Definitions<br/>• Success Criteria<br/>• Definition of Done<br/>• Stories/Tasks<br/>• Acceptance Criteria<br/>• Test Cases]
    Sections --> Review{Review &<br/>Validate Draft}
    Review -->|Needs Changes| Step1
    Review -->|✅ Approved| Step2

    %% Step 2: Generate PR
    Step2[Step 2: /generate-prp command]
    Step2 --> Breakdown[Break large document into<br/>smaller discrete PRs]
    Breakdown --> Step3[Step 3: /execute-prp command]
    Step3 --> Execute[Execute individual PRs<br/>with quality gates]
    Execute --> Success[✅ SUCCESS: Production-Ready Code]

    %% Styling
    classDef failedPhase fill:#ffcccc,stroke:#cc0000,stroke-width:3px
    classDef betterPhase fill:#ffffcc,stroke:#ccaa00,stroke-width:3px
    classDef successPhase fill:#ccffcc,stroke:#00cc00,stroke-width:3px
    classDef processBox fill:#e6f3ff,stroke:#0066cc,stroke-width:2px
    classDef doneBox fill:#d4edda,stroke:#28a745,stroke-width:3px

    class Phase1,Phase2,Result1,Result2,Problem1,Problem2 failedPhase
    class Phase3,Result3,Problem3 betterPhase
    class Phase4,Questions,Workflow successPhase
    class Step1,Draft,Sections,Review,Step2,Breakdown,Step3,Execute processBox
    class Success doneBox

The Three-Step Workflow That Actually Works

Once we understood that process was the key, we built a workflow that embodies those three fundamental questions. Here’s how it works in practice:

Step 1: Create the Draft PRP

/draft-prp Create a webpage with a blue background and white text that says "Hello World"

Our /draft-prp command is a custom prompt that includes the requirements for the task. It lays out the process, definitions, and context needed. This creates a draft Product Requirements Proposal (PRP) that is placed in the ./prp/drafts directory for review.

This document will contain all the sections and requirements you’ve defined such as:

Constraints
Requirements
Definitions
Success Criteria
Definition of Done
Stories/Tasks
Acceptance Criteria
Test Cases

Yes, all of that! The more structure the better. This will create a large document but it’s worth it (20-150k tokens). Review the document and make sure it meets your needs. If not, update the doc or work with the agents to adjust. This is your starting point so make sure it’s complete and correct.

Step 2: Generate Individual PRs

/generate-prp <draft-filename>

This command takes that detailed draft and breaks it down into smaller, more manageable pieces. If you noticed, the document that was created is very detailed, includes a lot of information and is a bit overwhelming to implement as one monolithic task. This step creates individual PRs for each discrete unit of work.

Out of that 100k token file we will break it into individual PRs that are easier to manage, review and implement. Remember, the point is to accomplish the work and keep the agent focused, so smaller discrete tasks are better than large complex ones where the agent will get lost.

I’ve configured my system to organize them using this pattern:

###-[a-z]-###-task-title.md
[PRP #]-[Major Task ID]-[Sequence Number]-[Task Title].md

So now we have a directory of PRs that are ready to be worked on:

./prp/active/333-a-001-create-index-html.md
./prp/active/333-a-002-create-styles-css.md
./prp/active/333-a-003-create-tests.md
etc...

You need to figure out your system and process that works for you, but this is a good starting point.

Step 3: Execute the PR

/execute-prp <pr-filename>

This is where the magic happens. The agent will read the PR, understand the requirements and make the changes to your codebase. It will also run any tests, linters or actions you have configured to make sure the code is correct.

If the PR is completed successfully it will move the PR to the ./prp/completed directory and commit and push the change. If there are any issues it will move the PR to the ./prp/failed directory and leave the PR open for review.

Now let’s dive into what’s actually happening behind the scenes during execution.

Behind the Scenes: What Actually Happens During `/execute-prp`

When you run /execute-prp, you’re not just executing a script—you’re launching a sophisticated multi-agent orchestration system that coordinates 12+ specialized AI agents to deliver production-ready code. Here’s what happens behind the scenes.

The 12-Phase Pipeline

The execution follows a carefully choreographed sequence where each agent specializes in one domain and hands off to the next:

Phase 1-4: Planning & Design

Project Manager coordinates the entire pipeline and tracks progress
Architect Reviewer designs the system architecture and validates design patterns
Security Reviewer performs threat modeling before any code is written
Compliance Officer ensures regulatory and standards compliance (NIST, WCAG, etc.)

Phase 5-9: Implementation & Validation

Developer Agents (language-specific) implement features using strict TDD
Test Automation validates test quality and coverage
Performance Profiler benchmarks and optimizes critical paths
API Designer validates API contracts and endpoints
Database Administrator handles all data layer concerns

Phase 10-12: Documentation & Deployment

Documentation Writer creates README, API docs, ADRs, and deployment guides
Deployment Manager configures infrastructure and deployment pipelines
Code Management creates the pull request with comprehensive description

Each agent is a specialist. The Python developer doesn’t touch security. The security reviewer doesn’t write database schemas. This separation ensures expertise at every step.

The TDD Engine: RED-GREEN-REFACTOR

Here’s where things get interesting. Every developer agent is enforced to follow Test-Driven Development:

RED: Write a failing test first (proves the feature doesn’t exist yet)
GREEN: Write minimal code to make the test pass
REFACTOR: Clean up the code while keeping tests green

The system blocks any commit that doesn’t follow this pattern. You can’t skip straight to implementation. This ensures:

Every feature has tests from day one
Tests actually validate behavior (not just passing by accident)
Code remains maintainable through refactoring

This is real TDD, not “tests eventually” or “we’ll add tests later” development.

Quality Gates That Actually Enforce Quality

Between each phase, automated gates validate the work. These aren’t suggestions—they’re enforced requirements:

Test Coverage Gate

Requires 100% coverage (lines, branches, functions, statements)
No “we’ll add tests later”—the code won’t merge without them
Enforced by automated tooling, not trust

Mutation Testing Gate

Validates that tests actually catch bugs (not just execute code)
Requires ≥95% mutation score
Prevents weak tests that just achieve coverage numbers without actually validating behavior

For those unfamiliar with mutation testing: the system deliberately introduces bugs into your code (changing + to -, inverting boolean conditions, etc.) and ensures your tests catch them. If your tests still pass when the code is broken, they’re not really testing anything useful. Mutation testing catches this.

Production-Ready Gate

Zero stubs allowed: No pass, no NotImplementedError, no “TODO: implement this”
Complete error handling: Every external call wrapped in try/catch with meaningful errors
Comprehensive logging: Entry/exit logging, error logging, state changes tracked
No unresolved TODOs: Either implement it or link to a future issue

Security Gate

No hardcoded secrets or credentials
Encryption requirements validated
Auth/authorization patterns verified
Vulnerability scan passes

These gates aren’t optional. They’re enforced by the system and you can’t merge without passing them all.

What You Get at the End

After all 12 phases complete, you receive:

1. Production-Ready Code

100% test coverage (not aspirational—enforced)
Zero stub implementations
Complete error handling and logging
≥95% mutation score
Security-validated
Performance-benchmarked

2. Comprehensive Documentation

README with installation, usage, and examples
API documentation (if applicable)
Architecture Decision Records (ADRs)
Deployment guides with infrastructure setup
Security documentation

3. Quality Reports

Coverage report (coverage/index.html)
Mutation testing results (mutation/results.html)
Performance benchmarks
Security scan results
Accessibility audit (if UI-related)

4. Pull Request

Comprehensive PR description with context
All changes explained with rationale
Test results embedded
Ready for human review

5. Complete Audit Trail

Every agent action logged
Every decision documented
Every test result recorded
Full traceability from requirement to implementation

The Human Still Owns the Decision

Here’s the crucial part: /execute-prp creates a pull request, not a direct commit. The system delivers production-ready code, but you still review and merge. This keeps humans in control while automating the grunt work.

The PR description includes:

What was built and why
Design decisions made
Test coverage summary
Performance characteristics
Security considerations
Breaking changes (if any)

You review it like any other PR, except it’s already passed 6+ quality gates and has 100% test coverage.

Why This Matters

Traditional development: Developer writes code → manually writes tests → hopes they caught everything → ships and prays.

Agentic development: PRP defines requirements → 12 agents orchestrate delivery → every quality gate passes → human reviews and merges → ship with confidence.

The difference? You can’t skip quality. The gates enforce it. You can’t ship incomplete code. The production-ready validator blocks it. You can’t merge without tests. The coverage gate prevents it.

It’s not about AI replacing developers—it’s about AI handling the tedious parts (tests, docs, quality checks) so developers can focus on architecture, design, and strategic decisions.

Next time you run /execute-prp, you’ll know there are 12 agents working in parallel, following strict TDD, enforcing quality gates, and delivering code that’s actually production-ready—not “hope it works” ready.

Lessons Learned: From Four Phases to One Framework

Looking back at this journey, the pattern becomes clear:

Phase 1 (Prompt Engineering) taught us that more words don’t equal better results. The agent doesn’t need a novel—it needs structure.

Phase 2 (Gates/Guardrails) showed us that fighting the agent is counterproductive. Guardrails without context create adversarial relationships where the agent spends energy circumventing controls instead of doing work.

Phase 3 (Structure) was the breakthrough that smaller, focused tasks within clear processes work better. But we were still missing something.

Phase 4 (Process) brought it all together. The three questions—What’s important? What does success look like? What is the definition of done?—combined with the three-step workflow (draft, generate, execute) created a framework that actually delivers.

The key insight? Agents need the same things humans need to do good work: clear objectives, defined success criteria, and a process to follow. The difference is that with agents, we can enforce quality gates that humans might be tempted to skip under pressure.

This process has worked well for us and we are able to get a lot of work done effectively. The agents stay focused and deliver quality code that meets our requirements because the process keeps them on track.

We’re still learning and iterating, but this framework has proven solid. If you’re struggling with agentic programming, consider whether you’re giving your agents structure and process, or just hoping better prompts will fix everything.

The technology I dreamed about in the 90s is finally here. Now we know how to use it effectively.