a black and white photo of a wall
a black and white photo of a wall
a black and white photo of a wall

How Planning Impacts AI Coding

August 7, 2025

Intro


The development community has divided opinions on how much AI truly enhances real-world engineering. Some folks report massive productivity improvements, while others say that reviewing and editing AI written code slows them down overall. There are also varying opinions on how beneficial planning is before diving into the code vs. running strictly on vibes. After spending months using AI daily in a production environment, we wanted to measure the impact of proper planning on the productivity benefits of AI-assisted coding.


Experiment


The importance of preparing high-quality requirements for AI-coding agents such as Claude Code or Cursor are somewhat well known. However, we wanted to test just how much carefully prepared requirements at the user story or feature level really affect the quality of the agent output vs. just running quick hand-written prompts directly. Additionally, we wanted to see if better planning can impact the ability to get AI-coding agents to run in parallel, which is also described in the ClaudeCode documentation. We have also covered some of this in the author’s practical guide to AI coding that was created during day-to-day work.


While this experiment is low scale, we were able to achieve our goal of understanding the impact of detailed planning as a first step for AI coding. To perform our experiment, we asked each of the agents to implement the same task twice, once with detailed, generated instructions, once with just a simple hand-written prompt. The task is based on an open-source repository we maintain. The simple hand-written prompt was the following:


# Implement github changes summary

1. Contexify app should provide functionality to analyze recent changes happened in a github repository.

2. The analysis should be performed automatically periodically for repositories that are enrolled into such analysis.

3. Report should be persisted and available through API.

4. Reports should also be viewable in the UI.


The first attempt was performed by each coding agent with only these high-level requirements as an input. The second attempt was performed with much more detailed requirements generated by Devplan covering various aspects of implementation. Full description/prompt can be found here.


The conditions of the experiments:

  1. We will nudge agents to get as far as they reasonably can. We answer any of their questions if they have any and unblock them when stuck.

  2. We do not provide any other requirements information during their code implementation.


In the end, we had 6 Github PRs which were evaluated according to the following criteria.


Criteria


Name

Description

Ranges

Correctness

How close the implementation is functionally correct and in line with a properly designed solution.

1 - functionally, almost nothing is done right

5 - all functionality is implemented correctly the way it is intended to work.

Quality

Quality of the code. How maintainable the code is, how well does it follow the exiting coding standards.

1 - doesn’t follow any patterns, files are in wrong places, uses poor/unreasonable approaches

5 - production grade quality, can be shipped to prod as-is.

Autonomy

How autonomously AI got to the final solution.

1 - required constant supervision and couldn’t make any meaningful progress by itself.

5 - got to a final solution without any nudges beyond initial setup.

Completeness

How complete the solution is, did it satisfy all the explicitly set requirements.

1 - none of the requirements satisfied and none of them work.

5 - all explicitly set requirements were satisfied.


These criteria allows us to estimate how likely AI coding assistant can get to the working solution with all requirements satisfied and without much guidance, which will eventually enable them to be run in parallel. Note that for the parallel execution to work well, it is more important that all scores stay within 4-5 range, than to have higher average, e.g. scores [4, 4, 4, 4] are better than [5, 5, 2, 5], even though the latter has higher total. That’s because low score on any dimension can easily eliminate any productivity wins a human engineer would expect to gain from an AI assistant.


Results


Below are the scores for each tested case with links to the corresponding PRs.

Solution

Correct

ness

Quality

Autonomy

Complete

ness

Mean ± SD

Improve

ment %

Claude, Short PR

2

3

5

5

3.75 ± 1.5

20%

Claude, Planned PR

4+

4

5

4+

4.5 ± 0.4


Cursor, Short PR

2-

2

5

5

3.4 ± 1.9

20%

Cursor, Planned PR

5-

4-

4

4+

4.1 ± 0.5


Junie, Short PR

1+

2

5

3

2.9 ± 1.6

34%

Junie, Planned PR

4

4

3

4+

3.9 ± 0.6





Here is what we can observe after looking at the scores and produced PRs.


  1. High quality planning significantly improves correctness and quality. No surprises here. AI can’t read your mind, so properly prepared product and technical requirements help AI-assistant to do what you actually want and follow technical guidelines.

  2. Planning reduces score dispersion. In other words, results of all AI assistants were much closer to each other and more stable when they worked with detailed unambiguous requirements. Looking at PRs, you can also see that even approaches they chose were very similar. In practice it means you can use any of the good coding assistants available out there as long as you plan and prepare detailed spec for their work.

  3. Smaller tasks work more autonomously. This task was intentionally somewhat larger to test AI’s boundaries. Cursor and Junie required quite a bit of nudging to finish their work with the more detailed plan, but Claude Code was able to run without any nudging to continue. So, breaking down work into smaller chunks increases chances of AI autonomously getting to the end of a task list. Common best practices still apply, for example keeping PRs under 400-500 LOC.

  4. Code reviews are major bottlenecks. It was much easier to make all 6 of the AI assistant runs to get close to the finish line than it was to review even 2 of the PRs. Granted, the task was on the bigger side of a single PR, but as AI-coding becomes more and more common, teams will expect to be able to run larger features more autonomously.


Overall, this experiment confirms what standard best practices have taught us for years. It emphasizes that high-quality planning is crucial for achieving meaningful benefits from AI beyond code completion and boilerplate generation.



In order to fully take advantage of AI promise and run agents in parallel you need to:


  1. Prepare a detailed spec for each feature. The goal should be to remove ambiguity as much as possible ahead of time. The spec should include what’s in scope, what’s out of scope, acceptance criteria, test coverage, DB schema changes if any, etc. AI coding agents already can figure out quite well where and how to write code, but they need guardrails and specific requirements to produce production-ready code. It is especially important to outline design patterns and architecture decisions that may be obvious to a human as this is where AI will often go off the rails. We used Devplan for our spec creation and it worked very well for this step.

  2. Keep execution right-sized. This is key. To take full advantage of AI coding’s promise of speed and ability to run in parallel, you want to execute at right level. Tasks should be small enough for AI to finish autonomously, but not so small that you have to babysit each step. Creating the right size tasks is much easier with a purpose-built tool like Devplan, and having that generated for you is where the real speed unlock is. When you can assign a set of tasks for a full feature to an AI agent and let run alongside other AI agents, you start to see the magic of the new world we are in.

  3. Review every single change. In our experiment, the code usually wasn't production-ready at first pass. For anything but the smallest tasks, we rarely see any of the AI agents getting code to the ready-to-merge state by completely themselves. There’s usually some level of tweaking or editing that needs to be done manually. However, with proper planning, AI can get to an 80% complete level, even while running in parallel.


If you're using AI to build products, read about how we use Devplan to build Devplan.

Build better products faster.

We’re on a mission to transform how ambitious teams turn vision into software faster then ever before.

Maximum business efficiency with minimal effort.

Go end-to-end from quick queries to deep-dive analyses to beautiful interactive data apps – all in one collaborative, AI-powered workspace.

Build better products faster.

We’re on a mission to transform how ambitious teams turn vision into software faster then ever before.

Maximum business efficiency with minimal effort.

Go end-to-end from quick queries to deep-dive analyses to beautiful interactive data apps – all in one collaborative, AI-powered workspace.

Build better products faster.

We’re on a mission to transform how ambitious teams turn vision into software faster then ever before.