How Planning Impacts AI Coding
August 7, 2025
Intro
The development community has divided opinions on how much AI truly enhances real-world engineering. Some folks report massive productivity improvements, while others say that reviewing and editing AI written code slows them down overall. There are also varying opinions on how beneficial planning is before diving into the code vs. running strictly on vibes. After spending months using AI daily in a production environment, we wanted to measure the impact of proper planning on the productivity benefits of AI-assisted coding.
Experiment
The importance of preparing high-quality requirements for AI-coding agents such as Claude Code or Cursor are somewhat well known. However, we wanted to test just how much carefully prepared requirements at the user story or feature level really affect the quality of the agent output vs. just running quick hand-written prompts directly. Additionally, we wanted to see if better planning can impact the ability to get AI-coding agents to run in parallel, which is also described in the ClaudeCode documentation. We have also covered some of this in the author’s practical guide to AI coding that was created during day-to-day work.
It is worth noting that this experiment is very low scale, and we are not pretending to have any scientific significance. The goal was to check the basic effects of planning in AI coding. To do that, we have asked each of the agents to implement the same task twice, once with detailed, generated instructions, once with just a simple hand-written prompt. The task is based on an open-source repository we maintain. The simple hand-written prompt was the following:
# Implement github changes summary
1. Contexify app should provide functionality to analyze recent changes happened in a github repository.
2. The analysis should be performed automatically periodically for repositories that are enrolled into such analysis.
3. Report should be persisted and available through API.
4. Reports should also be viewable in the UI.
The first attempt was performed by each coding agent with only these requirements as an input. The second attempt was performed with much more detailed requirements covering various aspects of implementation. Full description/prompt can be found here.
The conditions of the experiments:
Nudge agents to get as far as they reasonably can. Answer any of their questions if they have any and unblock them when stuck.
Do not provide any other information during their code implementation.
In the end, we had 6 Github PRs which were evaluated according to the following criteria.
Criteria
Name | Description | Ranges |
---|---|---|
Correctness | How close the implementation is functionally correct and in line with a properly designed solution. | 1 - functionally, almost nothing is done right 5 - all functionality is implemented correctly the way it is intended to work. |
Quality | Quality of the code. How maintainable the code is, how well does it follow the exiting coding standards. | 1 - doesn’t follow any patterns, files are in wrong places, uses poor/unreasonable approaches 5 - production grade quality, can be shipped to prod as-is. |
Autonomy | How autonomously AI got to the final solution. | 1 - required constant supervision and couldn’t make any meaningful progress by itself. 5 - got to a final solution without any nudges beyond initial setup. |
Completeness | How complete the solution is, did it satisfy all the explicitly set requirements. | 1 - none of the requirements satisfied and none of them work. 5 - all explicitly set requirements were satisfied. |
These criteria allows us to estimate how likely AI coding assistant can get to the working solution with all requirements satisfied and without much guidance, which will eventually enable them to be run in parallel. Note that for the parallel execution to work well, it is more important that all scores stay within 4-5 range, than to have higher average, e.g. scores [4, 4, 4, 4] are better than [5, 5, 2, 5], even though the latter has higher total. That’s because low score on any dimension can easily eliminate any productivity wins a human engineer would expect to gain from an AI assistant.
Results
Below are the scores for each tested case with links to the corresponding PRs.
Solution | Correct ness | Quality | Autonomy | Complete ness | Mean ± SD | Improve ment % |
---|---|---|---|---|---|---|
Claude, Short PR | 2 | 3 | 5 | 5 | 3.75 ± 1.5 | 20% |
Claude, Planned PR | 4+ | 4 | 5 | 4+ | 4.5 ± 0.4 | |
Cursor, Short PR | 2- | 2 | 5 | 5 | 3.4 ± 1.9 | 20% |
Cursor, Planned PR | 5- | 4- | 4 | 4+ | 4.1 ± 0.5 | |
Junie, Short PR | 1+ | 2 | 5 | 3 | 2.9 ± 1.6 | 34% |
Junie, Planned PR | 4 | 4 | 3 | 4+ | 3.9 ± 0.6 |
Here is what we can observe after looking at the scores and produced PRs.
High quality planning significantly improves correctness and quality. No surprises here. AI can’t read your mind, so properly prepared product and technical requirements help AI-assistant to do what you actually want and follow technical guidelines.
Planning reduces score dispersion. In other words, results of all AI assistants were much closer to each other and more stable when they worked with detailed unambiguous requirements. Looking at PRs, you can also see that even approaches they chose were very similar. In practice it means you can use any of the good coding assistants available out there as long as you plan and prepare detailed spec for their work.
Autonomy drops with increased scope. This task was intentionally larger than a typical single-PR change task to test AI’s boundaries. Cursor and Junie required quite a bit of nudging to finish their work with the more detailed plan, but Claude Code was able to run without any nudging to continue. So, breaking down work into smaller chunks increases chances of AI autonomously getting to the end of the requirements list, but exactly how small the chunks should be will probably vary from one assistant to another. But there is a good chance that a common best practice still applies, for example keeping PRs under 400-500 LOC, which brings us to the next point.
Code reviews are major bottlenecks. It was much easier to make all 6 of the AI assistant runs to get close to the finish line than it was to review even 2 of the PRs. Granted, the task was bigger than what should typically be placed into a single PR, but as AI-coding becomes more and more common, teams will also start to adopt some of the AI code reviewing approaches. In general, the smaller the change and the higher quality code in the change, the easier it is to review by a human.
Overall, this experiment confirms what standard best practices have taught us for years. It emphasizes that high-quality planning is crucial for achieving meaningful benefits from AI beyond code completion and boilerplate generation.
In order to fully take advantage of AI promise and run agents in parallel you need to:
Prepare a detailed spec for each feature. The goal should be to remove ambiguity as much as possible ahead of time. The spec should include what’s in scope, what’s out of scope, acceptance criteria, test coverage, DB schema changes if any, etc. AI coding agents already can figure out quite well where and how to write code, but they need guardrails and specific requirements to produce production-ready code. It is especially important to outline design patterns and architecture decisions that may be obvious to a human as this is where AI will often go off the rails.
Keep execution right-sized. This is key. To take full advantage of AI coding’s promise of speed and ability to run in parallel, you want to execute at the feature / user story level. This size only really works with proper planning, but that is where the real speed unlock is. When you can assign a full user story to an AI agent and let run alongside other AI agents, you start to see the magic of the new world we are in. At this level, it should be easy to review and retry if something was done wrong by AI. And at this level, it should be small enough for AI to finish autonomously, but typically not so small that you have to babysit each step.
Review every single change. In our experiment, none of the code was production ready at first pass. Even for smaller tasks that we work on daily, we rarely see any of AI agents getting code to the ready-to-merge state by completely themselves. There’s almost always some level of tweaking or editing that needs to be done manually. However, with proper planning, AI can get to an 80% complete level, even while running in parallel.
Practical Guide Using Devplan
This is a walkthrough of how we use Devplan in real day-to-day development to achieve the results presented here. Right now, more than 90% of the code we ship runs through Devplan, making it the foundation of our ability to execute fast and get the true benefits of AI-enabled development.
Keep in mind the goals here are to create a repeatable, scalable system where AI can:
Get to a working solution independently
Execute tasks in parallel
Require minimal human oversight
In our experience as senior engineers, without Devplan, the overhead of managing AI-assisted workflows can sometimes cancel out the benefit, but with it, the benefits of AI coding are tremendous.
1. Define Product & Technical Specs with Devplan Agents
Every project starts with Devplan’s agents helping to define requirements. They work with you to ask smart clarifying questions, flag ambiguity, and scope the work properly—grounded in knowledge of your codebase, past projects, and your company’s structure and goals.
This step is deceptively important. It seems basic, but the quality of the questions the AI asks here is critical. It often surfaces misalignments or assumptions that would cause a coding agent to fail or require multiple follow-ups. Without this clarity, you risk vague specs, restarts, and messy outcomes.
By the end of this step, you’ve got a clean, scoped project with resolved ambiguity. You can archive it to your backlog or move straight to execution.
2. Break the Project Down into Right-Sized Features
Devplan automatically breaks each scoped project into individual features or user stories. This is where AI prompts are generated—one per feature.
Your job here is light. Mostly you're validating:
Are the features correctly sized (ideally half-day to 5-day chunks)?
Are there too many or too few?
Do the acceptance criteria make sense?
Thanks to the planning in Step 1, this typically takes less than two minutes. Most of the ambiguity has already been resolved, and this step simply formalizes the work into bite-sized units that are ready to ship.
3. Run Prompts into Your AI IDE (Manual vs. Devplan CLI)
Once features and prompts are ready, it’s time to run them inside your IDE of choice—Claude, Cursor, Junie, etc. This is where execution happens, and also where things can get inefficient quickly.
Approach 1: Manual Execution (Without Devplan CLI)
Here’s what the manual process looks like, per feature:
Download the generated prompt and format it for your IDE (
CLAUDE.md
,rules.json
,guidelines.md
, etc.).Clone your git repository or create a new worktree—especially important if you want to implement features in parallel.
Open your IDE manually in the correct folder with the right context.
Prompt the AI to begin coding the feature.
Doing this once isn’t a big deal. But doing it 6–10 times per day becomes a drag. It’s repetitive, error-prone, and easy to procrastinate, especially if you forget to clean up worktrees or misplace prompts.
Approach 2 (recommended): Automated Execution with Devplan CLI
With Devplan CLI, all of that overhead disappears. You can spin up a feature-ready workspace with one command:
This one-liner:
Creates a scoped cloned folder for the feature
Launches your IDE in the correct context
Automatically references the correct prompt file
After that, you just tell your AI agent: “Implement current feature.”
Before the CLI, we lost real time and energy just getting into a feature, switching between terminal, prompts, and IDEs. Parallel execution felt clunky, and small errors like forgetting a worktree setup often led to broken states or rework. With the CLI, feature execution is fast, consistent, and repeatable.
More importantly, this automation is what makes scale possible. We can run multiple features in parallel, and delegate reliably to AI.
4. Review and Polish the Output
This is the last human step before shipping. But the amount of work here drops dramatically if the planning and prompting were done well, which they are, if you followed the earlier steps.
Once the AI has written the code:
Manually review the output
Fix issues or edge cases
Test to ensure it meets your standards
Without this system, we wouldn’t be able to run or complete nearly as many AI-generated features per day. Devplan is what turns isolated AI prompts into a real production workflow.
We estimate that using Devplan makes our AI-assisted development planning process 8–10x faster compared to manually managing specs, prompts, repos, and execution and the overall coding execution 2-3x faster. But more importantly, it makes the entire workflow scalable.
Requirements Adjustments
There is a somewhat common flow to focus on separately which also highlights the power of using proper tools. When AI-coding agent went sideways and you need to course correct it, it is often easier to re-start from scratch with corrected requirements. The flow described above allows you to do a full restart in a matter of minutes if not seconds, depending on how complex the adjustment is.
The way you would do it is to go back to step 1 and update the PRD if the change needed was product related or the tech design doc if technical related and with the AI agents to update requirements with your new ask. Then, go to Build Plan (step 2) and regenerate features and prompts with a single click. Finally, use the CLI to restart with updated requirements. That’s it. It usually takes under 2 minutes from realizing that AI did something wrong that you want to adjust to AI restarting with the corrected prompt.
For example, I once worked on implementing remote MCP server and my AI IDE decided not to use SDK at all. When I noticed that, I updated Technical Requirements with a request to use python SDK for MCP, regenerated prompts and restarted. Took less than a minute to do that.
Another important reason to centralize requirements: every change will persist, even if you blow up the repo or switch to a different AI IDE. For example, you could edit a requirement directly in a rule file, but that change won’t carry over to the next feature. And if you try a different AI IDE, you’ll likely need to manually migrate those changes or risk losing them altogether if you roll back the repo. (That’s happened to me more than once before I switched to this centralized flow.)
Conclusion
There are a lot of people and articles (e.g. this) suggesting that AI may be a net loss for productivity. And indeed if not used smartly or with good tooling, that may be true. Good professional engineers are already quite efficient and for them it is critical to have efficient processes with tools allowing to minimize overhead while empowering AI to get big parts of the tasks to near-completion. Every minute of overhead, every single extra context switch matters. It will take time to figure out how to work with AI-coding at scale, but when done well, AI can engineers more productive and the job itself more fun.