Back to blog
X24LABS

What we got wrong with Stitch v1

Stitch v1 was a Python library that ran as a CI job, diagnosed failures with a regex classifier, and opened a merge request with the fix. It worked. And it was still the wrong shape.

The first version of Stitch did what it said on the box. You added two jobs to your CI config, stitch-monitor and stitch-heal, pointed at a public Docker image, and the next time a lint rule or a test broke, a merge request showed up with a patch. For the subset of failures it handled well, the experience was close to magical.

We shipped it. We used it on our own repositories. And over a few months, we compiled an honest list of the things it got wrong. This post is that list.

It lived in the wrong place

Stitch v1 was a citizen of your CI system. It ran as a downstream job, activated on upstream failure, and did all its work inside the pipeline environment. That sounds clean. It had consequences we underestimated.

Every fix required a full pipeline round trip. Your CI ran, failed, Stitch ran, committed, pushed, your CI ran again. Even on a fast runner, the feedback loop was minutes, sometimes tens of minutes. For the kind of error Stitch was meant to handle, a missing import, a formatter disagreement, a type annotation the test runner did not like, the fix itself took a second and the verification took forever. The human was still waiting on the machine, just in a different room.

It assumed the agent lived somewhere else

v1 talked to a language model over an API. We shipped an OpenRouter integration, a GitHubAdapter, a StitchAgent class. The onboarding story was “bring your own API key.” That read like a small ask. In practice it meant every team that wanted to try Stitch had to create an account on a provider, generate a key, store it in CI secrets, and budget tokens.

Meanwhile, more and more developers already had Claude Code or Codex sitting on their laptop, already authenticated, already paid for out of their personal or team subscription. Stitch v1 did not know how to talk to any of those. It asked users to pay twice.

It over-engineered the parts it could see

v1 shipped a regex classifier. A weighted engine that read CI logs, ran them through one hundred fifty patterns, and sorted errors into nine categories: lint, formatting, types, build, CI config, test, complex types, logic, unknown. Each category had a confidence score. Each category had a fix strategy.

It was satisfying to build. It was also mostly theater. The agent on the other end of the API was capable of reading a raw log and figuring out what went wrong. Our classifier was, at best, a way to keep the prompt short. At worst it was a translation layer that introduced its own bugs. When the agent got better, we kept the classifier. We had already built it.

It pushed before it knew

Auto-merge was a flag. Default off, but advertised. A fix lands, the re-run passes, the merge request can close itself. In practice, that flag is a footgun. Passing CI is evidence that the test suite is happy, not that the patch was the right one. We saw cases where the agent silenced a failing test by changing the assertion, got green, and the team lost the signal entirely. v1’s answer was “review every MR.” v2’s answer needed to be different.

It scaled poorly for the small case

v1 assumed the team had a CI platform running somewhere that could host its Docker image, had the budget for extra compute on every failure, and had someone empowered to add jobs to the pipeline config. For a startup with three engineers, that was three hurdles before the first fix. For a solo developer on a side project, it was a wall.

The failure cases Stitch was best at, the five-minute cleanup fixes, were the cases where the smaller team needed help most. And those were exactly the teams v1 could not reach.

What we kept

Not everything. But a lot.

The core idea, that AI agents can read failing logs, understand the problem, and produce a targeted patch, was right. The insistence on validation, that a patch that does not re-run cleanly is not a fix, was right. The notion of escalating after bounded attempts was right. These carried into v2 unchanged.

The shape around them did not. The next post in this series is about where we took it.

Back to blog