Back to blog
X24LABS

Deleting our regex classifier: when AI replaces your own abstraction

We spent months building a weighted regex engine that classified CI errors into nine categories with confidence scores. Then we deleted it. This is the story of why it had to go, and what replaced it.

Most of the v1 to v2 rewrite was about external shape. Local instead of remote, TypeScript instead of Python, CLI instead of CI job. The part worth writing about on its own is internal. v2 shipped with a lot less code than v1, and most of what is missing was deliberately deleted.

The biggest single deletion was our error classifier.

What it did

v1’s classifier was a weighted regex engine. It read CI logs, ran them through roughly one hundred fifty patterns, and sorted errors into nine buckets: lint, formatting, types, build, CI config, test contracts, complex types, logic, and a catch-all called unknown. Each classification came with a confidence score. Each bucket mapped to a fix strategy.

The landing page for Stitch v1 proudly listed all nine categories. We thought it was a feature.

Why we built it

The classifier was an answer to a real problem. Language model calls cost tokens. CI logs are enormous. If you send the whole log, you waste context and you get worse results because the model has to find the signal itself. If you preprocess the log, strip the noise, isolate the relevant error, and label its kind, you save tokens and you give the model a head start.

So we built a preprocessor. It ran before the model. It produced a compact payload that said “this is a type error on file X line Y, here is the traceback, here are the five lines around the call site.” The model then produced the fix. This worked.

Why it had to go

Two things changed.

Models got better at reading raw logs. Claude and GPT got cheaper per token and sharper at parsing unstructured output. The gap between “send the model a surgically extracted error” and “send the model the log and let it figure it out” closed. The extraction step stopped earning its cost.

Our classifier got worse in comparison. We had patterns for TypeScript and Python and Go. We did not have patterns for Elixir or Rust or whatever framework a user was running this week. When a log came in that our regex did not match, we routed it to the unknown bucket and let the model handle it anyway. That was the hint. If the fallback worked, the specialized path was optional. If the specialized path was optional, it was also a source of bugs we were maintaining for no reason.

The breaking realization was in v2’s local context. On a developer’s machine, running stitch run claude meant we were calling into Claude Code, which is a pluggable agent that already has tools for reading files, running commands, and reasoning about errors. Sending it a pre-classified payload was like sending a senior engineer a pre-filled bug report and then insisting they use our template. They did not need it. The template was adding friction.

What we replaced it with

Two things. One much simpler than before, one not even in the same layer.

Stitch v2 still does a small amount of preprocessing, but it is coarse and it is honest. It looks at job names and decides whether to run them. A job called deploy-prod is infra, skip locally. A job called lint is verify, run locally. This is a one-line decision per job. It does not pretend to know what the errors will be.

The real diagnosis moved into the agent. When a job fails, Stitch hands the log to the agent with a short, deliberately boring prompt: here is the job, here is the log, here is the repo, please fix it. The agent does what it would do in an interactive session. It reads files, it runs commands, it tries things. Stitch validates the result by re-running the job.

The total code removed was in the thousands of lines. The total capability lost was zero. The diagnoses got better because the agent had access to the whole repo instead of a five-line excerpt.

The lesson

The lesson is uncomfortable, but it shows up over and over in tools built around AI agents: the abstractions you build to help the model often stop helping once the model gets good enough. If you own both the preprocessor and the prompt, you will reach a point where the preprocessor is making the prompt worse. You will not notice because you built it.

The only way we noticed was by letting ourselves delete it and see what happened. Nothing happened. That was the point.

Next post in the series: Stitch 2.0 is shipping. What you actually install and what it does.

Back to blog