AI code review has had a credibility problem since the day it launched. Bots flag a hundred "issues" per PR, the developers learn to ignore the comments, and within a quarter the integration is muted in Slack and forgotten. The metric that matters — did the bot's comment actually get acted on — has been the industry's quiet embarrassment.
Cursor Bugbot just published the receipts. According to a benchmark Cursor ran on public GitHub repositories, Bugbot's resolution rate now sits at 78.13%, with the next-closest competitor — Greptile — at 63.49%. The rest of the field doesn't break 50%.
The number is interesting on its own. The reason the number jumped is more interesting still.
The benchmark: who actually ships acted-on bug reports
Cursor's methodology is the kind of thing the AI code-review category has needed for two years. For each comment a tool produced on a public PR, an LLM judge checked whether the comment was addressed before the PR merged. No customer surveys, no synthetic test suites — just real PRs and real merge decisions.
| AI Code Review Product | Resolution Rate | PRs Analyzed |
|---|---|---|
| Cursor Bugbot | 78.13% | 50,310 |
| Greptile | 63.49% | 11,419 |
| CodeRabbit | 48.96% | 33,487 |
| GitHub Copilot | 46.69% | 24,336 |
| Codex | 45.07% | 19,384 |
| Gemini Code Assist | 30.93% | 21,031 |
Two things jump out. First, the spread. The gap between Bugbot at the top and Gemini Code Assist at the bottom is more than 47 percentage points — this is not a category where every product is roughly the same. Second, GitHub Copilot, the default review companion shipped to tens of millions of developers, sits at 46.69% — meaning more than half its review comments are ignored at merge time.
That 78% isn't just a marketing number. It's a defensible answer to the question "should I let an AI review my code at all?"
Where the resolution rate came from
When Bugbot left beta in July 2025, its resolution rate was 52%. Roughly half its comments were getting addressed, half were getting ignored. Decent for a v1, but well within the noise of the rest of the field.
The path from 52% to 78% wasn't pure model upgrades. According to Cursor engineer Michael Zhao, the lift came from a structural change in how Bugbot trains itself — moving from offline experimentation to a continuous learning loop driven by real-time signals from real PRs.
Up until now, improvements have been propelled exclusively by offline experiments: We tweak Bugbot, test to see if the change improves the resolution rate, and we ship it if it does.
The new system is called learned rules, and it's the most important shift in this category since AI code review existed.
How learned rules actually work
Learned rules turn every merged PR into a labeled training example. Bugbot watches three signals on each of its review comments:
1. Reactions to Bugbot comments. A 👎 from the developer who owns the PR is a strong negative signal — the finding wasn't useful, or wasn't actionable, or was just plain wrong.
2. Replies to Bugbot comments. When a developer types out why the suggestion missed — "this is intentional, see ADR-0142" or "we already have a guard upstream" — that's structured feedback the system can learn from.
3. Comments from human reviewers that flag issues Bugbot missed. This is the "false negative" signal. Every time a human spots a bug that Bugbot didn't, that's a gap in coverage Bugbot can patch.
Bugbot processes these signals into candidate rules — additional instructions that nudge future runs toward issues that matter and away from noise that doesn't. As a candidate rule accumulates positive signal across PRs, it gets promoted to active. If an active rule starts generating consistent negative signal, it gets disabled. Developers can also edit or delete rules directly in the Cursor Dashboard.
The mental model is closer to a self-tuning linter than a traditional ML training pipeline. Instead of retraining a model on logged feedback once a quarter, Bugbot is rewriting its own prompt continuously.
The scale that makes this work
Two numbers from Cursor explain why this isn't a toy:
Since launching learned rules in beta, more than 110,000 repos have enabled learning, generating more than 44,000 learned rules.
110,000 repositories is more than enough cross-domain signal to avoid overfitting to one codebase's quirks. And 44,000 rules is roughly one rule per 2.5 repos — a density that suggests the system is finding genuinely codebase-specific patterns rather than generic "don't suggest renaming variables" guardrails.
The other number that matters: Bugbot reviews hundreds of thousands of PRs per day. Each one is a natural experiment. Each merged PR — with its reactions, replies, and human reviewer comments — feeds the loop.
Why this category will consolidate fast
If you're running a competing AI code review tool, the Bugbot benchmark is bad news in three different ways.
First, the scoreboard is now public. Every CTO evaluating tools in 2026 has a clean comparison table to put on a slide. CodeRabbit at 48.96% has a steep hill to climb to justify a switch away from Bugbot, and an even steeper one to justify a green-field deployment.
Second, the flywheel compounds. Learned rules get better as more repos enable learning, and more repos enable learning when the resolution rate is high. Bugbot is on the right side of that loop. Competitors are not.
Third, the comparison is on resolution rate, not findings count. For years, AI code review tools competed on how many issues they could find — which incentivized noise. Bugbot's pitch flips the metric: the goal isn't volume, it's the percentage of comments that actually change the merged code. Once buyers internalize that metric, "we find more issues" stops being a feature and starts being a liability.
How to actually use this
If you're already running Bugbot, learned rules are managed in the Cursor Dashboard under repository rules. Two things to know:
Backfill is worth running. The dashboard offers a backfill across recent PRs, which seeds the rule set with signal from your team's existing review history. Don't skip it — running learned rules cold means weeks before the system has enough signal to promote rules.
Edit aggressively in the first two weeks. Candidate rules show up before they're promoted. Review them. Generic rules are usually fine; rules that encode tribal knowledge about your codebase ("don't ever change the order of fields in the API response shape") are exactly the ones you want active fast. Promote them by hand if needed.
Watch for rule rot. If a rule starts generating false positives because the underlying code has changed — say, you migrated off a deprecated framework — Bugbot will eventually disable it, but you can speed that up by deleting it manually.
The Bottom Line
Cursor Bugbot's 78% resolution rate isn't just a benchmark win. It's evidence that the AI code review category has finally found the right metric — and that the company with the most PR data is opening a structural lead. Learned rules are the mechanism: each merged PR is now a free, labeled training example, and Bugbot has hundreds of thousands of them per day.
If you're choosing a code review bot in 2026, the question isn't "which one finds the most issues." It's "which one are my developers actually going to listen to." On that question, the data points one direction.


