AI News 5 min read

SubQ: The 12M-Token Subquadratic LLM Splitting AI Researchers

SubQ is a new 12M-token subquadratic LLM claiming massive context and low compute, sparking debate among researchers.

May 16, 2026

For nine years, every serious frontier model has been a transformer with dense attention — and every serious frontier model has been bumping against the same wall: doubling the input quadruples the work. A Miami startup called Subquadratic says it has a model that breaks that ceiling, and on May 5, 2026 it walked out of stealth with $29 million in seed funding and a flagship model named SubQ.

The pitch is audacious: a 12-million-token context window — roughly 9 million words, or about 120 books — running at a fraction of the cost of dense-attention frontier models. The AI research community is not sure whether to call it the biggest architectural shift since the transformer or a marketing exercise dressed up as physics. Both camps have a point.

The numbers Subquadratic put on the table

Co-founders Justin Dangel (CEO) and Alexander Whedon (CTO, ex-Head of Generative AI at Meta) describe SubQ as a transformer built on Subquadratic Sparse Attention (SSA). Instead of comparing every token with every other token, the model picks the most relevant tokens and computes relationships only inside that subset. The math is what the name promises: scaling that is linear-ish rather than quadratic in sequence length.

Their headline claims:

Claim	SubQ	Frontier comparison
Max context	12,000,000 tokens	~1–2M for frontier cloud models (Claude Opus 4.7, Gemini 3.1 Pro)
Speed @ 1M tokens	~50× faster	Frontier dense-attention models
Cost @ 1M tokens	~50× cheaper	Frontier dense-attention models
Compute @ 12M tokens	~1,000× less	Other frontier models at the same context
RULER 128K eval	95% accuracy at ~$8	Claude Opus: 94% at ~$2,600

"The fundamental scaling laws imposed by the transformer architecture and dense attention have been broken through," Dangel told SiliconANGLE.

The investor lineup is unusually loud for a seed round: Tinder co-founder Justin Mateen, former SoftBank Vision Fund partner Javier Villamizar, and early investors in Anthropic, OpenAI, Stripe, and Brex. This is the round you raise when you want benchmark numbers to travel.

Why a 12M context window actually matters

If SubQ holds up, the immediate beneficiary is every product that currently leans on RAG, agentic retrieval, and prompt-curation gymnastics to squeeze information into a 128K or 1M-token window. Those systems work, but they add latency, add a second failure mode, and bias what the model gets to see.

Whedon framed the problem bluntly: "I used to manually curate prompts and retrieval systems and evals and conditional logic to chain together the workflows. That is kind of a waste of human intelligence and also limiting to the product quality."

A model that can swallow an entire codebase, a full legal corpus, or a quarter's worth of meeting transcripts in a single pass doesn't kill RAG — but it changes the cost calculus for a lot of products that exist mostly to work around the context limit.

Two SubQ products ship with the launch:

SubQ API — direct access to the 12M-token window for developers and enterprise teams.
SubQ Code — a CLI coding agent that loads a whole repository into a single context window, so a developer can plan, execute, and review across a codebase without orchestrating multiple agents.

A free SubQ Search product is also coming, hinting at the land-and-expand strategy you'd expect from a long-context bet. Access today is waitlist-only — the model is not public, and Dangel said it will not be open-weight or open-source in the near term.

The skeptics are not being unreasonable

The response from researchers within hours of launch was, charitably, spirited. AI commentator Dan McAteer summed up the mood: "SubQ is either the biggest breakthrough since the Transformer… or it's AI Theranos."

The substantive objections are worth taking seriously:

Provenance. Engineer Will Depue initially observed that SubQ is "almost surely a sparse attention finetune of Kimi or DeepSeek." Subquadratic has not published architectural details or weights that would let anyone check.
Single-run benchmarks. Each benchmark model was run only once due to inference cost, and the company's own paper concedes the SWE-Bench margin is "harness as much as model."
Lab vs. production gap. On MRCR v2 the research score was 83, but the third-party verified production model scored 65.9 — a 17-point gap between what the paper reports and what the shipping model does.
Cost transparency. Subquadratic has not publicly disclosed API pricing, so the cost-per-task comparisons against Opus are not independently verifiable.

Not everyone is dismissive. AI researcher John Rysana pushed back on the Theranos framing, arguing the work is "just subquadratic attention done well which is very meaningful for long context workloads," and that "odds of it being BS are extremely low."

The Bottom Line

Sparse attention is not new — it has been in academic papers for years. What is new is (a) a well-funded team claiming to have made it work at frontier scale, and (b) the willingness to put the words "1,000x cheaper" in a press release before the model is in anyone else's hands. Until the API opens up and independent labs can replicate the RULER 128K and MRCR v2 numbers on a fixed harness, treat the headline figures as marketing, not science.

But the underlying problem — quadratic attention is the load-bearing wall of the entire LLM economy — is real, and the first company to credibly knock it down will reshape pricing across the stack. SubQ deserves the scrutiny it is getting. It also deserves to be checked, not laughed off. The next move belongs to whoever gets API access first.

subq subquadratic llm long-context ai-news transformer

More in AI News

AI News

GPT-5.6: OpenAI's Sol, Terra, and Luna Go Public

OpenAI made its three-tier GPT-5.6 family (Sol, Terra, Luna) generally available on July 9, 2026 after government safety review. Pricing runs from Luna at $1/$6 to Sol at $5/$30 per 1M tokens, with a Sol Fast option at $12.50/$75 on Cerebras. The release adds Programmatic Tool Calling in the Responses API (63.5% fewer tokens, 50.1% fewer turns) and longer prompt caching, but Sol's 64.6% on SWE-Bench Pro still trails Claude Mythos 5 (80.3%).

By Sarah Chen · 5 min · Jul 11, 2026

AI News

GPT-Realtime-2.1: OpenAI Adds Reasoning to Its Voice API

On July 6, 2026, OpenAI released GPT-Realtime-2.1 and GPT-Realtime-2.1-mini for the Realtime API. The headline change is reasoning in the low-cost mini tier, plus a 25% cut in p95 latency from better caching. The mini holds the prior gpt-realtime-mini price (0 audio in, 0 audio out per 1M) while the full model runs 2/4. Reasoning effort is configurable from minimal to xhigh.

By Sarah Chen · 5 min · Jul 8, 2026

AI News

Gemini 3.5 Flash: The Flash Model That Beats Google's Own Pro Tier

Google released Gemini 3.5 Flash on May 19, 2026, at Google I/O. The Flash-tier model beats Gemini 3.1 Pro on coding and agentic benchmarks (76.2% Terminal-Bench 2.1, 83.6% MCP Atlas, 1656 GDPval-AA Elo) while running 4x faster and costing $1.50/$9 per 1M tokens, 40% below 3.1 Pro. It trails Pro on academic reasoning (Humanity's Last Exam, ARC-AGI-2) and dense long-context recall. It powers Gemini Spark, Antigravity 2.0, and is now the default model for the Gemini app and AI Mode in Search.

By Sarah Chen · 5 min · Jul 7, 2026