Trinity-Large-Thinking: 400B U.S.-Made Open Reasoning Model
Open Source 7 min read

Trinity-Large-Thinking: 400B U.S.-Made Open Reasoning Model

Aisha Patel
Aisha Patel
Apr 30, 2026

Trinity-Large-Thinking: The 400B U.S.-Made Open Model That Bet the Company

For a year, the most exciting open-weight AI models have been Chinese. DeepSeek, Qwen, GLM, Kimi — every breakthrough on the leaderboards came stamped with a flag from across the Pacific. American labs that could compete on capability — Anthropic, OpenAI, Google — were busy shipping closed models behind APIs.

Then, on April 1, 2026, a 100-person startup called Arcee AI released Trinity-Large-Thinking under an Apache 2.0 license. It is a 400-billion-parameter sparse Mixture-of-Experts model trained on 2,048 NVIDIA B300 GPUs over roughly 33 days, optimized for long-horizon agentic reasoning. They spent $20 million to train it — nearly half the company's total funding — and then they gave away the weights.

This is not just another model release. It is a deliberate bet on the proposition that the open-weight AI ecosystem in the United States needs at least one frontier-class model that enterprises can inspect, fine-tune, host, and own. Here is what Trinity-Large-Thinking actually does — and why it matters.

The Architecture: Sparse, Reasoning-Heavy, Tool-Native

Trinity-Large-Thinking is a sparse Mixture-of-Experts with 398 billion total parameters and roughly 13 billion active parameters per token. That ratio — about 1.56% activation — puts it in the same architectural family as DeepSeek V3 and Qwen 3 Max: massive total knowledge with the inference economics of a much smaller dense model.

It is the reasoning-tuned sibling of Trinity-Large-Base. Where the base model was a general-purpose foundation, Trinity-Large-Thinking has been post-trained with extended chain-of-thought and agentic reinforcement learning. The model emits its reasoning in explicit <think>...</think> blocks before producing a final answer, similar to DeepSeek's R-series or Anthropic's extended-thinking traces.

The model was designed from day one for long-horizon agents and multi-turn tool use. That focus shows up in the benchmarks.

The Numbers That Matter

Arcee's reported benchmark scores are striking, especially in the agentic and code-execution categories:

Benchmark Trinity-Large-Thinking Claude Opus 4.6
τ²-Bench (agentic) 94.7% (closely matched)
PinchBench 91.9%
LiveCodeBench 98.2%
SWE-bench Verified 63.2 75.6

The pattern is clear: on tool-use and agentic tasks, Trinity-Large-Thinking is genuinely competitive with Opus 4.6. On complex repo-scale software engineering (SWE-bench Verified), there is still a real gap — Opus 4.6 wins by about 12 points.

This is the right honesty to bring to a release like this. Trinity is not magically better than Anthropic's flagship. It is competitive enough on the use cases where most enterprises actually deploy agents — customer-service flows, retrieval-augmented workflows, code generation in well-defined contexts — at a fraction of the price.

Pricing That Reframes the Conversation

On Arcee's hosted API, Trinity-Large-Thinking is $0.90 per million output tokens. Claude Opus 4.6 is roughly $75 per million output tokens.

That is approximately 96% cheaper. Or, framed the other way, 28x more output for the same dollar.

For agent workloads that burn through tokens — a planning agent that emits long reasoning traces, a coding agent that iterates over a codebase, a customer-service bot that holds long context — this kind of price gap is not a marginal optimization. It changes which use cases pencil out at all.

And because the weights are open under Apache 2.0, you do not have to use Arcee's API. You can:

  • Download the BF16 weights from Hugging Face at arcee-ai/Trinity-Large-Thinking.
  • Run quantized versions — FP8 or W4A16 (INT4 weights, 16-bit activations) — on more modest hardware.
  • Fine-tune the model on proprietary data without licensing concerns.
  • Self-host on your own GPUs for compliance-sensitive workloads.

"We built it for developers and enterprises that want models they can inspect, post-train, host, distill, and own." — Arcee AI's launch announcement.

That is the pitch in one sentence. Inspect, post-train, host, distill, own. Five verbs that closed-API providers cannot offer.

The Training Run: A $20 Million Bet

The story behind the training is almost as interesting as the model itself.

Arcee committed $20 million — nearly half of its total venture funding — to a single 33-day pretraining run on 2,048 NVIDIA B300 Blackwell GPUs. According to public reporting, it is the largest publicly disclosed pretraining run on B300 hardware to date. Post-training was handled separately on a cluster of 1,152 H100s.

The all-in $20M figure includes compute, salaries, data licensing, storage, and operations — not just rented GPU hours. For context, that is roughly an order of magnitude less than what Meta is reported to have spent on the original Llama 3 405B run, and it produced a model that is competitive on agentic benchmarks with frontier closed models that cost vastly more to train.

The lesson is not that frontier training is suddenly cheap. It is that algorithmic and architectural progress — better data curation, smarter MoE routing, more efficient post-training recipes — has compressed what a focused, technically excellent team can do with a constrained budget.

What This Means for Enterprises

The honest case for Trinity-Large-Thinking inside a company is different from the case for DeepSeek V4 or GLM-5.1, even though all three are open-weights MoE models.

For some U.S. and EU enterprises — particularly those in defense, critical infrastructure, healthcare, and finance — running inference against a Chinese-trained model raises real procurement, regulatory, and supply-chain questions. Whether or not those concerns are technically warranted in any given case, they are operationally real. A U.S.-trained, Apache-licensed, frontier-class open-weights model is a slot on the vendor list that genuinely did not exist a month ago.

Trinity-Large-Thinking fills that slot. Combined with the model's sparse architecture (cheap to serve), reasoning focus (good at agents), and unrestricted license (no royalty math), it is the rare U.S. open model that is plausibly deployable at enterprise scale, not just interesting for researchers.

The Caveats

A few things to keep in mind.

First, 13B active parameters means that even though the model is "only" using a fraction of its 398B at any moment, you still need to load all 398B parameters into memory to serve it. In BF16, that is roughly 800 GB of weights — a multi-GPU deployment. Quantized to INT4, you can fit it onto a single 8-GPU H200 node, but that is still a serious infrastructure commitment.

Second, the SWE-bench gap is real. If your primary use case is autonomously fix this bug in a 50,000-line repo, Claude Opus 4.6 and GPT-5.4 still win, sometimes decisively. Trinity is competitive on agentic flows, not on everything.

Third, Trinity is text-only. No vision, no audio, no native tool-use grammar baked in. That makes it a clean general-purpose reasoning engine, but it also means you are pairing it with separate models for multi-modal inputs.

The Bottom Line

Arcee just did something that has been missing from the U.S. AI ecosystem: shipped a frontier-capable open-weight model under a permissive license, from a U.S. team, with full transparency about the training process. That does not solve every open-source AI problem, but it changes the shape of the choice for any enterprise that has been waiting for a non-Chinese option.

At $0.90 per million output tokens hosted, or free to self-host under Apache 2.0, Trinity-Large-Thinking is competitive enough on the agentic benchmarks that matter to be a real contender — not a charity case. The fact that a 100-person startup pulled this off for $20 million is the more important story underneath: the door to frontier open AI is not closed, and it does not require Big Tech budgets to walk through.

If you have been holding off on agent infrastructure because the open options were not credible enough or the closed options were not affordable enough, Trinity-Large-Thinking is the first model in a while that genuinely changes the calculation.