Skip to content
Nuro AI Labs
← News2026-04-27Research

Introducing AVALON-2B

The first sub-3B language model that knows what it doesn't know.

By Akhil Ponnada and Naga Sri Arvapalli
~6 min read

Today we are open-sourcing AVALON-2B — a 1.88-billion-parameter language model that knows when to look things up.

AVALON is the first language model below 3B parameters to implement Self-Reflective Retrieval-Augmented Generation with learned reflection tokens. Built on Qwen 3.5 2B, AVALON introduces a five-token reflection vocabulary that lets the model decide, at generation time, whether a query needs external retrieval — and whether the response it just generated is good enough.

Why this matters

Personal intelligence requires self-knowledge. A mind that doesn't know when it doesn't know is not yet a mind. Until now, that capability — Self-RAG — required 7B+ parameters. AVALON does it at 1.88B.

What it does

AVALON generates four self-reflective token classes during inference:

  • [Retrieval] [No Retrieval] — does this query need external knowledge?
  • [Relevant] — is the retrieved content actually useful?
  • [Utility:1–5] — how good is the response I just generated?

A 22M-parameter MiniLM router predicts retrieval necessity at the query level with 90.5% accuracy in 5 ms — letting the retrieval call happen in parallel with prompt encoding.

Numbers

Small enough for a phone. Big enough to know what it doesn't know.

01
62.04
MMLU

Beats Qwen base 61.63, Gemma 4 E2B 58.0, SmolLM3 3B 55.0

02
82.5%
Self-RAG token accuracy

Reflection vocabulary, learned end-to-end

03
40 tok/s
MacBook Air M3 throughput

Q4_K_M quant · 1.5 GB on disk

04
12 tok/s
iPhone 15 Pro throughput

Native context window: 262 K tokens

  • HellaSwag: 64.14 · ARC-Challenge: 42.75
  • 262 K-token native context window
  • Apache 2.0 — weights, GGUF quants and the paper public from today

Open

AVALON-2B is released under Apache 2.0. Weights, GGUF quants and the paper are public from today.

What's next

AVALON-2B is the first model in our research roadmap. PLMR — pre-tokenizer latent memory routing for byte-level LMs — is in preprint. Hydra is in active development. And on the infrastructure side, our cognitive memory layer for agents, Hypersave, is live.

We're a research lab on the long arc from personal intelligence to general intelligence. AVALON is the first proof that we mean it.

— Nuro AI Labs

Newsletter

Slow updates from the lab.

One email per release. No marketing, no tracking, unsubscribe anytime.

← All posts