Skip to content
Nuro AI Labs
← All research
Released · April 2026Apache 2.0

AVALON-2B

The first sub-3B language model that knows what it doesn't know.

Akhil Ponnada · Naga Sri Arvapalli · Nuro AI Labs
Abstract

AVALON-2B is the first language model below 3B parameters to implement Self-Reflective Retrieval-Augmented Generation with learned reflection tokens.

Built upon Qwen 3.5 2B, AVALON introduces a five-token reflection vocabulary — [Retrieval] [No Retrieval] [Relevant] [Utility:5] [Utility:4] — and a 22M-parameter MiniLM router that predicts retrieval necessity at the query level with 90.5% accuracy in 5 ms. The model achieves 82.5% Self-RAG token accuracy under LoRA fine-tuning when embed_tokens and lm_head are saved alongside the LoRA adapters — a recipe that lifts token accuracy from 12% to 82.5% over naive LoRA.

We trained on 201,004 synthetic samples (80% Self-RAG, 20% general instruction) for ~6 hours on 8× A100 80GB. The model beats Qwen 3.5 2B base, Gemma 4 E2B and SmolLM3 3B on its target benchmarks, runs at 40 tok/s on a MacBook Air M3 and 12 tok/s on iPhone 15 Pro, and ships under Apache 2.0 with weights, GGUF quants and the synthetic-data pipeline public.

Architecture · REFLECT

The router fires in parallel with prompt encoding.

REFLECT architecture · live
router 90.5%
What's in the news today about lithium prices?
router · MiniLM
22M params · 5ms latency
decides:
[Retrieval]
generation · Qwen 3.5 2B + LoRA
1.88B · 18 GDN + 6 softmax
emits:
[Utility:5]
Base model

Qwen 3.5 2B — 18 Gated DeltaNet linear-attention layers + 6 full softmax layers, with a 262K native context window. No rope hacks, no position interpolation tricks.

Router

A 22M-parameter MiniLM router predicts retrieval necessity at the query level with 90.5% accuracy in 5 ms. The embedding call happens in parallel with prompt encoding — retrieval is free on the critical path.

The LoRA trick

We trained with LoRA plus modules_to_save=["embed_tokens", "lm_head"]. Without saving those two modules, the model cannot actually learn the new special tokens — Self-RAG token accuracy stays around 12%. With them saved, it lifts to 82.5%. That single config change is the difference between a paper and a noise floor.

~6 hours on 8× A100 80GB · 201,004 synthetic samples · 80% Self-RAG / 20% general
Benchmarks

Beats models 1.5–2× our size — and adds a capability nobody has at this scale.

Model
Params
MMLU
HellaSwag
ARC-C
Self-RAG
AVALON-2B★ ours
1.88B
62.04
64.14
42.75
82.5%
Qwen 3.5 2B (base)
1.88B
61.63
62.15
41.64
Gemma 4 E2B
2.3B
58.0
68.0
48.0
SmolLM3 3B
3.0B
55.0
70.0
50.0
Phi-4-mini
3.8B
62.0
MMLU
62.04
vs base 61.63
Self-RAG token
82.5%
vs 12% naive LoRA
Router accuracy
90.5%
5ms latency
Context window
262K
native, no rope hacks
On-device · Q4_K_M (1.5 GB)

Frontier capability that fits in a phone.

The 1.5 GB Q4_K_M GGUF quant runs locally on every device we tested. Numbers are real measurements, not vendor claims.
M3
MacBook Air M3
tokens / sec40.2 t/s
memory1.5 GB
M3 Pro
MacBook Pro M3 Pro
tokens / sec52.4 t/s
memory1.5 GB
M2 Ultra
Mac Studio M2 Ultra
tokens / sec78.6 t/s
memory1.5 GB
A17 Pro
iPhone 15 Pro
tokens / sec12.4 t/s
memory1.5 GB
Quickstart

Three lines of Python, or one line of bash.

quickstart.pypython
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("nuroai/Avalon-2B")
tokenizer = AutoTokenizer.from_pretrained("nuroai/Avalon-2B")

system = """You are AVALON, a self-reflective AI assistant.
Before answering: emit [Retrieval] or [No Retrieval].
End every response with [Utility:X] (1-5)."""

prompt = system + "\n\nUser: What's in the news today about lithium prices?"
out = model.generate(**tokenizer(prompt, return_tensors="pt"))
# → [Retrieval] ... <answer> ... [Utility:5]
ollama-cli.shbash
# install
ollama pull nuroai/avalon-2b

# run
ollama run nuroai/avalon-2b

# 1.5 GB Q4_K_M quant
# ~40 tok/s on a MacBook Air M3
# ~12 tok/s on iPhone 15 Pro
Limitations

What this model does not do.

Every claim is auditable; that includes the regressions. We report what broke alongside what shipped.
01

GSM8K math drops 4.5%

Reflection-token training shifts probability mass away from chained arithmetic. We recover most of it with mixed-task fine-tuning, but the regression is real and we report it.

02

English-only

Training corpus is 100% English. The reflection vocabulary should transfer, but we have not validated multilingual Self-RAG behavior — and we don't claim it.

03

Static binary retrieval at start

The router emits a single binary [Retrieval] / [No Retrieval] decision per query. Mid-generation re-routing and graded retrieval intensity are on the roadmap, not in this release.

Citation

If AVALON-2B helps your work, please cite.

avalon-2b.bibbibtex
@misc{ponnada2026avalon,
  title         = {AVALON-2B: Self-Reflective Retrieval-Augmented Generation
                   in Sub-3B Language Models},
  author        = {Ponnada, Akhil and Arvapalli, Naga Sri},
  year          = {2026},
  month         = {April},
  publisher     = {Nuro AI Labs},
  howpublished  = {\url{https://huggingface.co/nuroai/Avalon-2B}},
  note          = {Apache 2.0}
}

Read the paper. Then run it.

39 pages, NeurIPS format. Authors: Akhil Ponnada · Naga Sri Arvapalli.