← All researchAVALON-2B · Apache 2.0 · Released April 2026

Released · April 2026Apache 2.0

AVALON-2B

The first sub-3B language model that knows what it doesn't know.

Akhil Ponnada · Naga Sri Arvapalli · Nuro AI Labs

Abstract

AVALON-2B is the first language model below 3B parameters to implement Self-Reflective Retrieval-Augmented Generation with learned reflection tokens.

Built upon Qwen 3.5 2B, AVALON introduces a five-token reflection vocabulary — [Retrieval] [No Retrieval] [Relevant] [Utility:5] [Utility:4] — and a 22M-parameter MiniLM router that predicts retrieval necessity at the query level with 90.5% accuracy in 5 ms. The model achieves 82.5% Self-RAG token accuracy under LoRA fine-tuning when embed_tokens and lm_head are saved alongside the LoRA adapters — a recipe that lifts token accuracy from 12% to 82.5% over naive LoRA.

We trained on 201,004 synthetic samples (80% Self-RAG, 20% general instruction) for ~6 hours on 8× A100 80GB. The model beats Qwen 3.5 2B base, Gemma 4 E2B and SmolLM3 3B on its target benchmarks, runs at 40 tok/s on a MacBook Air M3 and 12 tok/s on iPhone 15 Pro, and ships under Apache 2.0 with weights, GGUF quants and the synthetic-data pipeline public.

Architecture · REFLECT

The router fires in parallel with prompt encoding.

REFLECT architecture · live

router 90.5%

›What's in the news today about lithium prices?

router · MiniLM

22M params · 5ms latency

decides:

[Retrieval]

generation · Qwen 3.5 2B + LoRA

1.88B · 18 GDN + 6 softmax

emits:

[Utility:5]

Base model

Qwen 3.5 2B — 18 Gated DeltaNet linear-attention layers + 6 full softmax layers, with a 262K native context window. No rope hacks, no position interpolation tricks.

Router

A 22M-parameter MiniLM router predicts retrieval necessity at the query level with 90.5% accuracy in 5 ms. The embedding call happens in parallel with prompt encoding — retrieval is free on the critical path.

The LoRA trick

We trained with LoRA plus modules_to_save=["embed_tokens", "lm_head"]. Without saving those two modules, the model cannot actually learn the new special tokens — Self-RAG token accuracy stays around 12%. With them saved, it lifts to 82.5%. That single config change is the difference between a paper and a noise floor.

~6 hours on 8× A100 80GB · 201,004 synthetic samples · 80% Self-RAG / 20% general

Benchmarks

Beats models 1.5–2× our size — and adds a capability nobody has at this scale.

Model

Params

MMLU

HellaSwag

ARC-C

Self-RAG

AVALON-2B★ ours

1.88B

62.04

64.14

42.75

82.5%

Qwen 3.5 2B (base)

1.88B

61.63

62.15

41.64

—

Gemma 4 E2B

2.3B

58.0

68.0

48.0

—

SmolLM3 3B

3.0B

55.0

70.0

50.0

—

Phi-4-mini

3.8B

62.0

—

MMLU

62.04

vs base 61.63

Self-RAG token

82.5%

vs 12% naive LoRA

Router accuracy

90.5%

5ms latency

Context window

262K

native, no rope hacks

On-device · Q4_K_M (1.5 GB)

Frontier capability that fits in a phone.

The 1.5 GB Q4_K_M GGUF quant runs locally on every device we tested. Numbers are real measurements, not vendor claims.

MacBook Air M3

tokens / sec40.2 t/s

memory1.5 GB

M3 Pro

MacBook Pro M3 Pro

tokens / sec52.4 t/s

memory1.5 GB

M2 Ultra

Mac Studio M2 Ultra

tokens / sec78.6 t/s

memory1.5 GB

A17 Pro

iPhone 15 Pro

tokens / sec12.4 t/s

memory1.5 GB

Quickstart

Three lines of Python, or one line of bash.

quickstart.pypython

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("nuroai/Avalon-2B")
tokenizer = AutoTokenizer.from_pretrained("nuroai/Avalon-2B")

system = """You are AVALON, a self-reflective AI assistant.
Before answering: emit [Retrieval] or [No Retrieval].
End every response with [Utility:X] (1-5)."""

prompt = system + "\n\nUser: What's in the news today about lithium prices?"
out = model.generate(**tokenizer(prompt, return_tensors="pt"))
# → [Retrieval] ... <answer> ... [Utility:5]

ollama-cli.shbash

# install
ollama pull nuroai/avalon-2b

# run
ollama run nuroai/avalon-2b

# 1.5 GB Q4_K_M quant
# ~40 tok/s on a MacBook Air M3
# ~12 tok/s on iPhone 15 Pro

Limitations

What this model does not do.

Every claim is auditable; that includes the regressions. We report what broke alongside what shipped.

GSM8K math drops 4.5%

Reflection-token training shifts probability mass away from chained arithmetic. We recover most of it with mixed-task fine-tuning, but the regression is real and we report it.

English-only

Training corpus is 100% English. The reflection vocabulary should transfer, but we have not validated multilingual Self-RAG behavior — and we don't claim it.

Static binary retrieval at start

The router emits a single binary [Retrieval] / [No Retrieval] decision per query. Mid-generation re-routing and graded retrieval intensity are on the roadmap, not in this release.

Citation

If AVALON-2B helps your work, please cite.

avalon-2b.bibbibtex

@misc{ponnada2026avalon,
  title         = {AVALON-2B: Self-Reflective Retrieval-Augmented Generation
                   in Sub-3B Language Models},
  author        = {Ponnada, Akhil and Arvapalli, Naga Sri},
  year          = {2026},
  month         = {April},
  publisher     = {Nuro AI Labs},
  howpublished  = {\url{https://huggingface.co/nuroai/Avalon-2B}},
  note          = {Apache 2.0}
}

Read the paper. Then run it.

39 pages, NeurIPS format. Authors: Akhil Ponnada · Naga Sri Arvapalli.

Paper (PDF)Hugging Face ↗Ollama ↗

Open weights, open papers.

Memory, agents, applied minds.

Personal intelligence in production.

London-registered research lab.

AVALON-2B

The router fires in parallel with prompt encoding.

Beats models 1.5–2× our size — and adds a capability nobody has at this scale.

Frontier capability that fits in a phone.

Three lines of Python, or one line of bash.

What this model does not do.

GSM8K math drops 4.5%

English-only

Static binary retrieval at start

If AVALON-2B helps your work, please cite.

Read the paper. Then run it.