Shfrah
← All articles
AI·7 min·February 20, 2026

Arabic RAG — what we're learning

Notes from building semantic search for Arabic documents.

Semantic search over Arabic documents is harder than it looks. Available models were trained mostly on English, and Arabic — with its morphological richness — exposes their limits quickly.

Chunking matters

Splitting Arabic text into meaningful chunks isn't trivial. Arabic sentences run long, and meaning spreads across connectors. Naive chunking tears context apart and ruins retrieval.

Normalisation is a double-edged sword

Stripping diacritics and unifying alif and hamza improves matching, but can erase meaningful distinctions. We normalise carefully, keeping the original where the difference matters.

Evaluate in Arabic, not in translation

It's not enough for the system to work on translated examples. We build native Arabic evaluation sets with questions a real user would ask, and measure against them — translation hides the flaws.

Arabic deserves tools built for it, not tools adapted to it. That's what we're working toward, one step at a time.

Read next