Why use an Arabic-specific LLM instead of GPT-4 or Claude?

Three reasons. (1) Quality on dialectal Arabic — Gulf, Egyptian, Levantine — is materially better in Arabic-trained models because the training data overlap is larger. Western frontier models handle MSA reasonably but degrade on dialect. (2) Sovereignty — Falcon-H1 and Jais 2 are open-weight and can run on UAE-resident infrastructure, satisfying data residency requirements that cloud-hosted GPT-4 or Claude cannot. (3) Cost — open-weight models running on owned hardware are cheaper at steady-state utilisation past a threshold. For UAE/GCC workloads operating on Arabic data, the choice is increasingly Arabic-LLM by default.

What's the practical quality gap between Arabic LLMs and Western frontier models?

On MSA reasoning tasks, Falcon-H1 Arabic and Jais 2 are within 5–15% of Claude Sonnet on most benchmarks, occasionally exceeding it on Arabic-specific reasoning. On Gulf dialects, the Arabic models often outperform — Western models translate to MSA internally and lose nuance. On English-language tasks within the same conversation (code-switched user input), Western models pull ahead. The right pattern is often hybrid: Arabic LLM for Arabic input, frontier model for English, with auto-detection.

Hardware-wise, what does it take to run Falcon-H1 Arabic in production?

Falcon-H1 (Arabic) runs in 70B and 180B parameter sizes. The 70B variant fits on a single H100 (80GB) for inference at FP16 with quantisation tricks; comfortable on a single H200 (141GB) at FP16. The 180B variant needs a multi-GPU setup — typically 2× H100 minimum, 4× H100 for headroom. For a UAE-sized banking workload (~20–50 concurrent users on Arabic queries), a 70B deployment on 2× H100 with vLLM serves comfortably, with overhead for spikes.

Is Jais 2 better than Falcon-H1 Arabic? Or the other way?

Neither dominates. Jais 2 has a slight edge on factual recall and consistency, particularly on Saudi-context content. Falcon-H1 Arabic has a slight edge on dialectal flexibility and lower latency at the same parameter count. For most production workloads the differences fall within the noise of prompt engineering quality. We typically benchmark both against the client's specific corpus during the discovery sprint and pick the winner on actual data, not generic benchmarks.

Can these models do code, math, and structured reasoning as well as English LLMs?

On code generation: meaningfully behind Claude Sonnet or GPT-4 on English-prompted code. About on par when prompted in Arabic with Arabic context. Math reasoning: similar pattern — competitive on Arabic word problems, behind on competition-style English math. For workloads requiring code or heavy math, the hybrid pattern (Arabic LLM for Arabic content tasks, Claude or GPT for code/math) outperforms either alone. For pure Arabic content workloads — customer service, document Q&A, Arabic content drafting — they're more than adequate.

Falcon-H1 Arabic vs Jais 2: A Production Comparison for GCC Workloads

For UAE and broader GCC AI workloads operating on Arabic data, the language model layer reached an inflection point in late 2025. Open-weight Arabic-capable models — Falcon-H1 Arabic from TII and Jais 2 from G42's Inception — moved from research-curiosity to production-quality. By May 2026 they're the default choice for any sovereign deployment processing Arabic content.

This post is the side-by-side comparison we use during the discovery sprint of Sovereign AI engagements. It's written for the architect or AI engineer choosing the model layer for an Arabic-heavy workload.

Why this question matters more in 2026

Three trends converged.

The CBUAE Sovereign Financial Cloud (launched February 2026) made data residency a hard constraint for in-scope workloads. Calling Claude or GPT for Arabic banking conversations now triggers compliance friction even where it didn't before.

The Dubai agentic AI mandate added documentation requirements that favour models you can fully audit — open-weight, deployed on your infrastructure, with traceable provenance.

The Arabic-LLM quality bar crossed the production threshold. Earlier Arabic-capable models (Jais 1, AceGPT) had quality gaps that pushed sophisticated workloads back to GPT-4 with translation layers. Falcon-H1 Arabic and Jais 2 closed enough of that gap to be the right answer for most cases.

The benchmark picture

We benchmark both models against client-specific corpora during discovery. The pattern across 2026 engagements:

Task type	Falcon-H1 Arabic 70B	Jais 2 70B	Claude Sonnet 4.6
MSA Q&A on retrieved context	Strong	Strong	Strong
Gulf dialect customer support	Strongest	Strong	Adequate
Egyptian/Levantine dialect	Strong	Adequate	Adequate
Saudi-context factual recall	Strong	Strongest	Adequate
Code generation (English)	Adequate	Adequate	Strongest
Code generation (Arabic context)	Strong	Strong	Adequate
Math reasoning	Adequate	Adequate	Strongest
Structured output (JSON)	Strong	Strong	Strongest
English-only tasks	Adequate	Adequate	Strongest
Arabic-English code-switching	Strong	Strong	Adequate

For Arabic-content-heavy workloads, the Arabic models match or beat Claude Sonnet on most relevant tasks. The places they don't match — code, math, English-only — are usually addressable by hybrid routing.

The hybrid pattern

The architecture we ship by default for GCC clients:

Detect language at request entry (lightweight classifier — ~50ms latency).
Arabic input → route to Falcon-H1 Arabic or Jais 2 on sovereign infrastructure.
English input → route to Claude/GPT through Vercel AI Gateway (subject to data classification — for regulated data, English routes to a sovereign English model like Llama 3.3 instead).
Code-switched input → route to whichever Arabic LLM benchmarked stronger for the use case.

Routing decisions happen at the application layer. From the user's perspective, it's one interface; from the architecture's perspective, the right model handles each request.

The cost picture

Steady-state for a typical UAE banking deployment (~20-50 concurrent users on Arabic queries, ~3M tokens/day of Arabic content):

Configuration	Capex	Monthly opex
Cloud-only (Claude via Bedrock)	AED 0	AED 35,000–80,000
Sovereign Foundations tier (Falcon-H1 70B on H100)	AED 280,000	AED 12,000–18,000
Sovereign Sovereign tier (multi-GPU H100/H200)	AED 380,000	AED 18,000–28,000

Cloud-only is cheaper at very low volumes. Sovereign overtakes economically past about 1.5–2M tokens/day, while always being the only compliant option for in-scope regulated workloads.

For our Hisabi.ai operations, the hybrid pattern (Sovereign Falcon-H1 for Arabic, Claude via Gateway for English non-regulated tasks) hits the sweet spot of compliance + cost + quality.

When to pick Falcon-H1 Arabic

Pick Falcon-H1 Arabic if:

Your primary workload is dialectal Arabic (Gulf, Egyptian, Levantine, Maghrebi)
Latency budget is tight (slightly faster than Jais 2 at same parameter count)
You want the open-source / non-commercial-restrictive licensing posture (Falcon's Apache 2.0 is more permissive)
You're deploying on UAE infrastructure and want a UAE-developed model (TII originated)

When to pick Jais 2

Pick Jais 2 if:

Saudi-context factual recall matters specifically (training corpus weighting)
You want G42 ecosystem integration (Inception, Stargate UAE infrastructure)
You're operating in a Saudi-regulatory context and prefer the KSA association
Your benchmarks on your specific corpus favour it (always test)

The benchmark you should actually run

Generic benchmarks tell you the rough order of magnitude. Your specific workload is what decides. Standard discovery procedure:

Curate 50–100 cases representative of your real traffic.
Hand-label expected outputs (or rubric-based scoring criteria).
Run all candidate models on the same inputs with consistent prompts.
Score with the same scoring function across all models.
Tally by task type to see where each model is strong.

Expect this to take 1–2 weeks for a single use case. The output is defensible model selection, not a guess from generic benchmarks.

What this changes about UAE/GCC AI agencies

Two strategic implications:

1. Translation layers are obsolete. The pattern of "translate Arabic to English, run English LLM, translate back" was an interim hack. With Falcon-H1 and Jais 2 production-ready, agencies still shipping translation-layer architectures are leaving quality and compliance on the table.

2. Sovereign + Arabic is the moat. Western AI agencies serving GCC clients have to either deploy Arabic-LLM expertise or partner — and most haven't. Local agencies who run Arabic-LLM in production (us; a small number of others) hold a structural advantage on regulated GCC work.

Where Codenovai fits

Every Sovereign AI + RAG deployment we ship for Arabic-content clients runs on Falcon-H1 Arabic or Jais 2 — selected based on benchmark results against the client's specific corpus during discovery. We've shipped both in production. We have the H100 capacity, the inference tooling (vLLM and Ollama), and the eval harness to make either choice work.

For pure Arabic-content workloads, we recommend starting with the Foundations tier at AED 150,000 — a single-corpus deployment on hardware sized for your specific workload, with model selection part of the engagement, not pre-decided.

Book a scoping call — bring your sample corpus and we'll have benchmark results within 7 days.