Building a ChatGPT for the Arab World

Meet the companies and countries that are leading the way

Welcome to FWDstart! šŸ¢

This weekā€™s deep-dive is a big one, as we explore the companies bridging the gap between the Arab-speaking world and advancements in AI.

If you know someone who would be interested reading, please feel free to share the newsletter with them here.

And a reminder, if you haven't subscribed yet, join readers from 500 Global, Speedinvest and Antler getting a FWDstart twice weekly.

AI is taking over the world, gradually transforming how we work and live.

Yet, many Arabic speakers risk being left out.

A few years ago, Mohammad AlSharekh, the renowned Kuwaiti entrepreneur who brought Arabic to computing, noticed something alarming.

Many Arabs had stopped using dictionaries.

They were too outdated and complicated, filled with archaic words and definitions.

AlSharekhā€™s solution was the release of Sakhr Software Companyā€™s online Modern Arabic Dictionary, featuring 50 million Arabic vocabularies for everyday use.

In a TEDX Talk in 2018, he delivered a clear message: only Arabs can address the challenges facing the Arabic language.

In recent years, local founders and companies have embraced this challenge, to drive an Arabic AI revolution.

šŸŒ Bridging the Arabic AI gap

When ChatGPT was released, it was a total revelation.

But there was a big problem - it struggled enormously with Arabic.

Given that Arabic is spoken by over 400 million people, this is remarkable for all the wrong reasons.

The underperformance of major LLMs can be put down to a variety of factors:

  1. Arabic is a complicated language: Itā€™s filled with diacritical markings and an inflected letter system. Letters can take up to three shapes depending on their position and are often connected, which computers can struggle to make sense of.

  2. Lack of online content: LLMs need training on vast amounts of digital text. Although Arabic is the fourth-most-spoken language globally, it makes up less than 1% of internet content. Thatā€™s not a lot for AI developers to play around with.

  3. Diverse dialects: There are at least 25 dialects. Some are similar, but others can be difficult to understand even for Modern Standard Arabic speakers.

Map of Arabic dialects

Add all of that together and you come out the other side with a language thatā€™s harder to represent in a coding model than most others.

But if thereā€™s one common thread that has emerged throughout this newsletter to date, itā€™s that MENAā€™s founders love a challenge.

And they donā€™t come much bigger that bridging the gap between the Arab-speaking world and advancements in AI.

šŸ“¢ Finding their voice

Back in 2021, two founders from Egypt launched a startup called Intella, on a mission to do just that.

Their startup has developed an Arabic speech-to-text AI model that localises AI across all Arabic dialects.

Nour Taher and Omar Mansour

And it doesnā€™t just talk the talk - the platform has a 95.70% average accuracy across 25 Arabic dialects.

Their speech-to-text and analytics models use advanced AI to continuously enhance accuracy and efficiency by processing large datasets of various Arabic accents and dialects.

This has powered it to surpass industry leaders including Googleā€™s speech-to-text, ChatGPT maker OpenAIā€™s Whisper, Meta Platformā€™s SeamlessM4T and IBM's Watson.

Intella Voice can be used in chatbots, voice assistants, customer service centres, emergency hotlines, and IVR systems.

And the company plans to expand into audio analytics, including summarisation and sentiment analysis

Nour and Omar are by no means alone.

Many MENA founders are committed to enhancing Arabic representation in AI.

Other startups like Maqsam, Uktob, ClusterLab, and LisanAI, to name but a few.

The importance of ensuring Arabic catches up, cannot be overstated.

šŸŒ The importance of an Arabic LLM

The genie is out of the AI bottle.

There are only going to be more and more people in Arabic-speaking countries turning to AI to complete tasks over the coming months and years.

But why is it so important that Arabic doesnā€™t fall behind in the AI race in the first place?

1. Productivity and education

Made with DALLĀ·E

AI is amazing when it comes to automating tasks, analysing data, and helping you rethink whether sending that faintly passive aggressive email is really the wisest decision.

Okay but seriously, to drive productivity gains in Arabic-speaking countries, supporting a variety of dialects is crucial, as they are more commonly used than Modern Standard Arabic in business.

The same goes for education, where if AI is to truly deliver on the promise of highly personalised and adaptive learning journeys for students - tailoring to local dialects will be crucial to their success in MENA context, where the need is significant given that 59% of children are in learning poverty.

AI needs to meet Arabic speakers where they already are.

2. Language preservation

Made with DALLĀ·E

As weā€™ve mentioned, thereā€™s a significant shortage of Arabic content online.

While AI has the potential to remedy this by increasing the amount of Arabic content available, an unintended consequence arises when mainstream LLMs with poor Arabic skills produce low-quality text.

These models learn from internet data, so if they consume this poor-quality text, the Arabic language suffers.

This could make future AI models struggle with Arabic even more, harming the language's quality and preservation.

3. Representation and cultural nuance

The bias of Stable Diffusion (Bloomberg)

LLMs are trained on billions of examples of human language in all its flawed glory.

However, not every culture is represented inclusively, especially since Arabic content accounts for only 1% of online content.

Consequently, LLMs often learn about Arab culture from non-Arab perspectives, potentially incorporating unfair or untrue biases, and missing important cultural nuances.

To ensure accurate and fair representation, it is crucial to increase high-quality Arabic content and involve native Arabic speakers in the training process.

ā›°ļø A new peak

In recent months, the MENA region has made significant strides in advancing Arabic AI, most notably with the release of G42ā€™s Jais (named after the UAE's highest peak, Jebal Jais).

Jais chat interface

This 13-billion parameter model was trained on a unique dataset of 116 billion Arabic tokens, capturing the complexity and richness of the language.

To ensure comprehensive training, the model also included 279 billion English tokens, resulting in a fully bilingual LLM that supports a variety of Arabic/English digital services.

The project was a collaboration between Inception, Mohamed bin Zayed University of Artificial Intelligence, and AI chip maker Cerebras Systems.

Open-sourced under an Apache 2.0 license and available via Hugging Face, Jais enables AI professionals, developers, and researchers worldwide to create their own use cases.

Microsoft CEO Satya Nadella introduces Model-as-a-Service at Microsoft Ignite (Image credit: Talal Al Kaissi, Core42)

But Jais is by no means alone, in May alone there was a flurry of Arabic AI model announcements, with Saudi Arabia, Qatar, and Huawei all announcing major developments.

  • The Saudi Data and Artificial Intelligence Authority and IBM launched 'ALLaM,' an open-source Arabic LLM on IBM's watsonx platform.

  • Qatar introduced "Fanar," a gen-AI Arabic language model developed in collaboration with the Ministry of Communications and Information Technology, Qatar Computing Research Institute, and other partners.

  • Huawei unveiled a 100+ billion parameter Arabic language model with 96% accuracy in Arabic speech recognition (ASR) tests. Trained on MSA and diverse data from the Arab world, it covers local culture, history, customs, and industry-specific knowledge like oil, gas, and financial services.

šŸ¤– Whatā€™s next?

The introduction of Arabic language models like Jais is a game-changer.

Until recently, developers lacked high-quality Arabic AI models.

Now, they can build specifically for the Arabic-speaking world's companies, consumers, cultures, and locations.

There's still a long road ahead, but rather than diminishing Arabic, AI might actually strengthen it.

šŸ‘‹ Message from the team

Thanks for reading this weekā€™s edition!

If youā€™re enjoying the newsletter, donā€™t forget to share it with a friend!

Have a question or any feedback? Just hit reply, or provide a rating below - we want to hear from you!!

How was this newsletter edition?

Rate it and shell out your feedback!

Login or Subscribe to participate in polls.

Was this forwarded to you? Sign up here.