Using large language models to process Arabic text

5 Feb 2024
copy
  • Top of page
  • Main text
  • More on this topic
copy

For all of the controversy surrounding large language models, these tools present science with the chance to transform lives for the better. At Qassim University, researchers are exploring ways of improving their Arabic language handling, text processing power and how they can tackle misinformation in the social media era

Large language models (LLM) have already changed the world, and they are only getting started. LLMs such as ChatGPT are now part of the cultural landscape, with the potential to transform our information ecosystem. They can be used to create content, writing stories and painting pictures that look uncannily human.

There is much debate about whether LLMs will usher in a golden era of efficiency or the fall of humankind. However, everyone can agree that they will be transformative – and that they still have some improving to do. 

Mohammed Alsuhaibani is assistant professor in the Computer Science Department at Qassim University. His research focus is natural language processing and LLMs and how they can be improved, particularly when processing languages such as Arabic, which have a rich diversity of dialect.

“Large language models are a cornerstone for breaking down language barriers and facilitating communication, but the problem is the uniqueness of some languages such as Arabic because it has a rich morphology,” he says. “AI, and these large language models, have been really helpful in enhancing the accuracy and fluency of translations, as an example.”

Enormous data sets are required to train and fine-tune LLMs, and to ensure they are able perform tasks such as to recognise the meaning behind words and preserve sentiment – and use these insights constructively. “Understanding and accurately interpreting sentiments in language is really important, especially to understand the sentiment in a social media context, in the news and in literature,” Alsuhaibani says. “It would be really important for market analysis, sociocultural studies, many other fields.” 

This technology presents many opportunities for innovation but also malfeasance. One of the sector’s challenges is ensuring that LLMs are used ethically, trained on “fair data”, and address issues surrounding privacy and stereotyping. The case for regulation is convincing. “It’s really crucial and important to have these sorts of regulations,” Alsuhaibani says. “It may or may not accelerate LLM development but it will definitely help us utilise these models in a good way.” 

Alsuhaibani’s work also involves neural machine translation and training LLMs to summarise text. His research condenses long-form Arabic texts, making them more accessible while preserving their meaning. Crucially, AI-based LLMs can also be trained to remove disinformation and spot fabricated texts.

“The goal here is to deploy AI-based large language models to identify or flag any fabricated text,” he says. “This is really important in the information age, especially when it comes to sacred texts such as the Koran and Hadith. It is crucially important to detect any fabricated text because it must be correct and from the original source.”

The debate around LLMs is not going anywhere. Concerns over their ethical use, data privacy and the computational power needed to participate in LLM development will remain. But Alsuhaibani has faith that LLMs will be a net positive for society, making education more accessible to the global population. The more we understand them and their interoperability, the more useful they might become.

Find out more about Qassim University.