A UP academic is leading cutting-edge research to increase the availability of African languages online

If you have ever thought "there are hardly any African languages on the internet", Dr Vukosi Marivate's presentation at a recent USAf Round Table not only proved this assumption to be right, but provided snippets of what he is doing to overcome this – and how others can help.

Dr Marivate (right) is the Absa Chair of Data Science in the Department of Computer Science at the University of Pretoria. He was invited to speak at a recent colloquium that was hosted by the Community of Practice for African Languages (CoPAL) of Universities South Africa (USAf), the umbrella body of the country's 26 public universities. The virtual event, titled "African Languages in the Age of 4IR," took place on 29 October.

In a presentation titled "A call to action. Using data science in the advancement of African languages", Dr Marivate showed how limited resources of indigenous languages are in South Africa.

Looking at the circulation of the most popular newspapers in the country last year, the top three, the Sunday Times, Soccer Laduma and the Daily Sun, are in English. Then comes Rapport, which is in Afrikaans, followed by Isolezwe in isiZulu, whose circulation of 86 342 is a big drop from the Sunday Times's 260 132.

Dr Vukosi Marivate

Another metric for gauging the resources is to look at the number of articles written in that language on Wikipedia, the crowd source system written by volunteers and available for free. Last year there were six million articles in English on Wikipedia. In South Africa, Afrikaans was second with about 89 000 articles, followed by just over 8000 in Sepedi, 1 395 in isiZulu, 683 in Sesotho, and by the time you get to isiNdebele, it doesn't exist on Wikipedia at all. So there is inequality of language resources.

What does the availablity of online text in a language have to do with 4IR?

Dr Marivate said text is a very rich interface to share information and interact with machines, such as computers, adding that understanding the definitions related to 4IR makes this clearer.

To explain artificial intelligence, he used the example of a robot vacuum cleaner. It lives in your house, can scan the floor and move around, building a map of the room to pick up the dirt or vacuum. Its goal is to make sure the room is clean. For another example, he said search engines such as Google are machines that live on the internet. You can type in a query and it will return documents that relate to the query. That is really what artificial intelligence (AI) is: machine environments that try to achieve a goal. Machine learning is a subset of AI in which you want to learn patterns from data. You give the machine data, and then it learns patterns from that.

Natural language processing is connected to AI and machine learning, and is about trying to learn language tasks from text. However, if you don't have available language resources, you cannot do machine learning and natural language processing. And that is the big challenge in South Africa, where there are very limited resources of African language texts.

Why is this important?

Machine learning involves giving training data to the machine, and then the machine learning algorithm pops out a model. (An algorithm is a set of rules to be followed in calculations or other problem solving, such as a multiplication algorithm.)

So if you are doing classification, such as classifying that a novel is a thriller, or a love story, that is what the model will try to do. Or you could be trying to identify that a word is the name of a person, or a place, or this is a thing. You then get your chain, you pop out the model, you give it this real-time data and then it can predict, based on that data.

He said online voice assistants such as Apple's Siri or Google Assistant, audio search engines that you ask a question and it talks back, giving you an answer, work in the same way.

What do you do if you don't have enough data?

Dr Marivate has been researching how to improve the algorithms even if there is not a lot of data to work with. He is involved in a multi-institute and multi-researcher project with institutions from across Africa and independent researchers. They are looking at low-resourced machine translations, using African languages as a case study.

He is also involved in a project that involves extracting public data from the SABC's news snippets that it shares on social media. "A little bit of civil disobedience in this case but it's for the public good and it's from a public broadcaster," he said.

The SABC has 90 radio stations, five TV channels and online digital news that is published only in English. The radio stations cover all 11 official languages but their news scripts are not public. So Dr Marivate and his team have been taking the headlines from radio stations' Facebook pages, and working with students to annotate them according to categories. They then build machine-learning algorithms that can predict these categories automatically. The problem is if you have only 200 articles, it is not enough.

They have also used data from sources such as the South African Centre for Digital Language Resources (SADiLaR), a research institute at North-West University in Potchefstroom.

And they are building algorithms that can artificially increase the amount of data by recreating sentences using synonyms, or changing the sentence to something very similar. This is giving them up to 60% accuracy in being able to classify articles in the local languages, and this research was presented at the Language Resources and Evaluation (LREC) workshop on Resources for African Indigenous Languages (RAIL) in May. The paper, co-written with collaborators Tshephisho Sefara and Abiodun Modupe and a group of undergraduate students from different South African universities, was titled Investigating an approach for low resource languages dataset creation, curation and classification: Setswana and Sepedi.

"This is an active project that's being used across the world by researchers who are working on different languages, such as Chinese and languages in Eastern Europe," said Dr Marivate. They are getting an encouraging number of downloads a month of their tool which makes artificial improvements to the amount of data.

How non-traditional researchers can help the data science community

The Masakhane project is about putting Africa on the natural language processing map and wants researchers across Africa to join them in building translation models for African languages.

Everyone is welcome to join the project. Dr Marivate said Masakhane relies on people who are not data scientists or in research councils. They might be students in other disciplines or just people who are interested. Masakhane is open source and continent-wide and has over 50 collaborators. They are given the tools to start working on the language tasks.

The first task was machine translation, effectively, building Google translate for African languages from scratch using Dr Marivate and his team's techniques. It is available online at masakhane.io.

Masakhane is presenting its cutting-edge research at top conferences

Masakhane has made progress – by February 2020 it had more than 144 participants from 17 African countries with diverse education, as well as from two countries outside Africa, namely the USA and Germany. Thirty-five translations for 29 African languages have been published by over 25 contributors on GitHub, an open source platform for software development, and the community recently had nine papers accepted in Natural Language Processing (NLP) workshops at the International Conference on Learning Representations (ICLR), one of the top machine learning conferences.

The project has been running for about 14 months and its first joint paper, written by a large team of about 40 people, was accepted a few weeks ago for the Empirical Methods in Natural Language Processing (EMNLP) Conference.

There is money for people across Africa to create data sets and resources

Dr Marivate is one of five people on the steering committee that governs the international Lacuna Fund, which punts itself as ''putting the benefits of machine learning within reach of data scientists, researchers, and social entrepreneurs worldwide".

He says it is giving a couple of million dollars for people to collect data in different subsets, one of which is language. "We're making money available across the African continent for people to create data sets and resources," he said.

And his research group, Data Science for Social Impact, at the University of Pretoria, is continuing its research in enhancement of machine learning pipelines for low resource scenarios. A lot of his students are working on his new models for increasing data sets.

Facebook is funding them to work on isiZulu. They also have funding from Mozilla to build a Masakhane web tool that allows access to machine translation systems.

"We need to get to a point where we're teaching more and more natural language processing in our universities, not just in humanities, but also in computing sciences and engineering, because having more researchers who are exposed increases the pipeline for everybody," said Dr Marivate.

This is a second piece in a series of four articles being published from the Roundtable on African Languages in the Age of 4IR.

The organising body, the Community of Practice on the Teaching of African Languages (CoPAL), was formed in 2015 to enable academics and relevant other university staff members to collaborate, network and share knowledge on issues of common interest. CoPAL is just one of numerous communities of practice operating under USAf's banner.

In addition to influencing and contributing to national and institutional language policies, the CoPAL seeks to benchmark, develop, advocate for and share good practices and relevant information needed to advance the teaching of African Languages in schools and universities. It does so by contributing to teacher training initiatives for African Languages and to the development of approaches to the teaching of African Languages that use African Languages in the teaching process.

This CoP also seeks to enhance regional collaboration among African Languages scholars, as well as to actively contribute to the establishment of linguistic networks to ensure that information and common understandings are shared.

Two more articles from the Roundtable will follow, shortly, on this platform.

Written by Gillian Anstey, a freelance writer commissioned by Universities South Africa.

CoPAL Colloquium on Round table on African Languages in the Age of 4IR
Powered by NewsSite