College of Human Sciences

Developing digital African language resources

Working with government on the South African Centre for Digital Language Resources (SADiLaR), Unisa’s College of Human Sciences demonstrates how serious it is about developing African languages and contributing to the development of South African and African indigenous knowledge systems.

SADiLaR, a national centre supported by the Department of Science and Innovation (DSI), form part of the South African Research Infrastructure Roadmap (SARIR). According to the SADiLaR website, "SARIR is a high-level strategic and systemic intervention to provide research infrastructure across the entire public research system, building on existing capabilities and strengths, and drawing on future needs."

SADiLaR has an enabling function, with a focus on all official languages of South Africa, supporting research and development in the domains of language technologies and language-related studies in the humanities and social sciences. The centre supports the creation, management and distribution of digital language resources, as well as applicable software, which are freely available for research purposes through the Language Resource Catalogue.

SADiLaR runs two programmes. The first is a digitisation programme, which entails the systematic creation of relevant digital text, speech and multi-modal resources related to all official languages of South Africa. The development of appropriate natural language processing software tools for research and development purposes are included as part of the digitisation programme.

The second is a digital humanities programme, which facilitates the building of research capacity by promoting and supporting the use of digital data and innovative methodological approaches within the humanities and social sciences.

Unisa’s African languages in a unique position

The centre’s website states that its clients are academic scholars and professionals in all domains of humanities and social sciences, language technologies, natural language processing, computer science, as well as potential end-users in education, business and industry. SADiLaR is also a multi-partner entity with the North-West University functioning as host, as well as hub of a network of linked nodes, one of which is the Unisa Department of African Languages. The Unisa node has two node managers, Prof Sonja Bosch and Prof Mampaka Lydia Mojapelo.

According to Bosch, the Unisa Department of African Languages is working in close collaboration on the creation, management and distribution of digital language resources, which are made freely available.

"The Unisa node of SADiLaR, linked to the Department of African Languages, specialises in language development since this department is in the unique position of offering all nine official African languages. The Unisa node stands on two legs; the first leg is the African Wordnet (AfWN) and the second one is the Multilingual Linguistic Terminology."

She explains that SADiLaR has contributed considerably to the sustainability of these projects at Unisa. "Instead of time and effort being spent on writing short-term funding proposals, longer-term arrangements with SADiLaR ensure a more stable research and development environment."

In this way, Mojapelo states that the project teams can concentrate on the development and quality assurance of the wordnets and the linguistic terminology. "Linguists involved in this project also benefit in the sense that they can now be supplied with all essential equipment such as laptops, have access to the relevant software, and attend dedicated training sessions. Furthermore, researchers have the opportunity to publish their research findings."

She says that the language resources that are being developed are managed with great ease via the SADiLaR server, while the hosting of the wordnet editing tool on this server means stability and continuous technical support.

A training workshop introducing WordnetLoom as an editing tool during 2019 was facilitated by international experts in the field of wordnet development, and was well attended by project members:
Front row: Justina Wieczorek (PolNet developer and facilitator of the workshop, University of Wroclaw, Poland), Dr Janek Wieczorek (WordnetLoom development team member and facilitator of the workshop, University of Wroclaw, Poland), Mmasibidi Setaka (Sesotho), Angelinah Dazela (isiXhosa) and Valencia Wagner (Setswana)
Middle row: Prof Sonja Bosch (Node manager), Matseleng Mabusela (Sesotho sa Leboa), Mercy Mahwasane (Tshivenḓa), Dr Inie Kock (Sesotho), Taki Matamela (Tshivenḓa), Opelo Thole (Setswana), Celimpilo Dladla (isiZulu) and Lindelwa Mahonga (isiZulu)
Back row: Delvah Mathevula (Xitsonga), Respect Mlambo (Xitsonga), Prof Mampaka Lydia Mojapelo (Node manager, Sesotho sa Leboa), Mlamli Diko (isiXhosa), Dr Jurie le Roux (Setswana), Dr Celani Zwane (isiZulu) and Prof Stanley Madonsela (Siswati)

Making multilingual linguistic terminology freely available

Mojapelo explains that the Unisa project team is also well aware that African languages have important and specialised terminology in specific fields. In this regard, it is their aim to make the multilingual linguistic terminology freely available in a large database so that these resources will contribute positively to the teaching and learning domain as well as to other forms of language practice such as language learning and interpretation.

"Open access to the African Wordnet data as well as the Multilingual Linguistic Terminology, is bound to have a significant impact, not only on the promotion of African languages, but also on the further development of natural language processing applications such as inter-lingual information retrieval, question-answering systems as well as machine translation," explains Bosch.

Speaking more on the African Wordnet project, Bosch says that South Africa with its rich diversity of 11 official languages is seen as a potential emerging market where language technology (LT) applications can contribute to the promotion of multilingualism and language development, and as such have a positive impact on the South African community. In this regard, one of the fundamental resources required for the development of a large number of core language technologies (LTs) and LT applications, is a wordnet. A wordnet is a lexical database consisting of words that are grouped into sets of synonyms called synsets. Various conceptual-semantic and lexical relations are indicated between the synsets contained in a wordnet.

She explains that wordnets for African languages were introduced with a training workshop for linguists, lexicographers and computer scientists by international experts in 2007. Since then, wordnets for five African languages, namely Setswana (tsn), isiXhosa (xho), isiZulu (zul), Sesotho sa Leboa (nso) and Tshivenḓa (ven) have grown to roughly 10 000 synsets each, while the other four official African languages, namely Sesotho (sot), Xitsonga (tso), isiNdebele (nde) and Siswati (ssw), each boast with 1 000 synsets. Marissa Griesel, a PhD student in the Department of African Languages, is the general project manager.

Bosch concludes by highlighting that wordnets are not only useful, but indispensable components of large automatic language understanding systems being developed and tested in academia and industry. "Adding several South African languages to the wordnet web enables many such applications for each of these languages in isolation. Moreover, linking the South African wordnets to one another and to the many global wordnets makes cross-linguistic information retrieval and question answering possible, and significantly aids machine translation, which is an important contribution to the empowerment of the African languages."

* Compiled by Rivonia Naidu-Hoffmeester, Communications and Marketing Specialist, College of Human Sciences

Publish date: 2020/09/04