Skip to content
Linguistics and Modern Languages


Spoken Language Corpora for the 9 official African Languages of South Africa Project

One of the subjects that was recorded speaking Xhosa at an imbizo (meeting or gathering) in the Eastern Cape, South Africa.

This is a collaborative research project between the Linguistics Departments at Unisa and the University of Göteborg. The main objective of this project is to develop a platform of computer supported basic linguistic resources for the previously disadvantaged languages of South Africa. The project aims to develop audio-visual spoken language corpora of the 9 official African Languages of South Africa, namely S-Sotho, N-Sotho, Tswana, Xhosa, Zulu, Swati, Ndebele, Tsonga and Venda.

The basic methodological framework is that of Corpus Linguistics and it involves the audio-visual recording of spoken and phatic language use in a variety of social activities in a natural environment, the transcription and codification of the data in order to facilitate the electronic analyses and processing of the data.

The significance of the project lies in its contribution to the creation of textual corpora of natural language use which will serve as basic linguistic resources for language development (spoken language grammars and lexica), language teaching (language teaching materials), remedial and therapy programs (speech therapy for sufferers of language disorders and other language deficiencies), language functionality (translation and interpretation).

The Swedish team is mainly responsible for the training of the recorders and transcribers, the design of the infrastructure for the corpora as well as the development of codification systems (tagging) and the design and maintenance of the electronic processing software.

The South African team is responsible for the overall management of the project and specifically with the collection, collation and codification of data (i.e. recordings and transcriptions), the analyses of collocational and phraseological patterns, and the publishing of the resource material.

Principal investigators

Prof Jens Allwood

Göteborg University

Prof Rusandré Hendrikse

University of South Africa

Other researchers participating in the project

Prof George Poulos

University of South Africa

Prof Sheila Mmusi

University of the North

Prof Rosemary Moeketsi

University of South Africa

Prof Themba Msimang

University of South Africa

Mrs Shirley Mukhari

University of South Africa

Dr Abraham Mulaudzi

University of South Africa

Prof Sizwe Satyo

University of Cape Town

Mr Leif Grönqvist

Göteborg University

Mr Magnus Gunnarsson

Göteborg University

At a conference entitled, Against All Odds: African Languages and Literatures into the 21st Century held in Asmara, Eritrea from 11 - 17 January 2000, the attendants issued the so-called Asmara Declaration which reads:

  1. African languages must take on the duty, the responsibility, and the challenge of speaking for the continent.
  2. The vitality and equality of African languages must be recognized as a basis for the future empowerment of African peoples.
  3. The diversity of African languages reflects the rich cultural heritage of Africa and must be used as an instrument of African unity.
  4. Dialogue among African languages is essential: African languages must use the instrument of translation to advance communication among all people, including the disabled.
  5. All African children have the inalienable right to attend school and learn in their mother tongues. Every effort should be made to develop African languages at all levels of education.
  6. Promoting research on African languages is vital for their development, while the advancement of African research and documentation will be best served by the use of African languages.
  7. The effective and rapid development of science and technology in Africa depends on the use of African languages and modern technology must be used for the development of African languages.
  8. Democracy is essential for the equal development of African languages and African languages are vital for the development of democracy based on equality and social justice.
  9. African languages, like all languages, contain gender bias. The role of African languages in development must overcome this gender bias and achieve gender equality.
  10. African languages are essential for the decolonization of African minds and for the African Renaissance.

This collaborative research project between the Linguistics Departments at Unisa and the University of Göteborg addresses all the concerns expressed in this declaration with reference to the 9 official African Languages of South Africa, but more specifically points 4-7 in the sense that it will develop a platform of computer supported basic linguistic resources for applications in translation (point 4), language teaching (point 5), language development (point 6) and language adaptations for science and technology (point 7).

Detailed description of the subject
This project falls within the ambit of Corpus Linguistics, but by its nature and scope also involves the principles and methodologies of Sociolinguistics, Textlinguistics and Conceptual-Functional Typology. Essentially, Corpus Linguistics deals with the analysis of patterns of language use in natural texts. Its approach is empirical in the sense that the analyses involve a large and theoretically informed collection of corpora of discourse or textual data. For the purpose of data storage, management and analysis, extensive use is made of electronic means, particulary computers. The design of the Corpus Linguistic software enables both automatic and interactive processing of data. The analyses involve both statistical (e.g., frequency analyses) and conceptual (e.g., semantic collocation analyses) methods and techniques.

The principles underlying the collection of the corpora may derive from among others Textlinguistics (involving parameters such as styles, registers and genre, formal/informal, role-playing/authentic), Sociolinguistics (involving parameters such as age, gender, levels of education, rural/urban) and Conceptual-Functional Typology (e.g., conceptual rather than formal construals of construction types).

The project involves the development of electronic databases from audio-visual recordings of language usage in a variety of social activities of speakers of S-Sotho, N-Sotho, Tswana, Xhosa, Zulu, Swati, Ndebele, Tsonga and Venda within the framework of Corpus Linguistics. The corpora will contain representative samples in terms of typical sociolinguistic parameters (e.g., age, gender, levels of education, rural/urban) as well as textlinguistic parameters (e.g., styles, registers and genre; formal/informal; role-playing and authentic). Data evaluation and processing will be based on typical markedness criteria developed in language typology.

The Linguistics Department at Göteborg University under the leadership of Professor Jens Allwood established several Corpus Linguistic research projects on Swedish spoken language. Infrastructures for multimodal spoken language corpora as well as the electronic tools to capture, process and analyse the data were developed.

The Linguistics Department at Unisa under the leadership of Professor A P Hendrikse established three Corpus Linguistic projects on African Language Typology, Semantic Collocations in African Languages and Conceptual Metaphors in African Languages.

The idea is to use the expertise in Corpus Linguistics developed at Göteborg and the conceptual semantic and typological information on South African African Languages developed at Unisa in the development of spoken language corpora for African Languages.

Specific objectives and expected significance of the research
The objective of this project is to develop audio-visual spoken language corpora of the 9 official African Languages of South Africa, namely S-Sotho, N-Sotho, Tswana, Xhosa, Zulu, Swati, Ndebele, Tsonga and Venda.

  • Audio-visual recordings of +/-300 hours of spoken language used in a variety of social activities in natural settings for each one of the languages.
  • The development of infrastructures for the electronic data bases of the spoken language corpora as well codification principles and a set of morphological and syntactic tags.
  • The adaptation (and where necessary, the development of) software for the electronic processing of the data from the Swedish situation to the African Language situation.
  • The interpretation of the electronically processed data and the publication of findings in formats that will enable the development of subsequent applications.

The data analysis aims at providing among others the following linguistic resources:

  • word frequencies
  • lexical collocations
  • conceptual metaphors
  • activity-specific phraseology in different registers and styles
  • differentials of spoken and written languages

These spoken language corpora will serve as the platform of linguistic resources for the development of applications in the spirit of the Asmara Declaration with specific reference to:

  • Language Development programs (grammars and lexica)
  • Remedial and Therapy programs ( linguistic deficiencies, speech therapy and language disorders)
  • Language Teaching (language teaching materials)
  • Multicultural interactions and communication strategies (intercultural understanding, translation and interpretation)