Spoken Language Corpora for the 9 official African Languages of South Africa Project
One of the subjects that was recorded speaking Xhosa at an imbizo (meeting or gathering) in the Eastern Cape, South Africa.
This is a collaborative research project between the Linguistics Departments at Unisa and the University of Göteborg. The main objective of this project is to develop a platform of computer supported basic linguistic resources for the previously disadvantaged languages of South Africa. The project aims to develop audio-visual spoken language corpora of the 9 official African Languages of South Africa, namely S-Sotho, N-Sotho, Tswana, Xhosa, Zulu, Swati, Ndebele, Tsonga and Venda.
The basic methodological framework is that of Corpus Linguistics and it involves the audio-visual recording of spoken and phatic language use in a variety of social activities in a natural environment, the transcription and codification of the data in order to facilitate the electronic analyses and processing of the data.
The significance of the project lies in its contribution to the creation of textual corpora of natural language use which will serve as basic linguistic resources for language development (spoken language grammars and lexica), language teaching (language teaching materials), remedial and therapy programs (speech therapy for sufferers of language disorders and other language deficiencies), language functionality (translation and interpretation).
The Swedish team is mainly responsible for the training of the recorders and transcribers, the design of the infrastructure for the corpora as well as the development of codification systems (tagging) and the design and maintenance of the electronic processing software.
The South African team is responsible for the overall management of the project and specifically with the collection, collation and codification of data (i.e. recordings and transcriptions), the analyses of collocational and phraseological patterns, and the publishing of the resource material.
Other researchers participating in the project
At a conference entitled, Against All Odds: African Languages and Literatures into the 21st Century held in Asmara, Eritrea from 11 - 17 January 2000, the attendants issued the so-called Asmara Declaration which reads:
This collaborative research project between the Linguistics Departments at Unisa and the University of Göteborg addresses all the concerns expressed in this declaration with reference to the 9 official African Languages of South Africa, but more specifically points 4-7 in the sense that it will develop a platform of computer supported basic linguistic resources for applications in translation (point 4), language teaching (point 5), language development (point 6) and language adaptations for science and technology (point 7).
Detailed description of the subject
The principles underlying the collection of the corpora may derive from among others Textlinguistics (involving parameters such as styles, registers and genre, formal/informal, role-playing/authentic), Sociolinguistics (involving parameters such as age, gender, levels of education, rural/urban) and Conceptual-Functional Typology (e.g., conceptual rather than formal construals of construction types).
The project involves the development of electronic databases from audio-visual recordings of language usage in a variety of social activities of speakers of S-Sotho, N-Sotho, Tswana, Xhosa, Zulu, Swati, Ndebele, Tsonga and Venda within the framework of Corpus Linguistics. The corpora will contain representative samples in terms of typical sociolinguistic parameters (e.g., age, gender, levels of education, rural/urban) as well as textlinguistic parameters (e.g., styles, registers and genre; formal/informal; role-playing and authentic). Data evaluation and processing will be based on typical markedness criteria developed in language typology.
The Linguistics Department at Göteborg University under the leadership of Professor Jens Allwood established several Corpus Linguistic research projects on Swedish spoken language. Infrastructures for multimodal spoken language corpora as well as the electronic tools to capture, process and analyse the data were developed.
The Linguistics Department at Unisa under the leadership of Professor A P Hendrikse established three Corpus Linguistic projects on African Language Typology, Semantic Collocations in African Languages and Conceptual Metaphors in African Languages.
The idea is to use the expertise in Corpus Linguistics developed at Göteborg and the conceptual semantic and typological information on South African African Languages developed at Unisa in the development of spoken language corpora for African Languages.
Specific objectives and expected significance of the research
The data analysis aims at providing among others the following linguistic resources:
These spoken language corpora will serve as the platform of linguistic resources for the development of applications in the spirit of the Asmara Declaration with specific reference to: