Somali Corpus
An editor has nominated this article for deletion. You are welcome to participate in the deletion discussion, which will decide whether or not to retain it. |
The Somali Corpus, also known as Kaydka Af Soomaaliga (KAF), is a digital collection of texts in the Somali, a language spoken in Greater Somalia, Ethiopia, and Kenya. It was started with 3 million words of Somali literature and language developed by Jama Musse Jama in 2016[1][2] as part of his doctoral dissertation.[3] The corpus currently contains over 7 million words, mainly from literature, poetry, songs, news, essays, and political speeches,[4] making it one of the most extensive collections of text types of language corpora within African languages and an important addition to online materials from under-resourced languages.[5][6][7][8] The words of the corpus are tagged for part of speech categories. The corpus can be used to distill frequency lists for Somali words.[9] The corpus also serves as the basis for an online Somali spell checker.[10]
Other Somali language corpora
[edit]- Bangiga Af Soomaaliga,[11] 79.7 million tokens (as of Oct 2024), at the Swedish Language Bank, University of Gothenburg.
- Somali Web Corpus,[12] 18.9 million tokens (as of Oct 2024), at NLP Center, Brno University, Czech Republic, in coop. with Oslo & Addis Abeba.
See also
[edit]References
[edit]- ^ "The Official Somali Corpus 2016".
- ^ Morgan Nilsson. 2018. Three Somali Language Corpora: How can they be useful? https://morgannilsson.se/ppt/2018-08-15-Mogadishu.pdf
- ^ Jama Musse Jama (2016). A Syntactically Annotated Corpus of Somali Literature. Unpublished PhD Thesis.
- ^ Jama Musse Jama. 2017. Somali Corpus: state of the art, and tools for linguistic analysis.https://www.academia.edu/26504727/Somali_Corpus_state_of_the_art_and_tools_for_linguistic_analysis.
- ^ Bendjaballah, Sabrina. 2024. Somali particle clusters: Complete paradigms, syncretism and corpus frequency. Brill’s Journal of Afroasiatic Languages and Linguistics. Brill 16(1). 102–136. https://doi.org/10.1163/18776930-01601003.
- ^ Mohammed, Siraj. 2020. Using machine learning to build POS tagger for under-resourced language: the case of Somali. International Journal of Information Technology 12(3). 717–729. https://doi.org/10.1007/s41870-020-00480-2.
- ^ Hashi, Awil. 2014. Developing a Model Corpus for Endangered Languages. Graduate Studies. University of Calgary. Doctoral thesis. https://doi.org/10.11575/PRISM/25614.
- ^ Nimaan, Abdillahi. 2014. Building and Evaluating Somali Language Corpora. In Jeff Good, Julia Hirschberg & Owen Rambow (eds.), Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, 73–76. Baltimore, Maryland, USA: Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-2210.
- ^ Giorgio Banti. 2022. Some Issues For An Etymological Dictionary Of Somali. https://www.academia.edu/81600790/Banti_2022_Some_issues_for_an_Etymological_Dictionary_of_Somali.
- ^ "RCF Somali Corpus | 2012-2013".
- ^ "Bangiga Af Soomaaliga".
- ^ "Somali Web Corpus".