Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish
View Researcher's Other CodesDisclaimer: The provided code links for this paper are external links. Science Nest has no responsibility for the accuracy, legality or content of these links. Also, by downloading this code(s), you agree to comply with the terms of use as set out by the author(s) of the code(s).
Please contact us in case of a broken link from here
Authors | Roshna Omer Abdulrahman, Hossein Hassani, Sina Ahmadi |
Journal/Conference Name | WS 2019 8 |
Paper Category | Artificial Intelligence |
Paper Abstract | Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license. |
Date of publication | 2019 |
Code Programming Language | Unspecified |
Comment |