• Identifying relations between characters in Afrikaans, Tshivenḓa, and Xitsonga books

    Phathushedzo Maxwell Ramukhadi, Respect Mlambo, Benito Trollip (see profile) , Menno van Zaanen
    DH2020, Network for Digital Humanities in Africa
    Computational linguistics, Natural language processing (Computer science), South African literature
    Item Type:
    Meeting Title:
    Meeting Org.:
    Meeting Loc.:
    Meeting Date:
    20-24 July 2020
    Afrikaans, Named Entity Recognition, Tshivenda, Xitsonga, Natural language processing
    Permanent URL:
    The usefulness of computational linguistic tools, such as named entity recognition (NER) systems, in linguistic or literary studies of under-resourced languages is an area that is still relatively unexplored. In this study the CTexTools2 NER system, which perform NER on all official South African languages, are applied to one Afrikaans novel and two scanned dramas, one in Tshivenḓa and one in Xitsonga. Next, personal relations are identified through character name co-occurence in sentences and these relationships are visualized using Gephi. The research identified several practical problems. First, the NCHLT Optical Character Recognition (OCR) system for South African languages was used to extract text from the Tshivenḓa and Xitsonga scans. However, the resulting quality turned out to be relatively low. We expect that one of the reasons is that the language model of the OCR system was trained on limited amounts of data. Second, the quality of the NER was also low (for all three languages), not only due to limited amounts of training data, but also because the recognizers were trained on government data, which typically has few named entities. Third, we found that in Tshivenḓa there are different ways of addressing people. This is indicated using the prefix Vho-. For example, Vho-Tshibovhola (formal) and Tshibovhola (informal) refer to the same person. Finally, the identification of the relationships between characters turned out to be difficult. All texts are relatively short and hence not many sentences contain more than one character. However, also the type of text has an effect here. Whereas the Afrikaans novel contains some multi-character sentences, the Tshivenḓa and Xitsonga plays hardly contained any. This shows that this approach of relationship finding may not be very well suited for plays. To try to resolve this issue, we incorporate the speaker as a character for each sentence, resulting in more extensive relationship information.
    Presenting Author: Respect Mlambo (respect.mlambo@nwu.ac.za)
    Last Updated:
    2 years ago
    Share this:


    Item Name: mp4 adho.2020.identifying.relations.1.0.rm_.mp4
      Download View in browser
    Activity: Downloads: 58