Development of Morphological Analyzer For Af-Somali
September 4, 2022 | Author: Anonymous | Category: N/A
Short Description
Download Development of Morphological Analyzer For Af-Somali...
Description
DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR AF-SOMALI
MAHDI YONIS KAYAD
A Thesis Submitted to the Department of Computer Science in Partial Fulfillment for the Degree of Master of Science in Computer Science
Addis Ababa, Ethiopia May, 2017
MAHDI YONIS KAYAD Advisor: Dr. Yaregal Assabie
This is to certify that the thesis prepared by Mahdi Yonis, titled: development of Morphological analyzer for Af-Somali and submitted in partial fulfillment of the requirements for the Degree of Master of Science in Computer Science complies with the regulations of the University and meets the accepted standards with respect to originality and quality. Signed by the Examining Committee: Name __________________________Signature__________ _______________________ ___Signature__________ Date_______ Advisor:_______________________________________ Examiner:______________________________________ Examiner:______________________________________
ABSTRACT
Morphological analysis is a very critical issue especially for natural language processing related tasks on inflectional languages. This thesis work gives the implementation details of the development of morphological analyzer for Af-Somali, which is an inflectional language. A detailed computational analysis of Af-Somali morphology such as formalization of alternation and morphotactic rules for Af-Somali is worked out in order to create the morphological analyzer. In the implementation of the morphological analyzer, alternation and morphotactic rules of AfSomali are represented by two-level morphology rules. This is the first detailed computational analysis of Af-Somali from morphological view. The attempt of this thesis is mainly based on the dictionary book Annarita, known as Qaamuus and the declensions of nouns Andrzejewski. This thesis work is employed by finite state two level approach using Xerox finite state toolkit. The work is done in two parts, means to encode the lexicon we have used lexical formalism (lexc) and the alternation rules are implemented by xfst. Generally, we evaluated the morphological analyzer by measuring the following things, the total number of word tokens correctly accepted by the analyzer versus the number of words incorrectly processed by the analyzer. We hav havee man manually ually annotated an notated 2218 18 tokens, 90 nouns, 120 verbs and 8 adjectives of words from the book known as (qaamuus). 77 nominal, 105 verbal and 6 adjectives were correctly analyzed. So, from this we can understand that, 85.5% Nominal, 87.5% verbal and 75% of adjectives were correctly analyzed, and total of 218 tokens 86.2% was correctly analyzed, 13.76% is wrongly analyzed and total 10 tokens failed to be analyzed by the system. The results were evaluated by a human reader familiar with the languages. Therefore we found fou nd an encouraging result which is a preliminary work for computational development of Af-Somali.
Keywords:
(NLP) Natural language Processing, morphological analyzer, (FST) finite state
transducer, (XFST) Xerox finite state toolkit and lexical formalism (LEXC).
I
ACKNOWLEDGEMENTS
I thank all who in one way or another contributed in the completion of this thesis. First, I give thanks to Allah who gives me protection and ability to do work. I am so grateful to the Addis Ababa university college of natural science and computer science department for making it possible for me to study here. I give deep thanks to the lecturers at the department of computer science, the librarians, and other workers of the faculty. My special and heartily thanks to my Advisor, Dr. Yaregal Assabie who encouraged and directed me. His challenges brought this work towards a completion. It is with his advices that this work came into existence. For any faults I take full responsibility. My special gratitude and appreciation also goes to Annarita Puglielli and Cabdalla Cumar Mansuur for their invaluable service contribution to Af-Somali dictionary which was first fully written dictionary with the full grammatical information. Their discussions and comments on Af-Somali Lexicons and Morphology have been the base of this work. Moreover, I am grateful to many friends and colleague through these difficult years. I appreciate my dear, Mother and goodhearted brothers, Mr Abdirashid Yonis and Hamse Yonis, who have supported and helped me many setback and I greatly value their contribution.
II
Table of Contents List of Figures Figures .................................. ................. ................................... ................................... ................................... ................................... .................................. ................................... ....................VI List of Tables ........................................................................................................................................... VII Chapter 1 : Introduction ............................................................................................................................ 1 1.1
Background of the Study ................................... .................. ................................... ................................... .................................. ................................... ..................... ... 1
1.2
Morphological Analysis .................................. ................. ................................... ................................... .................................. ................................... ........................ ...... 1
1.3
Statement of the Problem .................................. ................. ................................... ................................... .................................. ................................... ..................... ... 3
1.4
Objectives.................................. ................. .................................. ................................... ................................... .................................. ................................... .............................. ............ 4
1.5
Methodology ................................... ................. ................................... ................................... ................................... .................................. ................................... ........................ ...... 5
1.5.1
Literature Review ................................... .................. ................................... ................................... .................................. ................................... ........................ ...... 5
1.5.2
Data Collection and Classification................................... .................. ................................... ................................... ................................ ............... 6
1.5.3
Analysis ................................... ................. ................................... ................................... ................................... .................................. ................................... ........................ ...... 6
1.5.4
Implementation ................................. ................ ................................... ................................... ................................... ................................... ............................. ............ 6
1.5.5
Testing .................................. ................ ................................... ................................... ................................... .................................. ................................... ........................... ......... 6
1.6
Application of the Result ................................... .................. ................................... ................................... .................................. ................................... ..................... ... 6
1.7
Scope and Limitation ................................... .................. ................................... ................................... ................................... ................................... .......................... ......... 7
1.8
Organization of the Thesis ................................... .................. .................................. ................................... ................................... ................................... .................. 7
Chapter 2 : Literature Review ................................................................................................................... 8 2.1 2.2
Introduction .................................... .................. ................................... ................................... ................................... .................................. ................................... ........................ ...... 8 Introduction to Morphological Analysis ................................. ................ ................................... ................................... ................................ ............... 8
2.2.1
Morphemes .................................. ................. ................................... ................................... ................................... ................................... ................................... .................. 8
2.2.2
Affixes................................... ................. ................................... ................................... ................................... .................................. ................................... ........................... ......... 9
2.2.3
Types of Morphological Processes ................................... .................. ................................... ................................... ................................ ............... 9
2.2.4
Inflection .................................... .................. ................................... ................................... ................................... .................................. ................................... .................... 10
2.2.5
Derivation .................................. ................ ................................... ................................... ................................... .................................. ................................... .................... 10
2.2.6
Compounding ................................. ................ ................................... ................................... ................................... ................................... .............................. ............. 10
2.3
AF-Somali Morphology .................................. ................. ................................... ................................... .................................. ................................... ...................... .... 10
2.3.1
AF-Somali Phonetics .................................. ................. ................................... ................................... .................................. ................................... .................... 11
2.3.2
Basic Characteristics of Af-Somali .................................. ................. ................................... ................................... .............................. ............. 11 III
2.4
Inflectional Process of AF-Somali ................................... ................. ................................... .................................. ................................... ...................... .... 12
2.4.1
Nouns................................. ................ .................................. ................................... ................................... .................................. ................................... ............................ .......... 12
2.4.2
AF-Somali Noun Determiners.................................. ................ ................................... .................................. ................................... ...................... .... 15
2.4.3
Adjectives ................................... ................. ................................... ................................... ................................... .................................. ................................... .................... 17
2.4.4
The Verb .................................... .................. ................................... ................................... ................................... .................................. ................................... .................... 17
2.4.5 Classification AF-Somali Verbs .................................. ................ ................................... .................................. ................................... .................... 18 2.5 Derivational System of AF-Somali .................................. ................ ................................... .................................. ................................... ...................... .... 20 2.6
Approaches to Morphological Analysis .................................. ................. ................................... ................................... .............................. ............. 21
2.6.1
Corpus-based Approaches ................................. ............... ................................... .................................. ................................... ............................ .......... 21
2.6.2
Rule-based Approach ................................... .................. .................................. ................................... ................................... ................................. ................ 22
2.7
Finite State Technology .................................. ................. ................................... ................................... .................................. ................................... ...................... .... 23
2.7.1
Finite State Machines................................. ................ ................................... ................................... .................................. ................................... .................... 24
2.7.2
Finite-state transducers ................................... ................. ................................... .................................. ................................... ............................... ............. 24
2.7.3
Two Level Morphological Approach.................................. ................. ................................... ................................... ........................... .......... 25
2.7.4
The Xerox Finite State Frame work ................................... .................. ................................... ................................... ........................... .......... 25
2.8
Summary................................... .................. .................................. ................................... ................................... .................................. ................................... ............................ .......... 28
Chapter 3 : Related work ......................................................................................................................... 29 3.1
Introduction .................................... .................. ................................... ................................... ................................... .................................. ................................... ...................... .... 29
3.2
Morphological Analyzer for European Languages................................. ................ ................................... ............................... ............. 29
3.3
Morphological Analyzer for Asian Languages ................................... .................. .................................. ................................... .................... 30
3.4
Morphological Analyzer for Ethiopian Languages................................. ................ ................................... ............................... ............. 31
3.5
Summary................................... .................. .................................. ................................... ................................... .................................. ................................... ............................ .......... 32
Chapter 4 : Design of Af-Somali Morphological Analyzer ................................................ ............................... ................................... .................... 33 4.1
Introduction .................................... .................. ................................... ................................... ................................... .................................. ................................... ...................... .... 33
4.2
General Architecture of AF-Somali Morphological Analyzer ................................. ............... ............................... ............. 33
4.2.1
Lexicon/ Morph-tactics ................................. ................ .................................. ................................... ................................... ................................. ................ 35
4.2.2
Alternation Rules ................................. ................ ................................... ................................... .................................. ................................... ......................... ....... 36
4.3
The Design of AF-Somali Part-Of-Speech Lexicon and Alternation Rules ......................... .................. ....... 37
4.3.1
AF-Somali Verb Lexicon Design .................................. ................. ................................... ................................... ................................. ................ 37
4.3.2
Alternation Rules of AF-Somali Verbs ................................. ................ .................................. ................................... ......................... ....... 41
4.3.3
Noun Lexicon Design ................................. ................ ................................... ................................... .................................. ................................... .................... 44
4.3.4
Alternation Rules of AF-Somali Nouns ................................. ................ .................................. ................................... ......................... ....... 47
4.3.5
Adjectives Lexicon Design .................................. ................ ................................... .................................. ................................... ............................ .......... 48
Chapter 5 : Experimentation and Evaluation ........................................................................................ 50 5.1 Introduction .................................... .................. ................................... ................................... ................................... .................................. ................................... ...................... .... 50 IV
5.2
Experimentation.................................. ................. ................................... ................................... ................................... ................................... ................................. ................ 50
5.3
Discussion and Evaluation.......................................... Evaluation........................ ................................... .................................. ................................... ............................ .......... 51
Chapter 6 : Conclusion and Future Work .............................................................................................. 53 6.1
Conclusion ................................... ................. ................................... ................................... ................................... .................................. ................................... ......................... ....... 53
6.2
Future Work ................................... ................. ................................... ................................... ................................... .................................. ................................... ...................... .... 54
References .................................................................................................................................................. 55 1.9 Appendix-A: Alternation Rules for Noun and Verb .................................. ................. ................................... .............................. ............ 1 1.10
Appendix-B: Af-Somali verb Lexicon .................................. ................. ................................... ................................... ................................... .................. 4
1.11
Appendix-C: Af-Somali Noun lexicon .................................. ................. ................................... ................................... ................................... .................. 9
V
List of Figures Figure 2-1: Example of two lev level el representat representation ion of Af-Somali ................................................................ 27 Figure 2-2: Creation of a lex lexical ical transducer. The .o. operator represen represents ts the composit composition ion operation ...... 28 Figure 4-1: 4-1: Af-Somali morphological analyzer arch architecture itecture desig designn ......................................................... 34 Figure 4-2: Af-Somali verb lexicon .......................................................................................................... 38 Figure 4-3: 4-3: Af-Somali verbs finite state nnetworks etworks .................................................................................... 39 Figure 4-4: Example represent representation ation of A Af-Somali f-Somali second and third group verb FSN .............................. ................. ............. 41 Figure 4-5: Af-Somali verbs alternation rules ..........................................................................................
42
Figure 4-6: 4-6: Alternation rule representation with xxfst fst ................................................................................ 43 Figure 4-7: person morpheme realization ................................................................................................ 44 Figure 4-8: Af-Somali noun lexicon .........................................................................................................
45
Figure 4-9: Af-Somali noun suffixes ........................................................................................................
45
Figure 4-10: Af-Somali verb finite state networks ..................................................................................... 46 Figure 4-11: Af-Somali noun alternation rules ........................................................................................... 47 Figure 4-12: Af-Somali adjective lexicon ................................................................................................... 48 Figure 4-13: Af-Somali Adjective finite state networks ............................................................................. 49 Figure 5-1: 5-1: AF-Somali Verb to suffix attachment .................................................................................... 51
VI
List of Tables Table 2.1: Pluralization system of Af-Somali ........................................................................................... 11 Table 2.2: Derivational inflected plural form of Af-Somali ..................................................................... 12 Table 2.3: Af-Somali Gender Markers .....................................................................................................
13
Table 2.4: Example ooff nou nounn with gender m markers arkers .................................................................................... 13 Table 2.5: Example of Af-Somali plu pluralization ralization an andd declension formation .............................................. 14 Table 2.6: Example of Af-Somali Articles ...............................................................................................
15
Table 2.7: Af-Somali Demonstratives ......................................................................................................
16
Table 2.8: AF-Somali possessive .............................................................................................................. 16 Table 2.9: Interrogative representation of A A-Somali -Somali ................................................................................ 17 Table 2.10: Pluralization of adjectives ........................................................................................................ 17 Table 2.11: Example of person agreement a greement with tenses ............................................................................... 18 Table 2.12: First conjugation representation of Af-Somali verbs ............................................................... 19 Table 2.13: Second Af-Somali verb conjugation (toosi) ............................................................................ 19 Table 2.14: Example of Af-Somali 3rd. conjugation representation .......................................................... 20 Table 2.15: Fourth Af-Somali verb conjugation representation ................................................................. 20 Table 2.16: Example of Af-Somali two level representation ...................................................................... 25 Table 4.1: Tags ooff AF-Som AF-Somali ali gramm grammatical atical in information formation .......................................................................... 35 Table 4.2: Mappings ooff root words and their mo morphemes rphemes ........................................................................ 36 Table 4.3: An examp example le of Af-Somali vverb erb morphotac morphotactics tics ........................................................................ 40 Table 4.4: Realization w with ith sh when it suffixed wi with th t ............................................................................. 42 Table 4.5: Example ooff nou nounn declens declension ion 2 morphotactics ......................................................................... 47 Table 4.6: Partial reduplication of nouns .................................................................................................. 48 Table 4.7: The alternation of declension 5 representation ......................................................................... 48 Table 4.8: Example of adjective morphotactics ........................................................................................ 49 Table 5.1: Overall accuracy of the system ................................................................................................ 52
VII
List of Abbreviations Af-Somali
Somali Language
FSA
Finite State Automata
FST
Finite State Transducers
IR MT
Information Retrieval Machine Translation
NLP
Natural Language Processing
POS
Part-Of-Speech
SOV
Subject-Object-Verb
VIII
Chapter 1 : Introduction 1.1 Background of the Study
A natural language is the preferred medium of communication for people and it can be in a spoken or written form, which is difficult to be simply understood by the computers. This needs a mechanism with enough information of the language including its word grammar and sentence structure to be understood by the computers. The processing of this information by a computer is known as natural language processing (NLP). NLP is used for both generating human readable information from computer systems and converting human language into more formal structures that a computer can understand [6]. It is a field of study which consists of different levels of linguistics analysis such as phonetic, morphological, syntactic and semantic analysis, and the basic level is the morphological analysis to different NLP applications. 1.2 Morphologica Morphologicall Analysis
Morphological analysis is a process of segmenting words into morphemes, the assignment of grammatical information to grammatical categories and the assignment of the lexical information to particular lexeme or lemma [30]. It retrieves the grammatical features and properties of an inflected word. The analyzer breaks the word into minimal meaning bearing morphemes and produces the morph syntactic features such as the root, tense, person and number etc. Morpheme Words Words are formed by combination of one or more free morphemes and zero or more bound morphemes. In spoken language, morphemes are composed of phonemes, the smallest linguistically distinctive distinctive units of sound. re-, de-, un-, -ish, -ly, -cieve, -mand, tie, boy, like, etc. of receive, demand, untie, boyish, likely. Morphology is seen as ‘the study of words that are formally and semantically related’. In
order to consider a word as an expression, it must be characterized as having three 1|Page
features, a phonological form, a category or word classes and a meaning. Morphology is concerned with the study of internal structure of words. Morphological analysis consists of the identification of parts of the words or constituents of the words. For example the word toosi (strengthen) in Af-Somali consists of two constituents, the root word toos (straight) and the imperative marker (i). The morphological analysis primarily consists in breaking up the words into their parts and establishing the rules that govern the co-occurrence of these parts. Morphology can be viewed as the process of building words by inflection and word-formation. So, the task of morphological analysis, is to take forms and relate them to other word forms, at the same time deriving information about the form [30]. A morphological analyzer is an essential and basic tool for building any language processing application in natural language e.g., Machine Translation system and it is an essential technology for most text analysis applications like information retrieval (IR) and text summarization etc. The most obvious applications are found in the areas of lexicography and computational linguistics [24]. Two factors are essential to achieve accurate automatic morphological analysis, one factor is the construction of a set of morphological rules (morphotactic) and the other is the morphological analysis procedure [24]. The absence or underperformance of either of them impairs the overall ability of the morphological analyzer. For example, with respect to the word "dogs", we can say that the "dog" is the root form, and s‟ is the affix. Here the affix gives the number information of the root word. Thus,
morphological analysis is found to be centered on the analysis and generation g eneration of the word forms. It deals with the internal structure of the words and how those words can be formed. Morphological analysis also play an important role in applications such as spell checking, electronic dictionary interfacing and information retrieving systems, where it is important that words are only morphological variants of each other are identified and treated similarly [30]. In NLP and especially in machine translation (MT) systems, we need to identify words in texts in order to determine their syntactic and semantic properties. Morphological study helps us by providing rules for analyzing the structure and formation of the words. 2|Page
Therefore, having a morphological analyzer for any natural language is a vital step in starting natural language processing; especially those lesser-studied and under-resourced languages, it is often a practical and extremely valuable first step, making use of corpora, lexicons, morphological grammars and phonological rules already produced by field of linguists and descriptive linguists [9]. Several Morphological Analyzers have been developed for different well documented languages such as English [30] and Arabic [13]. On the other hand, there is some significant studies in the area of computational morphology for Ethiopian languages like Amharic [5, 8, 21, 22 and 29], Oromo [22] and Tigrinya [22]. Moreover, there are also works performed for Afaraf [2] and Ge’ez by Yitayal Abate [34]. But, to the best of our knowledge there is no academically or published study that had been made so far to develop morphological analyzer for Af-Somali. 1.3 Statement of the Problem
Af-Somali is the official language of Somalia, Ethiopian Somali region and it’s the working language for Kenyan Northern Province and Djibouti [26]. It is also the instructional medium of education of all the schools of these countries, which means that the language is spoken by a number of people and needs to be given attention to computationally process the language. Furthermore, a large number of official documents, religious books and computerized documents are found in Af-Somali, these makes the language to be predominantly used in word processing activities in different areas. In addition to this, there are some NLP applications developed for AfSomali like, machine translation system by Google, bilingual electronic dictionary project which is an English to Somali and Somali speech corpus by Niman Abdillahi [26] and these need to identify words in texts in order to determine their syntactic and semantic properties and the word is lexical category. For example, to translate a word in Af-Somali to English using the electronic dictionary, the users couldn’t find the exact meaning or the corresponding word in English
language. Firstly, this process needs to have the morphological analyzer to distinguish the word category like that tells the word is past or present and it identifies its part-of-speech. Furthermore, if someone wants to conduct a research on NLP and to access the different resources found in different format of the Af-Somali, we need a computational processing of the language or in other way we need to translate the language to the well-developed languages. 3|Page
Considerable research has been done on NLP systems for main Ethiopian languages in general including various works on computational morphology like, Amharic [5, 8, 21, 22 and 29], Afan Oromo [22], Tigrinya [22] and Afaraf [2]. However, No research has been conducted so far in the area of automatic morphological analyzer for Af-Somali. The absence of morphological analysis systems limits the effort of making computers work comfortable with Af-Somali. Af-Somali is the same Cushitic origin to the Afaraf and Afaan Oromo and the other Cushitic language family; and has a much similarity in its vocabulary and grammatical structure, which means they follow SOV structure. However, it has its own uniqueness by which it differs extensively in terms of focus noun and verb markers’, morphology and word order which seems to the semant ic family of Arabic
language. It is also unique in that, the modifiers modif iers occupy a single position, it is pluralization pattern of the language and their word formation process; hence, it needs its own independent morphological analyzer. Af-Somali is morphologically rich and the word formation in the language possesses a number of different linguistic morphological features including complex verb and noun inflectional, derivational and compounding, and because of this complexity, automated morphological analyzer is difficult to construct. Hence, it is a challenging task. Moreover, Af-Somali has more complex inflectional verbs, adding a large number of affix to the stem word and morphological analysis, is vital for the development of many practical natural language processing systems such as machine readable dictionaries, machine translation, information retrieval, spell-checkers, and speech recognition. Therefore, the aim of this work is to conduct a research on morphological morp hological analysis for Af-Somali morphology that can be implemented from computational point of view, to anal analyze yze the word and morphological category, the word formation process in the language and to model computational morphological analysis for Af-Somali. 1.4 Objectives General Objective
The main objective of this thesis work is to develop a morphological analyzer for Af-Somali word morphology.
4|Page
Specific Objectives
In order to achieve the above general objectives, this thesis work has the following specific objectives;
Studying and understanding the word and morphological categories in Af-Somali
Studying and understanding the phonological and morphological alternation rules
involved in Af-Somali word formations and conjugations.
Assessing the different techniques and approaches employed so far in morphological analysis tasks and select the ones that appropriate to the morphological propert propertyy of Somali inflectional morphology.
Designing morphological analyzer for Af-Somali;
Formulating the phonological/orthographic rules involved in inflectional morphological
processes in the language
Test the prototype for morphological analyzer to measure it-s performance. 1.5 Methodology 1.5.1 Literature Review
Literature review will be conducted to understand the language ’s morphology in developing the morphological analyzer. Consultations of the scholars in the area of Af-Somali morphology will be conducted to better understand the morphology of the language and to get information which is helpful for the thesis work. Developing a morphological analyzer requires to analyze and identify the property of Somali word formation and it will be important to review the researches done on the development of morphological analyzer for other languages. It is also, so important and will be helpful by studying and selecting s electing the suitable approach of morpho morphology logy for Af-Somali. Besides this, literature in the area of morphological analysis in particular and computational linguistics in general (e.g. approaches) will be reviewed to better understand how words are analyzed. Thus, the Finite State transducer based Approach to morphological analysis was selected to analyze and derive the root and grammatical properties of Somali words.
5|Page
1.5.2 Data Collection and Classification
To conduct any study needs to collect and analyze a data important for the research to be conducted. In this thesis work a corpus data d ata or a list of words, being electronic text data consisting of list of words such found in a Book Known as Qaamuus and different magazines from internet of Af-Somali words will be collected. The unique word-forms will be classified into different categories such as nouns, verbs, adjective, etc. and further subdivisions have been made according to their morpho-syntactic behaviors using Xerox finite state tool. 1.5.3 Analysis
The classified data will be analyzed into root or (stems) and affixes for each category using Xerox finite state tool in lexicon formalism. Then phonological rules have also been identified and formalized for each category by using xfst-tool. 1.5.4 Implementation
Finite state transducers for each group of words will created following concept of ‘finite state transducer ’. ’. Then, a computational model for Af-Somali inflectional morphology will be implemented using Xerox Finite State Tool (xfst) developed by the two principle researchers at the Xerox Palo Alto Research Center. 1.5.5 Testing
In this thesis work, finite state approach will be used to develop, the morphological analyzer. A wordlists of surface word forms (tokens) will be extracted from Af-Somali Dictionary Book (Qaamuus) and will be inserted in to the prototype to be analyzed. An output was considered correct only if it found all legal combinations of roots and grammatical structure for a given word form and included no incorrect roots or structures. 1.6 Application of the Result
As morphological analyzer is a vital step in starting natural language processing for any language, Af-Somali morphological analyzer is developed for Af-Somali morphology morpholog y to have more efficient and improved NLP applications like Spelling and grammar checker, POS tagger, machine translation system, etc. Besides it has a great contribution to the linguistic experts to easily analyze 6|Page
the language’s morphological properties and when the applications related to Af-Somali are
developed, such as the end users who are seeking the information stored in Af-Somali can be benefited from the analyzer by identifying the word is morphological categorical property. In this regard, this work can be basic and very much useful for the languages’ technological improvement. The computational analysis of morphology in Af-Somali would be a central and essential component for the development of other Af-Somali processing applications. 1.7 Scope and Limitation
Somali linguistic varieties are divided into three main groups: Northern, Benadir and Maay. The Northern Somali forms the basis for Standard Af-Somali. So, the scope of this study stud y is limited to develop a morphological analyzer for the standard Af-Somali/northern Af-Somali morphology. It doesn’t include other dialects of Af-Somali. On the other hand, this study mainly focuses on the
written form of words. Derivation and compounding are also morphologically important, but they have not been dealt with in this thesis work. Despite the fact that there are a number of models/approaches for computational analysis in the literature, a finite f inite state approach is employed in this thesis work. 1.8 Organization of the Thesis
This thesis work has been structured into six chapters. The first chapter of this thesis work, started by giving background information of the thesis work, which introduces natural language processing and morphological analysis, presenting pr esenting the problems that motivated us, objectives and the methodologies followed. Also the first chapter describes about the importance and the scope of the thesis work. In chapter 2, we presented literatures reviewed for the thesis work. It looks into the general Af-Somali word morphology and the general characteristics of Af-Somali part of speech. In this chapter, we also presented the morphological analysis approaches. The studies related to this thesis work are presented in chapter 3. The fourth chapter describes the design and implementation of all those analyses done in the preceding chapters. In chapter 5, the experimentation and evaluation are discussed. In the last chapter 6 we have concluded the thesis work and give a direction to the future works related to this thesis.
7|Page
Chapter 2 : Literature Review 2.1 Introduction
This chapter presents documents reviewed, which are important for the development of Af-Somali morphological analyzer. Mainly, this chapter presents Af-Somali morphology giving more emphasis on the description of the morphological processes involved in the word formation and generation. It also presents the Af-Somali background information and phonetics. In addition to this the chapter reviews the different computational approaches employed in natural language processing systems and morphological morph ological analysis. 2.2
Introduction to Morphological Analysis
2.2.1 Morphemes
Morphs are the phonological/orthograp p honological/orthographical hical realization of morphemes. A single morpheme may be realized by more than one morph. In such cases, the morphs are said to be allomorphs of a single morpheme. The following examples demonstrate the concept of morphemes and their realization as morphs. Free morphemes like town, dog can appear with other lexemes (as in town hall or dog house) or they can sstand tand alone, i.e. "free". Free morphemes are morphemes, which can stand by themselves as single words, for example caleemo (‘leaves’) and saar (“get off’) in Af -Somali -Somali
whereas, bound morphemes (or
affixes) never stand alone. They always appear attached with other morphemes like "un-" appears only together with other morphemes to form a lexeme. Bound morphemes in general tend to be prefixes and suffixes. For example, in Af-Somali, the morph ‘in’ is the realization of morpheme for denoting verb infinitive marker. For example, the words like “afuri, ababi, toosi” and other second group of Af -Somali -Somali verbs use “in” as infinitive marker which makes “afurin, ababin and toosin”. But when the
same morpheme is attached with a different word, it is realized as a different morph. So, the same morpheme can be realized by different morphs in a language. These different 8|Page
morphs of the same morpheme are called allomorphs. An allomorph is a special variant of a morpheme. For example, the second person singular marker in Af-Somali is sometimes realized as o, t or s and the morpheme -t has the morph "-t" in birta (the metal), but "d" in mindida (the knife) of definite marker in feminine nouns. These are the allomorphs of "-t". A group of allomorphs make up one morpheme class. In addition to this, morphology deals with all combinations that word forms or parts of words. So, the two broad classes of morphemes are stems and affixes. The stem is the “main morpheme” of the word, supplying the main meaning, for example, “guriga” where
guri (house) is the stem and “ga” is the affixes which adds an additional meaning “the”. 2.2.2 Affixes
An affix is a bound morph that is realized as a sequence of phonemes. Affixes are classified according to whether they are attached before or after the form to which they are added. Prefixes are attached before and suffixes after. Most Af-Somali word uses the suffixes and a few number of verbs may use use the prefix ty type pe of affixes. Therefore, we can classify languages into concatenative and non-concatenative languages based on the morphology they possess. Non-concatenative language is called template or root-and-pattern morphology and Af-Somali possesses this system in its plural formation of nouns. For example, its duplifix property of the fourth noun declension “aC” as buug-buugag and fool-
foolal. 2.2.3 Types of Morphological Processes
Word is defined as the smallest thought unit vocally expressible composed of one or more sounds combined in one or more syllables. A word is a minimum free form consisting of one or more morphemes. There are three broad classes of ways to form words from morphemes and Af-Somali make use of these three forms in word formation, inflection, derivation and compounding.
9|Page
2.2.4 Inflection
Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function and is productive, e.g. imperative of verb Toos (direct!) toos+i (straighten) the meaning of the resulting word is easily predictable. Inflectional morphemes modify a word's tense, number, aspect, and so on. 2.2.5 Derivation
Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. In case of derivation, the part of speech (POS) of the new derived word may change. Mostly, in Af-Somali we use inflectional word formation process even if some word uses to form a word in derivational. 2.2.6 Compounding
Compounding is the joining of two or more base forms to form a new word. Such S uch frequent root-root fusions are very common in written Af-Somali. Compounds are formed by combining uninflected noun forms with semantic content with either different inflected verbal forms with no semantic content. For example, the Af-Somali plural noun “buugag” books with the verbal form sheeg for another noun of “buugagsheeg” bibliography.
2.3 AF-Somali Morphology
Somali language (Af-Somali) is an Afro-Asiatic language, belonging to the Cushitic family's branch. It is a Lowland East-Cushitic language spoken by roughly up to 16 million people in Somalia, Somaliland, Puntland, Djibouti, Ethiopia (Somali Region) and Kenya (Northeastern Province) [25]. Somali linguistic varieties are divided into three main groups Northern, Benadir and Maay. Northern Somali (or Northern-Central Somali) forms the basis for Standard Somali language. Northern Somali dialect, commonly known as Somali language is spoken in Djibouti, Ethiopia, Puntland and North of the Wabi-Shabeele, which represent the spoken standard of 10 | P a g e
literary Somali [26]. The written system of the language was adopted in 1972 and there are no textual archives before this date. It uses Roman letters and doesn ’t consider the tonal accent [26]. 2.3.1 AF-Somali Phonetics
The phonetic structure of Af-Somali has 22 consonants and 10 vowels, 5 long and 5 short vowels [33]. Af-Somali is also a tone accent language with 2 to 3 lexical tons. Af-Somali consonants follow the same order and have the same value with the equivalent letters of the Arabic alphabet, except G. As presented below some alphabets are not found in English and this alphabets are similar to Arabic voiced. The Af-Somali alphabets are preceded by ' ( ‘= alif) ' and contains 21 consonants which are B, T, J, X, KH, D, R, S, SH, DH, C, G, F, Q, K, L, M, N, W, H, Y and other ten vowels of Somali language which are a, i, e, u, and o and their long counterparts aa, ee, ii, oo and uu. There is no problem for the Latin understanding and the vowels have the same value as in Spanish or Italian. 2.3.2 Basic Characteristics of AF-Somali
The syllable structure of the Somali language is (C) V(C) (C) [items in parentheses are optional] and most words have a di- or tri-syllabic structure (root morphemes and affixes are usually monoor disyllabic [33]. Af-Somali is of the same Cushitic origin to the Afaraf and Afaan Oromo and the other Cushitic language family; and has a similarity in its vocabulary and in their basic word order, which means they follow SOV structure. But, the most distinguishing characteristics of AfSomali is that, double pluralization processes such as the ones illustrated in Table 2.1, where an independently productive plural suffix -yáal can be added to already plural forms such as nim-á-n ‘men’ or naag-ó ‘women’. Table 2.1: Pluralization system of Af-Somali Singular word
Simple plural
Plural of plural
Nin(ka)-masculine
Niman(ka)-masculine
Niman-yaal-Feminine
‘The Man’
‘Men’
‘Groups of men’
Roob(ka)-masculine
Roobab(ka) -masculine
‘Roobab-yow’(ga)-masculine
‘The rain’
‘rains’
11 | P a g e
The other and important characteristics that distinguishes Af-Somali from the other Cushitic languages is that, existence of unquestionably derivational process that takes inflected plural forms as a basis as illustrated in Table 2.2. Table 2.2: Derivational inflected plural form of Af-Somali
Root word Inflected Plural Derivation Word/English Buug ag Sheeg Buugagsheeg/bibliography Buug
ag
haye
Buugaghaye/librarian
Geed
O
Aqoon
Geedaqoon/botany
Xagl
O
Gooye
Xaglogooye/diagonal
Therefore, like any other language there are some common notable characteristics in AF-Somali and these are inflectional system, inflected forms in composition/derivation, conjugational classes, affixation, and reduplication. In addition to this, there are three broad classes of ways to form words from morphemes in AF-Somali namely, inflectional, derivational and compounding. So, in this work we consider the analysis of inflectional word formation processes relating to the important AF-Somali part of speech. Therefore, the most important part of speech in Somali language are nouns, verbs and adjectives and we present their word formation process in the following sections. 2.4
Inflectional Process of AF-Somal AF-Somalii 2.4.1 Nouns
Grammatically, Af-Somali nouns are encoded morphologically by way of affixation to root and stems. Also, as in other related languages, Af-Somali nouns are inflected for gender, number and person. Nouns in Af-Somali, like any other languages, languages, are the names of persons, places, things and abstract entities from estimated point of view. Nouns are inherently masculine or feminine. In general, a noun consists of a root and affixes, which provides a combination of gender and number marking. The main complication is that there are several declension classes, with specific singular and plural suffixes for groups of classes. So, Af-Somali is marked for gender distinction, pluralization and determiners as we will present as follows. 12 | P a g e
Gender Markers
Somali language nouns can be marked for gender to distinguish between masculine and feminine. Some of the Af-Somali nouns are distinguished by accentual tone difference. But in this thesis work, we will only consider the nouns that are marked for gender changes. The markers for AfSomali gender changes are only suffixes that distinguish between the masculine and feminine. f eminine. The markers for the feminine and masculine are shown in the following Table 2.3. As the Table 2.3 shows “ka, ha, a and ga” are masculine markers and ta, da and sha. Table 2.3: Af-Somali Gender Markers
Masculine marker Ka Ha Ga a Feminine markers Ta Da sha
The gender markers in Af-Somali are attached to the nouns as suffixes suffix es to differentiate between the masculine and feminine. In the Table 2.4, we will describe how the markers are suffixed to the nouns of Af-Somali. Table 2.4: Example of noun with gender markers
Words
Masculine marker Words
Ninka( the man)
Ka
Feminine marker
Gabadha (the girl) da
Guriga (the house) Ga
Badda (the sea)
Aabaha (a father)
Hasha (the camel) Sha
Ha
da
Even though we have presented the nouns and how the gender markers are suffixed to them, there are different rules that have to be captured in this study. In Af-Somali the basic markers for gender are ‘ka” and “kii’ for masculine and ‘ta and “ tii” for feminine. But the markers can be changed
based on the last character of the words. F For or example if the masculine nouns are ended up with the vowels i and e the ka marker is changed into ga and ha respectively and if the feminine nouns are ended up with the consonant l the feminine marker “A” is changed into “sha” and the “l” is deleted. The other rule is that all feminine words that end up with the vowel o take the “da” gender marker.
13 | P a g e
Pluralization System of AF-Somali Nouns
There are different rules to change the singular nouns of Af- Somali into plural by looking at the gender of the words. Most Af-Somali n oun pluralization is inflectional, which means it doesn’t change the grammatical word category and most of them become b ecome plural by simply taking suffixes. As described in a Table 2.5, one syllabic Af-Somali words can be plural with partial reduplication of their last consonant alphabet and ‘a’ vowel is inserted between the double consonants. If singular Af-Somali noun ends with the consonants like b, d, n, l and r the last consonant of the word becomes dou ble and ‘o’ vowel is added to make the word plural and the gender is changed in to feminine. And also, if the noun ends with the consonants like s, q, c, f, x, and I, we add the root word ‘yo’ suffix as a plural. Some nouns which are two syllabic singular words are changed into
plural by adding the suffix ‘o’ and the alphabet that is found before the last consonant is deleted and their gender remains unchanged. In addition to this, the nouns that end up with the alphabet – e is changed into plural by adding the suffix – yaal. yaal. There are some words derived from Arabic language which becomes plural like the Arabic pluralization. As a result of this, Af-Somali nouns are classified into seven declensions as shown in Table 2.5, based on how they become plural and the gender of the plural with respect to the singular, as shown in the Table 2.5, if the singular word is masculine and changed into feminine when it becomes, becomes , plural that word is in declension one [1]. Table 2.5: Example of Af-Somali pluralization and declension formation
Word
Gender
Miis
Masculine Singular Miis+as
Masculine Plural Tables
Baal
Feminine
Feminine
erey
Masculine Singular erey+yo
Masculine Plural Words
Dec-2
Mindi
Feminine
Singular Mindi+yo
Masculine Plural Knifes
Dec-2
Naag
Feminine
Singular Naag+o
Masculine Plural Women
Dec-1
Ilig
Masculine Singular Ilk+o
Masculine Plural Teeth
Dec-3
Masculine Plural Girls
Decl-3
Gabadh Feminine
Number Word Singular Baal+al
Singular Gabdh+o
Gender
Num
English
Dec-4
Plural Diagonal Dec-4
Dameer Masculine Singular Demeer+ro
Feminine
Hooyo
Feminine
Singular Hooyooyin
Masculine Plural Mothers
Sheeko
Feminine
Singular Sheeko+oyin Masculine Plural Stories
14 | P a g e
declension
Plural Donkeys Dec-5 Dec-5 Dec-6 Dec-6
Aabe
Masculine Singular Aabeyaal
Feminine
Plural Father
Dec-7
2.4.2 AF-Somali Noun Determiners
The determiners are the modifiers which add meaning to the noun by attaching as a suffix. They are classified into 4 types according to the meaning they add to the noun. These are, Articles (Qodob), demonstrative (Tilmaame), interrogative (Weydimo) and possessive (Lahaansho). Articles
AF-Somali Articles take different forms like, -ka and – kii kii for masculine nouns, and – ta ta and – ttii ii for feminine nouns. If the person we are talking about is far from us or the thing we are reporting is past, we will change ka/ta into – kii/-tii kii/-tii respectively. The form of the articles are changed into another form by looking at the last alphabet of the noun that the article is attached to to.. For example, let us take the noun “kabo” and add the article “ka”; then “ka” is changed into ha and the word becomes “kabaha”. So, we have described this process in the Table 2.6, which article is attached to the noun and how it was changed. As indicated in Table 2.6, the article marker – k k can be changed into – g when it is suffixed to the masculine nouns that ends with the characters like,-g, w, -aa, -u, -y or – I and the article – k can be changed into – a when the masculine nouns ends up with the characters like,-h, -x, -q, -c, -kh. In addition to this, the feminine article marker – ta ta can be changed into – da da or – sha sha. – T can be – d if it is suffixed to the noun that ends with the characters like -o or – d, d, -c, -x, -h, -y, (‘) and the “ta” article article can be – ssh h when it was suffixed to the feminine nouns that end with the character “–l” by deleting the “l” character. Table 2.6: Example of Af-Somali Articles
Root word Gender
15 | P a g e
Article Formed word
Kabo
Masculine Ka
Kaba(ha)
Buug
Masculine ka
Buug(ga)
Magic
Masculine Ka
magac(a)
Maro
Feminine
Ta
Mara(da)
Bac
Feminine
Ta
Bac(da)
Ul
Feminine
Ta
Usha
Demonstrative suffixes
Like the articles, demonstratives are suffixed to the nouns to modify the meaning of nouns in determining the farness or where the things are. Their difference depends on the relationship that is found between the subject and object or the distance between the person talking and what he was talking about. So, in Af-Somali we have three different demonstratives of noun markers as described in the Table 2.7, which indicates nearness (kan), farness (kaas), to left/right (keer) for masculine and nearness (tan), farness (taas) and to left/right (teer) for feminine. Table 2.7: Af-Somali Demonstratives
Word
Near
Farness
To left/right
Gabadh Feminine
Tan
Taas
teer
Gabadh Feminine
Gabadhan Gabadhaas gabadheer
Nin
Gender
Masculine Kan
Kaas
keer
Nin Masculine Ninkan Ninkaas ninkeer The Table 2.7 also describes that whenever, a suffix starting with t is added to a feminine noun which the last character is “dh”, the t is deleted and only takes the remaining part of the suffix.
Possessive Suffixes
In Af-Somali the possessive suffixes are used to represent in the word that something you own or possession like other languages and are classified into masculine and feminine f eminine which depends on the degree of person and this forms 6 different possessives as indicated in Table 2.8. Table 2.8: AF-Somali possessive
Person Masculine Feminine Root noun Gender 1st.Sg.
Word
Kayga
Tayda
Buug
Masculine Buug-gayga
2nd.Sg. Kaaga
Taada
Gabadh
Feminine
3rd.Sg
Kiisa
Tiisa
Buug
Masculine Buug-giisa
3rd.Sg
Keeda
Teeda
Gabadh
Feminine
1st.Pl
Keenna
Teenna
Nin
Masculine Nin-keenna
2nd.Pl
Kiinna
Tiinna
Wiil
Masculine Wiil-kiinna
Kooda
Tooda
Bac
Feminine
Gabadh-aada Gabadh-eeda
rd
3 .Pl 16 | P a g e
Bac-dooda
Interrogative Suffixes
The interrogative suffixes are determiners which adds question like meaning and uses markers like other determiners that can be masculine and feminine. So use – (kee) (kee) for masculine nouns and the – (tee) (tee) suffix for feminine nouns as we described in the Table 2.9. Table 2.9: Interrogative representation of A-Somali
Root noun Gender
Interrogative suffix Word
Dal
Masculine Kee
Dalkee
Sacad
Feminine
Tee
Sacaddee
Meel
Feminine
Tee
meeshee
2.4.3 Adjectives
Adjectives, in turn, do not belong to a clearly defined category in Af-Somali. Items such as yár ‘small’ and wéyn ‘big’ are best interp reted as state verbs displaying a particular defective
paradigm. Adjectives are inflectionally pluralized through rreduplication. eduplication. The reduplicated plural is formed by prefixing a copy of the first syllable to the stem. Only the second syllable bears the high tone. Besides this adjectives can be marked for person, definiteness and have tense markers. For example the plural form of adjective words like cad, cusub, yar are described with the Table 2.10. Table 2.10: Pluralization of adjectives
Root adjective word Number Word
Number
Cad(white)
Sg
Cadcad
plural
Cusub
Sg
Cuscusub plural
Yar
Sg
Yaryar
Plural
2.4.4 The Verb
The verb is the most important part of speech in Af-Somali, which can be inflectionally complex than other parts of speeches. Verb morphology is slightly more complex. Again, a typical verb 17 | P a g e
consists of a root plus a number of affixes. These include derivational affixes (Somali includes a passivizing form which can only be applied to verbs which have a ‘causative’ argument, and a causative affix which adds such an argument) and a set of inflectional affixes which mark aspect, tense and agreement [25]. It has complex alternation patterns and it is basic building part of the Af-Somali verbs are the root word, modifiers, person and conjugation. The most important that have to be described is the verbs conjugations. So we have presented some of properties of conjugations with an examples as follows. The conjugation is a thing that shows the verb’s ten se,
aspect and mood. The agreement of the person and tense produces 6 different forms of a word as we illustrated with an examples in Table 2.11. And also the table shows sh ows the person agreement with tenses and the person markers for each the 6 forms. Table 2.11: Example of person agreement with tenses
Person
The root verb
Tense
Present verb Past verb
1st.Sg
Cun
Present Cun-0-aa
past Cun-0-Ay Cunaa
cunay
2nd.Sg
Cun
Cun-t-aa
Cun-t-ay
Cuntaa
cuntay
3rd.Sg.masc Cun
Cun-0-aa
Cun-0-ay
Cunaa
Cunay
3rd.Sg.fem
Cun
Cun-t-aa
Cun-0-ay
Cuntaa
Cuntay
1st.Pl
Cun
Cun-n-aa
Cun-n-ay
Cunnaa
Cunnay
2nd.Pl
Cun
Cun-t-aan Cun-t-een Cuntaan
cunteen
As shown in the above table 2.11 ( 0) indicates the person 1st.sg, 3rd.sg.masc; 3rd.pl. And the suffix – t shows the 2nd.sg, 3rd.sg.fem, 2nd.pl; the suffix – n also indicates the 1 st.person Pl. These can be also affixed by the suffixes like – ay ay or – een een for the conjugation of the past verb and when the verb
is present it takes the suffixes like – aa/-aan. aa/-aan. Af-Somali verbs are classified into five conjugation categories based on their imperative markers. 2.4.5 Classification AF-Somali Verbs
Based on conjugation Af-Somali verbs are classified into two broad categories, huge number of Af-Somali verbs with only suffixes and small number of verbs with both prefix and suffixes. So, firstly late is consider the conjugation of verbs only with suffix which we mostly used in AfSomali. This types of Af-Somali verbs are classified into five types of conjugations known as 1 st. 18 | P a g e
conjugations, 2nd. Conjugations, 3rd. conjugations, 4th. Conjugations and 5th. Conjugations. The 1st. conjugation verbs are characterized by b y that, this verbs didn’t use an imperative marker and they are mostly one syllabic words. For example let us consider and present this in the Table 2.12. Table 2.12: First conjugation representation of Af-Somali verbs
Verb Cun
Person 0
Tense Ay
Imperative 0
The word cunay
Jab
0
Ay
0
Jabay
Qor
T
Ay
0
qortay
Secondly, the 2nd. Conjugation of Af-Somali verbs are characterized by that, these verbs are mostly formed from other verbs and they are suffixed with imperative marker “I”. For example, the verb “toos” is suffixed with “I” to become the 2 nd. Conjugation type of Af-Somali verbs as shown in
Table 2.13. Table 2.13: Second Af-Somali verb conjugation (toosi)
Person/number imperative Habitual
Present
present
continuous
past
Past continues
1st.Sg
I
Toosiyaa
Toosinayaa
Toosiyay
Toosinayay
2nd.Sg
I
Toosisaa
Toosinaysaa
Toosiyay
Toosinaysay
3rd.Sg.masc
I
Toosiyaa
Toosinayaa
Toosiyay
Toosinayay
3rd.Sg.fem
I
Toosisaa
Toosinaysaa
Toosisay
Toosinaysay
1st.pl
I
Toosinaa
Toosinaynaa
Toosinay
Toosinaynay
2nd.Pl
I
Toosisaan
Toosinaysaan
Toosiseen Toosinayseen
3rd.Pl
I
Toosiyaan
Toosinayaan
Toosiyeen toosinayeen
The other type of Af-Somali verbs is that, 3rd. conjugation verbs which is characterized to be suffixed with “ee” of imperative marker as shown in Table 2.14 and this indicates that the verb is
in 3rd. conjugation and we have listed some of the verbs in this conjugation and represented in an example found in the Table 2.14.
19 | P a g e
Table 2.14: Example of Af-Somali 3rd. conjugation representation
The root verb
Imperative
Infinitive
The verb
In English
Dhab
Ee
Eyn
Dhabeyn
Make the truth
Ciid
Ee
Eyn
Ciideyn
Put the soil
Lastly, the 4th. Af-Somali verb conjugations are characterized by their imperative marker “o” which makes this verbs to have different representation and 5 th. Af-Somali verb conjugation are also characterized by their imperative marker “so” and we clearly described the following example ex ample
found in Table 2.15 to represent the verb conjugation which shows their inflections like person, number, tenses and other properties and how this conjugation forms seven different part of verbs which formed from the person agreement with number and tenses. Table 2.15: Fourth Af-Somali verb conjugation representation
Person/number imperative Habitual
Present
present
continues
Paste
Paste continues
1st.Sg
O(dhaqo)
Dhaqdaa
Dhaqanayaa
Dhaqday
Dhaqanayay
2nd.Sg
O
Dhaqataa
Dhaqanaysaa
Dhaqatay
Dhaqanaysay
3rd.masc
O
Dhaqdaa
Dhaqanayaa
Dhaqday
Dhaqanayay
3rd.fem
O
Dhaqataa
Dhaqanaysaa
Dhaqatay
Dhaqanaysay
1st.Pl
O
Dhaqannaa
Dhaqanaynaa
Dhaqannay Dhaqanaynay
O O
Dhaqataan Dhaqdaan
Dhaqanaysaan Dhaqanayaan
Dhaqateen Dhaqdeen
nd
2 .Pl 3rd.Pl
2.5
Dhaqanayseen dhaqanayeen
Derivational System of AF-Somali
Morphologically Af-Somali words are inflectional like other Cushitic languages, but some words are derivational. Mostly words which are derivational in Af-Somali are verbs and Adjectives, which can be formed from other categories of words and most adjectives are formed from verbs. Some nouns are morphologically derived from other categorical word classes in the process of 20 | P a g e
word formations. Most verbs in Af-Somali can be changed in to nouns by taking the suffix (a) and doubling the last consonant. For example the verb “dil” can be changed chan ged into noun by simply adding “aa” and it becomes “dilaa” the verb “cun” is also changed into noun by adding the character “o” and the noun formed is “cunto”.
Verb morphology is slightly more complex and gain, a typical verb consists of a root plus a number of affixes. These include derivational affixes (Somali includes a passivizing form which can only be applied to verbs which have a ‘causative’ argument, and a cau sative affix which adds such an
[25]. For example Aadaan (prayer)-noun word becomes “aadanay” (praying) which is a verb and the noun word iskaashato (cooperation) noun word is changed in iskashi which is a verb. Also like other part of speech Af-Somali adjectives have a derivational process. There are two sorts of adjectives, ‘basic adjectives’ (a small number), such as yár ‘small’ and wéyn ‘big’ and
those formed from nouns and verbs by addition of lexical suffixes, such as caan-sán ‘famous’ (cáan ‘fame’), wanaag-sán ‘good’ (wanáag ‘goodness’) an d jar-án ‘chopped’ (jár ‘to break’). On the
other hand the compounding of words creates a derivational word which can be formed from two different words like verb and noun or adjective to noun and others. 2.6
Approaches to Morphological Analysis
There are a number of approaches which are widely used in computational morphology. Some of these approaches are based on concepts in automata theory, probability, principle of analogy, and information theory. The computational morphological approaches are broadly categorized into rule-based and corpus-based approaches. 2.6.1 Corpus-based Approaches
Corpus-based approaches are statistical in nature and these approaches do not strictly follow explicit theory of linguistics [32].Suitable machine learning algorithm is used to train the system and collect the necessary information and features from the corpus. The knowledge acquired is then used to perform the morphological analysis task [32].Based on the type of text corpora used, corpus-based approaches can be further categorized into supervised and unsupervised approaches. Supervised approaches use annotated text corpora while unsupervised approaches uses natural corpus as those found in newspaper and books. As noted above, these approaches need a huge 21 | P a g e
corpus of words which used to train the algorithm to be developed. So this approach is difficult for under resourced languages like Somali and it may not produce an efficient and quality output. Mostly, the most developed languages used the machine learning approach, which mostly requires huge number of word corpora and electronic dictionary, newspapers and other documents that are found in the Internet. The languages used this approach to overcome the overload created by the rule based approach and some of the languages that used this approach are English [30], Arabic [13], etc. Limited researches are done in this area for local languages such as Amharic [22] and Ge’ez [34] using corpus based approaches. But, most of local languages are used a rule based
approach specifically the two level morphological analysis. 2.6.2 Rule-based Approach
The rule-based approach strictly follows the explicit theory of the linguistics, which is based on a theory of morphology laid down by an expert. Kazakov and Munandhar [32] stated that this approach enables to incorporate sophisticated linguistic theories such as generative phonology into computational morphology processes [32]. Because of their reliance on linguistic theories, systems s ystems developed using rule-based approaches are often efficient and produce better quality outputs [28]. There are different rule-based methods used to develop morphological analyzer for any languages and some of these are, paradigm based and finite state automata. In paradigm based method for a particular language, each word category like nouns, verbs, adjectives, adverbs and postpositions will be classified into certain types of paradigms. Based on their morphophonemic behavior, a paradigm based morphological compiler program is used to develop the morphological analyzer. The Finite State Automata (FSA) based method uses regular expressions and is used to accept or reject a string in a given language. In general, an FSA is used to study the behavior of a system composing of states, transitions and actions. When FSA starts working, it will be in the initial stage and if the automation is in any an y one of the final states it accepts its input and stops working. Within computational morphology, a very significant advance came with the demonstration that phonological rules could be implemented as finite state transducers (FSTs) and that the rule ordering could be dispensed with using FSTs that relate the surface and lexical levels directly, socalled “two level” morphology (TLM) to lexical output) to one that performs generation (lexical 22 | P a g e
input to surface output) [32].TLM is devised to handle morphological analysis and generation in a bi-directional way. The approach a pproach is based on two lexica lexi ca (one for the underlying and the other for surface word forms), and a set of morphological rules. The rules establish whether a given sequence of characters at the surface level (as it appears in the text) can correspond to a sequence of symbols used to represent the morphemes in the lexicon. In other word, the rules map the two strings to each other. TLM is currently very popular method in computational morphology [32].And the most common benefits of FST for NLP stem from several properties of finite-state devices are true representation, modularity, compactness, efficiency and reversibility. True representation means that the kind of phonological and morphological rules r ules that are common in linguistic theories can be directly implemented as finite-state relations. The implementation of linguistically motivated rules in FST is therefore straightforward and direct. Modularity is the closure properties of regular languages and relations provide various means for combining regular expressions, supporting a variety of operations on the languages these expressions denote. For example, closure under union facilitates a separate development of two grammar fragments which can then be directly combined in a single operation. The most useful operations under which transductions are closed is probably composition, which is the central vehicle for implementing replace rules. Finite-state automata can be minimized, guaranteeing that for a given language, an automaton with a minimal number of states can always be generated and this property is known as compactness. Toolboxes can apply minimization either explicitly or implicitly to improve storage requirements. When an automaton is deterministic, recognition is optimally efficient (linear in the length of the string to be recognized). Automata can always be determined, and toolboxes can take advantage of this to improve time efficiency. In addition to this finite-state automata and transducers are inherently declarative, it is the application program which either implements recognition or generation. In particular, transducers can be used to map strings from the upper language to the lower language or vice versa with no changes in the underlying finitestate device [28]. 2.7
Finite State Technology
Finite-state technology (FST) denotes the use of finite-state devices, such as automata and transducers, in natural language processing. Since the early works which demonstrated the 23 | P a g e
applicability of this technology to linguistic representation. FST is considered adequate for describing the phonological and morphological processes of the world’s languages [32].In order
to understand how to build the linguistic application, we first need to be acquainted acq uainted with the basics of how a finite-state machine works. 2.7.1 Finite State Machines
A finite-state machine (FSM) is an abstract machine that implements a regular language. Regular languages can be described formally in a concise notation, through regular expressions. A finite-state machine is a network consisting of states indicating one start state and one or more final states. Transitions between states are possible only onl y if the required input is recognized. A path is a sequence of transition over arcs to a particular state. In computational morphology, a path is a set of alphabets equivalent to a word in natural language. So, it can be said tthat hat the technology that utilizes the finite-state network in the processing of creating an application is said to be a finite state technology. But, the finite state automata only accepts word and checks if the word is a valid word that found in the language. It does not gives or produces an output or generate. 2.7.2 Finite-state transducers
So far, the analysis of words in a network has simply yielded one of two responses, either accept, indicating that the word is in the language of the network, or a reject, indicating that the word is not in the language. While this can be valuable, as for instance in spell-checking, finite-state networks are capable of storing and returning much more interesting information [28]. Within computational morphology, a very significant advance came with the demonstration that phonological rules could be implemented as finite state transducers [11] and that the rule ordering could be dispensed with using FSTs that relate the surface surf ace and lexical levels directly [11], so-called “Two-level” morphology. A second important advance was the reco gnition by [11] that a cascade of composed FSTs could implement the two-level model. Finite-state techniques are probably the most prevalent approach employed by automatic morphology systems, as their simplicity and outstanding efficiency are unequaled. FSAs can be used to recognize particular patterns, but don’t, by themselves, allow for any an y analysis of word forms. Hence for morphology, we use finite state transducers (FSTs) which allow the
24 | P a g e
surface structure to be mapped into the list of morphemes. FSTs are useful for both analysis and generation, since the mapping is bidirectional [28]. 2.7.3 Two Level Morphological Approach
The two-level morphology approach to morphological analysis is a language independent general formalism for analysis and generation of word-forms [28]. [ 28]. Kimmo invented this approach in 1983. The Generative phonology approach creates un-necessary intermediate levels and is also unidirectional. Kimmo decided to eliminate the intermediate levels. This created a new approach, which has only two levels, the lexical level and the surface level, hence the name Two-Level Morphology. This model has also an added advantage of being bi-directional, implying that both analysis and generation could be done using the same system, which was not possible with the earlier approaches which were uni-directional. Two-level morphology depends heavily on finite state methods, which are well known and are often described as elegant [28]. The two level approach has already successfully been used to develop a comprehensive morphological analyzer for Swahili, a Bantu, Amharic, Afan Oromo and Tigrign languages. The following examples described in Table 2.16, shows the two level representation of Af- Somali S omali words of tagay (he went) and waddooyin (roads). The surface level is the inflected word form and the lexical level defines the stem plus a set of morphological feature tags relating to the word. For example; let as describe with an example shown in table 2.16 using the Tagay and waddooyin of Af-Somali words. Table 2.16: Example of Af-Somali two level representation
Word
Word class
Inflectional type
Generated word
Lexical level Surface level
Tag (go) Tag
Verb 0
paste Ay
Lexical level
Waddo
Noun
Pl
waddooyin
Surface level
Waddo
0
oyin
waddooyin
3rd.per.Sg.masc tagay 0 tagay
2.7.4 The Xerox Finite State Frame work
Xerox research institute has developed a set of finite-state tools which provide a means of implementing two level morphologies. The tools are natural language independent and have been used to implement morphologies for many of the major languages English, Spanish, French, and 25 | P a g e
German, Arabic etc. as well as Afaraf, Afan Oromo, Amharic and others. Xerox finite state technology (XFST) is a programming language for regular expressions, which can be compiled into finite state networks and is used here for analysis of Af-Somali morphology. morphol ogy. It comes bundled with a set of tools for compiling and working with FSTs. XFST includes two components known as lexc and xfst. lexc is a compiler for lexicons lex icons in the lexc language, which is specifically designed for handling morphotactics (the syntax of the morphemes) in natural languages and xfst xfs t is the core tool providing an interface to the finite state calculus for building, accessing, manipulating finite state networks and a compiler for regular expressions and replacement rules which will be essential for any work. Lexicon Compiler
Lexicon compiler (Lexc) is the finite-state tool which has been developed by Xerox for defining two-level lexicons. Lexc is just one of several ways to specify finite-state transducers, but it is especially designed to facilitate the work of the lexicographer [28]. Lexicons and morphotactic information are encoded in the lexc language, which is a kind of right recursive phrase-structure grammar, and are compiled into finite-state transducers as shown in figure 2.1. Finite-state transducers (FSTs) are data structures that encode regular relations [28] which are mappings between two regular languages. For our human convenience, we can visualize a finite-state relation as having an upper-side regular language and a lower-side regular language and each string in one language is related to one or more strings in the other language. By convention, the upper-side or analysis strings of an FST compiled from a lexc description consist of underlying morphemes (strings of phonemes and morphophonemic) and multi-character symbol s ymbol tags like +Noun, +Verb, +Adj(adjectives, +Conj (conjugations), +ImpeV (imperative verb), +Masc[masculine], +Fem[feminine], +Sg[singular], +Pl[plural], etc. that identify the morphemes [3].It accepts a text file containing a user-defined lexicon encoded using to the following syntax. Lexical-item
Continuation-class;
The lexical item is usually the unmarked form of the word (the root or headword given in a dictionary). In the context of this work the lexical item is the stem (the root in most cases) to which inflectional affixes are attached, i.e. a free morpheme. The continuation class can be a pointer to another lexicon or it can be the end-of-string marker, the example below found in 26 | P a g e
Figure 2-1 shows two entries for ‘tag (go)’, one of which is followed by the end-of-string marker ‘#’ and the second which points to the continuation class past Tense, where the aspect form of the
word will be defined.
Figure 2-1: Example of two level representation of Af-Somali
We make use of the two-level representation to encode valuable morphological information about the words as the above example shows. The symbols to the left of the colon represent the lexical level Verb+tag’, and the symbols to the right of the colon represent the surface su rface form ‘tag’. Xerox Finite State Technology Interface
The xfst part of this frame work is mainly concerned with the realization, i.e. surface forms, and phonological alternation rules. This component takes the output of lexc transducer (lexical grammar) as input, which has stems with grammatical features labeled with tags and it is passed through additional rules to obtain the acceptable surface forms. The xfst component helps to compile the lexc grammar into an FST as well as other rule FSTs using lexc files and rule files respectively. Generally, the followi following ng Figure 2-2 illustrates the components of morphological analyzer using finite state transducer, where the The .o. operator represents the composition operation.
27 | P a g e
Figure 2-2: Creation Creation of of a lexical lexical transduc transducer er
2.8
Summary
In this chapter, we introduced Af-Somali background information, morphology and the Af-Somali important part of speech words. We have also described d escribed finite state technology that is successfully applied to computational morphology. The regular expression that can be compiled into finite state network which signifies regular language and the same language can be encoded by the finite state network. The complex finite state network can be built from the smaller networks using various mathematical operations such as union, concatenation, composition, complementation, subtraction su btraction and intersection.
28 | P a g e
Chapter 3 : Related work 3.1 Introduction
In this chapter, we present the system developed for computational morphological analysis for different languages in the world and also in this chapter we look at the approaches they used to develop the morphological analyzers. Specifically, we will look in detail the rule based approach of finite state technologies developed and used for the morphological analyzer of Ethiopian and Cushitic language which are related to Af-Somali. Creating an automatic morphological analyzer/generator is just one step in starting natural language processing for any language; but especially for minority, emerging or generally lesserstudied languages, it is often a practical and extremely valuable first step, making use of corpora, lexicons, morphological grammars and phonological rules already produced by linguists and descriptive linguists [6]. 3.2 Morphologica Morphologicall Analyzer for European Languages
Cagri [17] developed TRmorph, a two-level morphological analyzer for Turkish. The system is completely implemented using freely available Stuttgart finite state transducer tools (SFST). As Cagri [17] presented, SFST is a freely available finite state tool set particularly aimed for implementing morphological analyzers. The tool uses a simple specification language mainly based on regular expressions, with additions of the well-known two-level operators that are particularly useful in implementing phonological (or orthographic) alternations. The TRmorph was analyzed and evaluated with real world data during its development and the system has been tested on two relatively large corpora, the METU corpus and Turkish Wikipedia. Generally, Cagri[17] said, the same process is repeated for successfully analyzed words, where there was no errors, but with some ambiguous analyses. Elaine [18] also developed morphological analyzer for Irish language. The system was developed by using finite-state two-level description descr iption with Xerox Finite-State Tools. The ssystem ystem encodes the inflectional morphology of all inflected parts-of-speech in modern Irish and the morphotactics of 29 | P a g e
stems and affixes are encoded in the lexicon and word mutations are implemented as a series of replace rules encoded as regular expressions. A major advantage that Elaine [18] get from finitestate two-level implementations of morphology is their inherent bi-directionality; the same system is used for both analysis and generation of word forms in the language. The system designed for broad coverage co verage of the language, is evaluated against the most ffrequently requently used words in a corpus corpu s of contemporary Irish texts. Finally, Elaine [18] gives as suggestion to include derivational morphology and dialectal or historical word-forms that the system was not implemented. Generally, we can understand that, morphological analyzer systems can be used as a component part in many NLP applications such as spelling checkers/correctors, stemmers, s temmers, and text to speech synthesizer’s [18].
In addition to this, Xuri [30] developed an English morphological analyzer using machine learning learn ing approach. The system is consists of two closely related components; morphological rule learning and morphological analyzing. As Xuri [30] presented unsupervised learning learnin g has been employed to obtain a set of affix transformational rules and the experiment presented shows that the analyzer has a satisfactory performance. However as stated in [30], problems remain and the most difficult is combinatory ambiguity. This shows that a larger context, such as part of speech or context contex t between words is needed for a correct analysis of these words. So, mostly the machine learning approaches require to have huge number of wordlist in a corpus corp us trained to give an analysis which did not exactly follow the linguistic rules of the languages. 3.3 Morphologica Morphologicall Analyzer for Asian Languages
Gulshat and Ilyas [19] developed a rule based morphological analyzer and a morphological disambiguator for Kazakh language. This system gives the implementation details of a rule-based morphological analyzer of Kazakh language which is an agglutinative language. In the implementation of the morphological analyzer, alternation and morphotactic rules of these systems are represented by two-level morphology rules and Foma finite state compiler is employed. As Gulshat and Ilyas [19] have presented the Morphotactic rules and possible morphemes are defined in the lexicon file and alternation rules in the system are defined and the rules are composed with 30 | P a g e
the lexicon file in a Foma file. The system was tested and evaluated which shows sh ows a beginning work on the development of morphological analyzer of Kazakh language. This system s ystem is working in two directions as at lexical and surface level and due to the ambiguities in language there is no one-toone mapping between surface and lexical forms of words and the system can produce more than one result. Also Kenneth [20] developed a morphological analysis and generation of Arabic language. The system uses Xerox finite state transducer toolkit for its implementation. Kenneth [20] described that, the Lexicons and morphotactic information are encoded in the lexc language which is a kind of right recursive phrase-structure grammar, and are compiled into finite-state transducers and Alternation rules to perform deletion, epenthesis, assimilation and metathesis are written in the twolc
language and/or in a notation known as REPLACE rules. The system was tested and
evaluated with an encouraging performance containing include about 4930 roots. So, for any language to have a morphological analyzer is one step forwarding fo rwarding to technology for that language. 3.4 Morphological Analyzer for Ethiopian Languages
Micheal [22] developed a morphological analyzer for three of Ethiopian languages, Amharic, Afaan Oromo and Tigrinya called HornMorpho. The system uses finite state transducer integrated with python programming language for the implementation and the system uses separate finite state transducer for each language. langu age. In addition to this, the system was evaluated with a web crawler developed by Biniam Gebremicheal and Michael Gasser [22], stated that, more testing is called for, this evaluation suggests excellent coverage of Amharic and Tigrinya verbs for which the roo roots ts are known. Although Oromo, a Cushitic language, does not exhibit the root+template morphology that is typical of Semitic languages, it is also convenient to handle its morphology using the same technique because there are some long-distance dependencies and because it is useful to have the grammatical output that this approach yields for analysis. For Amharic, however, the system is apparently able to at least analyze the great majority of nouns and adjectives. adjectives . The system treats all Amharic words other than verbs, nouns, and adjectives as unanalyzed lexemes. But, the tool is not convenient to Afaan Oromo, because of the language is complicated by the great variation in the use of double consonants and vowels by Oromo writers [22].
31 | P a g e
The other mostly related language is Afaraf and Ali Mohamed [2] developed the first morphological analyzer for this languages and used a finite state transducer. As Ali described that the analyzer, manually annotated 312 tokens, 200 (100 consonant-initial & 100 vowel-initial) verbal, 80 nominal and 32 adjectival words from three popular Afar magazines2 published in Ethiopia and Djibouti. 192 verbal, 75 nominal and 28 adjectives were correctly analyzed and said that the results were evaluated by a human reader familiar with the languages. An output was considered correct only if it found all legal combinations of roots and grammatical structure for a given word form and included no incorrect roots or structures [2]. 3.5 Summary
A limited researches have been conducted in developing morphological analyzer for Cushitic languages like Afaan Oromo [22] and Afaraf [2] and both languages analyzers used rule based approach with finite state transducer. But, to the best of our knowledge no research has been conducted so far in the area of automatic morphological analyzer for Af-Somali. The absence of morphological analysis systems limits the effort of making computers work comfortable with Somali.
32 | P a g e
Chapter 4 : Design of Af-Somali Morphological Analyzer 4.1 Introduction
This chapter presents the design of Af-Somali morphological categories and phonological rules to design a computational model using the Xerox finite state toolkit. It presents the general architecture of lexical FSTs for Af-Somali morphological analysis and the morph-tactics of the language which means how the morphemes co-occur. It also, shows the morph-tactics for each word class separately with lexc formalism and the alternation rules using xfst interface. The main objective in the design of the morphological analyzer is to construct a network which accepts all and only the valid Somali words, and delivers the right analysis. So, in this section, we clearly present the detailed overview of the morphological mor phological analyzer system design and its components. 4.2 General Architecture of AF-Somal AF-Somalii Morphological Analyzer
The construction of the morphological analyzer system, s ystem, using finite state transducer will be broken down into two large components lexicon/ morph-tactics mor ph-tactics part and phonological or alternation rules part. The morph-tactics of the language describes what stems and affixes can co-occur and in what order, are captured in the lexicon. While phonological and morph-phonological alternations between underlying forms and surface spoken or o r written forms are implemented using alternation rules. A word, in order to be analyzed, follows the path lexicon→morphotactic rules→alternation rules→surface. Before the result of the morphological analyzer appears at the surface, it will follow the lexicon path to determine the actual morpheme of that word. After moving from the lexicon, that word will be analyzed by morph-tactic and morphophonemic rules. Only after finishing the process in morph-tactic mor ph-tactic and morphophonemic rules, the result of morphological mor phological analyzer for that word will be delivered as shown in Figure 4-1.
33 | P a g e
Figure 4-1: Af-Somali morphological analyzer architecture design
The other common applications of finite-state techniques include handling words whose roots or stems are not found in the lexicon using guessers, by which the lexical component is replaced by a phonotactic component characterizing the possible shapes of roots or stems. Guessers is to define or recognize the words, which are not found f ound in the lexicon, because all words, cannot be collected or it is time consuming.
34 | P a g e
4.2.1 Lexicon/ Morph-tactics
The design of the tags has become very important in the development of morphological analyzers, since the tags will deliver linguistic information that occurs on a word being analyzed. The morphological analyses of Somali word forms are presented in this system in terms of the following symbols found in Table 4.1. Table 4.1: Tags of AF-Somali grammatical information
No. Grammatical
Tags
information 1
POS
+N(noun), +V(verb), +Adj(adjective)
2
Number
+Sg(singular), +pl(plural)
3
Definiteness
+def(definite), +indef(indefinite)
4
Gender
+fem(feminine), +masc(masculine)
5
Tenses
+pres(present tense), +paste(paste tense, +pres.conti(present continuous tense),+paste.conti(paste continuous tense)
6
Imperative
+imp(imperative)
7
Demonstratives
+close, +far, +near
8
Possessives
1st.Sg,2nd.Sg,3rd.masc,3rd.fem,1st.pl,2nd.pl,3rd.pl
9
Interrogatives
+inter(interrogative)
10
Infinitive
+inf(infinitive)
After various affixes in the morphology were identified, the order in which these affixes are attached to the verbal, nominal, adjectival stem was determined in the lexicon database.
35 | P a g e
The lexicon component will be a transducer that accepts as input only valid Somali stems/roots followed by only legal sequence of tags and produces as output from these, an intermediate form, where the tags are replaced by the morphemes that they correspond to. Within a lexicon, word classes (stems) are assigned to separate classes depending on their inflection they require. Each stem class has an associated continuation class where morphological tags and affixes are concatenated to the stem. Internal modifications (ablaut) to stems also have been implemented in the lexicon. The part that accomplishes this, the lexicon transducer, will be written in a formalism called lexc. The lexc-formalism is more suited for lexicon construction and expressing morph tactics. For example, in the analyzer about to be constructed, the lexicon component FST will perform the following mappings shown in the Table Tab le 4.2. Table 4.2: Mappings of root words and their morphemes
Word
verb Imperative infinitive
Lexical level Caddeyn +V
+imp
+inf
Surface level caddeyn
Ee
Eyn
+0
All root words and morph tactics rules were entered into lexicon database and all spelling rules were entered into rules database. Separate FSTs were created for lexicon and rules, and then combined into one big FST by applying FST composition operation. Therefore, for each word class we created a separate lexicon and alternation rules described in the following sections. 4.2.2 Alternation Rules
Having accomplished the first part of the grammar construction, we now turn to the alternation rules component. The idea is to construct a set of ordered rule transducers that modify the intermediate forms output by the lexicon component. At the very least we will need to remove the ^-symbol which is used to separate morpheme boundaries before we produce valid surface forms. The role of the alternation rules is to modify the output of the lexicon transducer according to phonological and morph-phonological rules. So, for the above example in Table 4.2, we've seen that Af-Somali verb3 word class root concatenated with imperative ee and infinitive marker eyn 36 | P a g e
cadd caddeyn (clarifying). However, when the infinitive marker eyn e yn is suffixed to double vowels (ee) the last vowel of the double vowels e is replaced with the character y. A way to describe the process of forming the correct verb3 word class is to always represent the infinitive suffix as the morpheme eyn as we have, and then subject these word forms to alternation altern ation rules that eliminate the final double vowels and only add the infinitive suffix. This, among others, is the task of the alternation rules component to produce the valid surface forms from the intermediate forms output by the lexicon transducer. Since alternation rule FSTs that are conditioned by their environment are very difficult to construct by hand, we use the replacement rules formalism in xfst to compile the necessary rules into FSTs. This is accomplished by the regular expression composition operator (.o.). Somali has several phonological alternations involving reduplication, lenition, vowel harmony and tone. With this documentation we described the design of alternation rules clearer and we describe or represent with an examples. 4.3 The Design of AF-Somali Part-Of-Speech Part-Of-Speech Lexicon and Alternation Rules
As described in Chapter 2, there are a number of approaches implemented for morphological analyzer development of many languages, but for this thesis work we have chosen rule-based approach by using finite state transducer technology with Xerox finite state toolkit. So, as mentioned in the previous section using rule based approach needs to have two components, lexicon and alternation rules of the language. Therefore, for the development of Af-Somali morphological analyzer we have created a lexicon for the morph-tactics of the Af-Somali most important part of speech verbs, nouns and adjectives separately and the rules are captured with the xfst tool. 4.3.1 AF-Somali Verb Lexicon Design
Verbs in Af-Somali are actions, and states. They agree in person and number, numbe r, and also gender. We classified the verbs into 5 groups which are interrelated based on their imperative markers. Their representation and encoding process is described as follows using finite state transducer lexc formalism by notepad as shown in Figure 4-2. 37 | P a g e
As mentioned above in the development of Af-Somali verb lexicon; we classified the verbs into five groups known as V1, V2, V3, V4 and V5 which we illustrated their V1 verbs in the above figure. The figure also shows that, there is a lexicon called verbs which contains five sub lexicons of v1, v2, v3 v4 and v5 which also have a sub lexicon called v_suffixing and the detailed description of the lexicon is found in Appendix-B. V_suffixing sub lexicon contains all the suffixes attached to the root verbs which is described or created in different lexicon as shown in the Figure 4-2. In this lexicon, we have presented the morphemes that goes with the root verbs and in which order they co-occur with the verbs.
Figure 4-2: Af-Somali verb lexicon
In addition to this, the development of morphological analyzer requires to build finite state networks which present how the morphemes and the root word can co-occur. So, we have 38 | P a g e
presented the Af-Somali verb finite state networks which shows the morphemes and the root verb and their order as shown in Figure 4-3. And in this process the states are described with the rule of Xerox finite state staring from the root verb till the word ends. As shown in the Figure 4-3, the arcs represent states and the arrows indicate the tags and the double circle indicates that the state is final state.
Figure 4-3: Af-Somali verbs finite state networks
39 | P a g e
Generally, we have described the word root/stem lexicon lex icon and their morphotactics with an examples as shown in the Table 4.3. For example, the morphotactics of Af-Somali second subgroup verb (V2) words are illustrated in Figure 4-4, and we also presented the finite state network with an example in Figure 4-4, using the verbs of “toosi” and “caddee” which shows how the verbs of second and 3rd group of Af-Somali verbs generated and the order in which they co-occur. Table 4.3: An example of Af-Somali verb morphotactics
Lexical level Toos +V +imp +Sg +inf +pers +paste The word Surface level toos
40 | P a g e
0
I
0
In
0
ay
toosinay
Figure 4-4: Example representation of Af-Somali second and third group verb FSN
4.3.2 Alternation Rules of AF-Somali Verbs
Af-Somali has a number of morpho-phonemic alternations that a morphological analyzer has to consider. These alternations are dependent on the phonological context, where the features of individual morphemes in the context affect this process. Alternation rules of Af-Somali are defined and the rules are composed with the lexicon file in xfst file. Af-Somali has several phonological alternations involving reduplication, lenition, vowel harmony and tone. 41 | P a g e
In order to construct a finite state transducer for alternation rules, firstly we have defined AfSomali alphabets such as ‘, b, t, j, x, kh, d, r, z, sh, q, k, l, m, n, w, h, y, (‘, B, T, J, X, KH, D, R, S, SH, DH, C, G, F, Q, K, L, M, N, W, H, Y and the five vowels a, e, I, o, u. but Af-Somali also has other five long vowels which are aa, ee, ii, oo, uu. Some vowels in certain words are dropped if a suffix starting with a vowel is attached and the detailed description of Af-Somali alternations are presented in Appendix-A.
Figure 4-5: Af-Somali verbs alternation rules
For example caddee is an imperative verb and if we suffix with infinitive eyn, one of the two last ee of imperative is replaced with y as we tried to show in figure 4-5. Table 4.4: Realization with sh when it suffixed with t
The root
English
Person
paste
The verb
Alternation
Maqal
Listen
T
Ay
Maqashay
l->sh
Hadal
Talk
T
Ay
Hadashay
l->sh
Dil
Kill
T
Ay
Dishay
l->sh
Partial ablaut occurs in verbal infinitives with mostly any word of the pattern CaC. The infinitive ending is appended, raises to . It also occurs around person Suffixes and tense ending 42 | P a g e
in . for example tag’go’ takes an infinitive marker ‘I’ and becomes tagi’to go’, but when we
add the 2nd.PL.paste tense of ‘een’ the verb be comes tageen ‘they went’ which means I replaced with e. also in Af-Somali verbs we have to consider the property of l replacement with sh when we add verb with 3rd.Sg.masc marker t and l is realized as sh as represented in Figure 4-6 and as an example in Table 4.4.
Figure 4-6: Alternation rule representation with xfst
The Person morphemes, the realization of personal suffixes on verbs is a little complex and depends mostly on declension type and whether or not the suffix is preceded by the progressive. Realization of these suffixes is currently all handled by xfst as described the following Figure 47.
43 | P a g e
Figure 4-7: person morpheme realization
4.3.3 Noun Lexicon Design
Nouns in Somali are things and we have developed a separate lexicon known as Nounlex.lexc using lexc binary file. They have separate paradigms depending on morpho-phonological stuff, but are split up into subgroups which correspond to pluralization pattern groups. Hence the AfSomali Noun lexicon in this study is classified in to seven declensions based on their pluralization pattern. Nominal marked for gender undergo under go gender po polarity larity changes in plural. p lural. We want to mark +Masc and +Fem, such that disambiguation is easier, but knowing the gender of the lemma since it is not predictable from a given plural form is a good thing. So, to solve this we already created a lexicon database, which shows their gender. Nominal are also affixed with demonstrative markers of aas, eer and an. So, we have defined a root lexicon known as noun which intern contains seven sub lexicons each for one declension and they are suffixed with the morphemes of the AfSomali nouns as shown in Figure 4-8 the first declension.
44 | P a g e
Figure 4-8: Af-Somali noun lexicon
In addition to this, there is also a separate lexicon which includes the suffix tags and the order in which these suffixes co-occur with the root nouns as illustrated in the following Figure 4-9. But the general co-occurrence of the root noun with the morphemes are shown in figure by using finite state networks and this shows the state in which the transducer passes. This figure simply shows the first declension known as D1_f which are feminine nouns and we have put the detailed description of the noun lexicon in Appendix-C.
Figure 4-9: Af-Somali noun suffixes
45 | P a g e
In general, the morphemes attached to the root nouns are number (Sg,Pl), definiteness (def,indef), interrogatives (inter), possessives and demonstratives as we presented in Figure 4-10 which the finite state network of the Af-Somali nouns.
Figure 4-10: Af-Somali verb finite state networks
For example, the morph-tactics of Af-Somali feminine noun of declension2 words as found in above Finite state networks are described with Table 4.5.
46 | P a g e
Table 4.5: Example of noun declension 2 morphotactics
Lexical
Mindi
D2_F
+Pl
+def
+inter
The noun
Mindi
0
Yo
Ha
ee
mindiyahee
level Surface level 4.3.4 Alternation Rules of AF-Somali Nouns
Generally, to develop and use a lexicon and alternation rules using Xerox finite state toolkit we have to define the characters used in that language. So, in the following sections we defined the variables of Af-Somali and the rules used to implement the transducer. In declension 5 some consonants becomes double when we make the noun plural and this process is captured with the alternation rule components as shown the following Figure 4-11 and detailed description of the alternation rules are presented in Appendix-A.
Figure 4-11: Af-Somali noun alternation rules
The other rule that have to be considered in the xfst is the deletion of when it follows a back consonant (which is not itself). For example Af-Somali noun magac possess this property when it is suffixed with the definite marker ‘ka. AF-Somali has two kinds of reduplication: partial and full. Reduplication is typically a strategy for marking plural in nouns and adjectives in some declensions, but also appears in verbs as a derivational process. The inflectional processes are quite productive, but the derivational processes proces ses are not as productive. The Partial reduplication occurs in the 4th declension of o f nouns, but a subtype 47 | P a g e
of these 4th declensional nouns also has full reduplication. Partial reduplication includes epenthesis of and in nouns it is suffixing. Also, the template is slightly different. For late is see with an example found in the following Table 4.6. Table 4.6: Partial reduplication of nouns
Root noun
English
Af Qoys
Suffixing
Number
The noun
Mouth,language Af
PL
Afaf
Family
Pl
Qoysas
As
So, this alternation can be presented with an example in table 4.7 as follows. Table 4.7: The alternation of declension 5 representation
Verb
Declension
Plural marker
The rule
Sacab Dameer
Dec-5 Dec-5
CCo CCo
sacabbo Dameerro
4.3.5 Adjectives Lexicon Design
The Af-Somali adjective is formed by an adjectival root and the inflected forms of the reduced paradigm of the verb yahay ‘to be’. A reduced paradigm is characterized by reduced distinctions
in subject marking.
Figure 4-12: Af-Somali adjective lexicon
48 | P a g e
Reduced present forms are identical to the root, whereas past forms display distinct inflectional endings. As described in Figure 4-12, Af-Somali Af-S omali adjectives are few in number and we defined ro root ot lexicon known as adjectives and sub lexicon known as Ad_suffix which indicates the suffixes attached to the Adjectives using lexc formalism. The Af-Somali adjectives inflectionally use person markers and tenses which needs with the agreement of numbers as shown in Table 4.8 with an example. Table 4.8: Example of adjective morphotactics
adjective
1st.Sg
1st.Pl
2nd.masc
2nd.fem
pres
paste
The word
fiican
0
N
Y
T
Ahay/ihiin 0/een
fiicanahay
fiican
0
N
Y
T
Ahay/ihiin 0/een
Fiicantahay
Fiican
0
N
Y
T
Ahay/ihiin 0/een
Fiicannahay
Fiican
0
N
Y
T
Ahay/ihiin 0/een
fiicanyihiin
In addition to this, the morphotactic representation of adjectives are also presented in the following Figure 4-13 and describes the order that the suffixes attached with the adjectives.
Figure 4-13: Af-Somali Adjective finite state networks 49 | P a g e
Chapter 5 : Experimentation and Evaluation 5.1
Introduction
This chapter discusses the test and evaluation conducted on Af-Somali Morphological analyzer. In the discussion emphasis is given to assess the outputs produced and the test result found. So the testing of any sizable natural-language processing system is notoriously difficult [8] and the morphological analyzer is an essential and basic tool for building any language processing application for a natural language e.g., Machine Translation system. 5.2 Experimentation
We have developed the morphological analyzer using XFST tool developed by Xerox. It supports UTF-8 character coding which is important for the implementation of Af-Somali computational morphologies. The tool is based on a lexicon and a set of rules for root and morphemes. This lexicon contains the list of root words and its category separated by a tab. The analyzer fails on giving a complex word as an input and the corresponding root word does not exist in the lexicon file. We have developed the Af-Somali lexicon and the rules r ules file required for analysis. The lexicon is designed to reflect the word categories in the Af-Somali language. The lexicon contains different states for each of the root words, starting with the declaration of the tags. For example the verb lexicon is illustrated as shown in Figure 4-2. The root words and its category are separated by a semicolon as shown in Figure 5-1 of Af-Somali verb. The left side of the colon represents the upper side or the analysis form of the transducer, and the right side shows the lower side or the surface form as presented on Appendix-B. The hash symbol at the end of a row indicates the end of the transition, and therefore, that state is the final state. The anal analyzer yzer takes the surface form as input and produces the result as the grammatical structure of the word or the lexicon form.
50 | P a g e
Figure 5-1: AF-Somali Verb to suffix attachment
5.3
Discussion and Evaluation
Evaluation of a morphological analyzer can be performed using a reliable broad-coverage morphological analyzer, or by having a human experts annotate a text manually. The former option was not possible as we have no such a tool developed for Af-Somali. Af -Somali. The latter option is very hard and difficult to perform manually and can be done on relatively small texts. Generally to evaluate and test any morphological analyzer requires to measure the following things the total number of word tokens correctly accepted by the analyzer versus the number of words incorrectly processed by the analyzer and the total percentage that are correctly correc tly analyzed in context versus the total percentage of tokens that are not analyzed at all in the context. Although, we have to know the total percentage of wrongly wrongl y analyzed linguistically regardless of context. Finall Finally, y, how many correct analysis have not output for a token is calculated. Therefore, we have manually annotated 220 tokens, 90 nouns, 120 verbs verb s and 8 Adjectives of words from the book known as (qaamuus). 77 nominal, 105 verbal and 6 adjectives were correctly analyzed. The results were evaluated by b y a human reader familiar with the language. An output was considered correct only if it found all legal combinations of roots and grammatical structure for a 51 | P a g e
given word form and included no incorrect roots or structures. Thus, the overall accuracy of the system is: 84.1% was correctly analyzed as shown in Table 5.1. Table 5.1: Overall accuracy of the system
ni l
l mo
90
jd re A V
N
ina
ct%e N
oc
ctive sb
ina
120
8
mo se
85.55
rr
cte V
oc
re
sb rr
ctive
cte A
oc
jd
87.5% 75%
rr
to to T T
218
188
%
C
ro
gn
cter gn
cter la
la
or w
30
ro
or
C
% ni W
%
86.2% 13.76 %
So, from this we can understand that, the total number tokens analyzed was 218 and out of this 86.2% was correctly analyzed, 13.76% is wrongly analyzed and total 10 tokens failed to be analyzed by the system. Lastly, we have observed that, there was an errors because of the limited size of lexicon we annotated and also we haven’t incorporated Guesser component which helps to guess the words
that was not found in the lexicon. In addition to this, the Af-Somali authors write words in different formats and this gives to analyze one word in different way. For example, some authors or writers write the word Dawlad while others write Dowlad (government).
52 | P a g e
Chapter 6 : Conclusion and Future Work 6.1 Conclusion
Language is one of the main tools for communication. Thus, its investigation will provide better perspectives on all other aspects related with NLP. However, the formalization and computational analysis of Af-Somali morphology are not worked out. In other words, there is lack of tools for analysis of Af-Somali morphology from computational point of view. Moreover, grammar resources contain variances depending on scholars. For example, in some resources there are that write down the adjectives as verbs, whereas others describe adjectives as a separate word class. To summarize, building correctly working system of morphological analysis by combining all information is valuable for further researches on the language. In this thesis, a detailed analysis of Af-Somali has been performed. Also, the formalization of rules over all morphotactics of AfSomali is worked out. By combining all gained information, a morphological analyzer is constructed. This thesis reports on an attempt made to develop Af-Somali morphological analysis system using finite state two level approach. The report started off with brief introduction to concepts and principles used in the study. The introduction also includes description of morphological analysis and the unique feature of Af-Somali words along with their peculiar morphemic components. The different subcategories of rule-based approaches were described briefly. In this study, finite state two level approach was considered. Finite state transducer is the main tool for the development of morphological analyzer and the implementation has been based on [8]. Two level morphology is proving to be very well suited to Af-Somali morphology. A major advantage of finite state two-level implementations of morphology is their inherent bi-directionality; the same system is used for both analysis and generation of word forms in the language. An additional advantage is the high efficiency of finite-state networks that allows to process even large words within a few seconds. We presented the design and implementation of analyzed categories into a finite state transducer using Xerox Finite State Toolkit in chapter 4. First, all forms of verbs, verbs , nouns and Adjectives have been implemented in separate lexc formalism. The rules identified have been implemented in xfst files respectively. The finite state transducers of each category categor y and finite state 53 | P a g e
transducers of rules for respective categories are composed separately. All the finite state transducers have been composed together resulting into a single lexicon finite state transducer which can be used as morphological analyzer and generator. However, the study is carried out under a number of constraints. The main challenge of these was to figure out the linguistic, especially the exact morphotactical details needed for analysis and (generation). The lack of any linguistic lexical resources, the list of words for Af-Somali in an electronic form was so demanding. And also it was difficult to find out the morphological rules that was used in the system. 6.2
Future Work
The morphological analyzer/generator can be useful for linguists who wish to understand the morphological processes of Af-Somali, as well as for language learners to aid in their language comprehension and the practice of word conjugation or declension, The main weakness of the system results from the limited number of available roots and stems in the lexicon, to incorporate Guesser and thus can be improved by increasing increasin g the number of stems and phonological alternation rules and using Guesser component. As this work deals only with inflectional morphology and the northern Somali dialect, there is a need to extend the system to also include derivational and compounding morphology and the Benaadir and Maay of Af-Somali morphology. Finally, it is good to note that when the SoMorph is completely describe Af-Somali Af- Somali morphological analysis it will be useful tool for large-scale NLP applications like machine language translation, Pos checkers in the future.
54 | P a g e
References [1] Annarita Puglielli iyo Cabdalla Cumar Mansuur, “QAAMUUSKA AF‒SOOMAALIGA”, AF‒SO OMAALIGA”, diyaarintii Roma TRE-PRESS, 2012 [2] Ali Mohamed “Development of morphological analyzer for afaraf”, M.sc Thesis, Debra Birhan University, 2014. [3] Andrzejewski, B. W. the Declensions of Somali Nouns, London: School of Oriental and African Studies, 1964 [4] Banti, G. ‘Two Cushitic Systems: Somali and Oromo Nouns’, in H., 1988 [5] BATI, T. B., AUTOMATIC MORPHOLOGICAL ANALYZER FOR AMHARIC, 2002. [6] Beesley K. R., Morphological Analysis and Generation:A First-Step in Natural Language Processing, 2004, p. 1 [7] Elaine Uí Dhonnchadha, A Two-level Morphological Analyser and Generator for Irish using Finite-State Transducers, Institiúid Teangeolaíochta Éireann 31 Plás Mhic Liam, Baile Átha Cliath 2, Éire, and Dublin City University Glasnevin, Dublin 11 , Ireland [8] Fissaha and Haller, “First larger -scale -scale morphological analyzer for Amharic verbs used XFST”, the Xerox Finite State Tools, 2003. 2003.
[9] Jackson Muhirwe, Computational Analysis of Kinyarwanda Morphology: The Morphological Alternations. Advances in Systems Modelling and ICT Applications. [10] Jurafsky Daniel and James Martin, Speech and Language Processing , Prentice-Hall, 2000 (referenced as J&M throughout this handout.4 [11] Karttunen, Lauri, Kaplan & Zaenen, Two-level morphology with composition, 1992. [12] Kenneth Beesley and Lauri Karttunen, Finite State Morphology, CSU Studies in Computational Linguistics, 2003. [13] Kenneth Beesley , Finite-State Morphological Analysis and Generation of Arabic at Xerox Research, Status and Plans in 2001, Xerox Research Centre Europe [14] Kenneth Beesley, Finite state morphology / Kenneth Beesley and Lauri Karttunen, p. cm. - (Studies in computational linguistics; 3), 1954. [15] Koskeniemmi, K., Two-level morphology: a general computational model for word-form recognition and production. Ph.D. thesis, University of Helsinki, 1983. 55
[16] Lauri Karttunen, Constructiong lexical transducers. In the proceeding of the fifteenth international conference on computional linguistics, 1994. [17]
Çagrı Çöltekin, A Freely Available Morphological Analyzer for Turkish, Center for Language and
Cognition (CLCG) University of Groningen
[18]
Elaine Uí Dhonnchadha, A Two-level
Morphological Analyser and Generator for Irish using us ing Finite-State Transducers, institute of technology of Éireann 31 Plás Mhic Liam, Baile Átha Cliath 2, Éire, and Dublin City University Glasnevin, Dublin 11, Ireland
[19] Gulshat Kessikbayeva and Ilyas Cicekli, A Rule Based Morphological Analyzer and A Morphological Disambiguator for Kazakh Language, Linguistics and Literature Studies, 2016 [20] Kenneth R. Beesley, Finite-State Morphological Analysis and Generation of Arabic, Xerox Research Centre Europe 6, chemin de Maupertuis 38240 MEYLAN, France, 2001 [21] Mesfin Abate, Yaregal Assabie (2014).”Development of Amharic mor phological analyzer using memory based approach”, 9th International Conference on NLP, PolTAL, Warsaw,
Poland, September 17-19, 2014. Proceedings. [22] Michael Gasser (2009). “HornMorpho1.0: a system for morphological processing of Amharic, Oromo, and Tigrinya”.
[23] KhumbarDebbarma,
BrajaGopalPatra,
Dipankar
Das,
Sivaji
Bandyopadhyay2
Morphological Analyzer for Kokborok [24] KorayAk, OlcayTanerYıldız, 2011. Unsupervised Morphological Analysis Using Tries, Dept. of Computer Science and Engineering. En gineering. Isık University
[25] Nicola Lampitelli, Evaluative morphology in Somali, Université Paris Diderot-Paris [26] Nimaan Abdillahi, Building and Evaluating Af-SomaliCorpora, Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 73 – 7766 [27] R.Akilan* and Prof. E.R.Naganathan , Morphological Analyzer for Classical Tamil Texts: A Rulebased approach, Research Scholar, (Department of Computer Science, Bharathiar University, Coimbatore) Programmer, Central Institute of Classical Tamil, Chennai. [28] Shuly Wintner and Gelbukh: Finite-State Technology as a Programming Environment, CICLing 2007, LNCS 4394, pp. 97 – 106, 106, 2007.
56
[29] Saba Amsalu, Girma A. Demeke. (2006). Non-concatinative Finite State Morphotactics of Amharic Simple Verbs. [30] Xuri TANG , English Morphological Analysis with Machine-learned Rules, Dept. Foreign Languages, Wuhan University of Science and Engineering, 430073 Wuhan, P. R. China ,
[31] Nicola Lampitelli, The morphophonology of Somali nouns, June, 15-18 2011 [32] Kazakov Dimater & Manandhar Suresh (2000) Unsupervised Learning for Word Segmentation Rules with Genetic Algorithms and Inductive Logic Programming. [33] John I. Saeed, “Somali Reference Grammar”, the University of Virginia, 26 Sep 2007 [34] Yitayal Abate, 2013.” Morphological analyzer for Ge’ez verbs us ing machine learning approach”, in the thesis of Addis Ababa University.
[35] Shlomo Yona , A finite-state based morphological analyzer for Hebrew, thesis in
Department of Computer Science, November, 2004.
57
1.9 Appendix-A: Alternation Alternation Rules for Noun and Verb
1
2
3
1.10 Appendix-B: Af-Somali Af-Somali verb Lexicon
!!Somorph-lex.txt LEXICON Root
kaamil
V1suffixing;
Verb;
naafow
V1suffixing;
LEXICON verb
naanays
V1suffixing;
aammus
V1suffixing;
naaqus
V1suffixing;
abab
V1suffixing;
qaad
V1suffixing;
aadaan
V1suffixing;
raac
V1suffixing;
aammin
V1suffixing;
raadgoob
V1suffixing;
daab
V1suffixing;
raadgur
V1suffixing;
daabul
V1suffixing;
saacidV1suffixing;
edeg
V1suffixing;
saaf
V1suffixing;
faalal V1suffixing; faan
V1suffixing;
!!aaddi
gaad
V1suffixing;
!!aammusi
gaadaan
V1suffixing;
!!baafi
V2suffixing;
gaaddabbuur V1suffixing;
!!baahi
V2suffixing;
gaadh
!!caafi
`
haajir V1suffixing;
!!faafi
V2suffixing;
habaar
V1suffixing;
!!faahi
V2suffixing;
kaah
V1suffixing;
!!gaabi
V2suffixing;
V1suffixing;
4
V2suffixing; V2suffixing;
V2suffixing;
!!gaadhsii !!haadi
V2suffixing; V2suffixing;
baadiyee
V3suffixing;
baahee
V3suffixing;
toosi
V2suffixing;
caalsaaree
V3suffixing;
maadi
V2suffixing;
caanee
V3suffixing;
maahi
V2suffixing;
hallee
V3suffixing;
maalgeli
V2suffixing;
hambalyee
V3suffixing;
qaawi
V2suffixing;
hambee
V3suffixing;
rafaadi
V2suffixing;
qaaligaree
V3suffixing;
rafaaji
V2suffixing;
qaaliyee
V3suffixing;
ragaadi saaci
V2suffixing; V2suffixing;
qaamee raacdee
V3suffixing; V3suffixing;
taabsii
V2suffixing;
saamee
V3suffixing;
uburi
V2suffixing;
saandambee V3suffixing;
ubxi
V2suffixing;
saawee
V3suffixing;
xaadi
V2suffixing;
taakee
V3suffixing;
xaadiri
V2suffixing;
taallee tabaabulee
V3suffixing; V3suffixing;
caddee
V3suffixing;
waayee
V3suffixing;
dhabee
V3suffixing;
yaree
V3suffixing;
aabee
V3suffixing;
aafee
V3suffixing;
abyood
V4suffixing;
aaladee
V3suffixing;
adaadumo
V4suffixing;
baabee
V3suffixing;
baaho
V4suffixing;
5
caashaqo
V4suffixing;
aamuso
V5suffixing;
lifaaqo
V4suffixing;
badso
V5suffixing;
liidaanyoo
V4suffixing;
bahayso
V5suffixing;
liido
V4suffixing;
caddayso
V5suffixing;
qaysho
V4suffixing;
cadgooso
V5suffixing;
qiiroo V4suffixing;
galdhacso
V5suffixing;
qodo
V4suffixing;
hakaabso
V5suffixing;
rigoo
V4suffixing;
halabayso
V5suffixing;
riiqo
V4suffixing;
ilaaleyso
V5suffixing;
riiqo saloolo
V4suffixing; V4suffixing;
janjeerso naso
V5suffixing; V5suffixing;
tacdaaro
V4suffixing;
qaawiso
V5suffixing;
tafaxaydo
V4suffixing;
qalayso
V5suffixing;
tafwareemo V4suffixing;
raacdayso
V5suffixing;
unko
V4suffixing;
samayso
V5suffixing;
urugoo waabo
V4suffixing; V4suffixing;
tallaabso tallaabso
V5suffixing; V5suffixing;
xeroo
V4suffixing;
taraarayso
V5suffixing;
xeydo
V4suffixing;
ubaxayso
V5suffixing;
yeelo
V4suffixing;
udgoonso
V5suffixing;
yeelo
V4suffixing;
waabariiso
V5suffixing;
xabeebso
V5suffixing;
xabkayso
V5suffixing;
aammiinso
V5suffixing; 6
LEXICON V1suffixing +V1+Sg+1P:0 +V1+Pl:a
+V2+Sg+inf:in #;
+V2+1PSg:y
#;
#; #;
+V2+3PSgmasc:y
#;
+V1+Sg+inf:i
#;
+V2+3PPl:y
#;
+V1+2P:s
#;
+V2+3PSgfem:s
#;
+V1+Sg+3Pfem:t
#;
+V2+2PSg:s
#;
+V1+1PPl:n
#;
+V2+2PPl:s
#;
+V1+pres:aa
#;
+V2+1PPl:n
#; #;
+V1+1P+pres:naa
#;
+V2+pres:aa
+V1+paste:ay
#;
+V2+2P+pres:saan
#;
+V2+3PPl+paste:yay
#;
+V2+Sg+inf+paste:nay
#;
+V2+3PPl+paste:yaan
#;
+V1+2P+paste:tay #; +V1+1PPl+paste:nay
#;
+V1+3Pfem+paste+1PPl:teen
#;
+V1+paste+1PPl:een
#;
+V2+paste:ay
#;
+V1+1Ppres.conti:ayaa
#;
+V2+2PSg+paste:seen
#;
+V1+2Ppres.conti:aysaa #; +V1+1PPlpres.conti:aynaa #;
+V2+3PPl+paste:yeen
#;
+V1+2PPlpres.conti:aysaan #;
LEXICON V3suffixing
+V1+3PPl+pres.conti:ayaan #; LEXICON V2suffixing
+V3:ee
#;
+V3+Sg:0
#; #;
+V2:i
#;
+V3+Pl:ya
+V2+Sg:0
#;
+V3+Sg+inf:yn
+V2+Pl:ya
#;
+V3+Sg+3PSgmasc:y 7
#; #;
+V3+3PSgfem:s
#;
+V4+Sg+1PSg+paste:aan
#;
+V3+1PPl:n
#;
+V4+Sg+paste:ay
#;
+V3+pres:aa
#;
+v4+3PSgfem+paste:teen
#;
+V4+Sg+1PSg+paste:een
#;
+V3+3PSgfem+pres:saan
#;
+V3+Sg+3PSgmasc+paste:yaan
#;
+V3+paste:ay
#;
+V3+3PSgfem+paste:seen #; +V3+Sg+3PSgmasc+paste:yeen
LEXICON V5suffixing #;
+V5+Sg:0
#;
+V5+Sg+3Pmasc:0 #; LEXICON V4suffixing
+V5+Pl:da
#;
+V4:o
#;
+V5+Sg+inf:an
#;
+V4+Sg:0
#;
+V5+3PSgfem:t
#;
+V5+Sg+1PPl:n
#; #;
+V4+Pl:da
#;
+V4+Sg+inf:an
#;
+V5+Sg+pres:aa
+V4+Sg+1PSg:0
#;
+V5+3PSgfem+pres:taan
+v4+3PSgfem:t +V4+1PPl:n
#;
+V5+Sg+3Pmasc+pres:aan #; +V5+Sg+paste:ay #;
#;
+V5+3PSgfem+paste:teen #;
#;
+V4+Sg+pres:aa +v4+3PSgfem+pres:taan
#;
+V5+Sg+3Pmasc+paste:een #;
8
#;
1.11 Appendix-C: Af-Somali Noun lexicon !!Somorph-lex.txt LEXICON Root Nouns; LEXICON Nouns
qori
N2MYo;
aalad
N1;
qurub N2MYo;
abaar
N1;
ubax
bad
N1;
unug N2MYo;
beer
N1;
xijaab N2MYo;
hees
N1;
kab
N1;
qor
kal
N1;
quraac N2FYo;
naag
N1;
sabti
qayb
N1;
subax N2FYo;
N2FYo;
N2FYo;
mindi N2FYo;
saacadN1; sannad
N1;
shimbir
N1;
suuradN1; toobad
N2MYo;
N1;
aroos N2MYo;
gabadh
N3F2V;
gacan
N3F2V;
galab
N3F2V;
kibis
N3F2V;
xubin
N3F2V;
asaas N2MYo; dalool N2MYo;
garab N3M2V;
dheri N2MYo;
hilib
N3M2V;
erey
ilig
N3M2V;
jilib
N3M2V;
N2MYo;
magac N2MYo;
9
9
xadhig N3M2V;
yaraan N5MCC;
baal
N4FaC;
daymo
N6Moyin;
seef
N4FaC;
dhismo
N6Moyin;
weel
N4FaC;
barkimo
N6Moyin;
wiil
N4FaC; abeeso
N6Foyin; N6Foyin;
af
N4MaC;
daawo
baaf
N4MaC;
darajo N6Foyin;
ceel
N4MaC;
hooyo
N6Foyin;
dal
N4MaC;
magalo
N6Foyin;
fal
N4MaC;
taallo N6Foyin;
miis
N4MaC;
ujeeddo
N6Foyin;
waddo
N6Foyin;
qoys N4MaC; riig
N4MaC;
shil
N4MaC;
aabbe
N7Myaal;
weel
N4MaC;
beenaale
N7Myaal;
biyoole
N7Myaal;
aabbur N5MCC;
caanoole
N7Myaal;
albaab N5MCC;
fure
N7Myaal;
alool
gacaliye
N7Myaal;
N5MCC;
baabuur
N5MCC;
jaalle N7Myaal;
dagaal N5MCC;
walaale
N7Myaal;
dameer
waraabe
N7Myaal;
yeele
N7Myaal;
N5MCC;
hoteel N5MCC; ijaar
N5MCC;
sacab N5MCC; shaqal N5MCC; wadaad N5MCC;
LEXICON N1 +N1+Sg:0 +N1+Pl:o
#; #;
10
+N1+defM:ka #;
+N2F+Pl:yo #;
+N1+defF:ta
+N2F+Pl:O #;
#;
+N1+defF+inter:tee
#;
+N2F+defF:ta #;
+N1+defM+inter:kee
#;
+N2F+defF:ha +N2F+defF :ha #;
+N1+defF+1PSg:tayda #;
+N2F+defF+inter:yahee #;
+N1+defF+2PSg:taada #;
+N2F+defF+inter:tee #;
+N1+defF+3Pmasc:tiisa
#;
+N2F+defF+1stSg:tayda #;
+N1+defF+3Pfem:teeda
#;
+N2F+defF+2ndSg:taada #;
+N1+defF+1PPl:taayada
#;
+N2F+defF+3rdmasc:tiisa #;
+N1+defF+close:tan
#;
+N2F+defF+3rdfem:teeda #;
+N1+defF+near:tas
#;
+N2F+defF+1stPl:taa +N2F+defF +1stPl:taayada yada #;
+N1+defF+far:teer
#;
+N2F+defF+close:tan +N2F+defF +close:tan #; +N2F+defF+near:tas #;
LEXICON N2MYo
+N2F+defF+far:teer #;
+N2M+Sg:0 #; +N2M+Pl:yo #;
LEXICON N3F2V
+N2M+defM:ka +N2M+defM :ka #;
+N3F+Sg:0 #;
+N2M+defM+inter:kee #;
+N3F+Pl:0 #;
+N2M+defM+1PSg:kayga #;
+N3F+defF:ta #;
+N2M+defM+2PSg:kaaga #;
+N3F+defF+inter:tee #;
+N2M+defM+3Pmasc:kiisa #;
+N3F+defF+1PSg:tayda #;
+N2M+defM+3Pfem:keeda #;
+N3F+defF+2PSg:taada #;
+N2M+defM+1PPl:kaayaga #;
+N3F+defF+3Pmasc:tiisa #;
+N2M+defM+close:kan +N2M+defM+ close:kan #;
+N3F+defF+3Pfem:teeda #;
+N2M+defM+near:kas +N2M+defM+n ear:kas #;
+N3F+defF+1PPl:taayada #;
+N2M+defF+far:keer #;
+defF+close:tan #; +defF+near:tas +defF+near:t as #;
LEXICON N2FYo +N2F+Sg:0 #;
+N3F+defF+far:teer #;
11
LEXICON N3M2V
LEXICON N4MaC
+N3M+Sg:0 #;
+N4M+Sg:0 #;
+N3M+Pl:0 #;
+N4M+Pl:aC #;
+N3M+defM:ka +N3M+defM :ka #;
+N4M+defM:ka +N4M+defM :ka #;
+N3M+defM+inter:kee #;
+N4M+defM+inter:kee #;
+N3M+defM+1PSg:kayga #;
+N4M+defM+1PSg:kayga #;
+N3M+defM+2PSg:kaaga #;
+N4M+defM+2PSg:kaaga #;
+N3M+defM+3Pmasc:kiisa #;
+N4M+defM+3Pmasc:kiisa #;
+N3M+defM+3Pfem:keeda #;
+N4M+defM+3Pfem:keeda #;
+N3M+defM+1PPl:kaayaga #;
+N4M+defM+1PPl:kaayaga #;
+N3M+defM+close:kan +N3M+defM+ close:kan #;
+N4M+defM+close:kan #;
+N3M+defM+near:kas +N3M+defM+n ear:kas #;
+N4M+defM+near:kas +N4M+defM+ near:kas #;
+N3M+defM+far:keer +N3M+defM+f ar:keer #;
+N4M+defM+far:keer #;
LEXICON N4FaC
LEXICON N5MCC
+N4F+Sg:0 #;
+N5M+Sg:0 #;
+N4F+Pl:aC #;
+N5M+Pl:CC #;
+N4F+defF:ta #;
+N5M+defM:ka +N5M+defM :ka #;
+N4F+defF+inter:tee +N4F+defF+ inter:tee #;
+N5M+defM+inter:kee #;
+N4F+defF+1PSg:tayda +N4F+defF+ 1PSg:tayda #;
+N5M+defM+1PSg:kayga #;
+N4F+defF+2PSg:taada #;
+N5M+defM+2PSg:kaaga #;
+N4F+defF+3Pmasc:tiisa #;
+N5M+defM+3Pmasc:kiisa #;
+N4F+defF+3Pfem:teeda #;
+N5M+defM+3Pfem:keeda #;
+N4F+defF+1PPl:taayada +N4F+defF+ 1PPl:taayada #;
+N5M+defM+1PPl:kaayaga #;
+N4F+defF+close:tan +N4F+defF+ close:tan #;
+N5M+defM+close:kan #;
+N4F+defF+near:tas #;
+N5M+defM+near:kas +N5M+defM+ near:kas #;
+N4F+defF+far:teer +N4F+defF+ far:teer #;
+N5M+defM+far:keer #;
12
LEXICON N6Moyin
+N6F+defF+3Pmasc:tiisa #;
+N6M+Sg:0 #;
+N6F+defF+3Pfem:teeda #;
+N6M+Pl:oyin #;
+N6F+defF+1PPl:taayada #;
+N6M+defM:ka +N6M+defM :ka #;
+N6F+defF+close:tan +N6F+defF +close:tan #;
+N6M+defM+inter:kee #;
+N6F+defF+near:tas #;
+N6M+defM+1PSg:kayga #;
+N6F+defF+far:teer #;
+N6M+defM+2PSg:kaaga #; +N6M+defM+3Pmasc:kiisa #;
LEXICON N7Myaal
+N6M+defM+3Pfem:keeda #;
+N7M+Sg:0 #;
+N6M+defM+1PPl:kaayaga #;
+N7M+Pl:yaal +N7M+Pl: yaal #;
+N6M+defM+close:kan +N6M+defM+ close:kan #;
+N7M+defM:ka +N7M+defM :ka #;
+N6M+defM+near:kas +N6M+defM+n ear:kas #;
+N7M+defM+inter:kee #;
+N6M+defM+far:keer +N6M+defM+f ar:keer #;
+N7M+defM+1PSg:kayga #; +N7M+defM+2PSg:kaaga #;
LEXICON N6Foyin
+N7M+defM+3Pmasc:kiisa #;
+N6F+Sg:0 #;
+N7M+defM+3Pfem:keeda #;
+N6F+Pl:oyin #;
+N7M+defM+1PPl:kaayaga #;
+N6F+defF:ta #;
+N7M+defM+close:kan #;
+N6F+defF+inter:tee +N6F+defF+ inter:tee #;
+N7M+defM+near:kas +N7M+defM+ near:kas #;
+N6F+defF+1PSg:tayda +N6F+defF+ 1PSg:tayda #;
+N7M+defM+far:keer #;
+N6F+defF+2PSg:taada #;
13
Submitted by: Mahdi Yonis Student
_____________________ `
Signature
May 30, 2017 Date
Approved by:
1. Yaregal Assabie Advisor
______________________ Signature
May 30, 2017 Date
2. ______________________ ______________________ ______________________ ____________________ Chairman, Dept’s
Signature
Date
Graduate Committee 3. _______________________ ______________________ ______________________ ___________________ Chairman, Faculty’s
Signature
Date
Graduate Commission 4. _______________________ ______________________ ______________________ ___________________ Dean, Graduate School
Signature
Date
View more...
Comments