Corpus linguistics as a method for the decipherment of rongorongo (Mres Dissertation)

Martyn Harris

The rongorongo writing system of Easter Island is the only example of writing in Polynesia. The structural properties of the script and the few remaining inscriptions has complicated decipherment work for many years. With the development of sophisticated language and word-frequency distribution models (Zipf, 1949; Baayen, 2001; Evert and Baroni, 2006), and text-classification, (Mosteller and Wallace, 1964; Landauer and Dumais, 1997; Landauer et al., 2002; 2004; 2007; Manning et al., 2008), the issues relating to the study of large multilingual corpora may be addressed. This study adopts a mixed-methods approach, and evaluates the potential for classification of the rongorongo texts according to literary-genre. Although there is still work to be done in determining the types of literary-genre produced by rongorongo scribes, it is believed a classification of the texts, will restore some contextual information to enable future studies to identify some of the structural-principles behind the productivity of particular glyphs. The paper highlights the empirical issues relating to the use of multilingual-corpora, word-frequency distributions, and samples of varying size. The Barthel (1958) corpus is tested to find the optimum representation for statistical exploration. Furthermore, the results validate previous conclusions regarding the presence of shared passages (Barthel, 1958; Fischer, 1997; Sproat, 2003; Guy, 2006; Horley, 2007a; 2009; 2010; Melka, 2008; 2009b; 2010), and literary-genre, (Butinov and Knorozov, 1957; Barthel, 1958; Guy, 1985; 1990; Fischer, 1997; Davletshin, 2002; Berthin and Berthin, 2006; Melka, 2008; 2009a; 2009b; 2010; Wieczorek, 2010). The final results may provide a possible foundation for further exploration of literary-genre and in the rongorongo corpus.

1st June 2010 School of Social Sciences, History and Philosophy Department of Applied Linguistics and Communication Birkbeck University Corpus linguistics as a method for the decipherment of rongorongo Martyn Harris martynharris@ymail.com Word count: 28,439 Dissertation submitted in partial fulfilment of the requirement for the MRes in Applied Linguistics 1 Acknowledgements This dissertation benefited from the help of many kind individuals, all of whom I would like to thank (in no particular order): Paul Horley and Tomas Melka for openly discussing their views on rongorongo, in addition to articles and hard to find sources; Fr. Jean Louis Schuester, and Maria Centofanti of the Congregation of the Sacred Hearts (SSCC, Rome), for allowing access to the Tahua (A), Aruku-kurenga (B), and Mamari tablets (C), and for granting permission to take photographs (see appendix); Jill Hasell of the British Museum, for granting access to the London tablet (K), and the Reimiro (L), and (J); Marco Baroni and Stefan Gries for comments and advice on lexicostatistic, and corpus linguistic methods. 2 Table of Contents Abstract........................................................................................................7 Chapter 1 Introduction...............................................................................8 Chapter 2 Writing systems and the rongorongo script........................12 2.1 2.2 2.3 2.4 2.5 Empirical issues relating to writing systems research and rongorongo......................12 The principle of the autonomy of the graphic system.................................................13 The principle of interpretation....................................................................................15 The principle of historicity..........................................................................................18 Literary genres of the rongorongo inscriptions...........................................................20 Chapter 3 Data..........................................................................................24 3.1 Data.............................................................................................................................24 3.2 Pre-processing.............................................................................................................31 Chapter 4 Methodology............................................................................34 4.1 Corpus linguistics.........................................................................................................34 4.1.1 N-grams: uni-grams, bi-grams, and tri-grams..............................................40 4.1.2 Key Word In Context concordances (KWIC)..............................................41 4.2 Authorship-attribution methods..................................................................................41 4.2.1 Principal components analysis.....................................................................44 4.2.2 Factor analysis..............................................................................................46 4.2.3 Tests of lexical-richness...........................................................................................46 4.2.4 Latent Semantic Analysis (LSA)..............................................................................47 Chapter 5 Presentation of findings.........................................................50 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Feature selection.........................................................................................................50 Principal components analysis....................................................................................50 Factor analysis: Rongorongo corpus...........................................................................52 N-grams.......................................................................................................................55 Factor analysis: Language corpus...............................................................................57 Frequency spectrum and VGCs as tests for lexical-richness......................................60 Latent semantic analysis.............................................................................................70 5.7.1 Document-document analysis......................................................................70 5.7.2 Term-term similarity analysis........................................................................80 Chapter 6 Conclusion and Recommendations for further research....85 6.1 Conclusion....................................................................................................................85 6.2 Further research...........................................................................................................87 Bibliography..............................................................................................91 Appendix...................................................................................................98 3 Table of figures Figure 1. Two sets of parallel-passages........................................................................................15 Figure 2. Classification of tablets by shared glyph-sequences, (Barthel, 1958)...........................21 Figure 3. Evidence of an inscription continuing on to the opposite side......................................24 Figure 4. Examples of spatial orientation and glyph-contraction.................................................27 Figure 5. Empirical Vocabulary growth curve of all data-sets......................................................28 Figure 6. Zipf Frequency/Rank plots............................................................................................37 Figure 7. Zipf's predicted straight-line in double-logarithmic space...........................................38 Figure 8. Scree plots (rongorongo data-sets)................................................................................51 Figure 9. PCA (rongorongo dataset).............................................................................................51 Figure 10. Factor analysis of rongorongo (DD4) with varimax and promax rotation.................53 Figure 11. Correspondence plot of the influence of morpheme loadings.....................................58 Figure 12. Comparison of the number of types: FZM and GIGP LNRE models.........................61 Figure 13. Number of types V(N) compared to the expected E[V(N)]........................................63 Figure 14. Measure of Baayen's P for texts in the rongorongo corpus.........................................64 Figure 15. VGC plot of the Santiago staff (Ia).............................................................................65 Figure 16. Discourse map of the Santiago staff (Ia).....................................................................66 Figure 17. Discourse map of the Mamari tablet (Ca)...................................................................67 Figure 18. VGC plots of experimental texts: a genealogy, and a narrative text...........................68 Figure 19. Plot of VGC and V1 for experimental texts................................................................68 Figure 20. VGC plot of texts Hr, Pr, and Qr.................................................................................70 Figure 21. Dendrogram of rongorongo corpus.............................................................................72 Figure 22. Dendrogram of language corpora................................................................................74 Figure 23. Dendrogram of Egyptian hieroglyph corpus...............................................................79 Figure 24. Examples of glyph <A1>............................................................................................80 Figure 25. Plot of glyph <N36>....................................................................................................83 Figure 26. Sample correlation-coefficients: English, Egyptian, Rongorongo, and Cuneiform....89 4 List of tables Table 1. Rongorongo tablet identifier codes (Barthel, 1958; Fischer, 1997:393).........................25 Table 2. Example transliteration of glyphs from the Barthel (1958) sign inventory....................25 Table 3. Summary of the four possible data-sets..........................................................................27 Table 4. Grammatical structures present in Maori, Marquesan, and Rapanui..............................30 Table 5. Compound morphemes in the Rapanui corpus...............................................................31 Table 6. The most productive bi-grams in the rongorongo corpus...............................................55 Table 7. The most productive tri-grams in the rongorongo corpus...............................................56 Table 8. Parameters and resulting VGC of Generalized Inverse Gauss Poisson model...............62 Table 9. Observed and modelled values for ranked-frequencies..................................................62 Table 10. Pair wise correlation between texts of group 1.............................................................69 Table 11. Pair wise correlation between texts of group 2.............................................................71 Table 12. Pair wise correlation between texts of group 3.............................................................71 Table 13. Pair wise correlation between texts of group 4.............................................................71 Table 14. Summary of text clusters and revised groupings..........................................................73 Table 15. Top 20 Frequency count of Glyph <A1>......................................................................81 Table 16. Summary of the KWIC concordance and LSA results.................................................81 Table 17. KWIC concordance of <U19>......................................................................................82 Table 18. Summary of KWIC results for glyph <U19>...............................................................82 6 Abstract The rongorongo writing system of Easter Island is the only example of writing in Polynesia. The structural properties of the script and the few remaining inscriptions has complicated decipherment work for many years. With the development of sophisticated language and word-frequency distribution models (Zipf, 1949; Baayen, 2001), and text classification (Mosteller and Wallace, 1964; Landauer and Dumais, 1997; Landauer et al., 2002; 2004; 2007; Manning et al., 2008), the issues relating to the study of large multilingual corpora can be addressed. This study adopts a mixed-methods approach and evaluates methods for classifying the rongorongo texts according to their literary-genre. Although there is still work to be done in determining the types of literarygenre produced by rongorongo scribes, it is believed a classification of the texts, will restore some contextual information to enable future studies to identify some of the structural-principles behind the productivity of particular glyphs. The paper highlights the empirical issues relating to the use of multilingual-corpora, wordfrequency distributions and dealing with different sample-sizes. The Barthel (1958) corpus is tested for the optimum representation for statistical exploration. Furthermore, the results validate previous conclusions regarding the presence of shared passages (Barthel, 1958; Fischer, 1997; Sproat, 2003; Guy, 2006; Horley, 2007a; 2009; 2010; Melka, 2008; 2009b; 2010), and literary-genre (Butinov and Knorozov, 1957; Barthel, 1958; Guy, 1985; 1990; Fischer, 1997; Davletshin, 2002; Berthin and Berthin, 2006; Melka, 2008; 2009a; 2009b; 2010; Wieczorek, 2010). The final results will provide a possible foundation for further exploration of literary-genre in the rongorongo corpus and a series of methods can produce robust statistical measures. 7 Chapter 1 Introduction Rongorongo is the undeciphered writing system of Rapanui (Easter Island), discovered in 1864. In terms of the lack of an agreed decipherment, the situation according to Steven Fischer, is that ‘after 130-odd years [now 150 years, my note], there is still no complete translation of Easter Islands rongorongo inscriptions’ (Fischer 1997:263)1. So why decipher a forgotten writing system? Aside from the academic challenge, the rongorongo tablets can no doubt contribute to our current understanding of the culture that flourished on Easter Island, and a particularly interesting question that needs to be addressed is: what was so important to the Rapanui, that it had to be written down? This paper uses corpus-based techniques, including lexicostatistics, and text classification or authorship-attribution methods to establish whether the corpus of rongorongo tablets can be categorised according to the scribe who produced them, or the literary-genre they represent, which will help to evaluate previous methods and hypotheses regarding rongorongo genres (Butinov and Knorozov, 1957; Barthel, 1958; Fedorova, 1978; Guy, 1985, 1990; Pozdniakov, 1996; Pozdniakov and Pozdniakov, 2007; Fischer, 1997; Berthin and Berthin, 2006; Horley, 2007a; 2010; Melka, 2008, 2009a, 2009b, 2010; Wieczorek, 2010). This paper focuses on the methods adopted by Mosteller and Wallace (1964), Biber, et al., (1998), Baayen (1994; Baayen et al., 1996; Baayen, 2001; Baayen, 2002; Baayen, 2008), Landauer, (et al., 2004), and Gries (2009). Only one genre is confirmed, or at least agreed upon, by the majority of researchers; namely, that the Mamari tablet (Ca and Cb) appears to be some form of lunar calendar, (Barthel, 1958; Guy, 1990, Berthin and Berthin, 2006; Melka, 2009a, 2009b, 2010). In addition, one side of the Keiti tablet (Er1-Er8), may also bear some calendar-sequences, (see Butinov and Knorozov, 1957; Wieczorek, 2010). Fischer (1997) concludes that the majority of the tablets represent procreation chants, based on the structure of the Santiago staff (Ia) and its apparent structural-parallels with the tradition Atua mata-riri (Thomson, 1889). These analyses relay on qualitative descriptions of the glyphs combined with an internal structural analysis2, but with little in the way of quantitative 1 . Inferring that his proposed decipherment has provided a partial solution, although there is still no consensus among the research community that any part of the inscriptions have yet been successfully deciphered. 2 . A method similar to distribution analysis, which is a structural-linguistic method, which identifies the, 'systematic relations and structural properties of language elements', involving segmentation and substitution used by linguists to analyse unknown languages. This method has been shown to identify the direction of writing and make proposals for the underlying inventory of basic signs, (see Coulmas, 1996: 132). 8 support (see however, Barthel, 1958; Horley, 2005; 2007; Pozdniakov and Pozdniakov, 2007; for examples of statistical analysis). This paper would like to address this need for further statistical analysis by pulling together several different methods in linguistics capable of identifying structures and stylistic “finger prints” present in corpora. Applying corpus-linguistic methods will identify the 100 most productive glyphs (or features), which form the basis for authorship-attribution methods, namely: principal component analysis (PCA), factor analysis, and latent semantic analysis (LSA). These methods are capable of drawing out structures present in multivariate data (see Baayen, 2008). Lexicostatistic methods will enable the study to address the issue underlying corpus linguistics; the comparison of texts with different sample-sizes. This involves modelling the distribution of word frequencies and measuring vocabulary growth as a test of lexical-richness, commonly using statistical models based on Zipfslaw (Baayen, 2001; Baroni, 2006). Word frequency data, KWIC concordances (Key Word In Context, see Summers 1993; Biber et al., 1998; Gries, 2009), and an analysis of collocations (bi-grams and tri-grams, see Baayen, 2008) will identify compound glyphs and their given context. Latent Semantic Analysis (LSA) will describe the correlation between a group of texts and the strength of that correlation, providing a faster, though computationally expensive, solution to traditional corpus-methods (word-frequency counts and KWIC concordances). In terms of the validation of results, previous research, including the identified parallel-passages, will act as a means to measure the success of text classification, in combination with n-gram counts and KWIC concordances. These methods are applied to a second corpus of oral traditions in three Polynesian languages: Rapanui, Marquesan, and Maori. This corpus is used for experimental purposes (see results section, particularly, factor analysis, VGC, and LSA). The assumption is that a genealogy, whether present in the language corpora or the rongorongo corpus, is likely to have a high degree of lexicalrichness. This is due to genealogies generally consisting of a list of individual names; they will therefore have more words with a low frequency of occurrence (i.e. hapax-legomena). A chant or song, on the other hand, may contain repeated or formulaic expressions, and will be less lexicallyrich than genealogies, where some words are repeated many times over (see reconstructed form of Atua-mata-riri, 'god angry-eyes' in Fischer, 1997). Therefore, the data produced by the above methods may allow for the categorisation of texts or textual fragments into their respective groups based on the productivity of stable glyph-sequences. 9 Naturally, it is not possible to hint at what 'genre' the groups may represent; for example, are they grouped according to the individual writing-style of the scribe, or by the genre of the text (procreation chant versus calendar), or a mixture of the two? Indeed, the absence of a glyph or group of glyphs, on a tablet, may enlighten us as to what motivates their productivity: for example, based on previous studies of the Mamari tablet (a likely calendar, see Barthel, 1958; Guy, 1990; Horley, 2009; Melka, 2009a; 2009b; 2010), can other tablets containing similar features or structures be classified as calendars, or at least, sequences of important dates? If methods in authorship-attribution can discriminate between tablets with shared features, then it will be possible to retrieve some form of contextual information, which has not been consistently recorded in early records. The notation adopted here, identifies rongorongo glyphs with the greater and lesser than signs '<...>', and a forward slash '/.../' for phonetic descriptions. The zero-padding in the Barthel (1958) corpus has been removed to enable the transcription to line up with glyph images wherever possible, and accordingly, the same convention is applied in the discussion to enable comparison, (not to be confused with the transliteration scheme proposed by Horley, 2005). The study is divided into five main sections. The first of these will address the principles that should be adopted when approaching writing from a linguistic perspective, combined with a literary review of past research, with particular emphasis on information elicited about the contents, or literary genre, of the tablets (see chapter two). Chapter three discusses the corpora and the stages involved in pre-processing. The following chapter introduces the methodology, evaluating the benefits and shortcomings associated with corpus linguistic and authorship-attribution based studies. The fifth chapter presents the results of PCA and factor analysis, a comparative analysis of relative word-frequencies and vocabulary growth curves, and a final section dedicated to latent semantic analysis, with an overview of its utility in identifying lexical or grammatical-association patterns (Biber et al., 1998). An Egyptian corpus is introduced to evaluate LSA methods applied to a deciphered writing system, allowing the analysis to be extended to glyphs, resulting in more transparent results than achieved with an undeciphered writing system. The final chapter will review what the study has achieved and some suggestions for future research, which may contribute to the decipherment of the rongorongo script and the messages that the 10 tablets contain. It is hoped that the results presented here will provide quantitative support for some of the previous suggestions that have been made concerning literary genre and the rongorongo tablets. A decipherment of the rongorongo script is beyond the scope of this paper and the data, however, what this paper does aim to achieve is the categorisation of the inscriptions into their respective groups or 'genres' and restore some contextual information to the inscriptions. 11 Chapter 2 Writing systems and the rongorongo script 2.1 Empirical issues relating to writing systems research and rongorongo Before any analysis, it is important to clarify some of the assumptions underlying the study of writing systems. This section outlines some key principles proposed by Coulmas (2003) for approaching writing systems from a linguistic perspective. Firstly, an outline of the terminology used in the rest of this paper with regards to writing. The term, 'script' is defined along the lines of Sproat (2000), who states that a script is considered to be, 'a set of distinct marks conventionally used to represent the written form of one or more languages', (Sproat, 2000:25). In this analysis it is assumed that the language underlying rongorongo is a Polynesian one, and one which is likely to reflect the linguistic conventions of an older form of the modern day Rapanui language. In addition, the term 'glyph' will refer to, 'a written symbol with a particular shape', regardless of whether it could be sub-divided in to smaller constituents, (see Guy, 1982; Horley, 2005; Pozdniakov and Pozdniakov, 2007). Two main assumptions need addressing with regards to writing. Firstly, there is no necessary oneto-one correspondence between the elements of a writing system and the linguistic units it is supposed to represent: For example, in the case of the proto-form of the Cuneiform syllabary: a specific social context (accounting) and glyphs represent semantic categories, and the arrangement of signs has little to do with the linear arrangement of speech. Phonetic values were later introduced, but played only a minor role. (Damerow, 2006:4; see also Hyman, 2006). Therefore it can not be assumed that a sequence of glyphs represents a string of phonemes, nor can it be concluded that as with other examples of writing, rongorongo represents a form of proto-writing with nothing but semantic categories. Proto-cuneiform tended to be tabular in form reflecting its use as a device for recording economic transactions. The Rongorongo glyphs on the other hand are arranged linearly, and the bhostrophedon nature of the script allowed for continuous chanting by rotating the tablet through 180 degrees every other line (see Jaussen, 1893b; Routledge, 1919; Metraux, 1940; Fischer, 1997, and the appendix). Therefore, the social or economic context may also have an influence on the structural properties of a script. Secondly, we can not interpret non-Western writing systems according to a Western-concept of writing (Coulmas 2003). In other words, with preconceived ideas on how the writing system should work. One case in point is given by the informant sessions between Metero and Bishop Jaussen 12 (1893b). Jaussen believed that the tablets could be deciphered by eliciting each word from his informant, and assigning the word to the relevant glyph in relation to the position in the oral chant, in his own words: I had as many gathering of words, separated one from the other, as there were signs in one line of the tablet; and as a person could, without knowledge of the language, by counting exactly, place each sign above the word that is its proper meaning. (Jaussen, 1893b:253; cf. Fischer, 1997:52) This one-to-one correspondence between the words of his informants' chant and the glyphs did not produce any coherent results, partly because the assignment of phonetic values was performed in the absence of his informant (see Fischer, 1997:53). As noted by Coulmas (2003:33) 'there is no perfect fit between the linguistic constructs that are functional in speech and writing', which is primarily caused by the fact that writing systems are 'static', whilst speech is 'dynamic' in nature, as seen from evidence in historical linguistic studies: for example, Indo-European was the predecessor or proto-language of modern English, French, German, and Hindi. These have become mutually unintelligible over time, whilst retaining some correspondences through their common ancestral root. Coulmas (2003:33) proposes four main assumptions regarding writing and its relationship to the spoken language. Three of these assumptions will be adopted in this paper: writing and speech are distinct systems; they are related in a variety of complex ways; speech and writing have both shared and distinct functions. These assumptions are the basis for three principles, which constitute the key reasons why linguists should study writing as a system of communication (Coulmas, 2003:33). 2.2 The principle of the autonomy of the graphic system Under this principle, writing systems are considered to be structured, consequently allowing them to be analysed in terms of 'functional units and relationships', (Coulmas, 2003:34), in the same way that linguists analyse the components of speech. The arrangement of the graphic signs is restricted by rules, which govern their 'linear arrangement in forming large expressions', (Coulmas, 2003:34); this mirrors language where syntax places restrictions on the ordering of linguistic constituents, i.e. subject and object. 13 A question that needs addressing under this principle is: 'What are the basic operational units of the system', (Coulmas, 1991:49), in other words, what units could be considered as forming the main inventory of signs that define the writing system? A number of proposals have been made in connection with rongorongo. One extreme is the mnemonic-aid hypothesis, where glyphs allow the reader to recall parts of chants that they have previously memorised, motivated by the belief that, 'pictographies generally have more variety and richness in the choice of figures and symbols', (Metraux, 1940:404). Other researchers conclude that rongorongo has a logographic or logo-phonetic system similar to Egyptian, 'in which the auxiliary parts of speech and affixes may easily be omitted', (Butinov and Knorozov, 1957:16; Davletshin, 2002:4); a mixed syllabary (Pozdniakov and Pozdniakov, 2007; Guy, 2006; Horley, 2005); or a combination of all three: a script with syllabic, logographic, and semasiographic properties (Fischer, 1998:5). A statistical analysis (Horley, 2005:114; Pozdniakov and Pozdniakov, 2007), reveals that the script is likely to be syllabic in accordance with the structure of the Rapanui language. The phonology of the Rapanui language allows for syllables of type (C)V only. Consequently, there are no 'closed syllables' or 'consonant clusters' of type CVC, CCV, CVCC, or CCVC. In addition, 'morphological considerations do not affect the division in syllables', and hence the only division between syllables is either 'before a consonant if there is one otherwise between vowels', De Feu (1996:186). Metraux (1940) observed that glyphs appear either in isolation or linked together. This form of conjunction is intentional, with Metraux concluding that 'these compound signs have no definite meaning of their own independent of the elements which compose them', (Metraux 1940:401). Therefore, he concludes that compound forms are derived by 'cursive writing' conventions. One argument against this view is given by Kudrjavtsev (1949) who observed stable sequences of glyphs believed to be the same inscription repeated across several tablets. These are termed the 'parallelpassages' and provide evidence of alloglyphic variants. These illustrate part of the script's composition, and the features that may form the main constituents 3. The parallel-passages also show that glyphs joined together on one tablet are sometimes carved individually on others, (Barthel, 1958:151-168); see for example tablet H, versus tablet P and Q, or Gr versus K, in figure 1 below. 3 . For sign lists reducing the current Barthel (1958) sign inventory to approximately 50 main signs, see Horley, (2005:112); Pozdniakov and Pozdniakov, (2007). 14 The tablets are identified with the letters A-Z, and each side with r and v (recto/verso where the tablets' starting point is agreed upon), or an a or b (where the beginning of the tablet is unknown). In addition, the proceeding number indicates the line number on the tablet. The period and semicolon, denote affixed and fused glyphs, respectively, and an asterisk indicates the end of a line. 1.a Group One Aa1 […] 430.40- 320.9-320.9-440-440-440-440-445-695-4.120a-4.67-34c-60a.260 Hr5 200.5-21h:5-2e-41-220.9h-220.9jh-440-440-440a-20t.440-205s-205s-4.3a-4.3a-72a-51.3a-66a.90Vj Pr5 200.5- 2a-5-2a-41-220.9-220.9-440-440-440a-440a-20V-205-205-4.3ax-4.3ax-66c-65-65.3ax-66a.95aj Qr5 200.5- 21:5-2a-41-220a.9-220a.9-440-440-440-440b-305-305s-4.3ax-4.3ax-117d-65.22t-65.3ax-66y.95.711b Barthel (1958:155-156) 1.b Group Two Gr3 680-470-1t-430-580c-380.1.3-602.9-232-600-380.1.3-595.5-122-280-67-59f-69.700-380.1.3-2-609-380.1.3-597-380.1.3-59f-720380.1.3* Kr 3-4 400-470-1t- 430-580c-380.1.3*-402-9- 380.1- 595.5- 122-280- 380.1- 2- 409- 380.1.3-597-380.1.3-59f- 720- 380.22 Barthel (1958:156) Figure 1. Two sets of parallel-passages between four tablets (1.a), and a group of two (1.b) tablets. The parallel-passages show the glyphs acting as affixes, compounding, reduplication, and hint at possible allographs. Note the differences in head and arm shape, and missing elements in the 15 examples (H, P, and Q) and (Gr, K). Metraux believed that these differences have no implications for understanding the script, stating it to be, 'inconceivable that a particular significance would be given to the presence or absence of an arm or a leg, the position of the head, or the form of the hand', (Metraux, 1940:401). In summary, the current assumptions regarding the nature of the script's typology and the support given by statistical analysis, (Horley, 2005; Pozdniakov and Pozdniakov, 2007) would indicate that rongorongo is likely to be a syllabary, with a small set of logograms perhaps representing the most frequent functional unit, or the most revered; for example, glyph <600> possibly represents a birdmotif attributed to the god makemake (Metraux, 1940:311), or glyph <50>, which may be associated with fertility rites (Routledge, 1919), and are found on the island rocks (Lee, 1992; McLaughlin, 2004). 2.3 The principle of interpretation Writing systems follow a structure, whether, 'phonetic, phonemic, morphophonemic', or 'lexical', (Coulmas, 2003:34). Analysing a writing system requires an awareness of how the elements relate to the linguistic structures and units. This refers to orthographic depth, or what Sproat (2000) terms, the 'orthographically relevant level' (ORL), (Sproat, 2000:10), which is divided into deep and shallow, or opaque and transparent. This describes how closely the script represents the phonological system of the language. A shallow orthography is a script like Finnish or Spanish, where there is a one-to-one correspondence between a grapheme and a phoneme. Whereas, a deep orthography, like Hànzi (Chinese), shows a one-to-many relationship between grapheme and phoneme. The way in which scholars previously tackled decipherment4 involved, 'identifying the underlying language and reconstructing the way it is coded in the written symbols' (Damerow, 2006:1). This has been successful in the case of Egyptian and Cuneiform where a large corpus of texts have been preserved. It should be acknowledged however, that the earlier systems of writing had very little to do with spoken language, and so the philological methods adopted by researchers are not as effective where the relationship between speech and writing is weak, (Damerow, 2006:1). In addition, without prior knowledge of the underlying language it becomes even more difficult to find a solid basis for decipherment. Etruscan, for example is a non-Indo European language, which thrived in Italy around 1200-100 BC (Bonfante and Bonfante, 2002:3), but adopted the Greek 4 . Methods include the analysis of bilingual texts, and the re-construction of sign-lists. See the decipherments of Egyptian, Linear B, and Mayan (Robinson, 2002). 16 script, complicating any decipherment work. Although it is not currently possible to draw conclusions regarding the mapping of any particular rongorongo glyph to a phoneme, researchers believe that, 'it is possible to isolate independent groups of words, and […] more important, single words from the continuous text', (Butinov and Knorozov, 1957:10), and statistical studies indicate that, 'the average glyph may consist of about three elements corresponding to letters or two elements of a certain syllabic value', (Horley, 2005:108). Pozdniakov and Pozdniakov (2007) state that any analysis attributing a logographic reading to the rongorongo script, ‘conflict with the frequency distribution of the glyphs’, (Pozdniakov and Pozdniakov 2007:11). According to their study, there are 52 glyphs (Pozdniakov and Pozdniakov, 2007:8), with the highest frequency in the corpus, which with approximately 55 possible syllables in the Rapanui language is, ‘itself a weighty argument in support of the hypothesis’ that the glyphs represent syllables, (Pozdniakov and Pozdniakov 2007:11). However, later on in their analysis they appear to class rongorongo as a mixed script with both logographic and syllabic elements. After an analysis of word-length in both corpora, it appears that, ‘the average length of a word in the Rapanui language coincides almost exactly with the average length of a word in the written texts’, (Pozdniakov and Pozdniakov 2007:14). Pozdniakov and Pozdniakov (2007:31) account for the issues associated with divergent statistics between the rongorongo script and the Rapanui language, by proposing the presence of 'determinative' or 'reduplicator' glyphs (glyphs <3> or <200>) and changes in meaning or phonology through mirroring: where an element of a glyph such as the head, faces in the opposite direction to the established reading order of left-to-right (Jaussen, 1893b; Kudrjavtsev, 1949). They attribute this behaviour to reduplication of the morpheme (Pozdniakov and Pozdniakov, 2007:32). This is not necessarily an incorrect assumption given, as mentioned previously, that writing and speech are distinct systems for different communicative needs, but this proposal has not yet been confirmed or further investigated. Another characteristic of the script is the constant repetition of small stable glyph-sequences 5, 5 . Glyph sequence <380.1.(3)> appears on a sub-grouping of tablets, for example the London (K), Small Santiago (Gr), Mamari (C), Echancree (D), and Small Vienna tablets. 17 which may represent, 'independent phraseological units or single words' (Butinov and Knorozov 1957:11), a conclusion paralleled by Metraux (1940:401), 'if the signs are a form of writing corresponding to words and syllables, groups or sequences of them will be repeated many times, especially for a Polynesian language'. A further proposal is that some elements may be omitted by the scribe, for example glyph <39>, possibly representing the particle te, which, if verified, would not cause any significant change at the semantic level, (Butinov and Knorokov, 1957; Horley, 2005:114). As a result, it should not be assumed that the glyphs can be assigned phonetic content due to the possible existence of 'semantic classifiers' represented by the delimiter groups (see Horley 2005), and further complications caused by the probable high-level of polysemy (see Guy, 2006; Fischer, 1997; Horley, 2007a; 2009; Melka, 2009a). In conclusion, on the basis of previous research the assumption made here is that the rongorongo script is a mixed system with an inventory consisting of logographic and syllabic glyphs. It is possible that there are additional glyphs in the inventory that have a special function or provide a shorthand version of a more complex glyph. How much logography is present in the script is still debated, and new studies posit a gestural function to some glyphs (the sequence <480.2-483-2-4802>, at the start of line Cb10 on the Mamari tablet, (Melka, 2010). This can not be ruled out due to the autonomy of writing (Coulmas, 1996:27). Therefore, it is necessary for the ORL of rongorongo to be determined in order to provide better grounds for decipherment. This would require a entirely new study beyond the scope of this paper (see the further research section for a preliminary study). 2.4 The principle of historicity It is assumed that writing developed at a latter phase of human history when it became necessary to keep records of transactions, or levels of production, as seen by the developmental phases undergone by Cuneiform into a syllabary capable of writing down different languages (from Sumerian to Akkadian, though with slight variations). Writing is therefore considered to be a, 'technological development', which has only been 'independently invented four times in history' (Sproat 2000:21), namely the Hieroglyphs (Egypt), Cuneiform (Sumer), Jiăgŭwén (China, later known as Hànzi), and Mayan (Central America). That is to say, that these writing systems were not the product of cultural diffusion due to close contact between one group of people and another. Rongorongo is another writing system that appears to have been developed independently and in relative isolation. 18 The rongorongo tablets where discovered by Joseph Eugéne Eyraud who visited Rapanui in 1864. The purpose of his visit was to survey the island on behalf of the Congrégation des Sacrés Coeurs and to establish a Catholic mission there. During his stay he became aware of wooden boards wrapped in leaves suspended from the rafters of the Rapanui huts: One finds in all the houses wooden tablets or staffs covered with sorts of hieroglyphic characters. These are figures of animals unknown on the island, which the natives trace by means of sharp stones. (Altman, 2004:21) The term rongorongo derives from te kohau rongorongo, which is translated as ‘the stick of the rongorongo men’. Metraux (1940: 389). Prior to this designation, the script was referred to as taa or he timo (Fischer 1997:35). The script is classified as a boustrophedon writing system; the reader starts reading from the bottom-left corner, and reads along until the end of the line whereupon the tablet is turned 180o and reading continues left-to-right, first noticed by Maklai (Fischer 1997:39), confirmed later in informant sessions by Jaussen (1893b), Thomson (1889), Routledge (1919:244), and by comparison of the glyphs in the parallel-passages (Kudrjavtsev, 1949). This method may have allowed for continuous chanting, or prevented the reader from missing a line of glyphs, (Metraux, 1940:405; Fischer, 1997:353). According to Thomson's informants, the first king to arrive on Rapanui, Hotu-Matua, 'possessed the knowledge of this written language, and brought with him to the island sixty-seven tablets containing allegories, traditions, genealogical tables, and proverbs relating to the land from which he had migrated'. (Thomson, 1889:514). The first evidence of Rapanui 'writing', is seen on a document presented by Gonzalez y Haedo to the islanders, when a Spanish expedition annexed the island in 1770. The signatures on the document are composed of linear and abstract signs, with a couple resembling rongorongo glyphs (Horley, 2005). Metraux (1940:400) dismisses these signatures as 'meaningless scrawls not connected to the tablets', although he does acknowledge that the 'bird figure' may be a possible exception (glyph <400>). Other research concludes that rongorongo may have been the result of contact with this 1770 Spanish expedition (see Bastian, 1872; Emory, 1968; Fischer, 1997; 1998; Facchetti, 2002); a cultural solution to the encoding of language, flourishing during the period leading up to 1860-4, but only lasting approximately three generations (Fisher, 1998:3). Horley (2005:115) argues that it is, ‘reasonable to assume that the rongorongo script was already developed before the contact with 19 Europeans’. Horley (2005:115) shows that some signs on the document can be traced to known rongorongo glyphs and motifs that are inscribed on skulls, and as petroglyphs on rocks (see Metraux, 1940:399), and moai hats (see also Guy, 1985; Lee, 1992; McLaughlin, 2004; Melka, 2009a). A knowledge of the written characters was confined to the royal family, the chiefs of the six districts (into which the island was divided), sons of those chiefs, and certain priests or teachers. (Thomson, 1889:514). There are a number of reports that show the tablets were used for chanting, and that there existed a tradition of transmission facilitated by readings of the tablets on public occasions, (Routledge, 1919). Thomson's (1889) report also supports this: […] people were assembled at Anekena Bay once each year to [h]ear all of the tablets read. The feast of the tablets was regarded as their most important fête day, and not even war was allowed to interfere with it. (Thomson, 1889:514) 2.5 Literary genres of the rongorongo inscriptions In order to assess the success or failure of any statistical analysis in identifying groups of similar inscriptions, it is important to study previous ethnographic evidence and research. Roussel reports that the close ancestors of a number of his informants could still read the script and the tablets contained the history of their island, (see Fischer, 1997:36; Altman, 2004). Meinicke (1871:550-551) proposes that the tablets were ancient genealogical texts of island ariiki (chiefs); this belief was dismissed by another researcher, Bastian (1872), who instead argues that they contain songs that were memorised and recited at festivals. This would entail that the script was developed after European contact rather than through ancient origins (cf. Fischer 1997:40-42). According to Metraux (1940:395) there have been no reports elicited by him or Routledge (1919) that any of the tablets contained genealogies, lists of chiefs, or the origins and exploits of the Rapanui, which conflicts with native tradition reported by Thomson (1889). Therefore, there is still no agreement as to what the rongorongo inscriptions represent in terms of literary genre. Ray (1932:155) believes the rongorongo tablets are, ‘a collection of more or less symbolic reminders of objects or actions which would serve the native orator as notes of a discourse on history, a prayer, or 20 even an inventory’, agreeing with Thomson (1889). Butinov and Knorokov (1957), propose that some tablets are genealogies, namely the Small Santiago tablet (G): based on the composition of the glyphs and repeated sequences (Butinov and Knorozov 1957:7-8, 15). In addition, they believe that, ‘one tablet could have several different texts’, (Butinov and Knorozov, 1957:8), given statistical evidence that shows glyph sequences representing a lunar series on one side of the Keiti tablet, (Wieczorek, 2010). Fischer (1997; 1998:5) however, suggests that the majority of the tablets are likely to represent procreation chants due to their triadic structure and the high productivity of 'the phallus', glyph <76>, on the Santiago staff, which represents a semasiograph meaning 'copulated with'. However, there is still some doubt whether this hypothesis holds as some tablets lack this suffix. Barthel (1958), proposed that the tablets could be categorised along the following lines: Aruku-kurenga(B) - Grt. Washington(S) Grt. Washington(R) - Tahua(A) - Grt. Santiago(H) Keiti(E) - Sm. Santiago(G) - Santiago staff(I) - Honolulu(T) Mamari(C) - Sm. Vienna(N) Figure 2. Classification of tablets by shared glyph-sequences, (Barthel, 1958:167). The above paradigm illustrates four main groups of tablets. Barthel (1958:1968) shows that there is some overlap with text C having sequences in common with R, A and H. In addition, text G can be split according to side, with Gr sharing sequences with N and E, and sections of C and S. Barthel (1958) attributes Gv as being similar to texts I and T. This classification is agreed upon by previous research (Metraux, 1940; Butninov and Knorozov, 1957; Fischer, 1997; Horley, 2007a; 2010). To complicate the issue, as well as more than one text-type being inscribed on a tablet, it may also be the case that each genre of inscription, 'held its own formulaic requirement', or 'different reading techniques', (Fischer, 1997:283). According to Fischer (1997), ethnographic data supports the tablets being thematically-grouped, with one acting as a summary or inventory of one or more other tablets. An example is observed in the apparent 'close connection between the two inscription categories of 'ika and timo. […]' (Fischer, 1997:285). These categories where recorded by Routledge (1919) during her work with an informant who confirmed that, 'a connected, or possibly the same, tablet was made at the instance of the relatives of the victim and helped to secure his 21 vengence' (Routledge, 1919:248. See also Fischer, 1997:285). Fischer (1997), summarises that, 'the timo category of inscription recorded the spell against the murderer(s), whereas the ika category listed each ahu’s6 slain victims' (Fischer 1997:286). Another genre was known as the Ta'u, a list of deeds (Routledge 1919:251-2), listing the dates and number of chickens stolen during the course of a man's life. These tablets were also known as kouhau koro. An additional tablet was produced, providing an inventory of these koro: recording only the name of the man, and the year of his koro (festival). The Ta’u genre of tablet in combination with its lesser form (koro) may be regarded as an example of 'island historiography' equating to a, 'register [of] historical events or names' (Fischer, 1997:289). There is structural evidence for the multiple-genre hypothesis, and for the existence of list-like texts. The inscriptions show regular 'delimiter groups', which Horley (2007a) attributes to lists, showing that each tablet may be identified through these delimiter-lists, allowing for further segmentation of the texts into smaller fragments. Horley (2007a) uses statistical methods to reveal sequences of glyphs that were not previously identified and demonstrates that there is a correspondence between the number of syllables in the Rapanui language and the number of glyph elements appearing between 'delimiter groups' like <380.1.3>, based on a revised sign-list, (Horley, 2005). The presence of these repeated glyphs, separating textual-fragments, indicates that we are dealing with an important structural formation reflecting an inventory or list, (Butinov and Knorozov, 1957:82). Horley (2007a:27) confirms that, 'from a statistical point of view, structured lists significantly increase the occurrence of the glyphs belonging to their delimiter groups'. Consequently, any comparative statistical study should take in to account the, 'corresponding Rapanui lists', however legends or songs should be excluded, as they 'feature different patterns and vocabulary', supporting the idea of a, 'formulaic requirement' for interpreting some genres of text, (Fischer, 1997:283). A further clue lies in the “Great Tradition” (or Große Tradition, Barthel, 1958:156), which Horley (2007a:27) attributes to, 'some kind of a refrain rather than a fixed introductory text'. Fischer (1997) assigns the “Great Tradition”, or procreation chant genre to texts Gv, Ia, and Ta due to their similar structural features and the presence of glyph <76> on all three. Although the genre of the tablet is still debated, previous observations (Barthel, 1958; Guy, 1985; 1990; Pozdniakov, and 6 . A burial chamber on which moai were erected. 22 Pozdniakov, 1996; 2007; Fischer, 1997; Horley 2005; 2007a; 2009; 2010; Melka, 2009b, 2010) can be referred to as a guide for analysis of the inscriptions. If it is possible to group the tablets into their respective genres, a comparative statistical study may be undertaken to discover why glyphs are more productive on particular tablets than others. Consequently, if text-classification methods show that previous assumptions on the genre of a tablet are correct, then previous qualitative studies will be supported by quantitative methods, where there previously was little or none at all. This is the domain of corpus linguistics, lexicostatistics, and authorship-attribution studies, which are discussed in the methodology section below. 23 Chapter 3 Data 3.1 Data The rongorongo corpus consists of 25 tablets (there is no consensus on the authenticity of the Poike tablet, see Fischer, 1997:533-534; Melka, 2009a:112). The rongorongo corpus is segmented by tablet side creating 41 separate texts. This decision is motivated by previous research showing that a selection of tablets are a collection of smaller sub-texts (Metraux, 1940; Butinov and Knorozov, 1957; Barthel, 1958:151-157; Fischer, 1997; Horley, 2007a:26-30;2010; Melka, 2008, 2009a). In reality, it is desirable to split each of these 41 texts still further into even smaller fragments or 'chunks', however it is not clear where the segments begin and end, or which statistical method will help identify them, this is therefore reserved for a future study (see further research section). Further evidence suggests that a text may continue on to the opposite side of the tablet. One example, (below, figure 3) shows a sequence of glyphs: <59f-2.76-187-[...]-605-2>, appearing on the last line of the recto of the Small Santiago tablet (Gr8) and repeated on the first line of the verso (Gv1): <59f.76-187-[...]-607.700x.76>, (note slight differences with removal of glyph <2>, see also Horley 2010 on Keiti). Therefore, it is acknowledged that this categorisation is still far from ideal. Gr 8 …7-59f- 2.76-187-200.10.124-605-2-599-59f-256-200.200.11-2.76?* Gv 1 ...59f.76- 187-186-607.700x.76-205.76?-113-3.95x.3.76-33c.10f.76-43t-33-450.24.?.76-6.33b-607.6.76-493 Barthel (1958) Figure 3. Evidence of an inscription continuing on to the opposite side. 24 A list of the rongorongo corpus and the alphabetic-code assigned by Barthel (1958), and the codes designated by Fischer (1997) are presented below. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Barthel (1958) A B C D E F G H I J K L M N Fischer (1997) RR1 RR4 RR2 RR3 RR6 RR7 RR8 RR9 RR10 RR20 RR19 RR21 RR24 RR23 Description Tahua Aruku Kurenga Mamari Échancrée Keiti Stephen-Chauvet Fragment Small Santiago tablet Great Santiago tablet Santiago staff London Rei Miro 6847 London tablet London Rei Miro Great Vienna tablet Small Vienna tablet 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. Barthel (1958) O P Q R S T U V W X Y Z Fischer (1997) RR22 RR18 RR17 RR15 RR16 RR11 RR12 RR13 RR14 RR25 RR5 - Description Berlin tablet Great St Petersburg tablet Small St Petersburg tablet Small Washington tablet Great Washington tablet Honolulu 3629 Honolulu 3623 Honolulu 3622 Honolulu 445 New York Birdman Paris snuffbox Poike (authenticity disputed) Table 1. Rongorongo tablet identifier codes (Barthel, 1958; Fischer, 1997:393). In this study, the Barthel (1958) paradigm is used to reduce the amount of clutter in the multivariate graph plots, (results section). The rongorongo glyphs are transliterated with numeric-codes, (Barthel, 1958). Below is an example, showing the family of glyphs with a 'lozenge' head-shape, <200-246> series: 200 201 202 220 222 240 242 203 243 204 205 206 224 225 226 244 245 246 Table 2. Example transliteration of glyphs classified under glyph-family <200> in the Barthel (1958) sign inventory. Note that each number in the three-digit code reflects the changes in constituents: the first digit is assigned to the head-type, the middle number to the body-type, and the final digit reflects the 25 change in arm-shape. Barthel (1958), also uses alphabetic-codes to reflect nuances in the composition or arrangement of the glyphs, i.e. variants, subscripts, mirrored or rotated glyphs. Certain alpha-numeric codes denote contracted glyph-forms. One example, is glyph <3> which may be an alloglyph for alpha-code 'f', called 'feathered glyphs' in Barthel's (1958) nomenclature. As an additional example, glyphs with alpha-code 't', like <4t>, are problematic as the only difference between the two types is where the scribe has had to fit the glyph in between two others (glyphs highlighted in grey in figure 4, Br1): see glyph <4> between <50.394> and <2>, written beneath glyph <394> further on in the inscription as <4t>. Therefore glyph <4t> could be considered an instance of the same glyph <4>, and may not have a different phonetic or semantic value as a result of subscripting. Alpha-code 'h' is assigned to glyphs, which are squeezed between two glyphs through superscription, for example in the Mamari tablet (Ca7): glyph <41h> (highlighted in grey) represents a superscripted 'crescent moon' glyph variant of <41> appearing in Ca8. In addition, as per the evidence from the parallel-passages (tablets H, P, and Q, Gr and K), combinations of glyphs, known as ‘suffixed glyphs’ and 'fused glyphs', transliterated with a period and colon respectively, are separated into their constituent parts. Therefore, sequences such as <380.1.3> is represented as individual glyphs: <380>, <1>, and <3>. Guy (1982:447) concludes that the reading order for fused glyphs is bottom-to-top due to structural evidence showing fused glyphs to be contractions of two or more glyphs that stand independently in other inscriptions. For example, note the contraction of glyph <40> and <211s> to a fused form, where glyph <40> is affixed below glyph <211s> as its alloglyph <42> (example Br1, in red). 26 Br 1 595-1-50.394s- 4-2-595.1-50-301s - 4 -2-40-211s -91 -200 -595.2- 50.394-4t- 2 -595.2-50 -394 -4t-2- 42:211s -91 -595s Ca 7 40 -390.41 -378y-41h-670-8.78.711 Ca 8 3.40-390.41-378y-41- 670 -8.78.711 Figure 4. Examples of spatial orientation and glyph-contraction. As a result of the issues with alpha-codes, a number of data-sets were generated with different parameters for pre-processing the corpus to find the best representation for statistical purposes. Table 3 shows the different data-sets and how they were processed. Pre-processing stages Conjuncts separated Replacement of f with glyph <3> Alpha codes removed DD1 DD2 DD3 DD4 N Y N Y N N Y Y N N Y Y Table 3. Summary of the four possible data-sets (prefixed with DD) Data-set 1 (DD1) and Data-set 2 (DD2) are abandoned due to the number of hapax-legomena caused by the preservation of the alpha-codes. These two data-sets do not consider the similarity between glyphs and their contracted or superscripted forms, and will over-inflate tests for lexicalrichness (see below). Further evidence from Vocabulary Growth Curves (VGC), supports this (see section on lexical-richness for more on VGCs). Figure 4 shows the empirical VGC of the four derivations of the Barthel (1958) transcriptions, plotted with the War of the Worlds, acting as an example of a VGC calculated across a larger sample. The y-axis shows the number of types (V) across the sample-size (N), denoted by the x-axis. 27 Figure 5. Empirical Vocabulary growth curve of all data-sets and a comparative sample from War of the worlds. The curve for the original Barthel (1958) data-set 1 (DD1) shows over 3000 individual types (V) in the whole corpus (N) and the curve is not particularly smooth compared to the War of the Worlds text, and the sample ends at just over N(10,000). This is also the case with data-set 2 (DD2) and data-set 3 (DD3). Data-set 4 (DD4) has less types, and is closer to the actual reported sample-size of the rongorongo corpus between N=14,000 and N=14,500 tokens, (see Barthel, 1958; Fischer, 1997; Horley, 2005:108; Melka, 2009a:114). The best representation therefore comes from DD4, which has a smoother curve than the previous data-sets reflecting what we expect from natural language (Zipf, 1949), observed with the War of the Worlds sample. The number of types (V) is also more in line with the number of glyphs, approximately 632 (Horley, 2005:108), appearing in Barthel (1958). The analysis in this paper adopts DD4, with all alpha-codes 'f', replaced by glyph <3> (Horley, 2005:111), and by separating fused and affixed glyphs. It is acknowledged that no one-to-one correspondence can be made between glyph and morpheme. This is due to the nature of word-frequency distribution and varying sample-size, which cause a shift in position between the ranks, making a comparison meaningless, (Baayen, 1998; 2001; Pozdniakov and Pozdniakov, 2007:18). The language corpora introduced for comparison, is composed of Rapanui language texts collated by Englert (1970), Thomson (1889), Metraux (1940), Felbermayer (1971), Campbell (1971), Barthel (1978), and Blixen (1979). These are collections of native oral tradition, and a possible 28 authentic rongorongo procreation chant, Atua-mata riri, (Thomson, 1889). However, some of these texts show contamination from Tahitian, where Tahitian equivalents of Rapanui words are intermingled with the rest of the text (Fischer, 1997:93-95), making them unreliable sources of data. In the case of the chant Atua-mata riri, Fischer (1997:96-100) offers a provisional reconstruction to compensate for this issue. The traditional texts in Barthel (1978), are not very representative of Rapanui oral tradition. They are a collection of traditions from an informant known as Mrs E, and are mainly narrative with typographical and segmentation errors. If the tablets contain a variety of genres, then the comparative language corpus needs to reflect this fact with the inclusion of Rapanui songs, chants, prayers, poetry. The lack of representativeness of the Rapanui data is compensated for by including additional Polynesian language texts from Maori and Marquesan. The Maori data is gathered from George Grey (1853). Grey published a large collection of Maori poems, traditions, chants, and songs. Consequently this part of the corpus is much more representative of different genres. The first 150 of these texts are selected. This decision is arbitrary, more texts could be included, and perhaps this would even be desirable, however processing-time is a consideration, which runs to several hours with large corpora. Furthermore, a random-sampling approach is chosen due to the other corpora having been subjected to the same mode of selection. The Marquesan texts form a collection of chants and love songs recorded by Elbert (1941:61-91). Within the Eastern Polynesian language group, Rapanui is closest to Marquesan morphologically, whereas its phonology has more in common with New Zealand Maori. Many Polynesian languages share the same syntax particles, some with slight phonemic changes; for example the causative: /haka-/ (Rapanui), and /Whaka-/ (Maori); the +specific determiner /te/; possessive /o/ and /a/; tense marker /i/. Phonological differences include Marquesan (i.e. the switching of [k] to the glottal stop in North and South Marquesan, respectively, for example: /koe/ – North Marquesan, /`oe/ – South Marquesan, the second person singular. See Elbert, 1941:58; 1982:502), but all three languages show the same patterns of use, confirmed by KWIC concordances: 29 1.) Possessive structure: o … te, and o te ... Examples a.) o b.) o c.) o miru te haumea ki te mokomae o te tumu taha o te o te mata te pua rangi Language RAP MAR MAO 2.) Dative Structure: ki te … Examples a.) i b.) i c.) i tepo ki ha'ati'o 'i maka ki te te te rona tu'u apai Language RAP MAR MAO Table 4. Grammatical structures present in Maori, Marquesan, and Rapanui Metraux (1940:419) believes that the three cultures are linked by a shared heritage of cultural traits each related to a particular area of expertise, for example the building of Ahu's (burial chambers); a part of Rapanui and Marquesan tradition. In addition, Marquesan and Rapanui cultures practice the carving of stone statues, whereas the Maori carved in wood, which is perhaps attributable to the type and abundance of certain materials. Another motivation for selecting Maori is due to their ancient origin; in a footnote to the text He tangi, mo mokowera, i mate ia ki wharekura, kei ohiwa, Grey notes: […] it is the custom of the natives to compose their poetry rather by combining materials drawn from ancient poems, than by inventing original matter. An apparently recent poem is thus sometimes really of very ancient origin. Grey (1853:14) Grey's texts are important as they pre-date many of the collections made by rongorongo investigators (Barthel, 1978; Jaussen, 1893b; Thomson, 1889; Routledge, 1919; Metraux, 1940). If any of these texts have 'ancient origins', an assumption that could be made is that there exists a set of formulaic expressions shared among Polynesian orators, which may be present in the rongorongo inscriptions. The Maori texts, 150 in total, were categorised into literary genre according to the genre designated in their titles. For example, He waiata meaning 'song', He tangi – 'lament', He tau – 'recitation 30 before speaking', He karakia – 'incantation', He whakaaraara – 'War chant'. Texts that could not be categorised were labelled 'prose'. It is acknowledged that this is not the most accurate form of classification. 3.2 Pre-processing The rongorongo glyphs are divided into units according to a hyphen between glyphs, along the lines of Pozdniakov and Pozdniakov (2007:14). Each glyph is assumed to be representative of a morpheme in the Rapanui language in line with Pozdniakov and Pozdniakov (2007) and Horley (2005; 2007). The language corpus was trialled in a pilot study7, and it became apparent that there are a wealth of compound forms. These distort the frequency-counts and KWIC concordances; reducing the number of actual types in the corpus (see for example, the forms used in interrogative Q-constructions, appearing in De Feu, 1996:19-21). As an example, the possessive particle system is particularly complex. Surface Underlying Syntactic structure structure form 1. ta'aku te-a-au Specifier - possessive alienable - 1st Person singular 2. 3. 4. 5. 6. to'oku ta'au to'ou ta'ana to'ona te-o-au te-a-koe te-o-koe te-a-ia te-o-ia Specifier - possessive inalienable - 2nd Person singular Specifier - possessive alienable - 2nd Person singular Specifier - possessive inalienable - 2nd Person singular Specifier - possessive alienable - 2nd Person singular Specifier - possessive inalienable - 2nd Person singular Table 5. Compound morphemes in the Rapanui corpus The final syllable of the possessive is a substituted form of the personal pronouns for 1 st, 2nd, and 3rd persons, with /-ku/ representing /au/ (1st), /-u/ for /koe/ (2nd), and /-na/ for /ia/ (3rd). The above paradigm represents the singular forms, and the plural is derived through elision of the initial /t/ of the specifier. This is a small sample of the possessive system and there are many more, including inclusive and exclusive versions (see De Feu, 1996:144-145; Karena-Holmes, 1997:50). The question is whether these possessive forms should be divided up in the corpus. This paper adopts the surface-structure, and reserves further segmentation for a future study. Although an error free dataset has been one of the primary aims, it has not always been possible due 7 . Pilot study results presented at the Bloomsbury Conference in Applied Linguistics (2009) 31 to the nature of working with corpora from varying languages, sources, transliterations (for example in vowel length: taana versus tana - 3rd person possessive singular, Tregear, 1891:460). The language corpora are primarily pre-processed through removal of all punctuation, as this is a Western orthographic convention. There are also differences in transcription, for example, in the Rapanui texts we see entries for /kiroto ki/ and /iroto i/, these are divided into /ki roto ki/ and /i roto i/8, respectively. The pre-processing stage was performed with software written for this study and manual corrections were kept to a minimum, in order to avoid errors associated with manual processing. The language corpus provides a good environment for future investigation of linguistic structures, and literary-genre in Polynesian languages. This is especially true here as there does not appear to be a recognised Polynesian corpus in existence, compared to those like the Brown corpus available for English-based studies. 8 . The morpheme /roto/ is a separate word meaning 'inside' or 'within', and forms part of a complex preposition, here denoted by /ki/ and /i/, which are particles, (see Tregear, 1891: 428), and the Austronesian Basic vocabulary Database word no.174, (Greenhill et al., 2008). 32 Chapter 4 Methodology 4.1 Corpus linguistics Many of the principles and methods adopted in this analysis are founded on corpus linguistics (Biber et al., 1998; Gries, 2009), lexicostatistics (Baayen, 2001; 2008; Baroni, 2006; 2009; Evert and Baroni, 2005; 2006; 2007), authorship-attribution studies (Mosteller and Wallace, 1964; Baayen et al., 1996; 2002; Stamatatos, 2000; 2009; Peng et al., 2002; 2003), and information retrieval, (see in particular Landauer et al., 2004; Manning et al., 2008). As stated by Pozdniakov and Pozdniakov (2007:4), ‘the appearance of yet another monograph with a new ‘translation’ of the corpus of texts would be significantly less interesting than the appearance of an article presenting the results of a structural analysis of some specific aspect of the writing system’. In light of this statement, great emphasis on statistics to support any generalisations about the script’s composition is the main guiding principle for this study. Melka (2009a) is the first to mention corpus linguistics as a method for the decipherment of rongorongo, although, in his view, 'the solution to what exactly rongorongo is will not be scheduled or delivered via a corpus linguistic methodology', although, 'the approach will take us closer to its 'understanding'', (Melka, 2009a:112). The first decision to be made is the unit of analysis, which determines not only the falsifiability of the final results, but also the initial steps needed to collate and process the corpus. Corpus studies are largely based on two forms of analysis, either 'describing a linguistic structure', or a 'group of texts (Biber et al., 1998:273). These are termed the 'observations' of the study with each observation being an occurrence of the feature to be studied. The unit of analysis, adopted in this study, is a single glyph or morpheme. However, when applying authorship-attribution methods, each text is considered to be an observation, (Biber et al., 1998:269). One notable study of this kind is by Mosteller and Wallace (1964) who investigate the disputed authorship of the Federalist papers, through an analysis of each text showing how they can be attributed to the most likely author. Melka (2009a) highlights the problematic nature of the rongorongo corpus itself: an issue that causes a 'number of problems for the investigation' (Melka, 2009a:114), is the destruction, misplacement, and sale of many tablets by the turn of the nineteenth century. This makes it 33 impossible to consider the corpus to be representative of the entire wealth of possible literary genres that formed the rongorongo tradition. History has only handed down a few tablets for study, however there is still enough material9 for a statistical analysis of the mechanisms underlying the rongorongo script. A mixed-methods and interdisciplinary approach is predicted to be the most successful for exploring the rongorongo script, as the best parts of each method can be adopted, whilst abandoning their individual shortcomings, (Dörnyei, 2007). The corpus of twenty-five tablets is not of the magnitude normally associated with corpus linguistic studies, compared the Brown Corpus of English, totalling a million words. However, Melka (2009a:113) states that current rongorongo research 'should consider the approach of a text-oriented analysis as essential in order to be able to establish safer parameters'. In light of this statement, a corpus-based approach is adopted in this paper. Biber (et al., 1998:249) propose that 1000 word samples are sufficient for reliable counts of a linguistic feature in a text. However, in lexicographic studies this sample would need to be much larger, totalling millions of words, due to the fact that many words occur with a very low frequency, perhaps only once, known as the hapax-legomena, which are a very important focus for wordfrequency-distribution studies, see Baayen (2001), but we also need to know which glyphs and morphemes are the most productive to begin identifying possible syntax markers. According to Biber (et al., 1998) if a corpus contains too few texts, 'a single text can have an undue influence on the results of the analysis', (Biber et al., 1998:248-249). This issue is attributable to a particular kind of study, i.e. a study into dialectal differences, or studies into language and gender. Therefore, as long as the nature of the texts used in a corpus are made explicit, and the methods and assumptions are transparent, then any imbalance in the corpus will be clear. Representativeness is an additional consideration in corpus linguistics (Summers, 1993; Biber et al., 1998; Melka, 2009a). The term 'representativeness' is understood to mean: […] covering what we judge to be the typical and central aspects of the language, and providing enough occurrences of words and phrases […] to make accurate statements about lexical behaviour. Summers (1993:186-187) 9 . Although rongorongo does not exhibit the wealth of inscriptions compared to Linear B (more than 3000, see Robinson, 2002:84), compared to the Phaistos disc (one artefact, see Fischer, 1997a; Robinson, 2002:298), or the 3700 short inscriptions of the Indus Valley seals (Robinson, 2002:268), sizeable sections of continuous texts can be extracted for statistical scrutiny. 34 Biber (et al., 1998:248) observes that, 'there are important differences in the use of lexical, grammatical, and discourse features across varieties of language'. If the rongorongo corpus is not considered to be representative of the diversity of literary genres, a comparable corpus that does contain a variety of text-genres is necessary, in order to offset any imbalance. This is where multilingual corpora are useful for comparing grammatical structures and word-frequency to identify parallels between rongorongo and Rapanui, Marquesan, or Maori. In this study, we must consider the corpus of rongorongo texts to be a random-sample, and consequently not all literary genres are likely to be represented, (Melka, 2009a). It should also be kept in mind that any corpus-based study will never be truly representative of the whole population, as this population is either very large, difficult to sample, or continuously being expanded, so although we could say, for example, that we have collated all texts produced by Noam Chomsky, there is no reason to assume he will not publish another book in the future. Therefore, as long as the corpus is representative of the unit(s) under investigation, and its limitations are recognised, the study can still provide valuable insights under these principles. Therefore this paper views the rongorongo corpus as a random sample of those tablets that may have existed as part of a wider collection of inscriptions, with the unit of analysis being a single glyph or morpheme. Due to the multilingual nature of the corpora, and differing transliteration schemes, off-the-shelf software is not sufficient. Although Freeware and commercially available software could perform sufficiently, the need to edit the data according to the requirements of the software are unattractive due to the size and complexities of the rongorongo corpus. In addition, many applications are not equipped to deal with multilingual data or numeric transliteration systems. The R statistical programming language is selected as the best approach due to it being freely available, very flexible, powerful, and fast (see http://cran.r-project.org/). The R environment makes it possible to process all the corpora procedurally with little intervention from the user aside from validating the results. Ultimately this means that the problems outlined above, with respect to transliteration schemes, is overcome by writing programs specific to each corpus ensuring that the same analysis is applied consistently throughout. In corpus linguistics, words are known as tokens, in fact, punctuation and white space are also considered as tokens in some authorship-attribution studies (see Baayen et al., 2002). Wordfrequency forms a large part of corpus-based investigations, for example in lexicostatistics, historical linguistics, and Natural Language Processing (NLP), which study the productivity of 35 words and lexical-association patterns to discover how language changes over time, by its speakers, and in different contexts, (see Biber et al., 1998). Although informative, generalisations can not be made purely on word counts. Due to the varying nature of language, the productivity of a word may vary depending on context, speaker/writer, and more importantly the size of the sample. Therefore, even if word counts are nominalised (Biber et al., 1998:263-264), the counts would still not reflect the true population from which the sample came. Relative frequency counts add up to 100%, which does not allow any space for unseen types. This means that there are still words in the population which have not yet appeared, but are likely to if we increase the sample size. Recent lexicostatistic studies (Baayen, 2001), have sought after a method to compensate for this issue. Baayen (2001:5) defines this problem as LNRE, or Large Number of Rare Events. This idea refers to the fact that word-distributions contain a large number of low-probability words, represented by the hapax-, dis-, and tri-legomena, or words occurring only once, twice, or three times. Another issue with word-frequency counts is that a comparison based on different sample-sizes, are not accurate as the sample mean increases as a function of sample-size, (Baayen, 2001:4-5; Evert and Baroni, 2006). In addition, median and mode are usually uninformative due to their value generally being between 1 and 3, because of the previously mentioned abundance of lowprobability words. There are a number of options available to address these fundamental issues. Firstly, we may take a set number of tokens and form a sub-corpora of each text, for example, the first 1000 glyphs in each text. This would allow a comparison, but with a certain amount of dataloss. For this study, this results in a considerable loss of data, to what is already a relatively small corpus by comparison. In order to address this issue, a word-frequency model can be generated first. An important law in lexicostatistics is Zipfs law (1949), (see also Baayen, 2001; Baroni, 2006). Application of Zipf’s law involves organising the frequencies according to decreasing value in rank-order, with the highest-frequency being assigned rank 1. We can now compensate for the issues corresponding to relative-frequency counts through prediction of the unseen types. The ranking does not consider which word is at each rank, for example, whether the word and is the most frequent, but does enable us to model word-distributions and analyse vocabulary-growth and lexical-richness. A language model is generated to extrapolate a small sample text to a larger sample-size whilst accounting for the LNRE, by freeing up the probability-space for new word-types. As observed by 36 Evert and Baroni (2006:1), 'in order to compare samples of different sizes […] it is thus necessary to extrapolate their observed values to much larger samples'. Zipf’s law predicts that word-distributions follow a straight-line in double-logarithmic space. According to Baayen (2001:32), ‘many statistical tests pre-suppose normality’, in other words the well recognised ‘bell-shape’ of the normal distribution, where the mean centres around zero, however, as seen by the plot below (figure 5) word-distributions do not follow a normal distribution: Figure 6. Zipf Frequency/Rank plots for ‘Alice in Wonderland’. Note the long tail stretching along the bottom-right of the plot. This represents the hapax-legomena and consequently the lexical-richness of the text. In terms of Zipfs double-logarithmic space, it is possible to make the above distributions more 'normal' by transforming the absolute frequency of each word to ‘log frequency’ (Baayen, 2001:32), reflected in figure 7. 37 Figure. 7. Texts 'Huckleberry Finn' and ‘Alice in Wonderland’ against Zipf's predicted straight-line in doublelogarithmic space. In figure 7, the straight line in double-logarithmic space, predicted by Zipf (1949), does not hold for the text 'Huckleberry Finn' nor for ‘Alice in Wonderland', though it accounts for the middlefrequencies fairly well. Therefore, Zipf's law fails to account for the top and bottom frequencies, which curve downwards from their predicted values. As a result, Zipf's law does not predict the correct distribution of the hapax-legomena and the most frequent words occupying the bottom and top-ranks, respectively. Tests of lexical-richness would also be unreliable due to the difference in sample-size: Huckleberry Finn at 116,608 tokens, and Alice in Wonderland consists of 29,951 tokens. Texts can also be divided into equal-sized 'chunks', (Mosteller and Wallace, 1964; Peng and Hengartner, 2002), or as per Baayen (2001), and Evert and Baroni (2006), a model of the worddistributions can be implemented by selecting an appropriate model and experimenting with the parameters until a good fit of the word-distributions is achieved. These models form the basis for interpolating and extrapolating data to the required sample-size. Baayen (2001), recommends that sample-size is not extrapolated to more than twice the original size, due to potential inaccuracies in 38 the predictions made by the model. This issue may be addressed by sampling only part of the population. However, as stated by Evert and Baroni, (2005:5), 'extrapolation quality rapidly degrades when less than 25% of the data are used for estimation', still, with only 25% it is possible to extrapolate up to four times the original sample-size (Evert and Baroni, 2005:2). In addition, they believe that a large part of the issues related to extrapolation can be attributed to the 'nonrandomness of the corpus data', (Evert and Baroni, 2005:14), although they admit that the nonrandomness of word-distributions can not be the sole cause of some poor estimates (Evert and Baroni, 2005:15). Evert and Baroni (2006) developed the ZipfR package for R, which calculates Zipf-like LNRE models such as the Zipf-Mandelbrot (ZM), finite Zipf-Mandelbrot (fZM), and the Generalised Inverse Gauss Poisson (GIGP) model. The Zipf-Mandelbrot model is similar to Zipf's original model, with two additional parameters: a to modify the slope of the rank-frequency graph in double-logarithmic space, and b to adjust the downward curvature of the line – accounting for the mismatch between the straight-line predicted by Zipf (1949) in figure 7, and the actual curve of the rank-frequency distribution, (see Baayen, 2001:101-102). The finite or non-finite distinction is where the model assumes that the number of words in the distribution are potentially infinite (or finite respectively), which influences the final predictions. The fZM and GIGP models generally provide 'the best results', (Evert and Baroni, 2005:17), as will be seen in the results section. Before describing how these models are generated, the notation used to describe the features of the model is explained, (see Baayen, 2001; 2008:231). Sample-size is denoted by N, with the number of types in the sample (N) expressed as V(N). In terms of Zipf's law, let m represent the rank of a particular frequency type, expressed as Vm(N), that is, the number of individual types with frequency m in a sample of N tokens. The models are based on two types of data. Firstly, the frequency spectrum, Vm(N), is a ‘table of frequencies with frequencies’ (Baayen, 2008:230). For example, the frequency spectrum of Alice in Wonderland shows there are 858 words with the lowest frequency, 295 for the second least frequent word, and 152 words that are third least frequent. The second data-type is the vocabulary growth rate, which is the, ‘joint probability of the unseen types’ (Baayen, 2009:229). The data are used to predict how the frequency of a word changes as a function of an increase in the size of the sample. In order to address the issue of words having non-random distributions, Baayen (2001:64-69) and 39 Evert and Baroni (2006:8) propose a technique called bi-nominal interpolation. This resolves the non-random distribution of words (a problem for extrapolation quality), by performing a series of computational calculations, which randomly order the words. Bi-nominal interpolation takes a frequency spectrum of the words, and calculates the expected vocabulary size E(Vm), and the count for each frequency rank. This means that vocabulary growth can be calculated for any sample-size, as long as it is equal to or less than the size of the sample that the frequency spectrum was originally generated from. By applying these models to the rongorongo and language corpora, we can select the texts with the biggest sample-size and interpolate them to values comparable to the smaller sample texts, or extrapolate the smaller texts to the text with the largest sample-size. This permits a comparison between texts, but it is still necessary to exclude some texts from the analysis, such as: the Honolulu tablets (3622, 3623 and 445), the New York Birdman (X), the Paris snuffbox (Y), and the Poike (Z) tablet; and language texts with less than 50 words. This is because it is not possible to model the vocabulary growth of a text with hardly any data available for the model to base its calculations on. 4.1.1 N-gram: Uni-gram, bi-grams, and tri-grams N-gram refers to a n number of morphemes or words. Frequency counts of n-grams are used as data for a frequency spectrum analysis. The 100 most frequent glyphs and morphemes are selected for further scrutiny using additional statistical techniques (described below), specifically as features for PCA and factor analysis. A bi-gram may represent a collocation and provides the basis for investigating larger stable sequences in the arrangement of linguistic or orthographic elements. Bi-grams may also denote sequences of [noun + prepositional phrase], for example in [school of fish], or [verb + preposition]; as seen in [went to town]. Tri-grams are also important as they capture larger sequences or reveal more information in terms of what morpheme is present after a collocation, whether it is a preposition, another noun, or the verb. However, one caveat to analysing bi-grams, and in particular tri-grams, is the data-sparseness issue. By dividing the corpus into these constituent parts there is a reduction in the amount of data that can be analysed, as more hapax-, dis-, and trislegomena are generated. 40 4.1.2 Key Word In Context concordances (KWIC) Key Words In Context concordances (or KWIC, see Biber et al., 1998:26-28), is another method in corpus linguistics used to identify stable linguistic sequences such as those that can be extracted with bi-gram and tri-gram data. The idea behind this method is to analyse a particular lexical unit within the context that it appears. This method is used in dictionary compilation and production of second language education literature. A KWIC listing provides all the contexts for a given word. The lexical unit under investigation is called the node, and appears in the centre; tokens before and after the node can be extracted to see which words form ‘collocates’ (Biber et al., 1998:36), or setphrases. KWIC data can often locate regular patterns of usage or lexical-association patterns (see Biber et al., 1998), and the analysis does not need to be restricted to neighbouring words, but may also include words appearing two or three places from the node, reflecting grammatical-association patterns (see Biber et al., 1998). This method may reveal possible collocations in the rongorongo corpus or identify a grammatical structure or pattern that may not be discovered by palaeographic methods. Combining all the texts in the corpus into one text file makes it more apparent which patterns are productive on more than one text, and by how much. This allows the data to be filtered for those key structures that appear with a high frequency, and in which structural context they appear. This is particularly useful for disambiguating homophones such as a number of grammatical particles found in Polynesian languages: the particle /i/, in Maori for example, has around seven different uses, including as a tense marker, and a dative. Another instance, is the particle /te/, which may appear in either the nominal or verbal frame (DeFeu, 1996). Consequently, KWIC lists allow us to explore and understand the structure of the corpus. 4.2 Authorship-attribution methods Authorship-attribution defines a group of methods used to identify the literary-genre or author of a text. In short, a 'special form of text classification in which the classes are essentially just the works of given authors' (Mosteller and Wallace, 1964:xvii). A significant body of research based on the use of statistical and computational methods was brought about by Mosteller and Wallace (1964) and have since been expanded in modern studies to include more advanced quantitative methods, such as machine learning, neural networks, and plagiarism detection. Research shows that different literary genres can be determined by underlying grammatical structures, including, collocations, 41 function words, and phrases; just some of the indicators used to identify the stylistic 'finger print' of a particular author (Baayen, 2002). As noted previously, relative frequencies of glyphs in the rongorongo corpus will not provide sufficient statistical evidence to support any generalisations. Therefore, other methods have been sought that will support and contextualise the frequency data, in the form of meta-data recording the source, year, context, and register of a text. Much of this information is missing in the rongorongo corpus and we do not know which tablet represents which genre of text. Authorshipattribution studies provide a set of methods that are non-parametric and capable of processing multivariate data, and highlighting categories or clusters in the data. The first issue to address in authorship-attribution studies are the selection of features that, ‘most accurately summarise an author’s style’, (Peng and Hengartner, 2002:175). Mosteller and Wallace (1964) discovered that function words were more successful as discriminators of authorship as they are context independent, unlike content words (see Mosteller and Wallace, 1964:17-39). This is supported by Peng and Hengartner’s study which shows that function words are the most successful in discriminating between texts, because they show high variability and frequency, (Peng and Hengartner, 2002:184). Peng et al., (2003) uses character-level n-grams. This involves removing all white space in the corpus, which reduces the data to one long block of text. This is then divided into single characters (letters) and frequencies are generated. Peng (et al., 2003) shows that by using groups of three characters as features, the classification of their Greek data-set is improved by 18% compared to previous studies (Stamatatos et al., 1999; 2000; 2001). In addition, the methods are language independent, which is demonstrated through a parallel analysis of Chinese and English data, (Peng, 2003:6). Holmes (1992) uses multivariate analysis including principal component analysis, and tests for lexical-richness to assess whether Mormon texts are attributable to more than one author. Whereas a study by Tomatsu (2006) investigates the literary motifs adopted by Japanese authors in order to identify three of the most traditional authors of Japanese prose. This is achieved through analysis of word-frequency, and length, but also investigating each authors’ use of hiragana, katakana, and kanji, (Tomatsu, 2006:5). 42 It is possible to use the unit of analysis proposed by Peng (et al., 2003); as Horley (2005), demonstrates that we can further subdivide the main glyphs into smaller units, which may represent syllables. Although this is not the unit of analysis adopted by this study, it would no doubt be interesting to compare the results from this corpus with the others to see if a smaller unit of analysis performs better as concluded by Peng (et al., 2003). This will be the focus of a future study. Lexical-studies demonstrate that the top-frequencies in corpora are generally occupied by ‘function words’ and the lower frequencies tend to be the domain of ‘content words’, (Baroni, 2006:5). In this study, word-frequency data is filtered for the hundred most productive function words in the language, and glyph corpora acting as the feature set. This is an arbitrary figure, but in accordance with previous research by Mosteller and Wallace (1964:28), and Pozdniakov and Pozdniakov (2007). Despite not knowing what the set of a hundred glyphs consists of in terms of 'functional' units, it is reasonable to assume that the most frequent glyphs have some structural significance in their usage. The Santiago staff (Ia) is often excluded from statistical analysis because of the high occurrence of the 'phallus' (glyph <76>) and because the large size of the text causes glyph <76> to appear in the top rank in the whole corpus of rongorongo texts. In other texts, sequence <380.1> dominates the frequency ranks. This is a good thing for this study, which looks for variability between texts, as these glyphs appear to be attributable to certain tablets or small sections of tablets, which make them good candidates as discriminators of rongorongo literary genre. To summarise, this study uses the texts and their n-grams of whole morphemes or glyphs as the unit of analysis for authorship-attribution methods. The term ‘genre’, when applied to the language corpora, is used in its traditional sense; i.e. to refer to a specific literary-style. However, when referring to the groupings of the rongorongo tablets, it is used as a convenient label to describe these groupings, but without implying a specific genre. Authorship-attribution methods will identify the elements that categorise the tablets into their respective groups, extracted through: word-frequency, lexical-richness, principal components analysis (PCA), factor analysis, and latent semantic analysis (LSA); commonly adopted in information-retrieval studies, forensic linguistics, and psycholinguistics. 43 4.2.1 Principal components analysis Principal components analysis (PCA) is part of a series of methods involving the analysis of more than one variable (termed multivariate analysis). PCA applies advanced statistical methods in order to reduce the number of variables, particularly if they offer little discriminatory value. One of the most attractive aspects of PCA is its non-parametric nature, which is beneficial as it removes part of the subjective nature of choosing the factors that best define multivariate data-sets. PCA, takes a matrix of data and tries to reduce the number of dimensions necessary for identifying the position of the data-points, (Baayen 2008). If this process was visualised in three-dimensional space, a cube would contain all of these data-points as coordinates. This cube can be rotated using varimax or promax rotation10 (Baayen, 2008), providing a new view of the data, until an interesting structure is identified. This is achieved through, 'rotating the axes in such a way that you get two new axes in the diagonal plane of the original, unrotated axes' (Baayen 2008:120). Once the cube is optimally positioned, the principal components (PCs) that capture the most variance, are extracted. So PC1 contains the points explaining the majority of the variation, with subsequent PCs explaining less and less. When plotted in a scree plot, the most important PCs are easily recognisable, (see results below). The rotation-matrix contains the values (loadings) needed to plot the data in a scatter-plot, (see Baayen, 2008:124). These loadings are proportionate to the correlation of the word or glyph frequency-counts. In this way the scatter-plot reveals interesting structures or clusters in the data. Baayen (et al., 2002), discovered that in a strictly controlled experiment involving groups of authors, with very similar backgrounds and training, there exists an authorial 'finger print' or style. They note that a principal components analysis of the most frequent function words does not provide any insight into authorial structure where controls are in place, but other principal component studies (Baayen, 1996) where the authors have very different backgrounds, training, or span different time periods, are more successful at being categorised with principal component analysis, (Baayen et al., 2002:6). Since the rongorongo tablets were likely carved by different scribes (Routledge, 1919), then this method may be successful. The results of the PCA (below) are based on the methods proposed by Baayen (2008) in a study of 10 . Promax rotation tends to provide the better fit, as it allows the factors to be correlated. Varimax rotation assumes that there is no correlation between factors, and is applied when the primary aim is on the generalisability of the findings (see Baayen, 2008:127). 44 'affix productivity' from Baayen (1994). Here Baayen (2008) investigates 'the extent to which the productivity of an affix is co-determined by stylistic factors' (Baayen 2008:118). Baayen's analysis is based on a matrix of n-gram counts; the frequency counts of 27 derivational affixes, over 44 texts with different genres. Baayen's results show that the productivity of the affix can be used as a means to group texts according to genre. In addition, PCA makes it possible to reduce the number of dimensions from 27 to only 3, without losing too much of the structure in the data, whilst accounting for 76.6% of the variance in the data, (Baayen, 2008:122). In conclusion, PCA is a data-reduction method allowing an informed choice on which dimensions are important for explaining the underlying structure of a large multivariate data-set. This paper applies principal components analysis to evaluate the available data-sets (DD1-DD4) to identify the potential for analysing rongorongo literary-genre. The analysis assesses the extent to which the rongorongo texts exhibit any form of clustering or clearly defined groups. The data-set used by PCA is based on the results from the frequency-count data. The 100 most productive glyphs, and morphemes, are chosen as the feature-set (or dimensions). The assumption is that these features capture the most productive grammatical particles in the Polynesian language, and rongorongo data. Of course, it is not expected that a one-to-one correspondence will be found. In other words, despite glyph <1> being the most productive in certain texts, it does not necessarily entail that it relates to the most frequent word in the Rapanui language data: the definite article /te/. The highest occurring glyphs in the rongorongo corpus may be ‘functional units’ of a kind, but they would be restricted to the writing system, due to the disparity between language and writing. Whether this is the case or not, it can be assumed that the top-ranking glyph-frequencies represent those glyphs that have a functional or syntactic role in the script. A matrix of glyph-frequencies for each text is used as the feature-set, which is selected according to the criteria adopted by Mosteller and Wallace's (1964:45). They recommend that the, 'pool of potential discriminators should be large enough, say, 50 to 1000, to offer a good chance of success'. In addition, Pozdniakov and Pozdniakov (2007:12) consider the 100 most productive glyphs to be appropriate for their statistical analysis. This paper therefore adopts the same principle. 45 4.2.2 Factor analysis Factor analysis is used to, 'examine how underlying constructs influence the responses on a number of measured variables' (DeCoster, 1998:1). Factor analysis is another multivariate technique that identifies structures in large data-sets, and is an extension of PCA. It differs in that an error term is applied to the model to account for the presence of possible noise in the data, (Baayen, 2008:127). The factor analysis requires a matrix of counts, which in this study, is composed of the features used in the PCA. Columns represent glyphs, or morphemes, and the counts across the corpus of texts (in rows). A technique called factor rotation is the applied to this matrix, providing a simpler interpretation of the data. This method is more successful where there are high loadings on a few factors (Baayen, 2008:127). The data-points can also be rotated through varimax or promax rotation methods. Promax rotation involves rotating the loadings through a general linear transformation of each factor in order to get a better idea of the underlying structure, whilst preserving the variance of the factors. This will therefore provide a better view of the separation between texts. For the purposes of this study two factor analyses (one varimax-based and one promax-based) are produced in order to demonstrate the difference. In this study three factors are chosen, as less than three factors provides hardly any separation in the data, and too many cause the points to be too dispersed. The results of the PCA provides some clue as to how many factors should be selected based on the number of significant principal components (PC), which are identified by a scree-plot (see the results section, below). 4.2.3 Tests of lexical-richness Data from vocabulary growth curves (VGC) and frequency-spectra provide the basis for a model of lexical-richness. Relative frequency counts are susceptible to the issues associated with samplesize, and LNRE (see Baayen, 2001). VGCs make it possible to adjust the sample-size of each text. This involves interpolating the samples, for smoothing, and then extrapolating the interpolated curve to the size of the largest text, through a LNRE model (Baayen, 2001): for example, the Finite Zipf-Mandelbrot model (fZM), or the Generalised Inverse Gauss Poisson model (GIGP), illustrated by Baroni (2006; 2007; 2009). Once an accurate goodness-of-fit is achieved, the model provides the basis for text comparison with Baayen’s P, (Baayen, 2001). Baayen’s P is equal to the number 46 of hapax-legomena divided by the number of types in the sample (V1/Vm). High values will show that a text is lexically-rich. For example, as many of the hapax-legomena are likely to be content words, many of these may be nouns denoting names of people, places, or things. As a result, it is expected that a genealogical text should be particularly rich in content-words as they reference names of people, events, and dates, compared to a prose or narrative text. 4.2.4 Latent Semantic Analysis (LSA) Another method adopted is latent semantic analysis (LSA), which is based on a, 'representation that captures the similarity of words and text passages', known as a ‘semantic space’, (Landauer and Dumais, 1997:211). LSA works by representing the words used in a corpus of texts, from the size of a whole essay, or even just key words appearing as a title in a document (Foltz et al., 1998:4). LSA applied to information-retrieval is shown to increase accuracy by up to 30% (Dumais, 1991), despite differences in language use. Wolfe et al., (1998:1) determine whether the acquisition of knowledge is dependent on knowledge already acquired, and how the complexity of a text has an impact on acquisition. They apply LSA to the problem of knowledge induction, by examining the entries listed in an encyclopaedia on the 'human heart', and comparing the results to a student-based survey on the same subject. LSA has found applications in modelling language acquisition (Landauer and Dumais, 1997), word-disambiguation (Pino and Eskenazi, 2009), plagiarism detection, and computational biology, applied to protein sequencing (Dong et al., 2006). LSA is particularly relevant for the study of rongorongo, as it provides an insight into how glyphs are used by calculating the, 'approximate estimates of the contextual usage substitutability of words in larger text segments', (Foltz et al., 1998:3). Applying a LSA to the glyphs (see below), retrieves lexical-association patterns (Biber et al., 1998), and the degree of correlation between the individual lexical-components. Although the analysis of bi-grams and tri-grams captures similar patterns, the output is often cluttered with incorrect collocates, which are included in the counts of hapaxlegomena, and result in false measures of lexical-richness, (see Baayen, 2008; Evert and Baroni, 2006). The data produced by LSA requires no manual processing as a threshold can be set, typically >=0.7, to return only the most correlated texts and glyphs, (see Landauer et al., 2004). The texts and words (or glyphs) are viewed as a high-dimensional multivariate space where all occurrences, of a word, are listed row-wise, and the associated texts across columns, with the counts in each cell; creating a high-dimensional matrix. The Singular Value Decomposition (SVD) model, 47 (see Manning et al., 2008:407) provides LSA with the ability to represent the mechanisms underlying human knowledge, (Landauer and Dumais, 1997). It achieves this by optimising the, 'prediction of the presence of all other events from those currently identified in a given temporal context, and does so using all relevant information it has experienced', (Landaur and Dumais, 1997:217). The terms context and event refer to the phenomenon being described. In this paper, context refers to an individual text in the rongorongo corpus, and event relates to a single instance of a glyph. In other studies (Landauer et al., 2004), an event may also be a word, syntactic construction (through corpus tagging, see Biber et al., 1998), or a paragraph. Likewise, a paragraph can be considered a context. LSA has been empirically tested on a variety of data (Dumais, 1991; Landauer and Dumais, 1997; Foltz et al., 1998; Wolfe et al., 1998; Landauer et al., 2004; Dong, et al., 2006; Pino and Eskenazi, 2009), and it appears to give consistent results. These tests demonstrate that the model is particularly robust with matrices of 300 dimensions, however performance drops below 100 dimensions, and above more than 1000 dimensions, due to limitations in computational power (see Landauer et al., 2007:71). To illustrate, in an initial study as part of this paper, processing a corpus of 46 texts, with approximately 100,000 words each, required nearly 32GB of memory, on a highend laptop, fitted with only 2.1GB (as of 2010). Therefore, LSA may not be able to process large corpora comparable to the Brown Corpus, unless a series of smaller sample texts are extracted. A SVD model is generated, in a similar way to factor analysis, and the data-matrix is transformed into a series of co-ordinates in high-dimensional space. This allows the researcher to explore the correlation between texts, words, or the correspondence between the two. In this paper, Pearson's correlation is used as the measure of similarity between a text, or glyph. One of the main advantages of using LSA in a corpus-based analysis is that it captures the fact that: […] if a particular stimulus, X has been associated with some other stimulu, Y, by being frequently found in joint context, and Y is associated with Z, then the condensation can cause X and Z to have similar representations. (Landauer, et al., 1997:217). In short, the model measures the probability of a word joining with other words to form larger clauses, given their productivity in the current context. (Landauer et al., 1994:5215). This entails that LSA ignores the ordering of constituents as it is purely interested in the ‘meaning’ attached to the documents as opposed to the words contained within, unlike corpus linguistic methods, which 48 adopt the token as the unit of analysis (see Biber et al., 1998; Gries, 2009). The resulting data is normally plot in a graph to explore possible relationships between text-text, word-word, and textword dependencies. LSA is predicted to perform better than previously mentioned methods as the accuracy of a factor analysis relies on: how the method is applied, the decisions behind selecting representative features, and the number of 'significant' factors. LSA processes all texts and words or glyphs, in a corpus, to create a latent-semantic space from which additional functions are applied. (see Landauer et al., 1997; Wild, 2009). In this paper, the rongorongo corpus is transformed into a matrix, with the contexts (tablets) organised by column and the events (glyphs), arranged in rows. The frequency-counts are calculated for each event in each context and assigned to the corresponding cells. A Pearson correlation is calculated and only the results meeting a >=0.7 threshold are retained. The results are presented as a dendrogram to make the relationship between groups more apparent (see below). A correlation value of 1 means that the given word or glyph was expected to appear in an infinite corpus of texts. Lesser values naturally imply that the word or glyph in question is unexpected in the current context. This captures the grammatical or syntactic association between glyphs, highlighting any potential compounds, or structural correspondence. For example, what motivates the presence of glyph <1> given its presence as an independent or main glyph, or as a prefix in the compound <1.6>? To assess the success or failure of LSA methods applied to glyphs, a term-term analysis is performed on an Egyptian corpus, since the values of the glyphs are known. A glyph is selected on the basis of glyph-frequency data. As with the document-document analysis, a threshold of >=0.7 is set in order to return the highest term-to-term similarity scores (positive correlation values). The final results are compared to the factor analysis, bi-gram, tri-gram, and Key Word In Context concordance data (KWIC) for validation. 49 Chapter 5 Presentation of findings 5.1 Feature selection The feature list is generated from the 100 most productive glyphs and morphemes, and a frequencymatrix created with the relative-frequency of the features across all texts. The feature-list was also enriched with current hypotheses concerning the significance of particular glyphs, for example the crescent moon glyph <40>, plus variants: <41> mirrored, <42> subscript, and <43> superscript, (Guy, 1990; Fischer, 1997; Berthin and Berthin, 2006; Melka, 2008; 2009b; 2010; Horley, 2009), and the text-divider glyph on the Santiago staff (Ia): <199> and <999> (Fischer, 1997). Due to issues associated with relative-frequencies, there is some imbalance in the final features selected for classification. For example, tablets X, Y, and Z, have only two glyphs, each one is attributed with a relative-frequency of 0.5, (or 50%). Therefore, when sorting the n-grams in descending-order, they are ranked amongst the top-frequencies. Consequently it was necessary to remove them from the rest of the analysis. This decision was extended to texts: J, M, and W due to their small sample-size, or fragmentary nature (damaged areas). 5.2 Principal components analysis After the initial feature selection, a PCA was performed (Baayen, 2008), and a scree-plot produced to identify how many PCs are required to explain the majority of the variance in the data-set, hence enabling the reduction of the number of variables. The general rule adopted is that only the significant PCs accounting for over 5% of the variance should be selected for further analysis. 50 Figure 8. Scree plots (rongorongo data-sets) The scree plot shows that the first three PCs explain the majority of the variance. These PCs account for almost 75% of the variance in data-set 4 (see figure 8). Figure 9. PCA (rongorongo dataset) The graph (figure 9) shows the resulting PCA scatter plot. The data-points show some separation and what appears to be a cluster of texts (referring to PC1 against PC2) with a few texts separated from this main group. In addition it is possible to see how much variability there is in the data on different combinations of PCs, for example, PC2 and PC3 show less scatter than the comparison between PC1 and PC2. 51 It is not clear which texts are associated with which clusters. However, PCA provides a means for reducing the number of variables needed to explain the underlying structure of the data-set. In addition, it is possible to spot any clustering, which is important for the remainder of the analysis, as they are based on text-categorisation methods. 5.3 Factor analysis: Rongorongo corpus A factor analysis, with varimax rotation, was applied to the same frequency-matrix as the PCA. Three factors were selected, and a separate analysis performed with promax rotation. The factor analysis reveals the actual texts that are part of the clusters identified by the PCA. One issue that needs to be addressed is how to explain what these groups represent, and how to justify the inclusion of a text with any of the members in its cluster. There are roughly four main groups; measured by their proximity to other texts as highlighted by the curves dividing the texts (see figure 10). These texts are distributed across the graph, forming a left-right distinction (highlighted by the dashed line). It is not possible to state with any certainty where the texts should be divided. As an example, text La could be part of group 3 or 4 as it is almost equal in distance from any single text in either group. It is apparent, that these results alone do not make it possible to make any definite decision, though they do provide a basis for exploring the corpus. 52 Figure 10. Plot of factor analysis of rongorongo (DD4) with varimax (left), and promax rotation (right). The texts are quite dispersed in the first plot (figure 9, left), but the promax-based analysis shows clearer clusters forming, (figure 10, right). Group 1 (H, P, and Q) are known parallel-passages and are expected to be similar in terms of genre, (see in particular Barthel, 1958:82-168; Horley, 2007a). As for group 2, texts Gr, Kr, and Kv cluster together at the top of the plot, joined by two additional texts: Na and Ev. The similarity of the Small Santiago (Gr), and London tablet (Kr and Kv) is confirmed by internal analysis of the script showing that tablet (K) is a copy of one side of the Small Santiago (Gr), (see Barthel, 1958). In group 3, the Mamari tablet (Ca and Cb) is at the centre of the plot with texts Rb, Xa, Sa, and Fa in close proximity. This may indicate that these texts contain some content similar to the tablet C. This group is partly driven by the presence of the 'crescent moon' glyph <40>, occurring twice on 53 the Stephen-Chauvet Fragment (lines Fa3 and Fa4). By comparison, there are many more instances of this glyph on tablet C, and Er (another possible calendar sequence, see Wieczorek, 2010), which is also in proximity to tablet C, but seems to form its own group with the verso of the Small Santiago (Gv), and one side of the Stephen-Chauvet fragment (Fb), designated group 4. Group 4 in the bottom-left of the plot, includes the Santiago staff (Ia). This text is the most separated from the other groups, with only a number of fragmented-texts (Wa, Ya, and Va) included in this group. Internal analysis of the inscription shows that its structure is very different to the other tablets: a X1YZ structure, proposed by Fischer (1997). Text Ia is also joined by the Honolulu text (Ta). The similarity between tablet I and T, is already noted by Barthel (1958:167). Turning to an assessment of the classification obtained; it is assumed that a correlation exists between text Aa, and Hr, Pr, and Qr, due to the shared sequence appearing on Aa1 (repeated on Hr5, Pr5, and Qr5, see Guy, 1985:383), instead the opposite side (Ab) is grouped with these texts (see group 1). Horley (2007a) shows that this is attributed to the presence of shared glyph-sequences on Ab2 and Ab4, which are also present on Pr7. Text Aa has sequences in common with Hr, Pr, and Qr, including a group of glyphs <1-9a>, also present on text Ra (line 5-8). This may point to a fixedglyphic compound, although on Ra, the compound is expanded with the addition of fused-glyph <5> (Horley, 2007a:27), with further examples of this form on texts Bv and Sa. The factor analysis illustrates this to some degree, although text Sa is separated from the rest of the group suggesting that there are other features with more influence over the categorisation: one sequence is the compound <380.1> on Cb. Barthel (1958), and Horley (2007:28) observe that this compound is quite productive (with instances on texts: Ab, Ca, Cb, Ev, Gr, Kr, Kv, and Na). The factor analysis splits texts with this delimiter into two groups, group 2 and group 3, appearing at the top-left of the graph, the others appearing in the middle and drifting to the lower part of the graph. The difference between these two groups, which share the compound <380.1>, may be caused by the presence of glyph <40>. The question to ask at this stage, is whether these observations point to there being two list-types: one a genealogical text, the other a mix of genealogical and calendar sequences (for example a significant date attributed to a number of individuals). However, it is not possible to draw conclusions purely on this basis alone. 54 5.4 N-grams An analysis of bi-gram and tri-grams reveals which glyphs are driving the classification of texts in the factor analysis. It appears from frequency-data that the four groups can be reduced to at least three due to the presence of the previously mentioned shared-sequences. The frequency-data also shows which texts contribute little to the classification of each group of texts. There is cause to rearrange some texts from the factor analysis, as they share bi-grams and tri-grams more in common with texts in other neighbouring groups (for example, texts Ev and Gr have some bi-grams in common with text Gv, Ia and Ta): Table 6. The most productive bi-grams in the rongorongo corpus. Table 6 presents the top 15 results from the bi-gram analysis. Only glyph-compounds that appear on several texts are retained as they are likely to be true compounds, rather than ‘junk’ strings caused by the concatenation process. The bi-gram data supports a distinction between three broad groups as opposed to the four illustrated in the factor analysis. Therefore the groups are re-arranged according to the number of shared bi-grams, for example bi-grams on text Gv compared to Ia, and Ta, show that Gv should be moved to group 4. Also, group 2 (Ev, Gr, Kr, Kv, Na, in figure 10), shares more glyph-sequences with texts of group 4. A correspondence between bi-grams and the type of texts they appear on, appears to exist. However, it is apparent that some texts are damaged, and therefore not reliably classifiable, for example some texts have hardly any shared glyphs, or do not appear in the top-ranks, (Fb, Ma, Oa, and Fa, respectively). It is therefore questionable whether these texts should be included. The texts in group 2 are more homogeneous: there are examples of the delimiter-list glyph-compound 55 <380.1> across all texts in this group. Group 2 consequently needs some adjustment or removal of texts Er and Fb as they share few bi-grams with the rest of the texts in this group. The statistics show that Fb has glyph-compounds in common with other texts, for example: <200-96> appearing once on both Fb and on Gv, and the 'moon' glyphs. With so few occurrences it does not seem justifiable to include Fb in the corpus, and so it is removed from the rest of this analysis. Texts Ca, Cb, and Sa show some correspondence with group 2, again, due to the presence of <380.1>, as a result this classification still appears to be far from perfect. In conclusion, the factor analysis categorises the tablets with delimiter <380.1>, and <76> on the left of the chart, and the more diverse texts, or those with less structured-lists (i.e. group 1), on the right of the plot. Group 1 Aa Ab Br Bv Da Db Hr Hv Ma Nb Oa Pr Pv Qr Qv Ra Sb 6_74_3 3 5 2 3 1_62_6 2 2 2 10_144_3 5 7 3 15_22_3 8 10 4 3 22_3_8 3_22_3 3_306_3 3_600_3 3_93_3 440_440_440 6_1_62 62_6_1 1_9_5 1_95_3 10_2_10 Total 3 1 4 2 2 3 3 17 5 0 3 0 27 12 0 0 6 6 6 1 1 1 1 760_40_6 1_22_3 10_3_70 3_70_760 6 5 5 5 2 2 3 5 4 4 8 2 2 90_90_76 90_90_90 0_0_160 Total 0 3 2 3 3 378_41_670 390_41_378 41_378_41 66_760_4 49_3_76 76_73_3 76_90_76 90_76_70 8 2 1 1 1 3 2 2 3 2 2 3 2 3 1 30 10 22 6 3 0 6 0 39 0 Group 3 Ca Cb Rb Sa 40_40_40 8 41_670_8 7 670_8_78 7 8_78_711 7 3 2 2 2 2 3 2 2 260_1_4 3_90_76 430_76_530 430_76_55 4 3 Group 2 Er Ev Fb Gr Gv Ia Kr Kv Na Ta 1_380_1 3 5 2 3 380_1_3 30 8 11 1_2_34 3 2 1_4_711 2 2 1 1 1 33 12 13 5 10 6 380_1_22 5 40_390_41 5 70_760_40 5 Total 83 6 0 0 Table 7. The most productive tri-grams in the rongorongo corpus. The tri-gram data reveals that texts Ca and Cb from group 3 should perhaps be included with group 2, due to the productivity of <380.1>. However, what seems to separate these texts is the presence of the 'moon' glyph <40>, plus variants (<41>, <42>, and <43>). This may indicate three general groups; those with glyph <40> and delimiter <380.1> (group 2-3, figure 10); another with glyphs <76> and <380.1> represented by the texts appearing in group 4 (figure 10); and those with fewer delimiter-glyph, (group 1, figure 10) The tri-gram data show that text Ia and Ta contain both the <380.1> delimiter and <76>, with Gr showing a few occurrences of <1.380>, meaning they could be moved to group 3, however they lack the fully expanded version <1-380.1> and the more common <380.1.3>. Once again, it seems necessary to revise the previously defined groups. Some texts show hardly any shared tri-grams: in group 1 (Aa, Br, Bv, Da, Db, Ma, Nb, Oa, Ra, and Sb), group 2 (Er, Fb, and Gv), and group 3 (Rb, and Sa). In conclusion, at this stage of the analysis, it is assumed that there are at least three main 56 groups, with maximally a fourth should the remainder of the analysis support one. A latent semantic analysis (LSA) will provide confirmation on the validity of the current classification by explaining the degree of correlation between each of the texts. 5.5 Factor analysis: Language corpus. Turning to the language corpora, the factor analysis failed to identify any particular groups, in fact the programs reported an error in the factor analysis algorithm. The matrix was composed of only 312 texts, with the frequency-counts of the top 100 morphemes as features. To investigate the issue, a correspondence analysis was performed to show the amount of variance between the texts, and the loadings of each morpheme. (see figure 11). 57 Figure 11. Correspondence plot of the influence of morpheme loadings on texts. The correspondence plot shows that the data is very skewed resulting in the texts being bunched together. This is because the features fail to draw out any significant difference in structure. Only two texts have separated from the rest, a Rapanui song (Ate-a-renga hokan iti poheraa) from Thomson (1889), and a Maori song (He Mata na te Kahu-kore) from Grey (1853). It is not possible, with these two results alone, to conclude that there has been any success. The two song texts are too far apart to suggest any relationship and the scores on the factors account for only 15.6% of the variance in the data, hence the skew. 58 A possible explanation for this is polysemy. Synonymy and polysemy is rife in Polynesian languages due to the restricted set of phonemes (only nine consonants, ten if we include the glottal stop, and five vowels, see De Feu, 1996). There are a large set of homophones with more than one interpretation, particularly in the case of syntax markers: for example /a/, acts as a possessive (De Feu, 1996:11), and person marker (De Feu, 1996:12). Tregear (1891) reveals further examples of: /a/ = 'collar-bone', 'god', 'an interjection', the plural of particle /ta/, and 'to drive', or 'urge' (Tregear, 1891:1). The word /hoki/ is a verb 'to return', but examples also exist of other word-types including: 'also', 'for', or 'because', and as the name for a fish, (Tregear, 1891:79). Therefore, it is likely that many of these morphemes represent other word-classes such as noun, verb, and preposition. The matrix is composed of morphemes with the highest-frequency in the corpus, which are generally 'function words' (Mosteller and Wallace, 1964; Evert and Baroni, 2005; 2006). However, the actual 'function' of a morpheme proposed to be part of the syntax of the language is obscured by polysemy. In addition, there is evidence to suggesting Maori poetry has a different structure compared to everyday usage. The comments of Rev. R Maunsell, illustrates the complications in interpreting Maori poetry: We shall see that it was not only abrupt and elliptical to an excess not allowed in English poetry, but that it also carries its license so far as to disregard rules of grammar that are strictly observed in prose; alters words so as to make them sound more poetically; deals most arbitrarily with the length of syllables, and sometimes even inverts their order, or adds other syllables. (cf. Grey, 1853:xiii-xiv) The issues attributed to these genres is materialised in the form of omissions, including; articles /ko/ and /te/, /ai/, pronouns, particles i.e. /nei/, the nominative case, verbs, and prepositions. In addition, there is substitution of one preposition for another, unusual or rare words introduced or inverted, (see Maunsell, 1862). The same phenomenon is paralleled in the rongorongo corpus where whole passages may undergo, 'significant re-phrasing' (Horley, 2007a:31). Consequently it is likely that these issues have an impact on the statistical analysis, as observed in the correspondence plot. One way to address this is to tag the corpus, however, for the purposes of this study, time constraints associated with handling large corpora (see Biber et al., 1998) was an issue due to the different transcription principles, and the multilingual nature of the corpora. Applying factor analysis methods to collections of multivariate data can be complicated, and many of the decisions relating to feature selection are fairly subjective, followed by the selection of factors believed to be appropriate to elicit structures present in the data. As a result, it is not 59 possible to make a decision regarding which of the rongorongo texts are similar in terms of genre based on these results alone. Therefore, we move to an analysis of lexical-richness through the application of frequency spectrum, Vocabulary Growth Curves (VGC), and Baayen's Productivity Index (see Baayen, 2001). 5.6 Frequency spectrum and VGCs as tests for lexical-richness As discussed previously, a test of lexical-richness based on the application of Baayen's P, or productivity-index, is calculated by dividing the number of hapax-legomena by the number of tokens (V1/Vm). This calculation will provide a measure of how lexically-rich the texts are. A lexically-rich text will have a high productivity-index, as more and more new words are encountered as sample-size increases. A narrative text, for example, may have a low score, and will display a straight-line showing that the number of types have been fully sampled, and no more are expected, despite an increase in sample-size. In comparison, a list of unique individuals or entities will show a steep curve upwards as each new type V is a hapax-legomenona (V1). To compensate for the difference in sample-size, which distort measures of lexical-richness, a Generalised Inverse Gauss Poisson LNRE model was implemented as it, 'achieves excellent results' (Evert and Baroni, 2005:14), in order to account for the number of unseen or expected types (E[Vm]) predicted by the Large Number of Rare Events (see Baayen, 2001). The Generalized Inverse Gauss Poisson (GIGP) model is more efficient at modelling the hapax-legomena than the rest of the spectrum frequency compared to the finite finite Zipf-Mandlebrot model (fZM), which supposedly adjusts the top and bottom ranks to comply more fully with Zipfs (1949) straight-line in a rank/frequency plot, (see Evert and Baroni, 2006). This was the same result for all texts and GIGP was therefore selected as the best fitting model for extrapolating the texts to the larger sample-size. 60 Figure 12. Comparison of the number of types predicted by FZM and GIGP LNRE models of text Aa (E[Vm]), and the number of observed types (Vm). A model for each text is created by applying bi-nominal interpolation to the empirical growth curve, followed by extrapolation up to the size of the largest text, (see figure 13 for an example). This provides a smoother curve by computing a series of randomised permutations over the distribution of the words (or glyphs) to compensate for the non-randomness of word distributions, which would result in an incorrect model of the extrapolated VGC curve. The interpolated VGC is extrapolated to the sample-size of the Santiago staff (Ia), the largest of the texts at N(2594). The results are plotted alongside the empirical VGC of each text. The parameters of the model are estimated by the Zipf package (Evert and Baroni, 2006), see figure 12, above, and table 8, below. 61 Parameters Shape Lower decay Upper decay Zipf size gamma = -0.29 B = 0.05 C = 0.02 Z = 50.29 Goodness-of-fit (multivariate chi-squared test) X2 df p 2.17 4 0.71 Table 8. Parameters and resulting VGC generated by a Generalized Inverse Gauss Poisson (GIGP) LNRE model The parameters of the model can be adjusted to provide a better fit, however this would involve a trial and error approach for each individual text. The goodness-of-fit based on the estimated values are close enough for the purposes of this study. To assess the model's success, a high P-value and a low chi-squared score is required, (see Baayen, 2008). Observed Expected V 225 229 V1 103 103 V2 38 38 V3 20 20 V4 15 13 V5 8 9 Table 9. Observed and modelled values for ranked-frequencies for text Aa. Despite the poor estimation of the frequency of words at ranks V4, and V5 (those occurring four and five times, respectively), the only values actually required are the total number of expected types (V) and the hapax-legomena (V1), for Baayen's P. The main benefit of applying such a model is to address the issues relating to the comparison of texts at differing sample-size. Below (figure 13), is an example of the resulting fit for text Aa, showing the empirical, bi-nominally interpolated, and extrapolated curve at the size of the largest text (Ia, N=2594). 62 Figure 13. Number of types V(N) compared to the expected number of types E[V(N)] for empirical and modelled VGC (interpolated and extrapolated curves). The end of the empirical curve of text Aa is apparent at approximately N(1000), where the straight line reflects no further growth in vocabulary. The interpolated curve follows the slope of the empirical curve fairly well until about N(450) where the empirical curve begins to diverge as observed by Baayen (2001), and Evert and Baroni (2006), in their studies, hence the reasoning behind extrapolating from an bi-nominally interpolated growth curve, and not from the empirical VGC11. After fitting the GIGP model, Baayen's P is calculated over each text, and plotted for comparison, (see figure 14). 11 . The interpolated-curve smooths the empirical curve in order to prevent the extrapolated curve from being calculated on wildly fluctuating values caused by the LNRE and the non-randomness of word distributions. 63 Figure 14. Measure of Baayen's P for texts in the rongorongo corpus at N(2594). Comparing the groups identified by the factor analysis and the results of the lexical-richness test, shows some correspondence between the graph above (figure 14), and the factor analysis (figure 10). For example, the similar Baayen's P value for texts Ab, Bv, Hv, Pv, La, Sb, and Bv (the texts forming group 1), which are joined by texts Ia, and Ta, showing a similar amount of lexicalrichness. The variation between Gv and texts Ia and Ta, despite its similarities in terms of glyph usage (glyph <76> delimiter), is now more apparent, it is lexically-richer. In addition, texts with the delimiter <380.1> or with many list-like structures (Horley, 2007a), are generally those with a Baayen's P value lower than 0.2. In conclusion, texts Da, Gv, and Aa are the most lexically-rich texts. And the right-side of the plot illustrates the list-like texts, which are the least lexically-rich. Given that the assumption is that lists are lexically-richer than narrative discourse, the plot does not appear to support this hypothesis. However, texts Aa, Bv, Cb, Da, Gv, Ia, Sb, and Ta, do actually appear to have a large number of listlike sequences, for example: texts Aa and Ab with glyph-sequence <25.9:5, and variant 1.9:5>; on tablet Br, a sequence marked with glyph <384> in line Br3, later replaced by glyph <63>, in Br6; and Cb with <4-760> and <380.1> glyph delimiters. However, if texts are composed of more than one text per tablet-side, the model may not estimate the correct VGC for some texts. This is where proper segmentation is required in order to be sure that these results are valid. Inspecting the resulting VGC of each text, reveals whether this holds true. As an example, the Santiago staff (Ia) has a sharper increase in new types, as the number of glyphs encountered (N) increases (see figure 15). However, the empirical VGC reveals that there 64 exists the possible presence of multiple sub-texts. Figure 15. VGC plot of the Santiago staff (Ia). The Santiago staff (Ia) is likely to be one text genre given the prevalence of glyph <76> across the whole text, forming a long list. The plot shows, however, four or five possible list-types due to the observed peak in the hapax-legomena, (marked by the arrows). To confirm whether this is indeed the case a discourse map, (see Biber et al., 1998:122-130), was produced for text Ia, with the most frequently occurring glyphs (<1> to <11>), a number of delimiter glyphs (<76>, and <380>), the crescent moon glyphs (<40> to <42>, as they may represent calendar sequences), and the text delimiter (glyph <999>), which provides more information on the organisation of the whole inscription. 65 Figure 16. Discourse map of the Santiago staff (Ia) The large number of occurrences and distribution of glyph <76> (figure 16), reveals that the text itself is probably one genre. The text delimiter glyph <999> is less frequent, but is again distributed throughout the length of the text, and shows a number of breaks and packed groups, see for example dense clusters at N(750) and N(1750). In addition, other glyphs may indicate smaller thematicallyrelated passages. To illustrate, glyph <1> and <2>, like <76>, is distributed across the whole inscription. Whereas glyphs <4>, <5>, <6>, <7>, and <380> are clustered together in small passages repeated in isolated sections of the text. Two glyphs are of particular interest, glyph <40> and <41> (the mirror image of glyph <40>). If glyph <41> is synonymous with glyph <40>, then there are five instances, evenly distributed throughout the text. Glyph <42> also occurs five times, but after the first instance of glyph <40> and <41>, at approximately N(1250). The distribution of these glyphs corresponds to the peaks in the VGC and hapax-legomena. However, to claim that this is the mechanism underlying the VGC, in that there are five significant dates dividing up these five textual fragments into some inventory or list of names, is probably too strong a claim to make based on these results alone. To explore this idea further, a discourse map is created for the Mamari tablet (text Ca) for comparison, (figure 17). Note the 'moon' glyphs (<40>, <41>, and <42>) and their distribution. It is very clear where this calendar resides in the text. In addition, glyph <8> appears to be structurally-related or dependent on this sequence of 'moon' glyphs, corresponding to the same position in the text, perhaps indicating a semantic or syntactic correspondence. The other hypothesis is that this is a sub-text, as the distribution of glyphs <1> and <380> (i.e. the delimiter), 66 appears to be mutually-exclusive to the 'moon' glyphs. Figure 17. Discourse map of the Mamari tablet (Ca) From the statistics of the VGCs and the structures illustrated by the discourse maps it can be demonstrated which texts are likely to be similar in terms of their literary-genre, and where some texts may be composed of multiple fragments. However, it may be argued that there are a number of interpretations for the charts. In particular the VGCs of each text may not show where segmentation can be made as they are not designed to measure this. In order to test the hypothesis that VGCs can act as predictors for segmentation, or for identifying sub-texts, an additional experiment is performed. Two texts are selected from the Maori corpus. The first is a sample of the prose text Hinemoa, with a sample-size of N(2000), the second is created by concatenating a series of short Maori genealogical texts of the structure: Ko [NOUN], Ko [NOUN] etc. up to a sample-size of approximately N(500). Three final texts are generated through replacement of 500 tokens from the Hinemoa text with the genealogical text resulting in the same total sample-size of N(2000). The three modified texts are composed of the genealogical text at the beginning, middle, and end of the text Hinemoa. It is predicted that the difference between the two text genres will be obvious due to the variation in the number of hapax-legomena. Figure 18 shows the original sample texts. Note the steep curve for the hapax-legomena of the genealogical text, indicating a high number of new types as we move through the sample (N). By comparison, the Hinemoa text is considered to be one whole prose text with less variability and lexical-richness than the genealogical text illustrated by its curve and the small fluctuations in the distribution of new types. Turning to the experimental texts (below), the presence of the genealogical text is clearly distinguishable as it moves though the 67 Hinemoa text from the beginning (figure 19.a), middle (figure 19.b), and end (figure 19.c). Figure 18. VGC plots of the experimental texts: a genealogy, and a narrative text. a.) b.) c.) Figure 19. Plot of VGC and V1 for experimental texts illustrating sharp increase of hapax-legomena. The genealogical text is marked by dashed-lines. In conclusion, VGCs provide a good indicator of change in discourse or literary type. Hapaxlegomena are particularly important, the graphs display a high number of hapax-legomena indicating a text where each word sampled is new, and generates a steep upward curve in comparison to a prose or narrative text. Therefore VGCs may act as good predictors of sub-texts in the rongorongo corpus and where to segment the rongorongo texts in order to improve statistical measures. Golcher (2007), on analysing the undeciphered Voynich manuscript, developed an ‘original constant’ that may be more stable than ZipfR, though ‘probably also trickier to compute’ 68 (Baroni, 2010, personal communication). Therefore, more work is needed before it can be demonstrated that VGC data can be used as a guide for segmentation, or shift in discourse-type. 5.7 Latent semantic analysis 5.7.1 Document-document analysis The LSA is predicted to provide more tangible results compared to a factor analysis. Although a factor analysis can be successful in spatially orientating apparent similarities between texts, the researcher is still left to decide where the groups begin and end, and how far the influence extends. In addition, lexical-richness tests can tell us how rich the vocabulary is in a text, but this does not equate to the text having any form of shared-content. With LSA there is less subjectivity, as the similarity between texts (documents), and glyphs (terms) can be measured according to their correlation. Therefore, texts are measured by being similar in terms of the glyphs that are present, and the structures that put these glyphs together into a larger phrases. LSA not only identifies collocations along similar lines to bi-gram, tri-gram or KWIC data, but also any structuraldependencies existing between constituents present further on in the text. Pearson's correlation is used here as a measure of text and glyph similarity (see Landauer, et al., 1994; Wild, 2009 on Cosine measures). The results (table 10), show the correspondence between texts. Here, LSA has correctly identified texts predicted by both the factor analysis and n-gram data (group 1). Text Aa Ab Br Bv Da Db Hr Hv Nb Oa Pr Pv Qr Qv Ra Aa Ab Br Bv Da Db Hr Hv Nb Oa Pr Pv Qr Qv Ra Sb 0.67 0.43 0.57 0.37 0.36 0.66 0.63 0.39 0.41 0.65 0.56 0.66 0.54 0.61 0.50 0.42 0.54 0.48 0.40 0.59 0.58 0.45 0.53 0.61 0.49 0.59 0.50 0.54 0.49 0.55 0.39 0.43 0.55 0.46 0.33 0.40 0.55 0.38 0.51 0.37 0.39 0.34 0.44 0.43 0.73 0.66 0.46 0.45 0.72 0.57 0.72 0.49 0.49 0.57 0.48 0.56 0.50 0.45 0.36 0.56 0.48 0.50 0.48 0.50 0.40 0.48 0.45 0.23 0.32 0.50 0.40 0.43 0.31 0.34 0.41 0.75 0.48 0.55 0.89 0.69 0.87 0.63 0.63 0.68 0.51 0.51 0.77 0.86 0.69 0.72 0.58 0.66 0.35 0.53 0.41 0.43 0.43 0.36 0.36 0.52 0.41 0.47 0.42 0.56 0.55 0.74 0.84 0.66 0.63 0.64 0.64 0.78 0.50 0.58 0.61 0.58 0.54 0.55 0.45 0.49 Sb Table 10. Pair wise correlation between texts of group 1. Underlined figures highlight the texts meeting the >=0.7 threshold. Tablet H, P, and Q are correlated 69 due to the number of shared passages (Kudrjavtsev, 1949). Text Pr shows the strongest correlation (0.89) with Hr than Qr. Similarly, Pv has a higher correlation with Hv, compared to Qv at 0.86. The difference between Hr and Pr, compared to Qr is probably caused by damage or allography.12 This is confirmed by plotting their VGCs (see figure 20, below). Figure 20. VGC plot of texts Hr, Pr, and Qr, explaining variation in observed results for LSA. The VGC of the recto side of each tablet, shows that text Qr is shorter than the other parallel-texts (Hr, and Pr), which explains the slight differences observed in previous results (i.e. factor analysis, and now LSA). The second group (table 11), identified by the factor analysis, reveals that texts Kr, and Kv are similar to one side of the Small Santiago (Gr), (see figure 1.b, above). Text Ev shows a degree of correlation with Gr, Kr, and Kv, more so than text Na. All these texts share the compound <380.1>, although differences exist with respect to the glyphs occurring between this delimiter. Although text Gr and tablet K are one text considered a ‘list’, Ev and Na may also be of the same genre-type, but listing different individuals or dates, for example. 12 . The fact that tablet H and P are so correlated would point to there being less allographic variation between them, an example exists between the head-shapes of glyph <205> and <305> in line 5 of Hr, Pr, and Qr, (see figure 1.a), possibly implying the work of a separate scribe. 70 Text Ev Gr Kr Kv Na Ev Gr Kr Kv 0.62 0.60 0.57 0.48 0.74 0.79 0.46 0.64 0.44 0.43 Na Table 11. Pair wise correlation between texts of group 2. The texts Ia, Ta, and Gv in group 3 (see figure 10, and table 12), show the most correlation, of all texts currently analysed. The correlation between Ia and Ta suggests that text Ta may be a copy of Ia, supported in part by the bi-gram data (above) where text Ia and Ta share many compound-forms not present in text Gv. Te x t Er Er Fb 0.05 Fb Gv Ia Gv Ia 0.28 0.10 0.10 0.01 0.79 Ta 0.15 0.02 0.80 0.91 Ta Table 12. Pair wise correlation between texts of group 3. Group 4 (table 13) shows little correlation between any texts, as per the results of the n-gram. Therefore these texts may represent separate genres from the rest of the corpus, or the classification according to the factor analysis is incorrect. Text Fa is fragmented and so there is no reliable correlation. Tex t Ca Cb Fa Rb Sa Ca Cb Fa Rb Sa 0.52 0.22 0.15 0.31 0.36 0.08 0.20 0.33 0.08 0.29 Table 13. Pair wise correlation between texts of group 4. The LSA is performed again with removal of the fragmented texts (Fa, Fb, Ma, and Oa), and small samples (La, Ua, Va, Wa, Xa, Ya, Yb, and Za). Texts showing a correlation between both sides of the tablet are joined together, (texts Ca and Cb, Hr and Hv, Pr and Pv), or where previous palaeographic research point to a duplicate copy, (texts Kr and Kv a copy of text Gr). 71 A hierarchical-cluster analysis is applied using agglomerative-clustering in order to locate the groups identified by the LSA. The document-document similarity values (correlations) are converted to a distance-object by squaring over the correlation matrix, to compensate for nonnegative values required by the rest of the analysis. The results of the distance-object represent the dissimilarities between two groups, equal to the maximum value of the dissimilarities between individual texts in the group (see Baayen, 2008:138; Johnson, 2008:183-192). These are plotted as a dendrogram, with the groups identified by the links between the texts (see figure 21). Figure 21. Dendrogram of rongorongo texts based on the LSA data of the sub-corpora. The plot shows two main groups divided into roughly five sub-groups. The texts forming linkedpairs i.e. Ia and Ta, Gr and K, H and Q etc. are highly correlated. The two main groups appear to be divided between tablets with large numbers of delimiter-groups (see Horley, 2007a:27), and those that may be considered narrative or prose structured texts. This does not mean they are all of the same genre, as there is still a distinction between texts delimited with glyph compound <380.1>, and those with glyph <76>, (see group 2 in table 11, and group 3, table 12). 72 Horley (2007a:28) identifies shared-glyph sequences between Ev and text Ab, and Pr, and between texts Ab, Ca, Cb, Ev, and Pr. There are, however, more shared-glyphs between texts: Ab - Ra - Gr, Cb - Ev - Gr - Kv - Na, and Aa - Pr, see Horley (2007a:28), for examples. Therefore, it is certainly the case that the groups may be skewed by the presence of these fragments, however, the categorisation is driven by the correlation between texts, grouped according to how much they have in common overall. Consequently, this provides a general view of the main groups, which can be improved once we are aware of the proper segmentation. In addition, it is possible that these shared-sequences relate to commonly used set-phrases, or formulaic introductory expressions, therefore the presence of these shared-glyphs may not be the final decider of literary-genre. Group 1 Ia - Ta Gv Group 2 Gr - K Ev Group 3 H-Q P A Group 4 Db-Rb Bv-Sb Da-Ra Group 5 Br-Er Na-Sa C Nb Table 14. Summary of text clusters and revised groupings (highlighted columns represent the main links between possible sub-genres). Group 1 supports the results seen previously in the factor analysis, showing the correlation between texts Gv, Ia and Ta. Group 2 agrees with the factor analysis. Group 3 contains the parallel-passages of tablets H, P, and Q, with the slight separation of tablet P, and the similarity between these texts and text A is also apparent (Guy, 1985:383). Group 4 and 5 shows a distinction between the the parallel-texts (Gr, Kr, and Kv), and the rest of the inscriptions, (those with the delimiter-glyph compound <380.1>). Furthermore, group 5 contains the lunar-calendar series (Ca, and Cb) and a repetitive lunar-sequence on one side of the Keiti tablet (Er), (Wieczorek, 2010). The final classification, agrees to an extent with that concluded by Barthel (1958). However, the hierarchical clustering shows that some texts are less associated with the ones proposed by Barthel (1958), but still part of the general category if we consider the top level clusters: the relationship between the parallel-passages H, P, and Q, and tablet A, which are seemingly related to some extent, with some slight separation. The statistics also support previous extensive internal-structural analyses (see Horley, 2007a; 2007b; 2010), which identify the shared-sequences through palaeographic methods. In conclusion, these groups appear to support previous research. If the list-like texts of groups 4-5 represent one broad-genre, for example a funerary texts, then we may 'expect that they would present slightly different versions of the same text, as happens with Egyptian funeral texts' (Horley, 73 2007a:31), hence the range of texts in these groups. As a result, the glyphs between delimiter-glyphs, may represent 'personal names'. Group 1 and 2 reflect a separate list-like genre, and the prose-like texts (group 3), could represent 'short songs' (kaikai), or 'prayers or charms' (Horley, 2007a:31). What the broad groups of inscription represent, whether the distinction is between prose versus structured-lists, is still unknown. It can not be assumed that the surviving tablets represent the wealth of rongorongo literary-genre (Melka, 2009a:116), and consequently, the classification presented here may hint that there are a restricted set of surviving genre. The above methods and procedures are applied to the language corpora, which previously showed little discrimination between texts in the factor analysis. Due to space restrictions the dendrogram plot is summarised by mapping the genre assigned to each text in the meta-data, rather than the name of each text individually. This reveals the main clusters formed by the genres, presented below (figure 22). Figure 22. Dendrogram of language corpora based on LSA data The results of hierarchical clustering show, that two broad categories exist, one group composed of genres between haka and recitation before speaking, with another sub-category consisting of kaikai and war chant genres. The last cluster is split between prose, RR (the Jaussen texts. Barthel, 74 1958:173-199), and song. The groups appear to be part of a broader category of genre: haka, incantation, lament, love song, lullaby, pepeha, and recitation before speaking may be classified as instances of 'ceremonial' or 'public address' performance, with a related sub-genre of 'chant' denoted by kaikai (see Blixen, 1979), and war chant (see Grey, 1853:39, 67, 72). Kaikai songs were apparently used as 'magical spells or charms' (Horley, 2007a:31). Some examples, are found in Campbell (1971:93-120), 'Cantos de Aku - songs of the spirits', which may have been chanted for protection against malevolent spirits. Horley (2007a: 31) highlights the list-like structure of some kaikai songs. The fact that the kaikai category is included as part of the incantation genre is an interesting parallel given their use as 'magic spells', and similar structure with repeat passages (marked in bold and italics): KAIKAI HANUANUA MEA Y PIKEA 'URI 1. Ka hau e, ka hau nga'e he ka hau te nukunuku ka Kava'aro, ka Kavatu'a kakokako. 2. A Ure a Ohovehi ku kahakihia a e Nga Ihu 3. More a Pua Katiki e hiahia pua mauku 'uta tangitangi pua mauku tai [...] (Blixen, 1979) A pepeha is a Maori 'introductory text' used to introduce oneself to an individual or group, detailing your ancestors and place of origin. A recitation before speaking may similarly be viewed as an instance of an 'introductory text'. Laments were sung when something affected the community negatively, such as a funeral, with the often crowd singing the chorus. Grey (1853:10) recorded one such lament, 'Ko Te Tangi A Te Ikaherengutu, Mo Ana Tamariki, I Mate Taua Etahi, I Mate Kongenge Etahi', which was, 'sung by Te Wherowhero, on the death of his brother kati, […]', and is, 'always sung by the aged chiefs if many members of their family die'. Another lament was sung on, 'the occasion of the Governor quitting Taupo in 1819', (Grey, 1853:70), showing that they were not restricted to the death of family members. An example of a lament showing group participation, sung in chorus, appears below. The parts sung by the group are marked by italics, (Grey, 1853:118): HE TANGI, MO TE MATE TURORO. 1. Mate kino, mate kino! 2. Mate taurekareka! 3. Me he mate taua pea koe, 75 4. 5. Tataia he toroa, Hoea i, te moana pea. In terms of the haka, this shows similar parallels with laments, in that a chief orator leads the group, with the group singing the chorus, below. This text also illustrates the issues concerning repetition and elliptical phrases (mentioned above). KO TE HAKA A TAHATAHA, TE WAHINE A TE UIRA TE RANGATIRA O NGATI-POU 1. He aha ra he kai, ma te tuna o te raupo, 2. E anga mai ai? 3. Aha! ha! 4. Ko nga mongamonga o nga wheua o Tawhiroi, 5. Whiua ki Whangape, ki reira whiriwhiri ai, kia pahoho. 6. I, i, i, 7. Me whakatapu ranei? 8. Me whakanoa ranei? 9. Me whakatapu ranei? 10. Me whakanoa ranei? 11. Me marau ki Kariwha, kia manana ake ko te puhi-tuna, 12. He' rino; 13. He tuna ha, 14. A te kai a te koioio. (Grey, 1853:79) An example of an incantation (Grey, 1853:58), shows similar parallels with the other examples, (below). What is particularly evident from this example is the number of repeat phrases (in bold and italics). There are two main verses repeated (see Grey, 1853:40, 61-62, 107, for further examples of repeated passages): MO TE TAANGA NGUTU KAUAE, MO TE WAHINE, TENEI WAIATA KARAKIA WHAKAWAI. 1. Takoto ra, e hine, 2. Pirori e, 3. Kia taia o ngutu, 4. Pirori e, 5. Mo to haerenga atu, ki nga whare tapere, 6. I kiia ana mai, 7. Ko hea tenei wahine kino ? 8. E haere mai nei. 9. Takoto ra, e hine 10. Pirori e, 11. Kia taia o ngutu, 12. To kauae, 13. Kia pai ai koe: 76 14. 15. 16. 17. 18. Pirori e Mo to haerenga, ki nga whare matoro, E kiia ana mai koe, Ko hea tenei wahine ngutu whero ? E haere mai nei. [...] (Grey, 1853:58) Consequently, it would seem plausible to attribute the category of 'public oration' or 'chant' to the first group at the (top of the chart), which all show some form of formulaic expression, repeated sequences, or removal of some syntax particles (note line 5 and 15, and the insertion of additional content at lines 13-14). The second category may be described as 'folklore' or 'narrative' genre with the texts classified as song being a possible misclassification, perhaps the genre poetry. The rongorongo (RR) texts (cf. Barthel, 1958) of the Tahua (A), Aruku-kurenga (B), Mamari (C), and Keiti (E) tablets, are apparently authentic rongorongo contents (Fischer, 1997), however this point is still contentious (Guy, 1985), though Barthel (1958) believes they should not be completely disregarded. Given their sample-size, between N(2056) and N(5522) it may be possible to rule them out as any of the genres represented in the first group. The hierarchical cluster plot shows that the RR texts are related to the category prose. There exists a correlation between the inclusion of these texts in the prose category, and their large sample-size, which fits more with a prose text, than a chant or song (typically between 50-200 words). In addition, Maunsell (cf. Grey, 1853:xiii-xiv), notes that prose was very different and adhered more to the grammar of Maori than poetry, songs, or chants, which have different grammatical structures. This observation is supported by the hierarchical cluster plot where prose is separated from the first main group of 'public address'. However, the meta-data related to the classification of the language corpus (according to key-words in the title), may over generalise and could therefore be unreliable, despite the apparent correspondence between the genres. Without assistance from a native Maori speaker, caution is needed until the text-categories are properly validated. Based on the results of the rongorongo and language corpus, it would be difficult to argue for a successful classification given that rongorongo is still undeciphered, and the issues associated with polysemy in the language corpora need to be properly addressed. Consequently, an additional corpus is introduced, containing 43 Egyptian Hieroglyphic texts collected from Rosmorduc (1997). 77 This will allow the methods applied above, to be validated properly. The genre of one group of texts is known, and are classified as teachings. The corpus consists of 28 texts: 27 of the Teachings of Amenemope, and a further teaching text, Kagemni. If all the texts are similar in terms of their contents, this will become apparent from a LSA, and resulting hierarchical clustering procedures. The analysis will also replicate the condition of the rongorongo corpus, as the Egyptian texts are transliterated in much the same way. The genre of the remaining texts is unknown, but they generally represent texts collected from coffin and stela inscriptions. Figure 24 presents the results. Texts AM001 to AM027 represent the Teachings of Amenemope, and KAG001 the Kagemni teachings text. There are two categories of text: One group is divided into two sub-groups with the majority of the Amenemope texts clustered together at the top of the chart. Six of the Amenemope texts show less correlation with this group and are instead part of other clusters. Four of the Amenemope texts (AM003, AM004, AM012, and AM014), appear with the Kagemni (KAG001) text, showing some relationship with the main Teachings of Amenemope, forming an additional sub-group, including 7 other texts, (see below figure 22). 78 Figure 23. Dendrogram of Egyptian Hieroglyphic texts categorised by literary-genre. The remaining group, appearing at the bottom of the chart, (headed by text SHT001), contains two of the Amenemope texts (AM002, and AM027), showing a departure from the main teachings group. This might, again, be a result of over-generalisation or pre-processing issues. However, for the purposes of this study, the categorisation has been quite successful, with the bulk of the Amenemope texts clustered together, and a secondary group containing the remaining teaching texts. 5.7.2 Term-term similarity analysis The same LSA methods applied to text classification, may be applied to the glyphs and morphemes themselves. To demonstrate, a term-to-term similarity analysis was performed on a chosen glyph. The Egyptian texts were selected again, as the values of the glyphs are known and results based on the 'known' highlight the accuracy of the method better than those based on the 'unknown' i.e. the 79 rongorongo corpus. In addition, attributing significance to one glyph or group of glyphs in the rongorongo corpus would require an entire new study in itself, and substantial statistical support. In Egyptian, glyph <A1> depicts a 'seated man', and acts as a determiner meaning 'I' or 'me', but also as a semantic classifier when attached to glyphs; denoting actions, occupations, relationships, and personal names (Gardiner, 2005:442), see figure 24 for examples. Glyph <A1> should be quite productive in the corpus, making it a good candidate for analysis. a.) b.) c.) Figure 24. Examples of glyph <A1>13 Above, glyph <A1> stands before the verb 'be silent' (Faulkner, 1988:290), forming the noun 'silent man' (24.a), in the second example (24.b), it acts as a determiner denoting an occupation, 'scribe' (Faulkner,1988:246), and in the final example (24.c), glyph <A1> is a determiner for son (Faulkner, 1988:207), as in 'his son', and also at the end of the passage denoting the plural of the compound 'king's children', (Faulkner, 1988:116). Glyph <A1> is highly productive over a large set of texts (table 15), which means we can proceed with the analysis, as there seems to be enough instances of glyph <A1> for the LSA to make good predictions of its relationship with other glyphs in the corpus. 13 . These examples are generated in LaTex with the Sesh package by Serge Rosmorduc (1997). 80 Ra nk Glyph Count 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 A1 260 211 177 169 114 109 85 30 19 16 13 11 10 10 9 8 8 8 7 7 Re l. Fre q. Te x t 0.03 < 0.01 < 0.01 0.05 < 0.01 0.02 0.02 0.03 0.03 0.02 0.04 0.03 0.02 0.02 0.01 0.02 0.01 0.01 0.02 0.02 OUN001 WST001 BRO001 NAU001 HET001 DPA001 INE001 IKH001 CTA001 KAG001 AM002 AM022 AM020 AM015 GUA001 AM024 CNE001 AM001 AM025 AM010 Table 15. Top 20 Frequency count of Glyph <A1> in descending order. The glyphs meeting the >=0.7 threshold are once again returned, (as per Landauer et al., 2004:5217). A KWIC concordance is generated to validate the results of the LSA. Due to the large number of instances, the results have been summarised by selecting the most correlated-glyphs, with a count of their position relative to the node <A1> (see table 16). The second to last column shows the total number of instances of each correlated-glyph, and the last column, their correlation to glyph <A1>. Glyph P OS .1 P OS .2 POS .3 V 31A 9 5 2 A2 9 5 7 D54 4 10 6 G41 9 3 9 N23 4 0 1 gm 0 0 3 mw 3 1 1 U19 0 0 0 N36 0 1 0 POS .4 5 7 14 19 0 6 0 0 0 P OS .5 P OS .6 NODE POS .8 POS .9 POS .10 P OS .11 P OS.12 P OS .13 Tota l Corre la tion 25 4 A1 0 5 12 15 7 7 89 0.91 8 22 A1 0 8 6 1 9 13 82 0.86 11 4 A1 2 0 6 2 9 8 68 0.92 0 0 A1 2 2 3 3 3 6 53 0.81 0 1 A1 0 1 7 2 1 4 17 0.93 0 0 A1 3 0 1 1 2 0 16 0.91 1 1 A1 0 0 0 1 0 1 8 0.91 0 0 A1 0 0 0 0 0 0 0 0.91 1 0 A1 0 3 0 0 1 0 6 0.72 Table 16. Summary of the KWIC concordance and LSA results. The KWIC concordance show that there are examples of glyph <A1> forming a relationship with the glyphs identified by the LSA (i.e. <V31A>, <A2>, <D54>, <G41>, <N23>, and <gm>). There are however, a few concerns in relation to the correlation of 0.91 assigned to glyph <U19> an 'adze', or variant of the preposition <n> (Gardiner, 2005:518). There are no instances of it in the KWIC concordances. A further analysis reveals that glyph <U19> is assigned this value due to it appearing in proximity to the other glyphs associated with glyph <A1> (table 17). 81 No. POS.1 POS.2 POS.3 POS.4 POS.5 POS.6 NODE POS.8 POS.9 POS.10 POS.11 POS.12 POS.13 1 2 3 4 5 6 7 8 9 10 A i A Z1 t ir W n a W i r s P1 N23 r N31 i n b i i D26 i Z1 t D54 n b n W W i W Z2 6 D2 z C7 sw k f A2 f n 1 1 b G7 W n n n n n n n n n n U19 U19 U19 U19 U19 U19 U19 U19 U19 U19 nw nw nw nw nw nw nw nw nw nw W W W W W W W W 6 W D6 D6 D6 D6 H ra D6 D6 4 D6 W r r m W Hr O i D6 D Sd G41 G41 wA t 1 i W i d d A A A F37B p W s W nb Table 17. KWIC concordance of <U19>. The correlation of 0.84 between glyph <nw> and <A1>, is explained by the KWIC concordance. Glyph <U19> appears mainly in combination with glyph <n> and <nw>. On closer examination, glyph <U19> combined with <n> and <nw> (actually glyph <N35> and <W24> respectively), forms the demonstrative 'this', (Gardiner, 2005:518). In addition, the other glyphs associated with <A1> occur with glyph <U19> as shown by the summary (table 18). A few previously observed glyphs, for example: <A2>, <D64>, <G41>, <mw>, and <N23>, are also returned in the results. Glyph POS.1 mw A2 G41 D54 N23 V31A gm U19 N36 0 0 0 0 0 0 0 0 0 P OS.2 P OS .3 POS .4 POS.5 P OS .6 NODE P OS .8 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 U19 U19 U19 U19 U19 U19 U19 U19 U19 0 0 0 0 0 0 0 0 0 P OS .9 POS .10 P OS.11 P OS.12 P OS .13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 Tota l 4 3 2 1 1 0 0 0 0 Table 18. Summary of KWIC results for glyph <U19>. What must be considered, is that LSA identifies the similarity between glyphs by looking at examples of collocations, as well as the clause and paragraph as a whole. LSA does this for all texts and glyphs in the corpus. Consequently, if a glyph has a small correlation (see <N36> at 0.72, in table 16), this maybe due to it being associated with many glyphs in the same domain (i.e. as a determiner or phonetic complement). This is suggested by the plot of the term-to-term similarity scores (correlation), see figure 25, below. 82 Figure 25. Plot of glyph <N36> The long flat line at the top of the graph shows that this glyph is equally correlated with many other glyphs, at approx 0.92. Therefore this glyph is less correlated with glyph <A1> because it is more productive, and less likely to favour a particular glyph. Therefore, the strong correlation between glyphs is attributed to those that form collocations, or where there is a semantic dependency. To illustrate, glyph <N36> and <N23> form compounds, including: /mrt/ 'weaver', /mr/ 'friend', which includes glyph <A1> as determiner (Gardiner, 2005:491). In addition, <G41> forms the sequence <A1-N35-G41-G54> meaning 'alight, or 'halt', and <G17-D36-X1-N35-T14-G41-A1>, as 'nomad hunter' (Gardiner, 2005:472), which explains to some degree its positive correlation with <A1>. Therefore, many of the glyphs retrieved by the LSA have some function associated with glyph <A1>, either as a determiner, phonetic complement, or logogram. In conclusion, the results show that the methods used here have elicited some form of literary genre based on the features peculiar to each text; whether it is from data containing glyphs or morphemes. Some of these results, for example, the factor analysis provide a good starting point for exploratory studies, but there are issues related to the number, and type of features selected. Traditional corpus linguistic techniques (counts of n-grams, and KWIC concordances) offer more precision in terms of identifying which glyphs and morphemes form some lexical or grammatical association pattern (Biber et al., 1998). However, it is often time consuming to filter through the data looking for these patterns. Latent semantic analysis (LSA) produces results which are easier to interpret, and shows 83 the strength of correlation between terms. One disadvantage, is that LSA is computationally expensive when applied to large corpora, however for this study it is capable of processing the whole rongorongo corpus without any problems. To summarise, future rongorongo studies may wish to consider adopting LSA as part of a mixed-methods approach, when exploring the semantics of the rongorongo corpus. 84 Chapter 6 Conclusion and Recommendations for further research 6.1 Conclusion The analysis of the rongorongo corpus appears to require a mixed-methods approach borrowing from corpus linguistics; including word-frequency distribution analysis, collocations, KWIC concordances (see Biber et al., 1998; Gries, 2009; Baayen, 2008); lexicostatistics involving spectrum frequencies and Vocabulary Growth Curves (Baroni, and Evert, 2006; Baayen, 2001; 2008), combined with authorship-attribution studies (see Mosteller and Wallace, 1964; Baayen, 2008; Stamatatos et al., 2009). These studies have been successful in classifying unknown authorship against a list of known texts; one example being the study of the 'Federalist papers' (Mosteller and Wallace, 1964). However, despite rigorous feature-selection criteria the choice of the number of features is still left to the judgement of the researcher. Consequently, one of the most attractive properties of applying LSA is that it requires high dimensional data-sets representative of 'global knowledge' (Landauer, et al., 1997). An understanding of the structural features, context, or genre of the rongorongo texts will provide an insight into the use of glyphs and is therefore paramount in revealing the underlying mechanisms that were exploited for creating inscriptions with such allographic diversity, yet a relative amount of standardisation (Horley, 2005; 2007a; 2009). Tests for lexical-richness hints at the nature of a text. The correlation between texts with large numbers of delimiters (Horley, 2007a) and those without, show that these list-like texts are lexically-richer and more likely to be lists of content-words, due to their low occurrence - typically among the hapax-legomena. Analysing Vocabulary Growth Curves (VGC) provides some insight into how the texts can be segmented, in order to compensate for any skew in statistical measures, caused by the presence of possible sub-texts. The most successful of the multivariate methods would appear to be Latent semantic analysis as a result of the underlying the model, which are not dependant on frequency data in the same way other analyses are prone (factor analysis, relative frequency counts). Each form of analysis has its advantages and shortcomings, however they all reveal a small piece of the puzzle, or a different view of the data. Each method is used to validate the results of previous methods, or provide model versions of raw data. 85 The groupings of tablets would appear to be broadly divided into two main groups with one designated 'prose', and the other 'lists' of personal names, places, dates or other possible inventory style items. These terms are applied generally however, and do not represent a final hypothesis concerning the literary-genre of the tablets, merely that they share a common theme or contents denoted by the strength of their correlation. However, it does provide statistical support for the conclusions drawn by previous studies, (Barthel, 1958; Butinov and Knorozov, 1957; Fedorova, 1978; Guy, 1982; 1985; 1990; 2006; Fischer, 1997; Davletshin, 2002; Horley, 2005; 2007a; 2007b; Pozdniakov, 1996; Pozdniakov and Pozdniakov, 2007; Melka 2008; 2009b; 2010). If these groups show that structural variation is restricted to a specific group of tablets, then it may be concluded that this is as a result of literary-genre, or that they are attributed to different scribal schools, (Routledge, 1919); resulting in the identification of the mechanisms behind the alloglyphs obvious from parallel-passages. In addition, the language corpus provided some evidence for the application of VGCs as a possible tool for segmentation. The LSA shows the presence of sub-groups, and how they are related to a broader category of 'chant' or 'public address', and 'prose'. Although more work would need to be done, a comparative analysis of rongorongo and Rapanui oral tradition is expected to identify some fragment or passage of text that reflects the structures assigned to the glyphs, which reviewing previous research (Barthel, 1958; Fedorova, 1978; Davletshin, 2002; Horley, 2007a; Pozdniakov, and Pozdniakov, 2007; Melka, 2010), 'would seem entirely justified', (Fischer, 1997: 304). Fischer (1997:305) proposes however, that language texts used in comparative structural analysis should rely on premissionary texts before 1866 as they are the most likely to parallel the rongorongo inscriptions. As the majority of texts in the language corpus of this study date before 1853, some being of ancient origin (particularly Grey, 1853), it is hoped that proper pre-processing and segmentation will improve the current results. The study evaluated a series of alternative data-sets. For the moment most scholars agree that Barthel's (1958) corpus is sufficient, though Guy (1985; 1990; 2006) highlights a number of inconsistencies, and both Horley, (2007a) and Pozdniakov and Pozdniakov (2007) choose to preprocess the Barthel (1958) corpus, removing particular alpha-codes. More importantly, there is no current alternative, although Horley (2005) proposes a new scheme based on the compositional elements of the glyphs, reducing the sign list to 50 elementary signs. 86 For statistical purposes it appears that DD4 provides better results, as it identifies all instances of a glyph in the corpus, simplifying the programming. The argument whether a glyph like 700 (fish) is the same as 700y (an upside down fish), will be revealed once further steps have been taken towards the decipherment. It is possible to remove many of the (Barthel, 1958) alpha-codes, though 'f' representing 'feathered' glyphs (i.e. those with a series of lines emanating from their contour), are problematic (Paul Horley, 2010, personal communication). To summarise, this paper has attempted to solve one a problem in rongorongo studies, that of literary genre. Once it is established which tablets are likely to contain similar inscriptions, it will provide some contextualisation to the study of the glyphs and the motivation for their presence on specific tablets. Lastly, this paper has benefited from the application of the R programming language for statistics, which provides numerous functions that can be incorporated into programs and applied to corpora. In addition, being free, there are no research costs compared to packages such as SPSS, where an annual license fee is required. It is hoped therefore that this paper also shows what applied-linguists can achieve with a little computer programming knowledge (see Biber et al., 1998; Evert and Baroni, 2006; Baayen, 2008; Johnson, 2008; Gries, 2009; Wild, 2009 for more on these methods). 6.2 Further research From the correspondence analysis, we see that preprocessing of the texts and tagging of grammatical constructions; such as the dative, active, and passive, would be beneficial to the study. In addition, the rongorongo corpus needs further segmentation based on entropy (Rao, R., et al., 2009), or on the constant proposed by Golcher (2007), which segments morphemes using unsupervised, and language independent, methods based on word frequency counts of substrings (character level n-grams, see Golcher, 2007:1). The underlying assumption is that a larger unit is composed of smaller units commonly occurring together. Consequently, the segmentation of text is made where the predictability of the following character falls, (Golcher, 2007:2). The level of orthography is still undetermined for rongorongo. Calculating the sample correlationcoefficients for Chinese texts and glyphs, Penn (2006) demonstrates how the degree of logography and phonography can be determined by a 3D pairs plot. The level of logography and phonography present in a particular writing system, can be used as a tool to 'classify entirely unknown writing 87 systems to assist in attempts at archaeological decipherment' (Penn, 2006:1). Addressing this issue through an analysis of Chinese Hànzi and English, by calculating the sample correlationcoefficients of a corpus of text based on a similar term-document matrix used in LSA. Although still in its early stages, this would appear to be a promising method for assessing the next issue in rongorongo research, the Orthographically Relevant Level (ORL), (Sproat, 2000). Is rongorongo syllabic, logographic, or a mixture of the two, plus semantic categories and phonetic complements? The perspective plots (figure 26) show a 3D pairs plot with the strength of correlation between glyphs represented by the height of each peak for Egyptian (as per previous LSA), Cuneiform (35 texts from the Laws of Hammurabi), rongorongo, and a number of English texts (Alice in wonderland, Huckleberry Finn, Dracula, Pride and Prejudice, and Sherlock Holmes, to name a few, 46 in total). There does seem to be a difference between a writing system like Egyptian and Cuneiform, versus English and rongorongo. A logographic writing system may display properties similar to the Egyptian plot where glyphs are used for a specific text i.e. a funerary formula, which will mean less 'semantic clumpiness' (see Penn, 2006:2), shown by the absence of any peaks in the rest of the Egyptian plot. However, the depth of the chart shows that there are quite a few similarities in the usage of glyphs (i.e. syntax particles or phonetic and semantic classifiers). The plot for Cuneiform, on the other hand, shows there is more correlation between the individual constituents, shown by the depth of the correlation (filling the whole 3D cube), this however may be due to the repeated formula in the Laws of Hammurabi, i.e. ‘If a man does X, to Y, he will receive Z punishment’, causing each section to be quite similar in structure, and they are also considered to be one text genre (list of laws). English and rongorongo show an interesting parallel between the distribution of the peaks, though the depth of the plot is slightly deeper for English, showing more of the documents share common vocabulary. 88 English Egyptian Rongorongo Cuneiform Figure 26. Sample correlation-coefficients for English, Egyptian, Rongorongo, and Cuneiform. These are only preliminary results and more comparisons need to be made between writing systems of the world, both modern and ancient. However, it is obvious from these plots that there is an interesting difference between these scripts, but whether this difference is quantifiable, and shown to assign the correct ORL, needs to be determined by a more extensive study. In particular, parameters for the representativeness and size of the corpora and transliteration schemes need to be assessed in order to ensure that results are consistent, and do not lead to incorrect assumptions. Returning to the previous discussion, LSA can be applied to glyphs to quantify glyph behaviour, 89 though it requires more testing, as this paper may be the first to apply LSA to ancient writing systems and decipherment. However, the study shows that by following the semantic correlation between glyphs, it is possible to adopt a hierarchical approach similar to syntax trees and the notion of binding to discover which glyphs 'govern' others (Chomsky, 1981a), and which reveal an anaphoric relationship to other glyphs in the corpus. Markov models used in Natural Language Processing, speech recognition, and work on deciphering the Indus Valley script (see Rao et al., 2009) would produce model glyph collocations, or predict the value of damaged glyphs and the semantic relationships that may exist had the corpus of surviving inscriptions been larger. In addition, cryptographic methods such as the transposition cypher, can be trained with the same data presented in this paper i.e., counts of features, correlations, typical collocations, and KWIC concordance data. This method can produce statistically based guesses of glyph values, which although likely to generate incorrect strings, may provide correct 'guesses' or some avenues for further study. A final remark is that each study submitted to the ever growing pool of research on rongorongo and writing systems, is a valid contribution, and when supplemented with statistical analysis as part of a mixed-methods approach, will result in more robust conclusions. Corpus linguistic methods offer a good basis on which to build more advance approaches using language modelling techniques, or to provide support to palaeographic analysis. It is hoped that with the recent increase in the pace of rongorongo research (Horley, 2009; 2010; Melka, 2009a; 2009b; 2010; Wieczorek, 2010), that we will see a decipherment in the not too distant future. 90 Bibliography Aiolli, F., M. Simi, D. Sona, A. Sperduti, A. Starita, and G. Zaccagnini. (1999). 'SPI: a System for Palaeographic Inspections'. AIIA Notizie. URL: http://www.dsi.unifi.it/AIIA/. 4:34-38. Accessed: 19/09/2008. Altman, A. (2004). ‘Early Visitors to Easter Island 1864-1877 (translations of the accounts of Eugène Eyraud, Hippolyte Roussel, Pierre Loti and Alphonse Pinart; with an Introduction by Georgia Lee)’. Los Osos, CA: Easter Island Foundation. Baayen, H. (1994). 'Derivational Productivity and Text Typology'. Journal of Quantitative Linguistics. 1:16-34. Baayen, H., Halteren, H., and Tweedie, F.(1996) 'Outside the cave of shadows: Using syntactic annotation to enhance authorship-attribution'. Literary and Linguistic Computing. 11:121-131. Baayen, H. (2001). 'Word frequency distributions'. Kluwer Academic Publishers. Baayen, H., Halteren, H., Neijt, A., and Tweedie, F. (2002). 'An experiment in authorshipattribution'. 6es Journées internationales d'Analyse statistique des Données Textuelles. Baayen, H. (2009). 'Analysing Linguistics Data: A practical introduction to statistics using R'. Cambridge University Press. Barthel, T. (1958). ‘Grundlagen zur Entzifferung der Osterinselschrift (Bases for the Decipherment of the Easter Island Script)’. Hamburg: Cram, de Gruyter. Barthel, T. (1974). 'The Eight Land – The Polynesian discovery and settlement of Easter Island', Honolulu Press of Hawaii. Translated from the German by Anneliese Martin. Baroni, M. (2006). ‘Counting Words: An Introduction to Lexical Statistics’. The 18th European Summer School in Logic, Language and Information, Málaga, Spain. Baroni, M. (2009). ‘Distributions in text’. In Anke Lüdeling and Merja Kytö (eds.), Corpus linguistics: An international handbook, 2. Berlin: Mouton de Gruyter:803-821. Berthin, G., and Berthin, M. (2006). 'Astronomical Utility and Poetic Metaphor in the rongorongo Lunar Calendar'. Applied Semiotics. 8:18. 85-98. Biber, D. Conrad, S. and Reppen, R. (1998). ‘Corpus. Linguistics: investigating language structure and use’. Cambridge University Press. Blixen, O. (1979). 'Figuras de hilo tradicionales de las Isla de Pascua'. Moana: Estudios de Antropología Oceánia. 2:1. 1-106. Bonfante, G., Bonfante, L. (2002). 'The Etruscan Language: An Introduction (2nd Edition)'. Manchester University Press. Butinov, N and Knorozov, Y. (1957). ‘Preliminary Report on the Study of the Written Language of Easter Island.’, Journal of the Polynesian Society. 66:1. 5-17 91 Campbell, R. (1971). 'La Herencia Musical de Rapanui: Etnomusicología de la Isla de Pascua'. Santiago de Chile: Andrés Bello. Chomsky, N. (1981a) 'Lectures in Government and binding'. Dordrecht: Foris Ciula, A. (2005). 'Digital palaeography: Using the digital representation of medieval script to support palaeographic analysis'. Digital Medievalist. 1. Coulmas, F. (1991). 'The Writing Systems of the World (The Language Library)'. Basil Blackwell: Oxford. Coulmas, F. (1996). 'Encyclopedia of Writing Systems'. Blackwell Publishing. Coulmas, F. (2003). 'Writing Systems: An Introduction to their Linguistic Analysis'. Cambridge University Press. Damerow, P. (2006). ''The Origins of Writing as a Problem of Historical Epistemology'. Cuneiform Digital Library Journal. 1. Available at http://cdli.ucla.edu/pubs/cdlj/2006/cdlj2006_001.html. Accessed: 4/3/2010. Davletshin, A. (2002). 'Names in the Kohau Rongorongo Script'. Paper presented as From Kohau Rongorongo Tablets to Rapanui Social Organization at the 2nd International Conference 'Hierarchy and Power in the History of Civilizations'. Saint Petersburg, Russia, July 4-7. DeCoster, J. (1998). 'Overview of Factor Analysis'. URL: http://www.stathelp.com/notes.html, Accessed: 10/2/2010 De Feu, V. (1996). 'Rapanui'. Routledge, London. Dong, Q; Wang, X; Lin, L. (2006). 'Application of Latent Semantic Analysis to Protein Remote Homology Detection'. Bioinformatics. 22(3):285-290. De Hevesy, G. (1932). 'Écriture de l'Ile de Pâques'. Bulletin de la Société des Américanistes de Belgique. 9:120-7 Dörnyei, Z. (2007). 'Research methods in Applied Linguistics'. Oxford University Press. Elbert, S. (1941). 'Chants and love songs of the Marquesas islands, French Oceania'. Journal of the Polynesian Society . 50(198):53-91. Elbert, S. (1982). 'Lexical diffusion in Polynesia and the Marquesan-Hawaiian relationship'. Journal of the Polynesian Society. 91(4):499-518 Emory, K. (1968). 'Review of Reports of the Norweign Archaeological Expedition to Easter Island and the East Pacific'. (2) Miscellaneous Papers by Thor Heyerdahl and Edwin N Ferdon, Jr., Eds. American Anthropologist, 70:152-154. Englert, S. (1970). ‘Island at the Center of the World'. Translated and Edited by William Mulloy. New York: Charles Scribner's Sons. 92 Evert, S., and Baroni, M. (2005). 'Testing the extrapolation quality of word frequency models'. Proceedings of Corpus Linguistics 2005, URL http://www.corpus.bham.ac.uk/PCLC/. Accessed 12/02/2009. Evert, S., and Baroni, M. (2006). 'The ZipfR library: Words and other rare events in R'. Presentation at useR! 2006: The Second R User Conference, Vienna, Austria. Evert, S., and Baroni, M. (2007). ‘ZipfR: Word distributions in R’. Proceedings of the ACL 2007 Demo and Poster Sessions. 29–32, Prague, June 2007. Association for Computational Linguistics. Facchetti, G. (2002). Antropologia della Scrittura: Con un' Appendice dulla Questione del Rongorongo dell' Isola di Pasqua. Milano: Arcipelago Edizioni. Faulkner, R. (1988). 'A Concise Dictionary of Middle Egyptian'. Griffith Institute, Oxford. Fedorova, I. (1978). 'Mify, predaniya i legendy ostrova Paskhi'. Moscow: Nauka. Felbermayer, F. (1971). 'Sagen und Überlieferungen der Osterinsel'. Darmstadt: Verlag Hans Carl Nürnberg. Fischer, S. (1997a). 'Glyph-Breaker'. New York. Fischer, S. (1997b). 'Rongorongo: The Easter Island script, History, Traditions, Texts', Clarendon Press: Oxford. Fischer, S. (1998). ‘Reading Rapanui's rongorongo.’ In C. M. Stevenson, G. Lee, & F. J. Morin (Eds.), Easter Island in Pacific Context: South Seas Symposium, Proceedings of the Fourth International Conference on Easter Island and East Polynesia, University of New Mexico, Albuquerque, 5–10 August 1997. 3–7. Los Osos: The Easter Island Foundation. Foltz, P.W., Kintsch, W., & Landauer, T.K. (1998). 'The measurement of textual coherence with Latent Semantic Analysis'. Discourse Processes. 25. 285-307. Gardiner, A. (2005). 'Egyptian grammar'. Third Edition. Published on behalf of the Griffith Institute, Ashmolean Museum, Oxford, by Oxford University Press, London Golcher, F. (2007). ‘A stable statistical constant specic for human language texts’. In Recent Advances in Natural Language Processing 2007 (RANLP-07), to appear. Available at: http://amor.rz.hu-berlin.de/~golcherf/ranlp.pdf Accessed: 24/02/2010. Greenhill, S., Blust. R, and Gray, R. (2008). 'The Austronesian Basic vocabulary Database: From Bioinformatics to Lexomics'. Evolutionary Bioinformatics, 4:271-283. Available at: http://language.psy.auckland.ac.nz/austronesian/ Accessed: 06/12/2009. Grey, G. (1853). 'Ko Nga Moteatea, Me Nga Hakirara O Nga Maori '. Robert Stokes, Wellington. Gries, S. (2009). 'Quantitative Corpus Linguistics with R: A Practical Introduction'. Routledge. 93 Guy, J. (1982). 'Fused Glyphs in the Easter Island Script'. Journal of the Polynesian Society. 91:445–447. Guy, J. (1985). 'On a Fragment of the ‘‘Tahua’’ Tablet'. Journal of the Polynesian Society. 94:367– 387. Guy, J. (1990). 'On the Lunar Calendar of Tablet Mamari'. Journal de la Societe des Oceanistes. 91(2):135–149. Guy, J. (2006). ‘General properties of the rongorongo Writing’. The Rapanui Journal. 20:1, May. Hyman, M. (2006). 'Of glyphs and glottography.' Language & Communication. 26,3-4. 231-249. Holmes, D. (1992). ‘A Stylometric Analysis of Mormon Scripture and Related Texts’. Journal of the Royal Statistical Society. Series A (Statistics in Society), 155:1. 91-120. Horley, P. (2005). ‘Allographic variations and statistical analysis of the rongorongo script’. Rapanui Journal. 19:2. Horley, P. (2007a). ‘Structural Analysis of rongorongo Inscriptions’. Rapanui Journal. 21(1). 2532. Horley, P. (2007b) Presentation at the VII International Conference on Easter Island. Gotland University: Sweden. 20-25th August 2007. Horley, P. (2009). 'Rongorongo Script: Carving Techniques and Scribal Corrections', Le Journal de la Société des Océanistes. 129. Juillet-Décembre Horley, P. (2010). 'Rongorongo Tablet Keiti'. Rapanui Journal. 24:1. 45-56 Jaussen, T. (1893). ‘L’île de Pâsques. Historique et Écriture'. Bulletin de Geographie, Historique et Descriptive 2. Johnson, K. (2008). 'Quantitative methods in Linguistics'. Blackwell Publishing Ltd. Karena-Holmes, D. (1997). Māori language : understanding the grammar. Auckland: Reed Publishing (NZ) Kudrjavtsev, B. (1949). 'The Writing of Easter Island'. Compilation of the Museum of Anthropology and Ethnography 11. Saint Petersburg. 175–221. Landauer, T., and Dumais, S. T. (1997). 'A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge'. Psychological Review. 104(2). 211-240. Landauer, T. (2002). 'Applications of Latent Semantic Analysis'. Paper presented at the 24th Annual Meeting of the Cognitive Science Society. August. Landauer, T. Laham, D., and Derr, M. (2004). 'From paragraph to graph: Latent semantic analysis 94 for information visualization'. Proceedings of the National Academy of Science. 101. 5214-5219. Landauer, T., McNamara, D., Dennis, S., and Kintsch, W. (2007). 'Handbook of Latent Semantic Analysis (University of Colorado Institute of Cognitive Science)'. Psychology Press 1st edition. Lee, G. (1992). 'Rock Art of Easter Island'. Monumenta Archaeologica 17. Los Angeles: UCLA Institute of Archaeology. Manning, C., Raghavan, P., and Schutze, H. Cambridge University Press. (2008). 'Introduction to Information Retrieval'. Maunsell, R. (1862). 'Grammar of the New Zealand Language'. W. C. Wilson, Auckland. McLaughlin, S. (2004). 'Rongorongo and the rock art of Easter Island'. Rapanui Journal. 18:87-94 Métraux, A. (1940). ‘Ethnology of Easter Island'. Bernice P. Bishop Museum Bulletin 160. Melka, T. (2008). ‘Structural Observations Regarding rongorongo Tablet Keiti’, Cryptologia. 32. 155-179. Taylor and Francis group, LLC. Melka, T. (2009a). ‘The Corpus Problem in the rongorongo Studies’. Glottotheory. 1: 11-136. Melka, T. (2009b). ‘Linearity, Calligraphy and Syntax in the rongorongo script’. Glottotheory. 2(2). 70-96. Melka, T. (2010). 'On Some Examined Features of Rongorongo: Tablet Mamari', Oxford Journal of Writing Systems, Oxford. Mosteller, F and Wallace, D. (1964), ‘Inference and Disputed Authorship - The Federalist’. CSLI Publications. Penn, G. (2006). 'Quantitative methods for classifying writing systems'. Proceedings of the 18th International Congress of Linguists (CIL-18), 2. 175-176. Peng, R and Hengartner, N. (2002). ‘Quantitative Analysis of Literary Styles’. The American Statistician. 5:3. 175-185 Peng, F. Schuurmans, D. Keselj, V. Wang, S. (2003). 'Language Independent Authorship-attribution using Character Level Language Models'. In Proceedings of 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003). 267-274, April 12-17, 2003, Budapest, Hungary. Pino, J. and Eskenazi, M. (2009). 'An application of latent semantic analysis to word sense discrimination for words with related and unrelated meanings'. In Proc. of the 4th Workshop onInnovative Use of NLP for Building Educational Applications. Pozdniakov, K. (1996) 'Les Bases du Dechiffrement de l'Ecriture de l'Ile de Paques'. Journal de la Societe des Oceanistes, 103:2. 289-303. Pozdniakov, K and Pozdniakov, I (2007). ‘Rapanui Writing and the Rapanui Language: Preliminary 95 Results of a Statistical Analysis’. Forum for Anthropology and Culture 3. 3–36. http://pozdniakov.free.fr/1620Easter%20Island%20english.pdf. Accessed: 6/8/2009. Source: Ray, S. (1932). ‘Note on Inscribed Tablets from Easter Island’. Royal Anthropological Institute of Great Britain and Ireland, Man 32. 153-155 Rao, R., Yadav, N., Vahia, M., Joglekar, H., Adhikari, R., Mahadevan, I. (2009). ‘Statistical analysis of the Indus script using n-grams’. PLoS ONE 5(3): e9506. Robinson, A. (2002). 'Lost Languages'. BCA: Great Britain. Rosmorduc, S. (1997), SETH http://www.iut.univparis8.fr/~rosmord/hieroglyphes/hieroglyphes.html, accessed 20th March 2010. Routledge, K (1919). ‘The Mystery of Easter Island: The story of an expedition'. London and Aylesbury: Hazell, Watson and Viney. Sproat, R. (2000). 'A Computational theory of Writing Systems'. Cambridge University Press: Studies in Natural Language Processing. Sproat, R. (2003). 'Approximate String matches in the rongorongo Corpus'. http://www.cslu.ogi.edu/~sproatr/ror/. Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). ‘Test Genre Detection Using Common Word Frequencies’, In Proc. of the 18th Int. Conf. on Computational Linguistics (COLING2000). 808-814 Stamatatos, E. (2009). ‘A Survey of Modern Authorship-attribution Methods’, Journal of the American Society for Information Science and Technology. 60:3. 538-556. Summers, D. (1993). 'Longman/Lancaster English Language Corpus Criteria and Design'. International Journal of. Lexicography, 6:3. Thomson, W. J. (1889). 'Te Pito te Henua or Easter Island: U.S. National Museum', Annual report: 447-552. Tomatsu, R. (2006). ‘A Computational Analysis of Literary Style: Comparison of Kawabata Yasunari and Mishima Yukio’. Second Annual Rhizomes: Re-Visioning Boundaries Conference, The University of Queensland, Brisbane. Available at: http://espace.library.uq.edu.au/eserv/UQ:7704/rt_rhiz.pdf) Accessed: 05/12/2009. Tregear, E. (1891). 'The Maori-Polynesian Comparative Dictionary'. Kessinger Publishing. Vives Solar, J. (1920). 'Rapanui: Cuentos Pascuences'. Santiago de Chile: Imprenta Universitaria, Estado 63. Wieczorek, R. (2010 in press). 'Astronomical Content in Rongorongo Tablet Keiti'. Le Journal de la Société des Océanistes. Wild, F. (2009). 'LSA: Latent Semantic Analysis. Open University. LSA package (Version 0.63-1) 96 for the R programming language for statistical computing. http://cran.rproject.org/web/packages/LSA/index.html. Accessed: 16/03/2010. Zipf, G. (1949). 'Human Behavior and the principle of Least-Effort'. Addison-Wesley. 97 Appendix Presented here are photographs of tablets Tahua (A), Aruku-kurenga (B), and Mamari (C). Photographed by Ilaria Rovera (September, 2008). (Not to be reproduced in any form without prior permission from Fr. Jean Louis Schuester, Congregation of the Sacred Hearts, Rome). [The images are removed for the online version to prevent commercial use – please request the appendix from the author] 98

RELATED PAPERS

RELATED TOPICS

Log In

Corpus linguistics as a method for the decipherment of rongorongo (Mres Dissertation)

Corpus linguistics as a method for the decipherment of rongorongo (Mres Dissertation)

Related Papers

RELATED PAPERS

RELATED TOPICS