1st June 2010
School of Social Sciences, History and Philosophy
Department of Applied Linguistics and Communication
Birkbeck University
Corpus linguistics as a method for the decipherment of
rongorongo
Martyn Harris
martynharris@ymail.com
Word count: 28,439
Dissertation submitted in partial fulfilment of the requirement for the MRes in Applied
Linguistics
1
Acknowledgements
This dissertation benefited from the help of many kind individuals, all of whom I would like to
thank (in no particular order): Paul Horley and Tomas Melka for openly discussing their views on
rongorongo, in addition to articles and hard to find sources; Fr. Jean Louis Schuester, and Maria
Centofanti of the Congregation of the Sacred Hearts (SSCC, Rome), for allowing access to the
Tahua (A), Aruku-kurenga (B), and Mamari tablets (C), and for granting permission to take
photographs (see appendix); Jill Hasell of the British Museum, for granting access to the London
tablet (K), and the Reimiro (L), and (J); Marco Baroni and Stefan Gries for comments and advice on
lexicostatistic, and corpus linguistic methods.
2
Table of Contents
Abstract........................................................................................................7
Chapter 1 Introduction...............................................................................8
Chapter 2 Writing systems and the rongorongo script........................12
2.1
2.2
2.3
2.4
2.5
Empirical issues relating to writing systems research and rongorongo......................12
The principle of the autonomy of the graphic system.................................................13
The principle of interpretation....................................................................................15
The principle of historicity..........................................................................................18
Literary genres of the rongorongo inscriptions...........................................................20
Chapter 3 Data..........................................................................................24
3.1 Data.............................................................................................................................24
3.2 Pre-processing.............................................................................................................31
Chapter 4 Methodology............................................................................34
4.1 Corpus linguistics.........................................................................................................34
4.1.1 N-grams: uni-grams, bi-grams, and tri-grams..............................................40
4.1.2 Key Word In Context concordances (KWIC)..............................................41
4.2 Authorship-attribution methods..................................................................................41
4.2.1 Principal components analysis.....................................................................44
4.2.2 Factor analysis..............................................................................................46
4.2.3 Tests of lexical-richness...........................................................................................46
4.2.4 Latent Semantic Analysis (LSA)..............................................................................47
Chapter 5 Presentation of findings.........................................................50
5.1
5.2
5.3
5.4
5.5
5.6
5.7
Feature selection.........................................................................................................50
Principal components analysis....................................................................................50
Factor analysis: Rongorongo corpus...........................................................................52
N-grams.......................................................................................................................55
Factor analysis: Language corpus...............................................................................57
Frequency spectrum and VGCs as tests for lexical-richness......................................60
Latent semantic analysis.............................................................................................70
5.7.1 Document-document analysis......................................................................70
5.7.2 Term-term similarity analysis........................................................................80
Chapter 6 Conclusion and Recommendations for further research....85
6.1 Conclusion....................................................................................................................85
6.2 Further research...........................................................................................................87
Bibliography..............................................................................................91
Appendix...................................................................................................98
3
Table of figures
Figure 1. Two sets of parallel-passages........................................................................................15
Figure 2. Classification of tablets by shared glyph-sequences, (Barthel, 1958)...........................21
Figure 3. Evidence of an inscription continuing on to the opposite side......................................24
Figure 4. Examples of spatial orientation and glyph-contraction.................................................27
Figure 5. Empirical Vocabulary growth curve of all data-sets......................................................28
Figure 6. Zipf Frequency/Rank plots............................................................................................37
Figure 7. Zipf's predicted straight-line in double-logarithmic space...........................................38
Figure 8. Scree plots (rongorongo data-sets)................................................................................51
Figure 9. PCA (rongorongo dataset).............................................................................................51
Figure 10. Factor analysis of rongorongo (DD4) with varimax and promax rotation.................53
Figure 11. Correspondence plot of the influence of morpheme loadings.....................................58
Figure 12. Comparison of the number of types: FZM and GIGP LNRE models.........................61
Figure 13. Number of types V(N) compared to the expected E[V(N)]........................................63
Figure 14. Measure of Baayen's P for texts in the rongorongo corpus.........................................64
Figure 15. VGC plot of the Santiago staff (Ia).............................................................................65
Figure 16. Discourse map of the Santiago staff (Ia).....................................................................66
Figure 17. Discourse map of the Mamari tablet (Ca)...................................................................67
Figure 18. VGC plots of experimental texts: a genealogy, and a narrative text...........................68
Figure 19. Plot of VGC and V1 for experimental texts................................................................68
Figure 20. VGC plot of texts Hr, Pr, and Qr.................................................................................70
Figure 21. Dendrogram of rongorongo corpus.............................................................................72
Figure 22. Dendrogram of language corpora................................................................................74
Figure 23. Dendrogram of Egyptian hieroglyph corpus...............................................................79
Figure 24. Examples of glyph <A1>............................................................................................80
Figure 25. Plot of glyph <N36>....................................................................................................83
Figure 26. Sample correlation-coefficients: English, Egyptian, Rongorongo, and Cuneiform....89
4
List of tables
Table 1. Rongorongo tablet identifier codes (Barthel, 1958; Fischer, 1997:393).........................25
Table 2. Example transliteration of glyphs from the Barthel (1958) sign inventory....................25
Table 3. Summary of the four possible data-sets..........................................................................27
Table 4. Grammatical structures present in Maori, Marquesan, and Rapanui..............................30
Table 5. Compound morphemes in the Rapanui corpus...............................................................31
Table 6. The most productive bi-grams in the rongorongo corpus...............................................55
Table 7. The most productive tri-grams in the rongorongo corpus...............................................56
Table 8. Parameters and resulting VGC of Generalized Inverse Gauss Poisson model...............62
Table 9. Observed and modelled values for ranked-frequencies..................................................62
Table 10. Pair wise correlation between texts of group 1.............................................................69
Table 11. Pair wise correlation between texts of group 2.............................................................71
Table 12. Pair wise correlation between texts of group 3.............................................................71
Table 13. Pair wise correlation between texts of group 4.............................................................71
Table 14. Summary of text clusters and revised groupings..........................................................73
Table 15. Top 20 Frequency count of Glyph <A1>......................................................................81
Table 16. Summary of the KWIC concordance and LSA results.................................................81
Table 17. KWIC concordance of <U19>......................................................................................82
Table 18. Summary of KWIC results for glyph <U19>...............................................................82
6
Abstract
The rongorongo writing system of Easter Island is the only example of writing in Polynesia. The
structural properties of the script and the few remaining inscriptions has complicated decipherment
work for many years.
With the development of sophisticated language and word-frequency
distribution models (Zipf, 1949; Baayen, 2001), and text classification (Mosteller and Wallace,
1964; Landauer and Dumais, 1997; Landauer et al., 2002; 2004; 2007; Manning et al., 2008), the
issues relating to the study of large multilingual corpora can be addressed. This study adopts a
mixed-methods approach and evaluates methods for classifying the rongorongo texts according to
their literary-genre. Although there is still work to be done in determining the types of literarygenre produced by rongorongo scribes, it is believed a classification of the texts, will restore some
contextual information to enable future studies to identify some of the structural-principles behind
the productivity of particular glyphs.
The paper highlights the empirical issues relating to the use of multilingual-corpora, wordfrequency distributions and dealing with different sample-sizes. The Barthel (1958) corpus is tested
for the optimum representation for statistical exploration. Furthermore, the results validate previous
conclusions regarding the presence of shared passages (Barthel, 1958; Fischer, 1997; Sproat, 2003;
Guy, 2006; Horley, 2007a; 2009; 2010; Melka, 2008; 2009b; 2010), and literary-genre (Butinov and
Knorozov, 1957; Barthel, 1958; Guy, 1985; 1990; Fischer, 1997; Davletshin, 2002; Berthin and
Berthin, 2006; Melka, 2008; 2009a; 2009b; 2010; Wieczorek, 2010). The final results will provide
a possible foundation for further exploration of literary-genre in the rongorongo corpus and a series
of methods can produce robust statistical measures.
7
Chapter 1 Introduction
Rongorongo is the undeciphered writing system of Rapanui (Easter Island), discovered in 1864. In
terms of the lack of an agreed decipherment, the situation according to Steven Fischer, is that ‘after
130-odd years [now 150 years, my note], there is still no complete translation of Easter Islands
rongorongo inscriptions’ (Fischer 1997:263)1.
So why decipher a forgotten writing system? Aside from the academic challenge, the rongorongo
tablets can no doubt contribute to our current understanding of the culture that flourished on Easter
Island, and a particularly interesting question that needs to be addressed is: what was so important
to the Rapanui, that it had to be written down?
This paper uses corpus-based techniques, including lexicostatistics, and text classification or
authorship-attribution methods to establish whether the corpus of rongorongo tablets can be
categorised according to the scribe who produced them, or the literary-genre they represent, which
will help to evaluate previous methods and hypotheses regarding rongorongo genres (Butinov and
Knorozov, 1957; Barthel, 1958; Fedorova, 1978; Guy, 1985, 1990; Pozdniakov, 1996; Pozdniakov
and Pozdniakov, 2007; Fischer, 1997; Berthin and Berthin, 2006; Horley, 2007a; 2010; Melka,
2008, 2009a, 2009b, 2010; Wieczorek, 2010). This paper focuses on the methods adopted by
Mosteller and Wallace (1964), Biber, et al., (1998), Baayen (1994; Baayen et al., 1996; Baayen,
2001; Baayen, 2002; Baayen, 2008), Landauer, (et al., 2004), and Gries (2009).
Only one genre is confirmed, or at least agreed upon, by the majority of researchers; namely, that
the Mamari tablet (Ca and Cb) appears to be some form of lunar calendar, (Barthel, 1958; Guy,
1990, Berthin and Berthin, 2006; Melka, 2009a, 2009b, 2010). In addition, one side of the Keiti
tablet (Er1-Er8), may also bear some calendar-sequences, (see Butinov and Knorozov, 1957;
Wieczorek, 2010). Fischer (1997) concludes that the majority of the tablets represent procreation
chants, based on the structure of the Santiago staff (Ia) and its apparent structural-parallels with the
tradition Atua mata-riri (Thomson, 1889). These analyses relay on qualitative descriptions of the
glyphs combined with an internal structural analysis2, but with little in the way of quantitative
1
. Inferring that his proposed decipherment has provided a partial solution, although there is still no consensus among the research
community that any part of the inscriptions have yet been successfully deciphered.
2
. A method similar to distribution analysis, which is a structural-linguistic method, which identifies the, 'systematic relations and
structural properties of language elements', involving segmentation and substitution used by linguists to analyse unknown languages.
This method has been shown to identify the direction of writing and make proposals for the underlying inventory of basic signs, (see
Coulmas, 1996: 132).
8
support (see however, Barthel, 1958; Horley, 2005; 2007; Pozdniakov and Pozdniakov, 2007; for
examples of statistical analysis). This paper would like to address this need for further statistical
analysis by pulling together several different methods in linguistics capable of identifying structures
and stylistic “finger prints” present in corpora.
Applying corpus-linguistic methods will identify the 100 most productive glyphs (or features),
which form the basis for authorship-attribution methods, namely: principal component analysis
(PCA), factor analysis, and latent semantic analysis (LSA). These methods are capable of drawing
out structures present in multivariate data (see Baayen, 2008). Lexicostatistic methods will enable
the study to address the issue underlying corpus linguistics; the comparison of texts with different
sample-sizes.
This involves modelling the distribution of word frequencies and measuring
vocabulary growth as a test of lexical-richness, commonly using statistical models based on Zipfslaw (Baayen, 2001; Baroni, 2006).
Word frequency data, KWIC concordances (Key Word In Context, see Summers 1993; Biber et al.,
1998; Gries, 2009), and an analysis of collocations (bi-grams and tri-grams, see Baayen, 2008) will
identify compound glyphs and their given context. Latent Semantic Analysis (LSA) will describe
the correlation between a group of texts and the strength of that correlation, providing a faster,
though computationally expensive, solution to traditional corpus-methods (word-frequency counts
and KWIC concordances). In terms of the validation of results, previous research, including the
identified parallel-passages, will act as a means to measure the success of text classification, in
combination with n-gram counts and KWIC concordances.
These methods are applied to a second corpus of oral traditions in three Polynesian languages:
Rapanui, Marquesan, and Maori. This corpus is used for experimental purposes (see results section,
particularly, factor analysis, VGC, and LSA). The assumption is that a genealogy, whether present
in the language corpora or the rongorongo corpus, is likely to have a high degree of lexicalrichness. This is due to genealogies generally consisting of a list of individual names; they will
therefore have more words with a low frequency of occurrence (i.e. hapax-legomena). A chant or
song, on the other hand, may contain repeated or formulaic expressions, and will be less lexicallyrich than genealogies, where some words are repeated many times over (see reconstructed form of
Atua-mata-riri, 'god angry-eyes' in Fischer, 1997).
Therefore, the data produced by the above
methods may allow for the categorisation of texts or textual fragments into their respective groups
based on the productivity of stable glyph-sequences.
9
Naturally, it is not possible to hint at what 'genre' the groups may represent; for example, are they
grouped according to the individual writing-style of the scribe, or by the genre of the text
(procreation chant versus calendar), or a mixture of the two? Indeed, the absence of a glyph or
group of glyphs, on a tablet, may enlighten us as to what motivates their productivity: for example,
based on previous studies of the Mamari tablet (a likely calendar, see Barthel, 1958; Guy, 1990;
Horley, 2009; Melka, 2009a; 2009b; 2010), can other tablets containing similar features or
structures be classified as calendars, or at least, sequences of important dates? If methods in
authorship-attribution can discriminate between tablets with shared features, then it will be possible
to retrieve some form of contextual information, which has not been consistently recorded in early
records.
The notation adopted here, identifies rongorongo glyphs with the greater and lesser than signs
'<...>', and a forward slash '/.../' for phonetic descriptions. The zero-padding in the Barthel (1958)
corpus has been removed to enable the transcription to line up with glyph images wherever
possible, and accordingly, the same convention is applied in the discussion to enable comparison,
(not to be confused with the transliteration scheme proposed by Horley, 2005). The study is divided
into five main sections. The first of these will address the principles that should be adopted when
approaching writing from a linguistic perspective, combined with a literary review of past research,
with particular emphasis on information elicited about the contents, or literary genre, of the tablets
(see chapter two).
Chapter three discusses the corpora and the stages involved in pre-processing. The following
chapter introduces the methodology, evaluating the benefits and shortcomings associated with
corpus linguistic and authorship-attribution based studies. The fifth chapter presents the results of
PCA and factor analysis, a comparative analysis of relative word-frequencies and vocabulary
growth curves, and a final section dedicated to latent semantic analysis, with an overview of its
utility in identifying lexical or grammatical-association patterns (Biber et al., 1998). An Egyptian
corpus is introduced to evaluate LSA methods applied to a deciphered writing system, allowing the
analysis to be extended to glyphs, resulting in more transparent results than achieved with an
undeciphered writing system.
The final chapter will review what the study has achieved and some suggestions for future research,
which may contribute to the decipherment of the rongorongo script and the messages that the
10
tablets contain.
It is hoped that the results presented here will provide quantitative support for some of the previous
suggestions that have been made concerning literary genre and the rongorongo tablets.
A
decipherment of the rongorongo script is beyond the scope of this paper and the data, however,
what this paper does aim to achieve is the categorisation of the inscriptions into their respective
groups or 'genres' and restore some contextual information to the inscriptions.
11
Chapter 2 Writing systems and the rongorongo script
2.1 Empirical issues relating to writing systems research and rongorongo
Before any analysis, it is important to clarify some of the assumptions underlying the study of
writing systems. This section outlines some key principles proposed by Coulmas (2003) for
approaching writing systems from a linguistic perspective. Firstly, an outline of the terminology
used in the rest of this paper with regards to writing. The term, 'script' is defined along the lines of
Sproat (2000), who states that a script is considered to be, 'a set of distinct marks conventionally
used to represent the written form of one or more languages', (Sproat, 2000:25). In this analysis it is
assumed that the language underlying rongorongo is a Polynesian one, and one which is likely to
reflect the linguistic conventions of an older form of the modern day Rapanui language.
In
addition, the term 'glyph' will refer to, 'a written symbol with a particular shape', regardless of
whether it could be sub-divided in to smaller constituents, (see Guy, 1982; Horley, 2005;
Pozdniakov and Pozdniakov, 2007).
Two main assumptions need addressing with regards to writing. Firstly, there is no necessary oneto-one correspondence between the elements of a writing system and the linguistic units it is
supposed to represent: For example, in the case of the proto-form of the Cuneiform syllabary: a
specific social context (accounting) and glyphs represent semantic categories, and the arrangement
of signs has little to do with the linear arrangement of speech.
Phonetic values were later
introduced, but played only a minor role. (Damerow, 2006:4; see also Hyman, 2006). Therefore it
can not be assumed that a sequence of glyphs represents a string of phonemes, nor can it be
concluded that as with other examples of writing, rongorongo represents a form of proto-writing
with nothing but semantic categories. Proto-cuneiform tended to be tabular in form reflecting its
use as a device for recording economic transactions. The Rongorongo glyphs on the other hand are
arranged linearly, and the bhostrophedon nature of the script allowed for continuous chanting by
rotating the tablet through 180 degrees every other line (see Jaussen, 1893b; Routledge, 1919;
Metraux, 1940; Fischer, 1997, and the appendix). Therefore, the social or economic context may
also have an influence on the structural properties of a script.
Secondly, we can not interpret non-Western writing systems according to a Western-concept of
writing (Coulmas 2003). In other words, with preconceived ideas on how the writing system should
work. One case in point is given by the informant sessions between Metero and Bishop Jaussen
12
(1893b). Jaussen believed that the tablets could be deciphered by eliciting each word from his
informant, and assigning the word to the relevant glyph in relation to the position in the oral chant,
in his own words:
I had as many gathering of words, separated one from the other, as there
were signs in one line of the tablet; and as a person could, without
knowledge of the language, by counting exactly, place each sign above the
word that is its proper meaning.
(Jaussen, 1893b:253; cf. Fischer, 1997:52)
This one-to-one correspondence between the words of his informants' chant and the glyphs did not
produce any coherent results, partly because the assignment of phonetic values was performed in
the absence of his informant (see Fischer, 1997:53). As noted by Coulmas (2003:33) 'there is no
perfect fit between the linguistic constructs that are functional in speech and writing', which is
primarily caused by the fact that writing systems are 'static', whilst speech is 'dynamic' in nature, as
seen from evidence in historical linguistic studies: for example, Indo-European was the predecessor
or proto-language of modern English, French, German, and Hindi. These have become mutually
unintelligible over time, whilst retaining some correspondences through their common ancestral
root.
Coulmas (2003:33) proposes four main assumptions regarding writing and its relationship to the
spoken language. Three of these assumptions will be adopted in this paper: writing and speech are
distinct systems; they are related in a variety of complex ways; speech and writing have both shared
and distinct functions. These assumptions are the basis for three principles, which constitute the
key reasons why linguists should study writing as a system of communication (Coulmas, 2003:33).
2.2 The principle of the autonomy of the graphic system
Under this principle, writing systems are considered to be structured, consequently allowing them to
be analysed in terms of 'functional units and relationships', (Coulmas, 2003:34), in the same way
that linguists analyse the components of speech.
The arrangement of the graphic signs is restricted by rules, which govern their 'linear arrangement
in forming large expressions', (Coulmas, 2003:34); this mirrors language where syntax places
restrictions on the ordering of linguistic constituents, i.e. subject and object.
13
A question that needs addressing under this principle is: 'What are the basic operational units of the
system', (Coulmas, 1991:49), in other words, what units could be considered as forming the main
inventory of signs that define the writing system?
A number of proposals have been made in connection with rongorongo. One extreme is the
mnemonic-aid hypothesis, where glyphs allow the reader to recall parts of chants that they have
previously memorised, motivated by the belief that, 'pictographies generally have more variety and
richness in the choice of figures and symbols', (Metraux, 1940:404). Other researchers conclude
that rongorongo has a logographic or logo-phonetic system similar to Egyptian, 'in which the
auxiliary parts of speech and affixes may easily be omitted', (Butinov and Knorozov, 1957:16;
Davletshin, 2002:4); a mixed syllabary (Pozdniakov and Pozdniakov, 2007; Guy, 2006; Horley,
2005); or a combination of all three: a script with syllabic, logographic, and semasiographic
properties (Fischer, 1998:5).
A statistical analysis (Horley, 2005:114; Pozdniakov and Pozdniakov, 2007), reveals that the script
is likely to be syllabic in accordance with the structure of the Rapanui language. The phonology of
the Rapanui language allows for syllables of type (C)V only. Consequently, there are no 'closed
syllables' or 'consonant clusters' of type CVC, CCV, CVCC, or CCVC. In addition, 'morphological
considerations do not affect the division in syllables', and hence the only division between syllables
is either 'before a consonant if there is one otherwise between vowels', De Feu (1996:186).
Metraux (1940) observed that glyphs appear either in isolation or linked together. This form of
conjunction is intentional, with Metraux concluding that 'these compound signs have no definite
meaning of their own independent of the elements which compose them', (Metraux 1940:401).
Therefore, he concludes that compound forms are derived by 'cursive writing' conventions. One
argument against this view is given by Kudrjavtsev (1949) who observed stable sequences of glyphs
believed to be the same inscription repeated across several tablets. These are termed the 'parallelpassages' and provide evidence of alloglyphic variants.
These illustrate part of the script's
composition, and the features that may form the main constituents 3. The parallel-passages also
show that glyphs joined together on one tablet are sometimes carved individually on others,
(Barthel, 1958:151-168); see for example tablet H, versus tablet P and Q, or Gr versus K, in figure 1
below.
3
. For sign lists reducing the current Barthel (1958) sign inventory to approximately 50 main signs, see Horley, (2005:112);
Pozdniakov and Pozdniakov, (2007).
14
The tablets are identified with the letters A-Z, and each side with r and v (recto/verso where the
tablets' starting point is agreed upon), or an a or b (where the beginning of the tablet is unknown).
In addition, the proceeding number indicates the line number on the tablet. The period and
semicolon, denote affixed and fused glyphs, respectively, and an asterisk indicates the end of a line.
1.a Group One
Aa1
[…] 430.40- 320.9-320.9-440-440-440-440-445-695-4.120a-4.67-34c-60a.260
Hr5
200.5-21h:5-2e-41-220.9h-220.9jh-440-440-440a-20t.440-205s-205s-4.3a-4.3a-72a-51.3a-66a.90Vj
Pr5
200.5-
2a-5-2a-41-220.9-220.9-440-440-440a-440a-20V-205-205-4.3ax-4.3ax-66c-65-65.3ax-66a.95aj
Qr5
200.5-
21:5-2a-41-220a.9-220a.9-440-440-440-440b-305-305s-4.3ax-4.3ax-117d-65.22t-65.3ax-66y.95.711b
Barthel (1958:155-156)
1.b Group Two
Gr3
680-470-1t-430-580c-380.1.3-602.9-232-600-380.1.3-595.5-122-280-67-59f-69.700-380.1.3-2-609-380.1.3-597-380.1.3-59f-720380.1.3*
Kr
3-4
400-470-1t- 430-580c-380.1.3*-402-9- 380.1- 595.5- 122-280- 380.1- 2- 409-
380.1.3-597-380.1.3-59f- 720- 380.22
Barthel (1958:156)
Figure 1. Two sets of parallel-passages between four tablets (1.a), and a group of two (1.b) tablets.
The parallel-passages show the glyphs acting as affixes, compounding, reduplication, and hint at
possible allographs. Note the differences in head and arm shape, and missing elements in the
15
examples (H, P, and Q) and (Gr, K). Metraux believed that these differences have no implications
for understanding the script, stating it to be, 'inconceivable that a particular significance would be
given to the presence or absence of an arm or a leg, the position of the head, or the form of the
hand', (Metraux, 1940:401).
In summary, the current assumptions regarding the nature of the script's typology and the support
given by statistical analysis, (Horley, 2005; Pozdniakov and Pozdniakov, 2007) would indicate that
rongorongo is likely to be a syllabary, with a small set of logograms perhaps representing the most
frequent functional unit, or the most revered; for example, glyph <600> possibly represents a birdmotif attributed to the god makemake (Metraux, 1940:311),
or
glyph <50>, which may be
associated with fertility rites (Routledge, 1919), and are found on the island rocks (Lee, 1992;
McLaughlin, 2004).
2.3 The principle of interpretation
Writing systems follow a structure, whether, 'phonetic, phonemic, morphophonemic', or 'lexical',
(Coulmas, 2003:34). Analysing a writing system requires an awareness of how the elements relate
to the linguistic structures and units. This refers to orthographic depth, or what Sproat (2000)
terms, the 'orthographically relevant level' (ORL), (Sproat, 2000:10), which is divided into deep and
shallow, or opaque and transparent.
This describes how closely the script represents the
phonological system of the language. A shallow orthography is a script like Finnish or Spanish,
where there is a one-to-one correspondence between a grapheme and a phoneme. Whereas, a deep
orthography, like Hànzi (Chinese), shows a one-to-many relationship between grapheme and
phoneme. The way in which scholars previously tackled decipherment4 involved, 'identifying the
underlying language and reconstructing the way it is coded in the written symbols' (Damerow,
2006:1). This has been successful in the case of Egyptian and Cuneiform where a large corpus of
texts have been preserved. It should be acknowledged however, that the earlier systems of writing
had very little to do with spoken language, and so the philological methods adopted by researchers
are not as effective where the relationship between speech and writing is weak, (Damerow, 2006:1).
In addition, without prior knowledge of the underlying language it becomes even more difficult to
find a solid basis for decipherment. Etruscan, for example is a non-Indo European language, which
thrived in Italy around 1200-100 BC (Bonfante and Bonfante, 2002:3), but adopted the Greek
4
. Methods include the analysis of bilingual texts, and the re-construction of sign-lists. See the decipherments of Egyptian, Linear B,
and Mayan (Robinson, 2002).
16
script, complicating any decipherment work.
Although it is not currently possible to draw conclusions regarding the mapping of any particular
rongorongo glyph to a phoneme, researchers believe that, 'it is possible to isolate independent
groups of words, and […] more important, single words from the continuous text', (Butinov and
Knorozov, 1957:10), and statistical studies indicate that, 'the average glyph may consist of about
three elements corresponding to letters or two elements of a certain syllabic value', (Horley,
2005:108).
Pozdniakov and Pozdniakov (2007) state that any analysis attributing a logographic reading to the
rongorongo script, ‘conflict with the frequency distribution of the glyphs’, (Pozdniakov and
Pozdniakov 2007:11). According to their study, there are 52 glyphs (Pozdniakov and Pozdniakov,
2007:8), with the highest frequency in the corpus, which with approximately 55 possible syllables
in the Rapanui language is, ‘itself a weighty argument in support of the hypothesis’ that the glyphs
represent syllables, (Pozdniakov and Pozdniakov 2007:11).
However, later on in their analysis they appear to class rongorongo as a mixed script with both
logographic and syllabic elements. After an analysis of word-length in both corpora, it appears that,
‘the average length of a word in the Rapanui language coincides almost exactly with the average
length of a word in the written texts’, (Pozdniakov and Pozdniakov 2007:14).
Pozdniakov and Pozdniakov (2007:31) account for the issues associated with divergent statistics
between the rongorongo script and the Rapanui language, by proposing the presence of
'determinative' or 'reduplicator' glyphs (glyphs <3> or <200>) and changes in meaning or
phonology through mirroring: where an element of a glyph such as the head, faces in the opposite
direction to the established reading order of left-to-right (Jaussen, 1893b; Kudrjavtsev, 1949). They
attribute this behaviour to reduplication of the morpheme (Pozdniakov and Pozdniakov, 2007:32).
This is not necessarily an incorrect assumption given, as mentioned previously, that writing and
speech are distinct systems for different communicative needs, but this proposal has not yet been
confirmed or further investigated.
Another characteristic of the script is the constant repetition of small stable glyph-sequences 5,
5
. Glyph sequence <380.1.(3)> appears on a sub-grouping of tablets, for example the London (K), Small Santiago (Gr), Mamari (C),
Echancree (D), and Small Vienna tablets.
17
which may represent, 'independent phraseological units or single words' (Butinov and Knorozov
1957:11), a conclusion paralleled by Metraux (1940:401), 'if the signs are a form of writing
corresponding to words and syllables, groups or sequences of them will be repeated many times,
especially for a Polynesian language'. A further proposal is that some elements may be omitted by
the scribe, for example glyph <39>, possibly representing the particle te, which, if verified, would
not cause any significant change at the semantic level, (Butinov and Knorokov, 1957; Horley,
2005:114). As a result, it should not be assumed that the glyphs can be assigned phonetic content
due to the possible existence of 'semantic classifiers' represented by the delimiter groups (see
Horley 2005), and further complications caused by the probable high-level of polysemy (see Guy,
2006; Fischer, 1997; Horley, 2007a; 2009; Melka, 2009a).
In conclusion, on the basis of previous research the assumption made here is that the rongorongo
script is a mixed system with an inventory consisting of logographic and syllabic glyphs. It is
possible that there are additional glyphs in the inventory that have a special function or provide a
shorthand version of a more complex glyph. How much logography is present in the script is still
debated, and new studies posit a gestural function to some glyphs (the sequence <480.2-483-2-4802>, at the start of line Cb10 on the Mamari tablet, (Melka, 2010). This can not be ruled out due to
the autonomy of writing (Coulmas, 1996:27). Therefore, it is necessary for the ORL of rongorongo
to be determined in order to provide better grounds for decipherment. This would require a entirely
new study beyond the scope of this paper (see the further research section for a preliminary study).
2.4 The principle of historicity
It is assumed that writing developed at a latter phase of human history when it became necessary to
keep records of transactions, or levels of production, as seen by the developmental phases
undergone by Cuneiform into a syllabary capable of writing down different languages (from
Sumerian to Akkadian, though with slight variations). Writing is therefore considered to be a,
'technological development', which has only been 'independently invented four times in history'
(Sproat 2000:21), namely the Hieroglyphs (Egypt), Cuneiform (Sumer), Jiăgŭwén (China, later
known as Hànzi), and Mayan (Central America). That is to say, that these writing systems were not
the product of cultural diffusion due to close contact between one group of people and another.
Rongorongo is another writing system that appears to have been developed independently and in
relative isolation.
18
The rongorongo tablets where discovered by Joseph Eugéne Eyraud who visited Rapanui in 1864.
The purpose of his visit was to survey the island on behalf of the Congrégation des Sacrés Coeurs
and to establish a Catholic mission there. During his stay he became aware of wooden boards
wrapped in leaves suspended from the rafters of the Rapanui huts:
One finds in all the houses wooden tablets or staffs covered with sorts of
hieroglyphic characters. These are figures of animals unknown on the
island, which the natives trace by means of sharp stones.
(Altman, 2004:21)
The term rongorongo derives from te kohau rongorongo, which is translated as ‘the stick of the
rongorongo men’. Metraux (1940: 389). Prior to this designation, the script was referred to as taa
or he timo (Fischer 1997:35). The script is classified as a boustrophedon writing system; the reader
starts reading from the bottom-left corner, and reads along until the end of the line whereupon the
tablet is turned 180o and reading continues left-to-right, first noticed by Maklai (Fischer 1997:39),
confirmed later in informant sessions by Jaussen (1893b), Thomson (1889), Routledge (1919:244),
and by comparison of the glyphs in the parallel-passages (Kudrjavtsev, 1949). This method may
have allowed for continuous chanting, or prevented the reader from missing a line of glyphs,
(Metraux, 1940:405; Fischer, 1997:353).
According to Thomson's informants, the first king to arrive on Rapanui, Hotu-Matua, 'possessed the
knowledge of this written language, and brought with him to the island sixty-seven tablets
containing allegories, traditions, genealogical tables, and proverbs relating to the land from which
he had migrated'. (Thomson, 1889:514). The first evidence of Rapanui 'writing', is seen on a
document presented by Gonzalez y Haedo to the islanders, when a Spanish expedition annexed the
island in 1770. The signatures on the document are composed of linear and abstract signs, with a
couple resembling rongorongo glyphs (Horley, 2005).
Metraux (1940:400) dismisses these
signatures as 'meaningless scrawls not connected to the tablets', although he does acknowledge that
the 'bird figure' may be a possible exception (glyph <400>).
Other research concludes that rongorongo may have been the result of contact with this 1770
Spanish expedition (see Bastian, 1872; Emory, 1968; Fischer, 1997; 1998; Facchetti, 2002); a
cultural solution to the encoding of language, flourishing during the period leading up to 1860-4,
but only lasting approximately three generations (Fisher, 1998:3). Horley (2005:115) argues that it
is, ‘reasonable to assume that the rongorongo script was already developed before the contact with
19
Europeans’. Horley (2005:115) shows that some signs on the document can be traced to known
rongorongo glyphs and motifs that are inscribed on skulls, and as petroglyphs on rocks (see
Metraux, 1940:399), and moai hats (see also Guy, 1985; Lee, 1992; McLaughlin, 2004; Melka,
2009a).
A knowledge of the written characters was confined to the royal family, the chiefs of the six districts
(into which the island was divided), sons of those chiefs, and certain priests or teachers. (Thomson,
1889:514). There are a number of reports that show the tablets were used for chanting, and that
there existed a tradition of transmission facilitated by readings of the tablets on public occasions,
(Routledge, 1919). Thomson's (1889) report also supports this:
[…] people were assembled at Anekena Bay once each year to [h]ear all of
the tablets read. The feast of the tablets was regarded as their most
important fête day, and not even war was allowed to interfere with it.
(Thomson, 1889:514)
2.5 Literary genres of the rongorongo inscriptions
In order to assess the success or failure of any statistical analysis in identifying groups of similar
inscriptions, it is important to study previous ethnographic evidence and research. Roussel reports
that the close ancestors of a number of his informants could still read the script and the tablets
contained the history of their island, (see Fischer, 1997:36; Altman, 2004).
Meinicke (1871:550-551) proposes that the tablets were ancient genealogical texts of island ariiki
(chiefs); this belief was dismissed by another researcher, Bastian (1872), who instead argues that
they contain songs that were memorised and recited at festivals. This would entail that the script
was developed after European contact rather than through ancient origins (cf. Fischer 1997:40-42).
According to Metraux (1940:395) there have been no reports elicited by him or Routledge (1919)
that any of the tablets contained genealogies, lists of chiefs, or the origins and exploits of the
Rapanui, which conflicts with native tradition reported by Thomson (1889). Therefore, there is still
no agreement as to what the rongorongo inscriptions represent in terms of literary genre. Ray
(1932:155) believes the rongorongo tablets are, ‘a collection of more or less symbolic reminders of
objects or actions which would serve the native orator as notes of a discourse on history, a prayer, or
20
even an inventory’, agreeing with Thomson (1889).
Butinov and Knorokov (1957), propose that some tablets are genealogies, namely the Small
Santiago tablet (G): based on the composition of the glyphs and repeated sequences (Butinov and
Knorozov 1957:7-8, 15). In addition, they believe that, ‘one tablet could have several different
texts’, (Butinov and Knorozov, 1957:8), given statistical evidence that shows glyph sequences
representing a lunar series on one side of the Keiti tablet, (Wieczorek, 2010). Fischer (1997;
1998:5) however, suggests that the majority of the tablets are likely to represent procreation chants
due to their triadic structure and the high productivity of 'the phallus', glyph <76>, on the Santiago
staff, which represents a semasiograph meaning 'copulated with'. However, there is still some doubt
whether this hypothesis holds as some tablets lack this suffix.
Barthel (1958), proposed that the tablets could be categorised along the following lines:
Aruku-kurenga(B) - Grt. Washington(S)
Grt. Washington(R) - Tahua(A) - Grt. Santiago(H)
Keiti(E) - Sm. Santiago(G) - Santiago staff(I) - Honolulu(T)
Mamari(C) - Sm. Vienna(N)
Figure 2. Classification of tablets by shared glyph-sequences, (Barthel, 1958:167).
The above paradigm illustrates four main groups of tablets. Barthel (1958:1968) shows that there is
some overlap with text C having sequences in common with R, A and H. In addition, text G can be
split according to side, with Gr sharing sequences with N and E, and sections of C and S. Barthel
(1958) attributes Gv as being similar to texts I and T. This classification is agreed upon by previous
research (Metraux, 1940; Butninov and Knorozov, 1957; Fischer, 1997; Horley, 2007a; 2010).
To complicate the issue, as well as more than one text-type being inscribed on a tablet, it may also
be the case that each genre of inscription, 'held its own formulaic requirement', or 'different reading
techniques', (Fischer, 1997:283). According to Fischer (1997), ethnographic data supports the
tablets being thematically-grouped, with one acting as a summary or inventory of one or more other
tablets. An example is observed in the apparent 'close connection between the two inscription
categories of
'ika and timo. […]' (Fischer, 1997:285). These categories where recorded by
Routledge (1919) during her work with an informant who confirmed that, 'a connected, or possibly
the same, tablet was made at the instance of the relatives of the victim and helped to secure his
21
vengence' (Routledge, 1919:248. See also Fischer, 1997:285). Fischer (1997), summarises that, 'the
timo category of inscription recorded the spell against the murderer(s), whereas the ika category
listed each ahu’s6 slain victims' (Fischer 1997:286).
Another genre was known as the Ta'u, a list of deeds (Routledge 1919:251-2), listing the dates and
number of chickens stolen during the course of a man's life. These tablets were also known as
kouhau koro. An additional tablet was produced, providing an inventory of these koro: recording
only the name of the man, and the year of his koro (festival). The Ta’u genre of tablet in
combination with its lesser form (koro) may be regarded as an example of 'island historiography'
equating to a, 'register [of] historical events or names' (Fischer, 1997:289).
There is structural evidence for the multiple-genre hypothesis, and for the existence of list-like
texts. The inscriptions show regular 'delimiter groups', which Horley (2007a) attributes to lists,
showing that each tablet may be identified through these delimiter-lists, allowing for further
segmentation of the texts into smaller fragments. Horley (2007a) uses statistical methods to reveal
sequences of glyphs that were not previously identified and demonstrates that there is a
correspondence between the number of syllables in the Rapanui language and the number of glyph
elements appearing between 'delimiter groups' like <380.1.3>, based on a revised sign-list, (Horley,
2005). The presence of these repeated glyphs, separating textual-fragments, indicates that we are
dealing with an important structural formation reflecting an inventory or list, (Butinov and
Knorozov, 1957:82).
Horley (2007a:27) confirms that, 'from a statistical point of view, structured
lists significantly increase the occurrence of the glyphs belonging to their delimiter groups'.
Consequently, any comparative statistical study should take in to account the, 'corresponding
Rapanui lists', however legends or songs should be excluded, as they 'feature different patterns and
vocabulary', supporting the idea of a, 'formulaic requirement' for interpreting some genres of text,
(Fischer, 1997:283).
A further clue lies in the “Great Tradition” (or Große Tradition, Barthel, 1958:156), which Horley
(2007a:27) attributes to, 'some kind of a refrain rather than a fixed introductory text'. Fischer
(1997) assigns the “Great Tradition”, or procreation chant genre to texts Gv, Ia, and Ta due to their
similar structural features and the presence of glyph <76> on all three. Although the genre of the
tablet is still debated, previous observations (Barthel, 1958; Guy, 1985; 1990; Pozdniakov, and
6
. A burial chamber on which moai were erected.
22
Pozdniakov, 1996; 2007; Fischer, 1997; Horley 2005; 2007a; 2009; 2010; Melka, 2009b, 2010) can
be referred to as a guide for analysis of the inscriptions. If it is possible to group the tablets into
their respective genres, a comparative statistical study may be undertaken to discover why glyphs
are more productive on particular tablets than others. Consequently, if text-classification methods
show that previous assumptions on the genre of a tablet are correct, then previous qualitative studies
will be supported by quantitative methods, where there previously was little or none at all. This is
the domain of corpus linguistics, lexicostatistics, and authorship-attribution studies, which are
discussed in the methodology section below.
23
Chapter 3 Data
3.1 Data
The rongorongo corpus consists of 25 tablets (there is no consensus on the authenticity of the Poike
tablet, see Fischer, 1997:533-534; Melka, 2009a:112). The rongorongo corpus is segmented by
tablet side creating 41 separate texts. This decision is motivated by previous research showing that
a selection of tablets are a collection of smaller sub-texts (Metraux, 1940; Butinov and Knorozov,
1957; Barthel, 1958:151-157; Fischer, 1997; Horley, 2007a:26-30;2010; Melka, 2008, 2009a). In
reality, it is desirable to split each of these 41 texts still further into even smaller fragments or
'chunks', however it is not clear where the segments begin and end, or which statistical method will
help identify them, this is therefore reserved for a future study (see further research section).
Further evidence suggests that a text may continue on to the opposite side of the tablet. One
example, (below, figure 3) shows a sequence of glyphs: <59f-2.76-187-[...]-605-2>, appearing on
the last line of the recto of the Small Santiago tablet (Gr8) and repeated on the first line of the verso
(Gv1): <59f.76-187-[...]-607.700x.76>, (note slight differences with removal of glyph <2>, see also
Horley 2010 on Keiti). Therefore, it is acknowledged that this categorisation is still far from ideal.
Gr 8
…7-59f- 2.76-187-200.10.124-605-2-599-59f-256-200.200.11-2.76?*
Gv 1
...59f.76- 187-186-607.700x.76-205.76?-113-3.95x.3.76-33c.10f.76-43t-33-450.24.?.76-6.33b-607.6.76-493
Barthel (1958)
Figure 3. Evidence of an inscription continuing on to the opposite side.
24
A list of the rongorongo corpus and the alphabetic-code assigned by Barthel (1958), and the codes
designated by Fischer (1997) are presented below.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Barthel
(1958)
A
B
C
D
E
F
G
H
I
J
K
L
M
N
Fischer
(1997)
RR1
RR4
RR2
RR3
RR6
RR7
RR8
RR9
RR10
RR20
RR19
RR21
RR24
RR23
Description
Tahua
Aruku Kurenga
Mamari
Échancrée
Keiti
Stephen-Chauvet Fragment
Small Santiago tablet
Great Santiago tablet
Santiago staff
London Rei Miro 6847
London tablet
London Rei Miro
Great Vienna tablet
Small Vienna tablet
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Barthel
(1958)
O
P
Q
R
S
T
U
V
W
X
Y
Z
Fischer
(1997)
RR22
RR18
RR17
RR15
RR16
RR11
RR12
RR13
RR14
RR25
RR5
-
Description
Berlin tablet
Great St Petersburg tablet
Small St Petersburg tablet
Small Washington tablet
Great Washington tablet
Honolulu 3629
Honolulu 3623
Honolulu 3622
Honolulu 445
New York Birdman
Paris snuffbox
Poike (authenticity disputed)
Table 1. Rongorongo tablet identifier codes (Barthel, 1958; Fischer, 1997:393).
In this study, the Barthel (1958) paradigm is used to reduce the amount of clutter in the multivariate
graph plots, (results section). The rongorongo glyphs are transliterated with numeric-codes,
(Barthel, 1958). Below is an example, showing the family of glyphs with a 'lozenge' head-shape,
<200-246> series:
200
201
202
220
222
240
242
203
243
204
205
206
224
225
226
244
245
246
Table 2. Example transliteration of glyphs classified under glyph-family <200> in the Barthel (1958) sign
inventory.
Note that each number in the three-digit code reflects the changes in constituents: the first digit is
assigned to the head-type, the middle number to the body-type, and the final digit reflects the
25
change in arm-shape.
Barthel (1958), also uses alphabetic-codes to reflect nuances in the composition or arrangement of
the glyphs, i.e. variants, subscripts, mirrored or rotated glyphs. Certain alpha-numeric codes denote
contracted glyph-forms. One example, is glyph <3> which may be an alloglyph for alpha-code 'f',
called 'feathered glyphs' in Barthel's (1958) nomenclature. As an additional example, glyphs with
alpha-code 't', like <4t>, are problematic as the only difference between the two types is where the
scribe has had to fit the glyph in between two others (glyphs highlighted in grey in figure 4, Br1):
see glyph <4> between <50.394> and <2>, written beneath glyph <394> further on in the
inscription as <4t>. Therefore glyph <4t> could be considered an instance of the same glyph <4>,
and may not have a different phonetic or semantic value as a result of subscripting. Alpha-code 'h'
is assigned to glyphs, which are squeezed between two glyphs through superscription, for example
in the Mamari tablet (Ca7): glyph <41h> (highlighted in grey) represents a superscripted 'crescent
moon' glyph variant of <41> appearing in Ca8.
In addition, as per the evidence from the parallel-passages (tablets H, P, and Q, Gr and K),
combinations of glyphs, known as ‘suffixed glyphs’ and 'fused glyphs', transliterated with a period
and colon respectively, are separated into their constituent parts. Therefore, sequences such as
<380.1.3> is represented as individual glyphs: <380>, <1>, and <3>. Guy (1982:447) concludes
that the reading order for fused glyphs is bottom-to-top due to structural evidence showing fused
glyphs to be contractions of two or more glyphs that stand independently in other inscriptions. For
example, note the contraction of glyph <40> and <211s> to a fused form, where glyph <40> is
affixed below glyph <211s> as its alloglyph <42> (example Br1, in red).
26
Br 1
595-1-50.394s-
4-2-595.1-50-301s -
4 -2-40-211s -91 -200 -595.2- 50.394-4t- 2 -595.2-50 -394
-4t-2- 42:211s -91 -595s
Ca 7
40 -390.41 -378y-41h-670-8.78.711
Ca 8
3.40-390.41-378y-41- 670
-8.78.711
Figure 4. Examples of spatial orientation and glyph-contraction.
As a result of the issues with alpha-codes, a number of data-sets were generated with different
parameters for pre-processing the corpus to find the best representation for statistical purposes.
Table 3 shows the different data-sets and how they were processed.
Pre-processing stages
Conjuncts separated
Replacement of f with glyph <3>
Alpha codes removed
DD1 DD2 DD3 DD4
N
Y
N
Y
N
N
Y
Y
N
N
Y
Y
Table 3. Summary of the four possible data-sets (prefixed with DD)
Data-set 1 (DD1) and Data-set 2 (DD2) are abandoned due to the number of hapax-legomena
caused by the preservation of the alpha-codes. These two data-sets do not consider the similarity
between glyphs and their contracted or superscripted forms, and will over-inflate tests for lexicalrichness (see below). Further evidence from Vocabulary Growth Curves (VGC), supports this (see
section on lexical-richness for more on VGCs). Figure 4 shows the empirical VGC of the four
derivations of the Barthel (1958) transcriptions, plotted with the War of the Worlds, acting as an
example of a VGC calculated across a larger sample. The y-axis shows the number of types (V)
across the sample-size (N), denoted by the x-axis.
27
Figure 5. Empirical Vocabulary growth curve of all data-sets and a comparative sample from War of the worlds.
The curve for the original Barthel (1958) data-set 1 (DD1) shows over 3000 individual types (V) in
the whole corpus (N) and the curve is not particularly smooth compared to the War of the Worlds
text, and the sample ends at just over N(10,000). This is also the case with data-set 2 (DD2) and
data-set 3 (DD3). Data-set 4 (DD4) has less types, and is closer to the actual reported sample-size
of the rongorongo corpus between N=14,000 and N=14,500 tokens, (see Barthel, 1958; Fischer,
1997; Horley, 2005:108; Melka, 2009a:114). The best representation therefore comes from DD4,
which has a smoother curve than the previous data-sets reflecting what we expect from natural
language (Zipf, 1949), observed with the War of the Worlds sample. The number of types (V) is also
more in line with the number of glyphs, approximately 632 (Horley, 2005:108), appearing in
Barthel (1958). The analysis in this paper adopts DD4, with all alpha-codes 'f', replaced by glyph
<3> (Horley, 2005:111), and by separating fused and affixed glyphs. It is acknowledged that no
one-to-one correspondence can be made between glyph and morpheme. This is due to the nature of
word-frequency distribution and varying sample-size, which cause a shift in position between the
ranks, making a comparison meaningless, (Baayen, 1998; 2001; Pozdniakov and Pozdniakov,
2007:18).
The language corpora introduced for comparison, is composed of Rapanui language texts collated
by Englert (1970), Thomson (1889), Metraux (1940), Felbermayer (1971), Campbell (1971),
Barthel (1978), and Blixen (1979). These are collections of native oral tradition, and a possible
28
authentic rongorongo procreation chant, Atua-mata riri, (Thomson, 1889). However, some of these
texts show contamination from Tahitian, where Tahitian equivalents of Rapanui words are
intermingled with the rest of the text (Fischer, 1997:93-95), making them unreliable sources of data.
In the case of the chant Atua-mata riri, Fischer (1997:96-100) offers a provisional reconstruction to
compensate for this issue.
The traditional texts in Barthel (1978), are not very representative of Rapanui oral tradition. They
are a collection of traditions from an informant known as Mrs E, and are mainly narrative with
typographical and segmentation errors.
If the tablets contain a variety of genres, then the
comparative language corpus needs to reflect this fact with the inclusion of Rapanui songs, chants,
prayers, poetry.
The lack of representativeness of the Rapanui data is compensated for by including additional
Polynesian language texts from Maori and Marquesan. The Maori data is gathered from George
Grey (1853). Grey published a large collection of Maori poems, traditions, chants, and songs.
Consequently this part of the corpus is much more representative of different genres. The first 150
of these texts are selected. This decision is arbitrary, more texts could be included, and perhaps this
would even be desirable, however processing-time is a consideration, which runs to several hours
with large corpora. Furthermore, a random-sampling approach is chosen due to the other corpora
having been subjected to the same mode of selection.
The Marquesan texts form a collection of chants and love songs recorded by Elbert (1941:61-91).
Within the Eastern Polynesian language group, Rapanui is closest to Marquesan morphologically,
whereas its phonology has more in common with New Zealand Maori. Many Polynesian languages
share the same syntax particles, some with slight phonemic changes; for example the
causative: /haka-/ (Rapanui), and /Whaka-/ (Maori); the +specific determiner /te/; possessive /o/ and
/a/; tense marker /i/. Phonological differences include Marquesan (i.e. the switching of [k] to the
glottal stop in North and South Marquesan, respectively, for example: /koe/ – North
Marquesan, /`oe/ – South Marquesan, the second person singular. See Elbert, 1941:58; 1982:502),
but all three languages show the same patterns of use, confirmed by KWIC concordances:
29
1.) Possessive structure: o … te, and o te ...
Examples
a.) o
b.) o
c.) o
miru te
haumea
ki
te
mokomae
o
te
tumu
taha
o
te
o
te
mata
te
pua
rangi
Language
RAP
MAR
MAO
2.) Dative Structure: ki te …
Examples
a.) i
b.) i
c.) i
tepo
ki
ha'ati'o 'i
maka
ki
te
te
te
rona
tu'u
apai
Language
RAP
MAR
MAO
Table 4. Grammatical structures present in Maori, Marquesan, and Rapanui
Metraux (1940:419) believes that the three cultures are linked by a shared heritage of cultural traits
each related to a particular area of expertise, for example the building of Ahu's (burial chambers); a
part of Rapanui and Marquesan tradition. In addition, Marquesan and Rapanui cultures practice the
carving of stone statues, whereas the Maori carved in wood, which is perhaps attributable to the
type and abundance of certain materials.
Another motivation for selecting Maori is due to their ancient origin; in a footnote to the text He
tangi, mo mokowera, i mate ia ki wharekura, kei ohiwa, Grey notes:
[…] it is the custom of the natives to compose their poetry
rather by combining materials drawn from ancient poems, than
by inventing original matter. An apparently recent poem is thus
sometimes really of very ancient origin.
Grey (1853:14)
Grey's texts are important as they pre-date many of the collections made by rongorongo
investigators (Barthel, 1978; Jaussen, 1893b; Thomson, 1889; Routledge, 1919; Metraux, 1940). If
any of these texts have 'ancient origins', an assumption that could be made is that there exists a set
of formulaic expressions shared among Polynesian orators, which may be present in the
rongorongo inscriptions.
The Maori texts, 150 in total, were categorised into literary genre according to the genre designated
in their titles. For example, He waiata meaning 'song', He tangi – 'lament', He tau – 'recitation
30
before speaking', He karakia – 'incantation', He whakaaraara – 'War chant'. Texts that could not be
categorised were labelled 'prose'. It is acknowledged that this is not the most accurate form of
classification.
3.2 Pre-processing
The rongorongo glyphs are divided into units according to a hyphen between glyphs, along the lines
of Pozdniakov and Pozdniakov (2007:14). Each glyph is assumed to be representative of a
morpheme in the Rapanui language in line with Pozdniakov and Pozdniakov (2007) and Horley
(2005; 2007). The language corpus was trialled in a pilot study7, and it became apparent that there
are a wealth of compound forms. These distort the frequency-counts and KWIC concordances;
reducing the number of actual types in the corpus (see for example, the forms used in interrogative
Q-constructions, appearing in De Feu, 1996:19-21). As an example, the possessive particle system
is particularly complex.
Surface Underlying
Syntactic structure
structure form
1. ta'aku
te-a-au
Specifier - possessive alienable - 1st Person singular
2.
3.
4.
5.
6.
to'oku
ta'au
to'ou
ta'ana
to'ona
te-o-au
te-a-koe
te-o-koe
te-a-ia
te-o-ia
Specifier - possessive inalienable - 2nd Person singular
Specifier - possessive alienable - 2nd Person singular
Specifier - possessive inalienable - 2nd Person singular
Specifier - possessive alienable - 2nd Person singular
Specifier - possessive inalienable - 2nd Person singular
Table 5. Compound morphemes in the Rapanui corpus
The final syllable of the possessive is a substituted form of the personal pronouns for 1 st, 2nd, and 3rd
persons, with /-ku/ representing /au/ (1st), /-u/ for /koe/ (2nd), and /-na/ for /ia/ (3rd). The above
paradigm represents the singular forms, and the plural is derived through elision of the initial /t/ of
the specifier. This is a small sample of the possessive system and there are many more, including
inclusive and exclusive versions (see De Feu, 1996:144-145; Karena-Holmes, 1997:50). The
question is whether these possessive forms should be divided up in the corpus. This paper adopts
the surface-structure, and reserves further segmentation for a future study.
Although an error free dataset has been one of the primary aims, it has not always been possible due
7
. Pilot study results presented at the Bloomsbury Conference in Applied Linguistics (2009)
31
to the nature of working with corpora from varying languages, sources, transliterations (for example
in vowel length: taana versus tana - 3rd person possessive singular, Tregear, 1891:460). The
language corpora are primarily pre-processed through removal of all punctuation, as this is a
Western orthographic convention. There are also differences in transcription, for example, in the
Rapanui texts we see entries for /kiroto ki/ and /iroto i/, these are divided into /ki roto ki/ and /i roto
i/8, respectively. The pre-processing stage was performed with software written for this study and
manual corrections were kept to a minimum, in order to avoid errors associated with manual
processing.
The language corpus provides a good environment for future investigation of linguistic structures,
and literary-genre in Polynesian languages. This is especially true here as there does not appear to
be a recognised Polynesian corpus in existence, compared to those like the Brown corpus available
for English-based studies.
8
. The morpheme /roto/ is a separate word meaning 'inside' or 'within', and forms part of a complex preposition, here denoted by /ki/
and /i/, which are particles, (see Tregear, 1891: 428), and the Austronesian Basic vocabulary Database word no.174, (Greenhill et al.,
2008).
32
Chapter 4 Methodology
4.1 Corpus linguistics
Many of the principles and methods adopted in this analysis are founded on corpus linguistics
(Biber et al., 1998; Gries, 2009), lexicostatistics (Baayen, 2001; 2008; Baroni, 2006; 2009; Evert
and Baroni, 2005; 2006; 2007), authorship-attribution studies (Mosteller and Wallace, 1964; Baayen
et al., 1996; 2002; Stamatatos, 2000; 2009; Peng et al., 2002; 2003), and information retrieval, (see
in particular Landauer et al., 2004; Manning et al., 2008).
As stated by Pozdniakov and Pozdniakov (2007:4), ‘the appearance of yet another monograph with
a new ‘translation’ of the corpus of texts would be significantly less interesting than the appearance
of an article presenting the results of a structural analysis of some specific aspect of the writing
system’. In light of this statement, great emphasis on statistics to support any generalisations about
the script’s composition is the main guiding principle for this study.
Melka (2009a) is the first to mention corpus linguistics as a method for the decipherment of
rongorongo, although, in his view, 'the solution to what exactly rongorongo is will not be scheduled
or delivered via a corpus linguistic methodology', although, 'the approach will take us closer to its
'understanding'', (Melka, 2009a:112).
The first decision to be made is the unit of analysis, which determines not only the falsifiability of
the final results, but also the initial steps needed to collate and process the corpus. Corpus studies
are largely based on two forms of analysis, either 'describing a linguistic structure', or a 'group of
texts (Biber et al., 1998:273). These are termed the 'observations' of the study with each observation
being an occurrence of the feature to be studied. The unit of analysis, adopted in this study, is a
single glyph or morpheme. However, when applying authorship-attribution methods, each text is
considered to be an observation, (Biber et al., 1998:269). One notable study of this kind is by
Mosteller and Wallace (1964) who investigate the disputed authorship of the Federalist papers,
through an analysis of each text showing how they can be attributed to the most likely author.
Melka (2009a) highlights the problematic nature of the rongorongo corpus itself: an issue that
causes a 'number of problems for the investigation' (Melka, 2009a:114), is the destruction,
misplacement, and sale of many tablets by the turn of the nineteenth century. This makes it
33
impossible to consider the corpus to be representative of the entire wealth of possible literary genres
that formed the rongorongo tradition. History has only handed down a few tablets for study,
however there is still enough material9 for a statistical analysis of the mechanisms underlying the
rongorongo script. A mixed-methods and interdisciplinary approach is predicted to be the most
successful for exploring the rongorongo script, as the best parts of each method can be adopted,
whilst abandoning their individual shortcomings, (Dörnyei, 2007).
The corpus of twenty-five tablets is not of the magnitude normally associated with corpus linguistic
studies, compared the Brown Corpus of English, totalling a million words. However, Melka
(2009a:113) states that current rongorongo research 'should consider the approach of a text-oriented
analysis as essential in order to be able to establish safer parameters'. In light of this statement, a
corpus-based approach is adopted in this paper.
Biber (et al., 1998:249) propose that 1000 word samples are sufficient for reliable counts of a
linguistic feature in a text. However, in lexicographic studies this sample would need to be much
larger, totalling millions of words, due to the fact that many words occur with a very low frequency,
perhaps only once, known as the hapax-legomena, which are a very important focus for wordfrequency-distribution studies, see Baayen (2001), but we also need to know which glyphs and
morphemes are the most productive to begin identifying possible syntax markers. According to
Biber (et al., 1998) if a corpus contains too few texts, 'a single text can have an undue influence on
the results of the analysis', (Biber et al., 1998:248-249). This issue is attributable to a particular
kind of study, i.e. a study into dialectal differences, or studies into language and gender. Therefore,
as long as the nature of the texts used in a corpus are made explicit, and the methods and
assumptions are transparent, then any imbalance in the corpus will be clear.
Representativeness is an additional consideration in corpus linguistics (Summers, 1993; Biber et al.,
1998; Melka, 2009a). The term 'representativeness' is understood to mean:
[…] covering what we judge to be the typical and central
aspects of the language, and providing enough occurrences of
words and phrases […] to make accurate statements about
lexical behaviour.
Summers (1993:186-187)
9
. Although rongorongo does not exhibit the wealth of inscriptions compared to Linear B (more than 3000, see Robinson, 2002:84),
compared to the Phaistos disc (one artefact, see Fischer, 1997a; Robinson, 2002:298), or the 3700 short inscriptions of the Indus
Valley seals (Robinson, 2002:268), sizeable sections of continuous texts can be extracted for statistical scrutiny.
34
Biber (et al., 1998:248) observes that, 'there are important differences in the use of lexical,
grammatical, and discourse features across varieties of language'. If the rongorongo corpus is not
considered to be representative of the diversity of literary genres, a comparable corpus that does
contain a variety of text-genres is necessary, in order to offset any imbalance. This is where
multilingual corpora are useful for comparing grammatical structures and word-frequency to
identify parallels between rongorongo and Rapanui, Marquesan, or Maori. In this study, we must
consider the corpus of rongorongo texts to be a random-sample, and consequently not all literary
genres are likely to be represented, (Melka, 2009a).
It should also be kept in mind that any corpus-based study will never be truly representative of the
whole population, as this population is either very large, difficult to sample, or continuously being
expanded, so although we could say, for example, that we have collated all texts produced by Noam
Chomsky, there is no reason to assume he will not publish another book in the future. Therefore, as
long as the corpus is representative of the unit(s) under investigation, and its limitations are
recognised, the study can still provide valuable insights under these principles. Therefore this paper
views the rongorongo corpus as a random sample of those tablets that may have existed as part of a
wider collection of inscriptions, with the unit of analysis being a single glyph or morpheme.
Due to the multilingual nature of the corpora, and differing transliteration schemes, off-the-shelf
software is not sufficient. Although Freeware and commercially available software could perform
sufficiently, the need to edit the data according to the requirements of the software are unattractive
due to the size and complexities of the rongorongo corpus. In addition, many applications are not
equipped to deal with multilingual data or numeric transliteration systems.
The R statistical
programming language is selected as the best approach due to it being freely available, very
flexible, powerful, and fast (see http://cran.r-project.org/). The R environment makes it possible to
process all the corpora procedurally with little intervention from the user aside from validating the
results. Ultimately this means that the problems outlined above, with respect to transliteration
schemes, is overcome by writing programs specific to each corpus ensuring that the same analysis is
applied consistently throughout.
In corpus linguistics, words are known as tokens, in fact, punctuation and white space are also
considered as tokens in some authorship-attribution studies (see Baayen et al., 2002). Wordfrequency forms a large part of corpus-based investigations, for example in lexicostatistics,
historical linguistics, and Natural Language Processing (NLP), which study the productivity of
35
words and lexical-association patterns to discover how language changes over time, by its speakers,
and in different contexts, (see Biber et al., 1998).
Although informative, generalisations can not be made purely on word counts. Due to the varying
nature of language, the productivity of a word may vary depending on context, speaker/writer, and
more importantly the size of the sample. Therefore, even if word counts are nominalised (Biber et
al., 1998:263-264), the counts would still not reflect the true population from which the sample
came. Relative frequency counts add up to 100%, which does not allow any space for unseen types.
This means that there are still words in the population which have not yet appeared, but are likely to
if we increase the sample size. Recent lexicostatistic studies (Baayen, 2001), have sought after a
method to compensate for this issue. Baayen (2001:5) defines this problem as LNRE, or Large
Number of Rare Events. This idea refers to the fact that word-distributions contain a large number
of low-probability words, represented by the hapax-, dis-, and tri-legomena, or words occurring
only once, twice, or three times.
Another issue with word-frequency counts is that a comparison based on different sample-sizes, are
not accurate as the sample mean increases as a function of sample-size, (Baayen, 2001:4-5; Evert
and Baroni, 2006). In addition, median and mode are usually uninformative due to their value
generally being between 1 and 3, because of the previously mentioned abundance of lowprobability words. There are a number of options available to address these fundamental issues.
Firstly, we may take a set number of tokens and form a sub-corpora of each text, for example, the
first 1000 glyphs in each text. This would allow a comparison, but with a certain amount of dataloss. For this study, this results in a considerable loss of data, to what is already a relatively small
corpus by comparison.
In order to address this issue, a word-frequency model can be generated first. An important law in
lexicostatistics is Zipfs law (1949), (see also Baayen, 2001; Baroni, 2006). Application of Zipf’s
law involves organising the frequencies according to decreasing value in rank-order, with the
highest-frequency being assigned rank 1. We can now compensate for the issues corresponding to
relative-frequency counts through prediction of the unseen types. The ranking does not consider
which word is at each rank, for example, whether the word and is the most frequent, but does
enable us to model word-distributions and analyse vocabulary-growth and lexical-richness.
A language model is generated to extrapolate a small sample text to a larger sample-size whilst
accounting for the LNRE, by freeing up the probability-space for new word-types. As observed by
36
Evert and Baroni (2006:1), 'in order to compare samples of different sizes […] it is thus necessary
to extrapolate their observed values to much larger samples'.
Zipf’s law predicts that word-distributions follow a straight-line in double-logarithmic space.
According to Baayen (2001:32), ‘many statistical tests pre-suppose normality’, in other words the
well recognised ‘bell-shape’ of the normal distribution, where the mean centres around zero,
however, as seen by the plot below (figure 5) word-distributions do not follow a normal
distribution:
Figure 6. Zipf Frequency/Rank plots for ‘Alice in Wonderland’.
Note the long tail stretching along the bottom-right of the plot. This represents the hapax-legomena
and consequently the lexical-richness of the text. In terms of Zipfs double-logarithmic space, it is
possible to make the above distributions more 'normal' by transforming the absolute frequency of
each word to ‘log frequency’ (Baayen, 2001:32), reflected in figure 7.
37
Figure. 7. Texts 'Huckleberry Finn' and ‘Alice in Wonderland’ against Zipf's predicted straight-line in doublelogarithmic space.
In figure 7, the straight line in double-logarithmic space, predicted by Zipf (1949), does not hold for
the text 'Huckleberry Finn' nor for ‘Alice in Wonderland', though it accounts for the middlefrequencies fairly well. Therefore, Zipf's law fails to account for the top and bottom frequencies,
which curve downwards from their predicted values. As a result, Zipf's law does not predict the
correct distribution of the hapax-legomena and the most frequent words occupying the bottom and
top-ranks, respectively. Tests of lexical-richness would also be unreliable due to the difference in
sample-size: Huckleberry Finn at 116,608 tokens, and Alice in Wonderland consists of 29,951
tokens.
Texts can also be divided into equal-sized 'chunks', (Mosteller and Wallace, 1964; Peng and
Hengartner, 2002), or as per Baayen (2001), and Evert and Baroni (2006), a model of the worddistributions can be implemented by selecting an appropriate model and experimenting with the
parameters until a good fit of the word-distributions is achieved. These models form the basis for
interpolating and extrapolating data to the required sample-size. Baayen (2001), recommends that
sample-size is not extrapolated to more than twice the original size, due to potential inaccuracies in
38
the predictions made by the model. This issue may be addressed by sampling only part of the
population.
However, as stated by Evert and Baroni, (2005:5), 'extrapolation quality rapidly
degrades when less than 25% of the data are used for estimation', still, with only 25% it is possible
to extrapolate up to four times the original sample-size (Evert and Baroni, 2005:2). In addition,
they believe that a large part of the issues related to extrapolation can be attributed to the 'nonrandomness of the corpus data', (Evert and Baroni, 2005:14), although they admit that the nonrandomness of word-distributions can not be the sole cause of some poor estimates (Evert and
Baroni, 2005:15).
Evert and Baroni (2006) developed the ZipfR package for R, which calculates Zipf-like LNRE
models such as the Zipf-Mandelbrot (ZM), finite Zipf-Mandelbrot (fZM), and the Generalised
Inverse Gauss Poisson (GIGP) model. The Zipf-Mandelbrot model is similar to Zipf's original
model, with two additional parameters: a to modify the slope of the rank-frequency graph in
double-logarithmic space, and b to adjust the downward curvature of the line – accounting for the
mismatch between the straight-line predicted by Zipf (1949) in figure 7, and the actual curve of the
rank-frequency distribution, (see Baayen, 2001:101-102). The finite or non-finite distinction is
where the model assumes that the number of words in the distribution are potentially infinite (or
finite respectively), which influences the final predictions. The fZM and GIGP models generally
provide 'the best results', (Evert and Baroni, 2005:17), as will be seen in the results section.
Before describing how these models are generated, the notation used to describe the features of the
model is explained, (see Baayen, 2001; 2008:231). Sample-size is denoted by N, with the number
of types in the sample (N) expressed as V(N). In terms of Zipf's law, let m represent the rank of a
particular frequency type, expressed as Vm(N), that is, the number of individual types with
frequency m in a sample of N tokens.
The models are based on two types of data. Firstly, the frequency spectrum, Vm(N), is a ‘table of
frequencies with frequencies’ (Baayen, 2008:230). For example, the frequency spectrum of Alice in
Wonderland shows there are 858 words with the lowest frequency, 295 for the second least frequent
word, and 152 words that are third least frequent. The second data-type is the vocabulary growth
rate, which is the, ‘joint probability of the unseen types’ (Baayen, 2009:229). The data are used to
predict how the frequency of a word changes as a function of an increase in the size of the sample.
In order to address the issue of words having non-random distributions, Baayen (2001:64-69) and
39
Evert and Baroni (2006:8) propose a technique called bi-nominal interpolation. This resolves the
non-random distribution of words (a problem for extrapolation quality), by performing a series of
computational calculations, which randomly order the words. Bi-nominal interpolation takes a
frequency spectrum of the words, and calculates the expected vocabulary size E(Vm), and the count
for each frequency rank. This means that vocabulary growth can be calculated for any sample-size,
as long as it is equal to or less than the size of the sample that the frequency spectrum was originally
generated from. By applying these models to the rongorongo and language corpora, we can select
the texts with the biggest sample-size and interpolate them to values comparable to the smaller
sample texts, or extrapolate the smaller texts to the text with the largest sample-size. This permits a
comparison between texts, but it is still necessary to exclude some texts from the analysis, such as:
the Honolulu tablets (3622, 3623 and 445), the New York Birdman (X), the Paris snuffbox (Y), and
the Poike (Z) tablet; and language texts with less than 50 words. This is because it is not possible to
model the vocabulary growth of a text with hardly any data available for the model to base its
calculations on.
4.1.1 N-gram: Uni-gram, bi-grams, and tri-grams
N-gram refers to a n number of morphemes or words. Frequency counts of n-grams are used as
data for a frequency spectrum analysis. The 100 most frequent glyphs and morphemes are selected
for further scrutiny using additional statistical techniques (described below), specifically as features
for PCA and factor analysis.
A bi-gram may represent a collocation and provides the basis for investigating larger stable
sequences in the arrangement of linguistic or orthographic elements. Bi-grams may also denote
sequences of [noun + prepositional phrase], for example in [school of fish], or [verb + preposition];
as seen in [went to town]. Tri-grams are also important as they capture larger sequences or reveal
more information in terms of what morpheme is present after a collocation, whether it is a
preposition, another noun, or the verb.
However, one caveat to analysing bi-grams, and in
particular tri-grams, is the data-sparseness issue. By dividing the corpus into these constituent parts
there is a reduction in the amount of data that can be analysed, as more hapax-, dis-, and trislegomena are generated.
40
4.1.2 Key Word In Context concordances (KWIC)
Key Words In Context concordances (or KWIC, see Biber et al., 1998:26-28), is another method in
corpus linguistics used to identify stable linguistic sequences such as those that can be extracted
with bi-gram and tri-gram data. The idea behind this method is to analyse a particular lexical unit
within the context that it appears. This method is used in dictionary compilation and production of
second language education literature. A KWIC listing provides all the contexts for a given word.
The lexical unit under investigation is called the node, and appears in the centre; tokens before and
after the node can be extracted to see which words form ‘collocates’ (Biber et al., 1998:36), or setphrases. KWIC data can often locate regular patterns of usage or lexical-association patterns (see
Biber et al., 1998), and the analysis does not need to be restricted to neighbouring words, but may
also include words appearing two or three places from the node, reflecting grammatical-association
patterns (see Biber et al., 1998).
This method may reveal possible collocations in the rongorongo corpus or identify a grammatical
structure or pattern that may not be discovered by palaeographic methods. Combining all the texts
in the corpus into one text file makes it more apparent which patterns are productive on more than
one text, and by how much. This allows the data to be filtered for those key structures that appear
with a high frequency, and in which structural context they appear. This is particularly useful for
disambiguating homophones such as a number of grammatical particles found in Polynesian
languages: the particle /i/, in Maori for example, has around seven different uses, including as a
tense marker, and a dative. Another instance, is the particle /te/, which may appear in either the
nominal or verbal frame (DeFeu, 1996).
Consequently, KWIC lists allow us to explore and
understand the structure of the corpus.
4.2 Authorship-attribution methods
Authorship-attribution defines a group of methods used to identify the literary-genre or author of a
text. In short, a 'special form of text classification in which the classes are essentially just the works
of given authors' (Mosteller and Wallace, 1964:xvii). A significant body of research based on the
use of statistical and computational methods was brought about by Mosteller and Wallace (1964)
and have since been expanded in modern studies to include more advanced quantitative methods,
such as machine learning, neural networks, and plagiarism detection. Research shows that different
literary genres can be determined by underlying grammatical structures, including, collocations,
41
function words, and phrases; just some of the indicators used to identify the stylistic 'finger print' of
a particular author (Baayen, 2002).
As noted previously, relative frequencies of glyphs in the rongorongo corpus will not provide
sufficient statistical evidence to support any generalisations. Therefore, other methods have been
sought that will support and contextualise the frequency data, in the form of meta-data recording
the source, year, context, and register of a text. Much of this information is missing in the
rongorongo corpus and we do not know which tablet represents which genre of text. Authorshipattribution studies provide a set of methods that are non-parametric and capable of processing
multivariate data, and highlighting categories or clusters in the data.
The first issue to address in authorship-attribution studies are the selection of features that, ‘most
accurately summarise an author’s style’, (Peng and Hengartner, 2002:175). Mosteller and Wallace
(1964) discovered that function words were more successful as discriminators of authorship as they
are context independent, unlike content words (see Mosteller and Wallace, 1964:17-39). This is
supported by Peng and Hengartner’s study which shows that function words are the most successful
in discriminating between texts, because they show high variability and frequency, (Peng and
Hengartner, 2002:184).
Peng et al., (2003) uses character-level n-grams. This involves removing all white space in the
corpus, which reduces the data to one long block of text. This is then divided into single characters
(letters) and frequencies are generated. Peng (et al., 2003) shows that by using groups of three
characters as features, the classification of their Greek data-set is improved by 18% compared to
previous studies (Stamatatos et al., 1999; 2000; 2001). In addition, the methods are language
independent, which is demonstrated through a parallel analysis of Chinese and English data, (Peng,
2003:6).
Holmes (1992) uses multivariate analysis including principal component analysis, and tests for
lexical-richness to assess whether Mormon texts are attributable to more than one author. Whereas a
study by Tomatsu (2006) investigates the literary motifs adopted by Japanese authors in order to
identify three of the most traditional authors of Japanese prose. This is achieved through analysis of
word-frequency, and length, but also investigating each authors’ use of hiragana, katakana, and
kanji, (Tomatsu, 2006:5).
42
It is possible to use the unit of analysis proposed by Peng (et al., 2003); as Horley (2005),
demonstrates that we can further subdivide the main glyphs into smaller units, which may represent
syllables. Although this is not the unit of analysis adopted by this study, it would no doubt be
interesting to compare the results from this corpus with the others to see if a smaller unit of analysis
performs better as concluded by Peng (et al., 2003). This will be the focus of a future study.
Lexical-studies demonstrate that the top-frequencies in corpora are generally occupied by ‘function
words’ and the lower frequencies tend to be the domain of ‘content words’, (Baroni, 2006:5). In this
study, word-frequency data is filtered for the hundred most productive function words in the
language, and glyph corpora acting as the feature set. This is an arbitrary figure, but in accordance
with previous research by Mosteller and Wallace (1964:28), and Pozdniakov and Pozdniakov
(2007).
Despite not knowing what the set of a hundred glyphs consists of in terms of 'functional' units, it is
reasonable to assume that the most frequent glyphs have some structural significance in their usage.
The Santiago staff (Ia) is often excluded from statistical analysis because of the high occurrence of
the 'phallus' (glyph <76>) and because the large size of the text causes glyph <76> to appear in the
top rank in the whole corpus of rongorongo texts. In other texts, sequence <380.1> dominates the
frequency ranks. This is a good thing for this study, which looks for variability between texts, as
these glyphs appear to be attributable to certain tablets or small sections of tablets, which make
them good candidates as discriminators of rongorongo literary genre.
To summarise, this study uses the texts and their n-grams of whole morphemes or glyphs as the unit
of analysis for authorship-attribution methods. The term ‘genre’, when applied to the language
corpora, is used in its traditional sense; i.e. to refer to a specific literary-style. However, when
referring to the groupings of the rongorongo tablets, it is used as a convenient label to describe
these groupings, but without implying a specific genre.
Authorship-attribution methods will identify the elements that categorise the tablets into their
respective groups, extracted through: word-frequency, lexical-richness, principal components
analysis (PCA), factor analysis, and latent semantic analysis (LSA); commonly adopted in
information-retrieval studies, forensic linguistics, and psycholinguistics.
43
4.2.1 Principal components analysis
Principal components analysis (PCA) is part of a series of methods involving the analysis of more
than one variable (termed multivariate analysis). PCA applies advanced statistical methods in order
to reduce the number of variables, particularly if they offer little discriminatory value. One of the
most attractive aspects of PCA is its non-parametric nature, which is beneficial as it removes part of
the subjective nature of choosing the factors that best define multivariate data-sets.
PCA, takes a matrix of data and tries to reduce the number of dimensions necessary for identifying
the position of the data-points, (Baayen 2008). If this process was visualised in three-dimensional
space, a cube would contain all of these data-points as coordinates. This cube can be rotated using
varimax or promax rotation10 (Baayen, 2008), providing a new view of the data, until an interesting
structure is identified. This is achieved through, 'rotating the axes in such a way that you get two
new axes in the diagonal plane of the original, unrotated axes' (Baayen 2008:120). Once the cube is
optimally positioned, the principal components (PCs) that capture the most variance, are extracted.
So PC1 contains the points explaining the majority of the variation, with subsequent PCs explaining
less and less. When plotted in a scree plot, the most important PCs are easily recognisable, (see
results below). The rotation-matrix contains the values (loadings) needed to plot the data in a
scatter-plot, (see Baayen, 2008:124). These loadings are proportionate to the correlation of the
word or glyph frequency-counts. In this way the scatter-plot reveals interesting structures or clusters
in the data.
Baayen (et al., 2002), discovered that in a strictly controlled experiment involving groups of
authors, with very similar backgrounds and training, there exists an authorial 'finger print' or style.
They note that a principal components analysis of the most frequent function words does not
provide any insight into authorial structure where controls are in place, but other principal
component studies (Baayen, 1996) where the authors have very different backgrounds, training, or
span different time periods, are more successful at being categorised with principal component
analysis, (Baayen et al., 2002:6). Since the rongorongo tablets were likely carved by different
scribes (Routledge, 1919), then this method may be successful.
The results of the PCA (below) are based on the methods proposed by Baayen (2008) in a study of
10
. Promax rotation tends to provide the better fit, as it allows the factors to be correlated. Varimax rotation assumes that there is no
correlation between factors, and is applied when the primary aim is on the generalisability of the findings (see Baayen, 2008:127).
44
'affix productivity' from Baayen (1994). Here Baayen (2008) investigates 'the extent to which the
productivity of an affix is co-determined by stylistic factors' (Baayen 2008:118). Baayen's analysis
is based on a matrix of n-gram counts; the frequency counts of 27 derivational affixes, over 44 texts
with different genres. Baayen's results show that the productivity of the affix can be used as a
means to group texts according to genre. In addition, PCA makes it possible to reduce the number
of dimensions from 27 to only 3, without losing too much of the structure in the data, whilst
accounting for 76.6% of the variance in the data, (Baayen, 2008:122).
In conclusion, PCA is a data-reduction method allowing an informed choice on which dimensions
are important for explaining the underlying structure of a large multivariate data-set. This paper
applies principal components analysis to evaluate the available data-sets (DD1-DD4) to identify the
potential for analysing rongorongo literary-genre. The analysis assesses the extent to which the
rongorongo texts exhibit any form of clustering or clearly defined groups. The data-set used by
PCA is based on the results from the frequency-count data. The 100 most productive glyphs, and
morphemes, are chosen as the feature-set (or dimensions). The assumption is that these features
capture the most productive grammatical particles in the Polynesian language, and rongorongo data.
Of course, it is not expected that a one-to-one correspondence will be found. In other words,
despite glyph <1> being the most productive in certain texts, it does not necessarily entail that it
relates to the most frequent word in the Rapanui language data: the definite article /te/. The highest
occurring glyphs in the rongorongo corpus may be ‘functional units’ of a kind, but they would be
restricted to the writing system, due to the disparity between language and writing. Whether this is
the case or not, it can be assumed that the top-ranking glyph-frequencies represent those glyphs that
have a functional or syntactic role in the script.
A matrix of glyph-frequencies for each text is used as the feature-set, which is selected according to
the criteria adopted by Mosteller and Wallace's (1964:45). They recommend that the, 'pool of
potential discriminators should be large enough, say, 50 to 1000, to offer a good chance of success'.
In addition, Pozdniakov and Pozdniakov (2007:12) consider the 100 most productive glyphs to be
appropriate for their statistical analysis. This paper therefore adopts the same principle.
45
4.2.2 Factor analysis
Factor analysis is used to, 'examine how underlying constructs influence the responses on a number
of measured variables' (DeCoster, 1998:1). Factor analysis is another multivariate technique that
identifies structures in large data-sets, and is an extension of PCA. It differs in that an error term is
applied to the model to account for the presence of possible noise in the data, (Baayen, 2008:127).
The factor analysis requires a matrix of counts, which in this study, is composed of the features used
in the PCA.
Columns represent glyphs, or morphemes, and the counts across the corpus of texts (in rows). A
technique called factor rotation is the applied to this matrix, providing a simpler interpretation of the
data. This method is more successful where there are high loadings on a few factors (Baayen,
2008:127). The data-points can also be rotated through varimax or promax rotation methods.
Promax rotation involves rotating the loadings through a general linear transformation of each
factor in order to get a better idea of the underlying structure, whilst preserving the variance of the
factors. This will therefore provide a better view of the separation between texts. For the purposes
of this study two factor analyses (one varimax-based and one promax-based) are produced in order
to demonstrate the difference.
In this study three factors are chosen, as less than three factors provides hardly any separation in the
data, and too many cause the points to be too dispersed. The results of the PCA provides some clue
as to how many factors should be selected based on the number of significant principal components
(PC), which are identified by a scree-plot (see the results section, below).
4.2.3 Tests of lexical-richness
Data from vocabulary growth curves (VGC) and frequency-spectra provide the basis for a model of
lexical-richness. Relative frequency counts are susceptible to the issues associated with samplesize, and LNRE (see Baayen, 2001). VGCs make it possible to adjust the sample-size of each text.
This involves interpolating the samples, for smoothing, and then extrapolating the interpolated
curve to the size of the largest text, through a LNRE model (Baayen, 2001): for example, the Finite
Zipf-Mandelbrot model (fZM), or the Generalised Inverse Gauss Poisson model (GIGP), illustrated
by Baroni (2006; 2007; 2009). Once an accurate goodness-of-fit is achieved, the model provides
the basis for text comparison with Baayen’s P, (Baayen, 2001). Baayen’s P is equal to the number
46
of hapax-legomena divided by the number of types in the sample (V1/Vm). High values will show
that a text is lexically-rich. For example, as many of the hapax-legomena are likely to be content
words, many of these may be nouns denoting names of people, places, or things. As a result, it is
expected that a genealogical text should be particularly rich in content-words as they reference
names of people, events, and dates, compared to a prose or narrative text.
4.2.4 Latent Semantic Analysis (LSA)
Another method adopted is latent semantic analysis (LSA), which is based on a, 'representation that
captures the similarity of words and text passages', known as a ‘semantic space’, (Landauer and
Dumais, 1997:211). LSA works by representing the words used in a corpus of texts, from the size of
a whole essay, or even just key words appearing as a title in a document (Foltz et al., 1998:4). LSA
applied to information-retrieval is shown to increase accuracy by up to 30% (Dumais, 1991),
despite differences in language use. Wolfe et al., (1998:1) determine whether the acquisition of
knowledge is dependent on knowledge already acquired, and how the complexity of a text has an
impact on acquisition. They apply LSA to the problem of knowledge induction, by examining the
entries listed in an encyclopaedia on the 'human heart', and comparing the results to a student-based
survey on the same subject.
LSA has found applications in modelling language acquisition
(Landauer and Dumais, 1997), word-disambiguation (Pino and Eskenazi, 2009), plagiarism
detection, and computational biology, applied to protein sequencing (Dong et al., 2006).
LSA is particularly relevant for the study of rongorongo, as it provides an insight into how glyphs
are used by calculating the, 'approximate estimates of the contextual usage substitutability of words
in larger text segments', (Foltz et al., 1998:3). Applying a LSA to the glyphs (see below), retrieves
lexical-association patterns (Biber et al., 1998), and the degree of correlation between the individual
lexical-components. Although the analysis of bi-grams and tri-grams captures similar patterns, the
output is often cluttered with incorrect collocates, which are included in the counts of hapaxlegomena, and result in false measures of lexical-richness, (see Baayen, 2008; Evert and Baroni,
2006). The data produced by LSA requires no manual processing as a threshold can be set, typically
>=0.7, to return only the most correlated texts and glyphs, (see Landauer et al., 2004).
The texts and words (or glyphs) are viewed as a high-dimensional multivariate space where all
occurrences, of a word, are listed row-wise, and the associated texts across columns, with the counts
in each cell; creating a high-dimensional matrix. The Singular Value Decomposition (SVD) model,
47
(see Manning et al., 2008:407) provides LSA with the ability to represent the mechanisms
underlying human knowledge, (Landauer and Dumais, 1997). It achieves this by optimising the,
'prediction of the presence of all other events from those currently identified in a given temporal
context, and does so using all relevant information it has experienced', (Landaur and Dumais,
1997:217). The terms context and event refer to the phenomenon being described. In this paper,
context refers to an individual text in the rongorongo corpus, and event relates to a single instance
of a glyph. In other studies (Landauer et al., 2004), an event may also be a word, syntactic
construction (through corpus tagging, see Biber et al., 1998), or a paragraph. Likewise, a paragraph
can be considered a context.
LSA has been empirically tested on a variety of data (Dumais, 1991; Landauer and Dumais, 1997;
Foltz et al., 1998; Wolfe et al., 1998; Landauer et al., 2004; Dong, et al., 2006; Pino and Eskenazi,
2009), and it appears to give consistent results. These tests demonstrate that the model is
particularly robust with matrices of 300 dimensions, however performance drops below 100
dimensions, and above more than 1000 dimensions, due to limitations in computational power (see
Landauer et al., 2007:71). To illustrate, in an initial study as part of this paper, processing a corpus
of 46 texts, with approximately 100,000 words each, required nearly 32GB of memory, on a highend laptop, fitted with only 2.1GB (as of 2010). Therefore, LSA may not be able to process large
corpora comparable to the Brown Corpus, unless a series of smaller sample texts are extracted.
A SVD model is generated, in a similar way to factor analysis, and the data-matrix is transformed
into a series of co-ordinates in high-dimensional space. This allows the researcher to explore the
correlation between texts, words, or the correspondence between the two. In this paper, Pearson's
correlation is used as the measure of similarity between a text, or glyph.
One of the main
advantages of using LSA in a corpus-based analysis is that it captures the fact that:
[…] if a particular stimulus, X has been associated with some other
stimulu, Y, by being frequently found in joint context, and Y is associated
with Z, then the condensation can cause X and Z to have similar
representations.
(Landauer, et al., 1997:217).
In short, the model measures the probability of a word joining with other words to form larger
clauses, given their productivity in the current context. (Landauer et al., 1994:5215). This entails
that LSA ignores the ordering of constituents as it is purely interested in the ‘meaning’ attached to
the documents as opposed to the words contained within, unlike corpus linguistic methods, which
48
adopt the token as the unit of analysis (see Biber et al., 1998; Gries, 2009). The resulting data is
normally plot in a graph to explore possible relationships between text-text, word-word, and textword dependencies. LSA is predicted to perform better than previously mentioned methods as the
accuracy of a factor analysis relies on: how the method is applied, the decisions behind selecting
representative features, and the number of 'significant' factors. LSA processes all texts and words or
glyphs, in a corpus, to create a latent-semantic space from which additional functions are applied.
(see Landauer et al., 1997; Wild, 2009).
In this paper, the rongorongo corpus is transformed into a matrix, with the contexts (tablets)
organised by column and the events (glyphs), arranged in rows.
The frequency-counts are
calculated for each event in each context and assigned to the corresponding cells. A Pearson
correlation is calculated and only the results meeting a >=0.7 threshold are retained.
The results are presented as a dendrogram to make the relationship between groups more apparent
(see below). A correlation value of 1 means that the given word or glyph was expected to appear in
an infinite corpus of texts. Lesser values naturally imply that the word or glyph in question is
unexpected in the current context. This captures the grammatical or syntactic association between
glyphs, highlighting any potential compounds, or structural correspondence. For example, what
motivates the presence of glyph <1> given its presence as an independent or main glyph, or as a
prefix in the compound <1.6>?
To assess the success or failure of LSA methods applied to glyphs, a term-term analysis is performed
on an Egyptian corpus, since the values of the glyphs are known. A glyph is selected on the basis of
glyph-frequency data. As with the document-document analysis, a threshold of >=0.7 is set in order
to return the highest term-to-term similarity scores (positive correlation values). The final results
are compared to the factor analysis, bi-gram, tri-gram, and Key Word In Context concordance data
(KWIC) for validation.
49
Chapter 5 Presentation of findings
5.1 Feature selection
The feature list is generated from the 100 most productive glyphs and morphemes, and a frequencymatrix created with the relative-frequency of the features across all texts. The feature-list was also
enriched with current hypotheses concerning the significance of particular glyphs, for example the
crescent moon glyph <40>, plus variants: <41> mirrored, <42> subscript, and <43> superscript,
(Guy, 1990; Fischer, 1997; Berthin and Berthin, 2006; Melka, 2008; 2009b; 2010; Horley, 2009),
and the text-divider glyph on the Santiago staff (Ia): <199> and <999> (Fischer, 1997).
Due to issues associated with relative-frequencies, there is some imbalance in the final features
selected for classification. For example, tablets X, Y, and Z, have only two glyphs, each one is
attributed with a relative-frequency of 0.5, (or 50%). Therefore, when sorting the n-grams in
descending-order, they are ranked amongst the top-frequencies. Consequently it was necessary to
remove them from the rest of the analysis. This decision was extended to texts: J, M, and W due to
their small sample-size, or fragmentary nature (damaged areas).
5.2 Principal components analysis
After the initial feature selection, a PCA was performed (Baayen, 2008), and a scree-plot produced
to identify how many PCs are required to explain the majority of the variance in the data-set, hence
enabling the reduction of the number of variables. The general rule adopted is that only the
significant PCs accounting for over 5% of the variance should be selected for further analysis.
50
Figure 8. Scree plots (rongorongo data-sets)
The scree plot shows that the first three PCs explain the majority of the variance. These PCs
account for almost 75% of the variance in data-set 4 (see figure 8).
Figure 9. PCA (rongorongo dataset)
The graph (figure 9) shows the resulting PCA scatter plot. The data-points show some separation
and what appears to be a cluster of texts (referring to PC1 against PC2) with a few texts separated
from this main group. In addition it is possible to see how much variability there is in the data on
different combinations of PCs, for example, PC2 and PC3 show less scatter than the comparison
between PC1 and PC2.
51
It is not clear which texts are associated with which clusters. However, PCA provides a means for
reducing the number of variables needed to explain the underlying structure of the data-set. In
addition, it is possible to spot any clustering, which is important for the remainder of the analysis,
as they are based on text-categorisation methods.
5.3 Factor analysis: Rongorongo corpus
A factor analysis, with varimax rotation, was applied to the same frequency-matrix as the PCA.
Three factors were selected, and a separate analysis performed with promax rotation.
The factor analysis reveals the actual texts that are part of the clusters identified by the PCA. One
issue that needs to be addressed is how to explain what these groups represent, and how to justify
the inclusion of a text with any of the members in its cluster. There are roughly four main groups;
measured by their proximity to other texts as highlighted by the curves dividing the texts (see figure
10). These texts are distributed across the graph, forming a left-right distinction (highlighted by the
dashed line). It is not possible to state with any certainty where the texts should be divided. As an
example, text La could be part of group 3 or 4 as it is almost equal in distance from any single text
in either group. It is apparent, that these results alone do not make it possible to make any definite
decision, though they do provide a basis for exploring the corpus.
52
Figure 10. Plot of factor analysis of rongorongo (DD4) with varimax (left), and promax rotation (right).
The texts are quite dispersed in the first plot (figure 9, left), but the promax-based analysis shows
clearer clusters forming, (figure 10, right). Group 1 (H, P, and Q) are known parallel-passages and
are expected to be similar in terms of genre, (see in particular Barthel, 1958:82-168; Horley, 2007a).
As for group 2, texts Gr, Kr, and Kv cluster together at the top of the plot, joined by two additional
texts: Na and Ev. The similarity of the Small Santiago (Gr), and London tablet (Kr and Kv) is
confirmed by internal analysis of the script showing that tablet (K) is a copy of one side of the
Small Santiago (Gr), (see Barthel, 1958).
In group 3, the Mamari tablet (Ca and Cb) is at the centre of the plot with texts Rb, Xa, Sa, and Fa
in close proximity. This may indicate that these texts contain some content similar to the tablet C.
This group is partly driven by the presence of the 'crescent moon' glyph <40>, occurring twice on
53
the Stephen-Chauvet Fragment (lines Fa3 and Fa4).
By comparison, there are many more
instances of this glyph on tablet C, and Er (another possible calendar sequence, see Wieczorek,
2010), which is also in proximity to tablet C, but seems to form its own group with the verso of the
Small Santiago (Gv), and one side of the Stephen-Chauvet fragment (Fb), designated group 4.
Group 4 in the bottom-left of the plot, includes the Santiago staff (Ia). This text is the most
separated from the other groups, with only a number of fragmented-texts (Wa, Ya, and Va) included
in this group. Internal analysis of the inscription shows that its structure is very different to the
other tablets: a X1YZ structure, proposed by Fischer (1997). Text Ia is also joined by the Honolulu
text (Ta). The similarity between tablet I and T, is already noted by Barthel (1958:167).
Turning to an assessment of the classification obtained; it is assumed that a correlation exists
between text Aa, and Hr, Pr, and Qr, due to the shared sequence appearing on Aa1 (repeated on Hr5,
Pr5, and Qr5, see Guy, 1985:383), instead the opposite side (Ab) is grouped with these texts (see
group 1). Horley (2007a) shows that this is attributed to the presence of shared glyph-sequences on
Ab2 and Ab4, which are also present on Pr7. Text Aa has sequences in common with Hr, Pr, and Qr,
including a group of glyphs <1-9a>, also present on text Ra (line 5-8). This may point to a fixedglyphic compound, although on Ra, the compound is expanded with the addition of fused-glyph
<5> (Horley, 2007a:27), with further examples of this form on texts Bv and Sa. The factor analysis
illustrates this to some degree, although text Sa is separated from the rest of the group suggesting
that there are other features with more influence over the categorisation: one sequence is the
compound <380.1> on Cb. Barthel (1958), and Horley (2007:28) observe that this compound is
quite productive (with instances on texts: Ab, Ca, Cb, Ev, Gr, Kr, Kv, and Na). The factor analysis
splits texts with this delimiter into two groups, group 2 and group 3, appearing at the top-left of the
graph, the others appearing in the middle and drifting to the lower part of the graph. The difference
between these two groups, which share the compound <380.1>, may be caused by the presence of
glyph <40>.
The question to ask at this stage, is whether these observations point to there being two list-types:
one a genealogical text, the other a mix of genealogical and calendar sequences (for example a
significant date attributed to a number of individuals).
However, it is not possible to draw
conclusions purely on this basis alone.
54
5.4 N-grams
An analysis of bi-gram and tri-grams reveals which glyphs are driving the classification of texts in
the factor analysis. It appears from frequency-data that the four groups can be reduced to at least
three due to the presence of the previously mentioned shared-sequences. The frequency-data also
shows which texts contribute little to the classification of each group of texts. There is cause to
rearrange some texts from the factor analysis, as they share bi-grams and tri-grams more in common
with texts in other neighbouring groups (for example, texts Ev and Gr have some bi-grams in
common with text Gv, Ia and Ta):
Table 6. The most productive bi-grams in the rongorongo corpus.
Table 6 presents the top 15 results from the bi-gram analysis. Only glyph-compounds that appear
on several texts are retained as they are likely to be true compounds, rather than ‘junk’ strings
caused by the concatenation process. The bi-gram data supports a distinction between three broad
groups as opposed to the four illustrated in the factor analysis. Therefore the groups are re-arranged
according to the number of shared bi-grams, for example bi-grams on text Gv compared to Ia, and
Ta, show that Gv should be moved to group 4. Also, group 2 (Ev, Gr, Kr, Kv, Na, in figure 10),
shares more glyph-sequences with texts of group 4.
A correspondence between bi-grams and the type of texts they appear on, appears to exist.
However, it is apparent that some texts are damaged, and therefore not reliably classifiable, for
example some texts have hardly any shared glyphs, or do not appear in the top-ranks, (Fb, Ma, Oa,
and Fa, respectively). It is therefore questionable whether these texts should be included. The texts
in group 2 are more homogeneous: there are examples of the delimiter-list glyph-compound
55
<380.1> across all texts in this group. Group 2 consequently needs some adjustment or removal of
texts Er and Fb as they share few bi-grams with the rest of the texts in this group. The statistics
show that Fb has glyph-compounds in common with other texts, for example: <200-96> appearing
once on both Fb and on Gv, and the 'moon' glyphs.
With so few occurrences it does not seem
justifiable to include Fb in the corpus, and so it is removed from the rest of this analysis. Texts Ca,
Cb, and Sa show some correspondence with group 2, again, due to the presence of <380.1>, as a
result this classification still appears to be far from perfect. In conclusion, the factor analysis
categorises the tablets with delimiter <380.1>, and <76> on the left of the chart, and the more
diverse texts, or those with less structured-lists (i.e. group 1), on the right of the plot.
Group 1 Aa Ab Br Bv Da Db Hr Hv Ma Nb Oa Pr Pv Qr Qv Ra Sb
6_74_3
3 5
2
3
1_62_6
2
2
2
10_144_3
5
7
3
15_22_3
8
10
4
3
22_3_8
3_22_3
3_306_3
3_600_3
3_93_3
440_440_440
6_1_62
62_6_1
1_9_5
1_95_3
10_2_10
Total
3
1
4
2
2
3
3
17
5
0
3
0 27 12 0
0
6
6
6
1
1
1
1
760_40_6
1_22_3
10_3_70
3_70_760
6
5
5
5
2
2
3
5
4
4
8
2
2
90_90_76
90_90_90
0_0_160
Total 0
3
2
3
3
378_41_670
390_41_378
41_378_41
66_760_4
49_3_76
76_73_3
76_90_76
90_76_70
8
2
1
1
1
3
2
2
3
2
2
3
2
3
1 30 10 22 6
3
0
6
0 39 0
Group 3 Ca Cb Rb Sa
40_40_40 8
41_670_8 7
670_8_78 7
8_78_711 7
3
2
2
2
2
3
2
2
260_1_4
3_90_76
430_76_530
430_76_55
4
3
Group 2 Er Ev Fb Gr Gv Ia Kr Kv Na Ta
1_380_1
3
5
2 3
380_1_3
30
8 11
1_2_34
3
2
1_4_711
2
2
1
1
1
33 12 13 5 10
6
380_1_22 5
40_390_41 5
70_760_40 5
Total 83
6
0
0
Table 7. The most productive tri-grams in the rongorongo corpus.
The tri-gram data reveals that texts Ca and Cb from group 3 should perhaps be included with group
2, due to the productivity of <380.1>. However, what seems to separate these texts is the presence
of the 'moon' glyph <40>, plus variants (<41>, <42>, and <43>). This may indicate three general
groups; those with glyph <40> and delimiter <380.1> (group 2-3, figure 10); another with glyphs
<76> and <380.1> represented by the texts appearing in group 4 (figure 10); and those with fewer
delimiter-glyph, (group 1, figure 10)
The tri-gram data show that text Ia and Ta contain both the <380.1> delimiter and <76>, with Gr
showing a few occurrences of <1.380>, meaning they could be moved to group 3, however they
lack the fully expanded version <1-380.1> and the more common <380.1.3>. Once again, it seems
necessary to revise the previously defined groups. Some texts show hardly any shared tri-grams: in
group 1 (Aa, Br, Bv, Da, Db, Ma, Nb, Oa, Ra, and Sb), group 2 (Er, Fb, and Gv), and group 3 (Rb,
and Sa). In conclusion, at this stage of the analysis, it is assumed that there are at least three main
56
groups, with maximally a fourth should the remainder of the analysis support one. A latent
semantic analysis (LSA) will provide confirmation on the validity of the current classification by
explaining the degree of correlation between each of the texts.
5.5 Factor analysis: Language corpus.
Turning to the language corpora, the factor analysis failed to identify any particular groups, in fact
the programs reported an error in the factor analysis algorithm. The matrix was composed of only
312 texts, with the frequency-counts of the top 100 morphemes as features. To investigate the issue,
a correspondence analysis was performed to show the amount of variance between the texts, and the
loadings of each morpheme. (see figure 11).
57
Figure 11. Correspondence plot of the influence of morpheme loadings on texts.
The correspondence plot shows that the data is very skewed resulting in the texts being bunched
together. This is because the features fail to draw out any significant difference in structure. Only
two texts have separated from the rest, a Rapanui song (Ate-a-renga hokan iti poheraa) from
Thomson (1889), and a Maori song (He Mata na te Kahu-kore) from Grey (1853). It is not possible,
with these two results alone, to conclude that there has been any success. The two song texts are
too far apart to suggest any relationship and the scores on the factors account for only 15.6% of the
variance in the data, hence the skew.
58
A possible explanation for this is polysemy.
Synonymy and polysemy is rife in Polynesian
languages due to the restricted set of phonemes (only nine consonants, ten if we include the glottal
stop, and five vowels, see De Feu, 1996). There are a large set of homophones with more than one
interpretation, particularly in the case of syntax markers: for example /a/, acts as a possessive (De
Feu, 1996:11), and person marker (De Feu, 1996:12). Tregear (1891) reveals further examples
of: /a/ = 'collar-bone', 'god', 'an interjection', the plural of particle /ta/, and 'to drive', or 'urge'
(Tregear, 1891:1). The word /hoki/ is a verb 'to return', but examples also exist of other word-types
including: 'also', 'for', or 'because', and as the name for a fish, (Tregear, 1891:79). Therefore, it is
likely that many of these morphemes represent other word-classes such as noun, verb, and
preposition. The matrix is composed of morphemes with the highest-frequency in the corpus,
which are generally 'function words' (Mosteller and Wallace, 1964; Evert and Baroni, 2005; 2006).
However, the actual 'function' of a morpheme proposed to be part of the syntax of the language is
obscured by polysemy. In addition, there is evidence to suggesting Maori poetry has a different
structure compared to everyday usage.
The comments of Rev. R Maunsell, illustrates the
complications in interpreting Maori poetry:
We shall see that it was not only abrupt and elliptical to an excess not
allowed in English poetry, but that it also carries its license so far as to
disregard rules of grammar that are strictly observed in prose; alters
words so as to make them sound more poetically; deals most arbitrarily
with the length of syllables, and sometimes even inverts their order, or
adds other syllables.
(cf. Grey, 1853:xiii-xiv)
The issues attributed to these genres is materialised in the form of omissions, including; articles /ko/
and /te/, /ai/,
pronouns, particles i.e. /nei/, the nominative case, verbs, and prepositions. In
addition, there is substitution of one preposition for another, unusual or rare words introduced or
inverted, (see Maunsell, 1862). The same phenomenon is paralleled in the rongorongo corpus
where whole passages may undergo, 'significant re-phrasing' (Horley, 2007a:31). Consequently it is
likely that these issues have an impact on the statistical analysis, as observed in the correspondence
plot. One way to address this is to tag the corpus, however, for the purposes of this study, time
constraints associated with handling large corpora (see Biber et al., 1998) was an issue due to the
different transcription principles, and the multilingual nature of the corpora.
Applying factor analysis methods to collections of multivariate data can be complicated, and many
of the decisions relating to feature selection are fairly subjective, followed by the selection of
factors believed to be appropriate to elicit structures present in the data.
As a result, it is not
59
possible to make a decision regarding which of the rongorongo texts are similar in terms of genre
based on these results alone. Therefore, we move to an analysis of lexical-richness through the
application of frequency spectrum, Vocabulary Growth Curves (VGC), and Baayen's Productivity
Index (see Baayen, 2001).
5.6 Frequency spectrum and VGCs as tests for lexical-richness
As discussed previously, a test of lexical-richness based on the application of Baayen's P, or
productivity-index, is calculated by dividing the number of hapax-legomena by the number of
tokens (V1/Vm). This calculation will provide a measure of how lexically-rich the texts are. A
lexically-rich text will have a high productivity-index, as more and more new words are
encountered as sample-size increases. A narrative text, for example, may have a low score, and will
display a straight-line showing that the number of types have been fully sampled, and no more are
expected, despite an increase in sample-size. In comparison, a list of unique individuals or entities
will show a steep curve upwards as each new type V is a hapax-legomenona (V1).
To compensate for the difference in sample-size, which distort measures of lexical-richness, a
Generalised Inverse Gauss Poisson LNRE model was implemented as it, 'achieves excellent results'
(Evert and Baroni, 2005:14), in order to account for the number of unseen or expected types
(E[Vm]) predicted by the Large Number of Rare Events (see Baayen, 2001). The Generalized
Inverse Gauss Poisson (GIGP) model is more efficient at modelling the hapax-legomena than the
rest of the spectrum frequency compared to the finite finite Zipf-Mandlebrot model (fZM), which
supposedly adjusts the top and bottom ranks to comply more fully with Zipfs (1949) straight-line in
a rank/frequency plot, (see Evert and Baroni, 2006). This was the same result for all texts and
GIGP was therefore selected as the best fitting model for extrapolating the texts to the larger
sample-size.
60
Figure 12. Comparison of the number of types predicted by FZM and GIGP LNRE models of text Aa (E[Vm]),
and the number of observed types (Vm).
A model for each text is created by applying bi-nominal interpolation to the empirical growth curve,
followed by extrapolation up to the size of the largest text, (see figure 13 for an example). This
provides a smoother curve by computing a series of randomised permutations over the distribution
of the words (or glyphs) to compensate for the non-randomness of word distributions, which would
result in an incorrect model of the extrapolated VGC curve.
The interpolated VGC is extrapolated to the sample-size of the Santiago staff (Ia), the largest of the
texts at N(2594). The results are plotted alongside the empirical VGC of each text. The parameters
of the model are estimated by the Zipf package (Evert and Baroni, 2006), see figure 12, above, and
table 8, below.
61
Parameters
Shape
Lower decay
Upper decay
Zipf size
gamma = -0.29
B = 0.05
C = 0.02
Z = 50.29
Goodness-of-fit (multivariate chi-squared test)
X2
df
p
2.17
4
0.71
Table 8. Parameters and resulting VGC generated by a Generalized Inverse Gauss Poisson (GIGP) LNRE
model
The parameters of the model can be adjusted to provide a better fit, however this would involve a
trial and error approach for each individual text. The goodness-of-fit based on the estimated values
are close enough for the purposes of this study. To assess the model's success, a high P-value and a
low chi-squared score is required, (see Baayen, 2008).
Observed
Expected
V
225
229
V1
103
103
V2
38
38
V3
20
20
V4
15
13
V5
8
9
Table 9. Observed and modelled values for ranked-frequencies for text Aa.
Despite the poor estimation of the frequency of words at ranks V4, and V5 (those occurring four and
five times, respectively), the only values actually required are the total number of expected types
(V) and the hapax-legomena (V1), for Baayen's P. The main benefit of applying such a model is to
address the issues relating to the comparison of texts at differing sample-size. Below (figure 13), is
an example of the resulting fit for text Aa, showing the empirical, bi-nominally interpolated, and
extrapolated curve at the size of the largest text (Ia, N=2594).
62
Figure 13. Number of types V(N) compared to the expected number of types E[V(N)] for empirical and modelled
VGC (interpolated and extrapolated curves).
The end of the empirical curve of text Aa is apparent at approximately N(1000), where the straight
line reflects no further growth in vocabulary. The interpolated curve follows the slope of the
empirical curve fairly well until about N(450) where the empirical curve begins to diverge as
observed by Baayen (2001), and Evert and Baroni (2006), in their studies, hence the reasoning
behind extrapolating from an bi-nominally interpolated growth curve, and not from the empirical
VGC11. After fitting the GIGP model, Baayen's P is calculated over each text, and plotted for
comparison, (see figure 14).
11
. The interpolated-curve smooths the empirical curve in order to prevent the extrapolated curve from being calculated on wildly
fluctuating values caused by the LNRE and the non-randomness of word distributions.
63
Figure 14. Measure of Baayen's P for texts in the rongorongo corpus at N(2594).
Comparing the groups identified by the factor analysis and the results of the lexical-richness test,
shows some correspondence between the graph above (figure 14), and the factor analysis (figure
10). For example, the similar Baayen's P value for texts Ab, Bv, Hv, Pv, La, Sb, and Bv (the texts
forming group 1), which are joined by texts Ia, and Ta, showing a similar amount of lexicalrichness. The variation between Gv and texts Ia and Ta, despite its similarities in terms of glyph
usage (glyph <76> delimiter), is now more apparent, it is lexically-richer. In addition, texts with the
delimiter <380.1> or with many list-like structures (Horley, 2007a), are generally those with a
Baayen's P value lower than 0.2.
In conclusion, texts Da, Gv, and Aa are the most lexically-rich texts. And the right-side of the plot
illustrates the list-like texts, which are the least lexically-rich. Given that the assumption is that lists
are lexically-richer than narrative discourse, the plot does not appear to support this hypothesis.
However, texts Aa, Bv, Cb, Da, Gv, Ia, Sb, and Ta, do actually appear to have a large number of listlike sequences, for example: texts Aa and Ab with glyph-sequence <25.9:5, and variant 1.9:5>; on
tablet Br, a sequence marked with glyph <384> in line Br3, later replaced by glyph <63>, in Br6;
and Cb with <4-760> and <380.1> glyph delimiters.
However, if texts are composed of more than one text per tablet-side, the model may not estimate
the correct VGC for some texts. This is where proper segmentation is required in order to be sure
that these results are valid. Inspecting the resulting VGC of each text, reveals whether this holds
true. As an example, the Santiago staff (Ia) has a sharper increase in new types, as the number of
glyphs encountered (N) increases (see figure 15). However, the empirical VGC reveals that there
64
exists the possible presence of multiple sub-texts.
Figure 15. VGC plot of the Santiago staff (Ia).
The Santiago staff (Ia) is likely to be one text genre given the prevalence of glyph <76> across the
whole text, forming a long list. The plot shows, however, four or five possible list-types due to the
observed peak in the hapax-legomena, (marked by the arrows). To confirm whether this is indeed
the case a discourse map, (see Biber et al., 1998:122-130), was produced for text Ia, with the most
frequently occurring glyphs (<1> to <11>), a number of delimiter glyphs (<76>, and <380>), the
crescent moon glyphs (<40> to <42>, as they may represent calendar sequences), and the text
delimiter (glyph <999>), which provides more information on the organisation of the whole
inscription.
65
Figure 16. Discourse map of the Santiago staff (Ia)
The large number of occurrences and distribution of glyph <76> (figure 16), reveals that the text
itself is probably one genre. The text delimiter glyph <999> is less frequent, but is again distributed
throughout the length of the text, and shows a number of breaks and packed groups, see for example
dense clusters at N(750) and N(1750). In addition, other glyphs may indicate smaller thematicallyrelated passages. To illustrate, glyph <1> and <2>, like <76>, is distributed across the whole
inscription. Whereas glyphs <4>, <5>, <6>, <7>, and <380> are clustered together in small
passages repeated in isolated sections of the text. Two glyphs are of particular interest, glyph <40>
and <41> (the mirror image of glyph <40>). If glyph <41> is synonymous with glyph <40>, then
there are five instances, evenly distributed throughout the text. Glyph <42> also occurs five times,
but after the first instance of glyph <40> and <41>, at approximately N(1250). The distribution of
these glyphs corresponds to the peaks in the VGC and hapax-legomena. However, to claim that this
is the mechanism underlying the VGC, in that there are five significant dates dividing up these five
textual fragments into some inventory or list of names, is probably too strong a claim to make based
on these results alone.
To explore this idea further, a discourse map is created for the Mamari tablet (text Ca) for
comparison, (figure 17). Note the 'moon' glyphs (<40>, <41>, and <42>) and their distribution. It
is very clear where this calendar resides in the text.
In addition, glyph <8> appears to be
structurally-related or dependent on this sequence of 'moon' glyphs, corresponding to the same
position in the text, perhaps indicating a semantic or syntactic correspondence.
The other
hypothesis is that this is a sub-text, as the distribution of glyphs <1> and <380> (i.e. the delimiter),
66
appears to be mutually-exclusive to the 'moon' glyphs.
Figure 17. Discourse map of the Mamari tablet (Ca)
From the statistics of the VGCs and the structures illustrated by the discourse maps it can be
demonstrated which texts are likely to be similar in terms of their literary-genre, and where some
texts may be composed of multiple fragments. However, it may be argued that there are a number
of interpretations for the charts.
In particular the VGCs of each text may not show where
segmentation can be made as they are not designed to measure this. In order to test the hypothesis
that VGCs can act as predictors for segmentation, or for identifying sub-texts, an additional
experiment is performed.
Two texts are selected from the Maori corpus. The first is a sample of the prose text Hinemoa, with
a sample-size of N(2000), the second is created by concatenating a series of short Maori
genealogical texts of the structure: Ko [NOUN], Ko [NOUN] etc. up to a sample-size of
approximately N(500). Three final texts are generated through replacement of 500 tokens from the
Hinemoa text with the genealogical text resulting in the same total sample-size of N(2000). The
three modified texts are composed of the genealogical text at the beginning, middle, and end of the
text Hinemoa. It is predicted that the difference between the two text genres will be obvious due to
the variation in the number of hapax-legomena. Figure 18 shows the original sample texts. Note
the steep curve for the hapax-legomena of the genealogical text, indicating a high number of new
types as we move through the sample (N). By comparison, the Hinemoa text is considered to be
one whole prose text with less variability and lexical-richness than the genealogical text illustrated
by its curve and the small fluctuations in the distribution of new types. Turning to the experimental
texts (below), the presence of the genealogical text is clearly distinguishable as it moves though the
67
Hinemoa text from the beginning (figure 19.a), middle (figure 19.b), and end (figure 19.c).
Figure 18. VGC plots of the experimental texts: a genealogy, and a narrative text.
a.)
b.)
c.)
Figure 19. Plot of VGC and V1 for experimental texts illustrating sharp increase of hapax-legomena. The
genealogical text is marked by dashed-lines.
In conclusion, VGCs provide a good indicator of change in discourse or literary type. Hapaxlegomena are particularly important, the graphs display a high number of hapax-legomena
indicating a text where each word sampled is new, and generates a steep upward curve in
comparison to a prose or narrative text. Therefore VGCs may act as good predictors of sub-texts in
the rongorongo corpus and where to segment the rongorongo texts in order to improve statistical
measures.
Golcher (2007), on analysing the undeciphered Voynich manuscript, developed an
‘original constant’ that may be more stable than ZipfR, though ‘probably also trickier to compute’
68
(Baroni, 2010, personal communication).
Therefore, more work is needed before it can be
demonstrated that VGC data can be used as a guide for segmentation, or shift in discourse-type.
5.7 Latent semantic analysis
5.7.1 Document-document analysis
The LSA is predicted to provide more tangible results compared to a factor analysis. Although a
factor analysis can be successful in spatially orientating apparent similarities between texts, the
researcher is still left to decide where the groups begin and end, and how far the influence extends.
In addition, lexical-richness tests can tell us how rich the vocabulary is in a text, but this does not
equate to the text having any form of shared-content. With LSA there is less subjectivity, as the
similarity between texts (documents), and glyphs (terms) can be measured according to their
correlation. Therefore, texts are measured by being similar in terms of the glyphs that are present,
and the structures that put these glyphs together into a larger phrases. LSA not only identifies
collocations along similar lines to bi-gram, tri-gram or KWIC data, but also any structuraldependencies existing between constituents present further on in the text. Pearson's correlation is
used here as a measure of text and glyph similarity (see Landauer, et al., 1994; Wild, 2009 on
Cosine measures).
The results (table 10), show the correspondence between texts. Here, LSA has correctly identified
texts predicted by both the factor analysis and n-gram data (group 1).
Text
Aa
Ab
Br
Bv
Da
Db
Hr
Hv
Nb
Oa
Pr
Pv
Qr
Qv
Ra
Aa
Ab
Br
Bv
Da
Db
Hr
Hv
Nb
Oa
Pr
Pv
Qr
Qv
Ra
Sb
0.67
0.43
0.57
0.37
0.36
0.66
0.63
0.39
0.41
0.65
0.56
0.66
0.54
0.61
0.50
0.42
0.54
0.48
0.40
0.59
0.58
0.45
0.53
0.61
0.49
0.59
0.50
0.54
0.49
0.55
0.39
0.43
0.55
0.46
0.33
0.40
0.55
0.38
0.51
0.37
0.39
0.34
0.44
0.43
0.73
0.66
0.46
0.45
0.72
0.57
0.72
0.49
0.49
0.57
0.48
0.56
0.50
0.45
0.36
0.56
0.48
0.50
0.48
0.50
0.40
0.48
0.45
0.23
0.32
0.50
0.40
0.43
0.31
0.34
0.41
0.75
0.48
0.55
0.89
0.69
0.87
0.63
0.63
0.68
0.51
0.51
0.77
0.86
0.69
0.72
0.58
0.66
0.35
0.53
0.41
0.43
0.43
0.36
0.36
0.52
0.41
0.47
0.42
0.56
0.55
0.74
0.84
0.66
0.63
0.64
0.64
0.78
0.50
0.58
0.61
0.58
0.54
0.55
0.45
0.49
Sb
Table 10. Pair wise correlation between texts of group 1.
Underlined figures highlight the texts meeting the >=0.7 threshold. Tablet H, P, and Q are correlated
69
due to the number of shared passages (Kudrjavtsev, 1949). Text Pr shows the strongest correlation
(0.89) with Hr than Qr. Similarly, Pv has a higher correlation with Hv, compared to Qv at 0.86.
The difference between Hr and Pr, compared to Qr is probably caused by damage or allography.12
This is confirmed by plotting their VGCs (see figure 20, below).
Figure 20. VGC plot of texts Hr, Pr, and Qr, explaining variation in observed results for LSA.
The VGC of the recto side of each tablet, shows that text Qr is shorter than the other parallel-texts
(Hr, and Pr), which explains the slight differences observed in previous results (i.e. factor analysis,
and now LSA).
The second group (table 11), identified by the factor analysis, reveals that texts Kr, and Kv are
similar to one side of the Small Santiago (Gr), (see figure 1.b, above). Text Ev shows a degree of
correlation with Gr, Kr, and Kv, more so than text Na. All these texts share the compound <380.1>,
although differences exist with respect to the glyphs occurring between this delimiter. Although
text Gr and tablet K are one text considered a ‘list’, Ev and Na may also be of the same genre-type,
but listing different individuals or dates, for example.
12
. The fact that tablet H and P are so correlated would point to there being less allographic variation between them, an example
exists between the head-shapes of glyph <205> and <305> in line 5 of Hr, Pr, and Qr, (see figure 1.a), possibly implying the work of
a separate scribe.
70
Text
Ev
Gr
Kr
Kv
Na
Ev
Gr
Kr
Kv
0.62
0.60
0.57
0.48
0.74
0.79
0.46
0.64
0.44
0.43
Na
Table 11. Pair wise correlation between texts of group 2.
The texts Ia, Ta, and Gv in group 3 (see figure 10, and table 12), show the most correlation, of all
texts currently analysed. The correlation between Ia and Ta suggests that text Ta may be a copy of
Ia, supported in part by the bi-gram data (above) where text Ia and Ta share many compound-forms
not present in text Gv.
Te x t Er
Er
Fb
0.05
Fb
Gv
Ia
Gv
Ia
0.28 0.10
0.10 0.01 0.79
Ta
0.15 0.02 0.80 0.91
Ta
Table 12. Pair wise correlation between texts of group 3.
Group 4 (table 13) shows little correlation between any texts, as per the results of the n-gram.
Therefore these texts may represent separate genres from the rest of the corpus, or the classification
according to the factor analysis is incorrect. Text Fa is fragmented and so there is no reliable
correlation.
Tex t
Ca
Cb
Fa
Rb
Sa
Ca
Cb
Fa
Rb
Sa
0.52
0.22 0.15
0.31 0.36 0.08
0.20 0.33 0.08 0.29
Table 13. Pair wise correlation between texts of group 4.
The LSA is performed again with removal of the fragmented texts (Fa, Fb, Ma, and Oa), and small
samples (La, Ua, Va, Wa, Xa, Ya, Yb, and Za). Texts showing a correlation between both sides of
the tablet are joined together, (texts Ca and Cb, Hr and Hv, Pr and Pv), or where previous
palaeographic research point to a duplicate copy, (texts Kr and Kv a copy of text Gr).
71
A hierarchical-cluster analysis is applied using agglomerative-clustering in order to locate the
groups identified by the LSA.
The document-document similarity values (correlations) are
converted to a distance-object by squaring over the correlation matrix, to compensate for nonnegative values required by the rest of the analysis. The results of the distance-object represent the
dissimilarities between two groups, equal to the maximum value of the dissimilarities between
individual texts in the group (see Baayen, 2008:138; Johnson, 2008:183-192). These are plotted as
a dendrogram, with the groups identified by the links between the texts (see figure 21).
Figure 21. Dendrogram of rongorongo texts based on the LSA data of the sub-corpora.
The plot shows two main groups divided into roughly five sub-groups. The texts forming linkedpairs i.e. Ia and Ta, Gr and K, H and Q etc. are highly correlated. The two main groups appear to be
divided between tablets with large numbers of delimiter-groups (see Horley, 2007a:27), and those
that may be considered narrative or prose structured texts. This does not mean they are all of the
same genre, as there is still a distinction between texts delimited with glyph compound <380.1>,
and those with glyph <76>, (see group 2 in table 11, and group 3, table 12).
72
Horley (2007a:28) identifies shared-glyph sequences between Ev and text Ab, and Pr, and between
texts Ab, Ca, Cb, Ev, and Pr. There are, however, more shared-glyphs between texts: Ab - Ra - Gr,
Cb - Ev - Gr - Kv - Na, and Aa - Pr, see Horley (2007a:28), for examples. Therefore, it is certainly
the case that the groups may be skewed by the presence of these fragments, however, the
categorisation is driven by the correlation between texts, grouped according to how much they have
in common overall. Consequently, this provides a general view of the main groups, which can be
improved once we are aware of the proper segmentation. In addition, it is possible that these
shared-sequences relate to commonly used set-phrases, or formulaic introductory expressions,
therefore the presence of these shared-glyphs may not be the final decider of literary-genre.
Group 1
Ia - Ta
Gv
Group 2
Gr - K
Ev
Group 3
H-Q
P
A
Group 4
Db-Rb
Bv-Sb
Da-Ra
Group 5
Br-Er
Na-Sa
C
Nb
Table 14. Summary of text clusters and revised groupings (highlighted columns represent the main links
between possible sub-genres).
Group 1 supports the results seen previously in the factor analysis, showing the correlation between
texts Gv, Ia and Ta. Group 2 agrees with the factor analysis. Group 3 contains the parallel-passages
of tablets H, P, and Q, with the slight separation of tablet P, and the similarity between these texts
and text A is also apparent (Guy, 1985:383). Group 4 and 5 shows a distinction between the the
parallel-texts (Gr, Kr, and Kv), and the rest of the inscriptions, (those with the delimiter-glyph
compound <380.1>). Furthermore, group 5 contains the lunar-calendar series (Ca, and Cb) and a
repetitive lunar-sequence on one side of the Keiti tablet (Er), (Wieczorek, 2010).
The final
classification, agrees to an extent with that concluded by Barthel (1958). However, the hierarchical
clustering shows that some texts are less associated with the ones proposed by Barthel (1958), but
still part of the general category if we consider the top level clusters: the relationship between the
parallel-passages H, P, and Q, and tablet A, which are seemingly related to some extent, with some
slight separation. The statistics also support previous extensive internal-structural analyses (see
Horley, 2007a; 2007b; 2010), which identify the shared-sequences through palaeographic methods.
In conclusion, these groups appear to support previous research. If the list-like texts of groups 4-5
represent one broad-genre, for example a funerary texts, then we may 'expect that they would
present slightly different versions of the same text, as happens with Egyptian funeral texts' (Horley,
73
2007a:31), hence the range of texts in these groups.
As a result, the glyphs between delimiter-glyphs, may represent 'personal names'. Group 1 and 2
reflect a separate list-like genre, and the prose-like texts (group 3), could represent 'short songs'
(kaikai), or 'prayers or charms' (Horley, 2007a:31). What the broad groups of inscription represent,
whether the distinction is between prose versus structured-lists, is still unknown. It can not be
assumed that the surviving tablets represent the wealth of rongorongo literary-genre (Melka,
2009a:116), and consequently, the classification presented here may hint that there are a restricted
set of surviving genre.
The above methods and procedures are applied to the language corpora, which previously showed
little discrimination between texts in the factor analysis. Due to space restrictions the dendrogram
plot is summarised by mapping the genre assigned to each text in the meta-data, rather than the
name of each text individually. This reveals the main clusters formed by the genres, presented
below (figure 22).
Figure 22. Dendrogram of language corpora based on LSA data
The results of hierarchical clustering show, that two broad categories exist, one group composed of
genres between haka and recitation before speaking, with another sub-category consisting of kaikai
and war chant genres. The last cluster is split between prose, RR (the Jaussen texts. Barthel,
74
1958:173-199), and song.
The groups appear to be part of a broader category of genre: haka, incantation, lament, love song,
lullaby, pepeha, and recitation before speaking may be classified as instances of 'ceremonial' or
'public address' performance, with a related sub-genre of 'chant' denoted by kaikai (see Blixen,
1979), and war chant (see Grey, 1853:39, 67, 72). Kaikai songs were apparently used as 'magical
spells or charms' (Horley, 2007a:31). Some examples, are found in Campbell (1971:93-120),
'Cantos de Aku - songs of the spirits', which may have been chanted for protection against
malevolent spirits. Horley (2007a: 31) highlights the list-like structure of some kaikai songs. The
fact that the kaikai category is included as part of the incantation genre is an interesting parallel
given their use as 'magic spells', and similar structure with repeat passages (marked in bold and
italics):
KAIKAI HANUANUA MEA Y PIKEA 'URI
1. Ka hau e, ka hau nga'e he ka hau te nukunuku ka Kava'aro, ka Kavatu'a kakokako.
2. A Ure a Ohovehi ku kahakihia a e Nga Ihu
3. More a Pua Katiki e hiahia pua mauku 'uta tangitangi pua mauku tai [...]
(Blixen, 1979)
A pepeha is a Maori 'introductory text' used to introduce oneself to an individual or group, detailing
your ancestors and place of origin. A recitation before speaking may similarly be viewed as an
instance of an 'introductory text'. Laments were sung when something affected the community
negatively, such as a funeral, with the often crowd singing the chorus. Grey (1853:10) recorded one
such lament, 'Ko Te Tangi A Te Ikaherengutu, Mo Ana Tamariki, I Mate Taua Etahi, I Mate
Kongenge Etahi', which was, 'sung by Te Wherowhero, on the death of his brother kati, […]', and is,
'always sung by the aged chiefs if many members of their family die'. Another lament was sung on,
'the occasion of the Governor quitting Taupo in 1819', (Grey, 1853:70), showing that they were not
restricted to the death of family members. An example of a lament showing group participation,
sung in chorus, appears below.
The parts sung by the group are marked by italics, (Grey,
1853:118):
HE TANGI, MO TE MATE TURORO.
1. Mate kino, mate kino!
2. Mate taurekareka!
3. Me he mate taua pea koe,
75
4.
5.
Tataia he toroa,
Hoea i, te moana pea.
In terms of the haka, this shows similar parallels with laments, in that a chief orator leads the group,
with the group singing the chorus, below. This text also illustrates the issues concerning repetition
and elliptical phrases (mentioned above).
KO TE HAKA A TAHATAHA, TE WAHINE A TE UIRA TE RANGATIRA
O NGATI-POU
1. He aha ra he kai, ma te tuna o te raupo,
2. E anga mai ai?
3. Aha! ha!
4. Ko nga mongamonga o nga wheua o Tawhiroi,
5. Whiua ki Whangape, ki reira whiriwhiri ai, kia pahoho.
6. I, i, i,
7. Me whakatapu ranei?
8. Me whakanoa ranei?
9. Me whakatapu ranei?
10. Me whakanoa ranei?
11. Me marau ki Kariwha, kia manana ake ko te puhi-tuna,
12. He' rino;
13. He tuna ha,
14. A te kai a te koioio.
(Grey, 1853:79)
An example of an incantation (Grey, 1853:58), shows similar parallels with the other examples,
(below). What is particularly evident from this example is the number of repeat phrases (in bold
and italics). There are two main verses repeated (see Grey, 1853:40, 61-62, 107, for further
examples of repeated passages):
MO TE TAANGA NGUTU KAUAE, MO TE WAHINE, TENEI WAIATA
KARAKIA WHAKAWAI.
1. Takoto ra, e hine,
2. Pirori e,
3. Kia taia o ngutu,
4. Pirori e,
5. Mo to haerenga atu, ki nga whare tapere,
6. I kiia ana mai,
7. Ko hea tenei wahine kino ?
8. E haere mai nei.
9. Takoto ra, e hine
10. Pirori e,
11. Kia taia o ngutu,
12. To kauae,
13. Kia pai ai koe:
76
14.
15.
16.
17.
18.
Pirori e
Mo to haerenga, ki nga whare matoro,
E kiia ana mai koe,
Ko hea tenei wahine ngutu whero ?
E haere mai nei. [...]
(Grey, 1853:58)
Consequently, it would seem plausible to attribute the category of 'public oration' or 'chant' to the
first group at the (top of the chart), which all show some form of formulaic expression, repeated
sequences, or removal of some syntax particles (note line 5 and 15, and the insertion of additional
content at lines 13-14).
The second category may be described as 'folklore' or 'narrative' genre with the texts classified as
song being a possible misclassification, perhaps the genre poetry. The rongorongo (RR) texts (cf.
Barthel, 1958) of the Tahua (A), Aruku-kurenga (B), Mamari (C), and Keiti (E) tablets, are
apparently authentic rongorongo contents (Fischer, 1997), however this point is still contentious
(Guy, 1985), though Barthel (1958) believes they should not be completely disregarded.
Given their sample-size, between N(2056) and N(5522) it may be possible to rule them out as any
of the genres represented in the first group. The hierarchical cluster plot shows that the RR texts are
related to the category prose. There exists a correlation between the inclusion of these texts in the
prose category, and their large sample-size, which fits more with a prose text, than a chant or song
(typically between 50-200 words). In addition, Maunsell (cf. Grey, 1853:xiii-xiv), notes that prose
was very different and adhered more to the grammar of Maori than poetry, songs, or chants, which
have different grammatical structures. This observation is supported by the hierarchical cluster plot
where prose is separated from the first main group of 'public address'.
However, the meta-data related to the classification of the language corpus (according to key-words
in the title), may over generalise and could therefore be unreliable, despite the apparent
correspondence between the genres. Without assistance from a native Maori speaker, caution is
needed until the text-categories are properly validated.
Based on the results of the rongorongo and language corpus, it would be difficult to argue for a
successful classification given that rongorongo is still undeciphered, and the issues associated with
polysemy in the language corpora need to be properly addressed. Consequently, an additional
corpus is introduced, containing 43 Egyptian Hieroglyphic texts collected from Rosmorduc (1997).
77
This will allow the methods applied above, to be validated properly.
The genre of one group of texts is known, and are classified as teachings. The corpus consists of 28
texts: 27 of the Teachings of Amenemope, and a further teaching text, Kagemni. If all the texts are
similar in terms of their contents, this will become apparent from a LSA, and resulting hierarchical
clustering procedures. The analysis will also replicate the condition of the rongorongo corpus, as
the Egyptian texts are transliterated in much the same way. The genre of the remaining texts is
unknown, but they generally represent texts collected from coffin and stela inscriptions.
Figure 24 presents the results. Texts AM001 to AM027 represent the Teachings of Amenemope, and
KAG001 the Kagemni teachings text. There are two categories of text: One group is divided into
two sub-groups with the majority of the Amenemope texts clustered together at the top of the chart.
Six of the Amenemope texts show less correlation with this group and are instead part of other
clusters. Four of the Amenemope texts (AM003, AM004, AM012, and AM014), appear with the
Kagemni (KAG001) text, showing some relationship with the main Teachings of Amenemope,
forming an additional sub-group, including 7 other texts, (see below figure 22).
78
Figure 23. Dendrogram of Egyptian Hieroglyphic texts categorised by literary-genre.
The remaining group, appearing at the bottom of the chart, (headed by text SHT001), contains two
of the Amenemope texts (AM002, and AM027), showing a departure from the main teachings
group. This might, again, be a result of over-generalisation or pre-processing issues. However, for
the purposes of this study, the categorisation has been quite successful, with the bulk of the
Amenemope texts clustered together, and a secondary group containing the remaining teaching
texts.
5.7.2 Term-term similarity analysis
The same LSA methods applied to text classification, may be applied to the glyphs and morphemes
themselves. To demonstrate, a term-to-term similarity analysis was performed on a chosen glyph.
The Egyptian texts were selected again, as the values of the glyphs are known and results based on
the 'known' highlight the accuracy of the method better than those based on the 'unknown' i.e. the
79
rongorongo corpus. In addition, attributing significance to one glyph or group of glyphs in the
rongorongo corpus would require an entire new study in itself, and substantial statistical support.
In Egyptian, glyph <A1> depicts a 'seated man', and acts as a determiner meaning 'I' or 'me', but
also as a semantic classifier when attached to glyphs; denoting actions, occupations, relationships,
and personal names (Gardiner, 2005:442), see figure 24 for examples. Glyph <A1> should be quite
productive in the corpus, making it a good candidate for analysis.
a.)
b.)
c.)
Figure 24. Examples of glyph <A1>13
Above, glyph <A1> stands before the verb 'be silent' (Faulkner, 1988:290), forming the noun 'silent
man' (24.a), in the second example (24.b), it acts as a determiner denoting an occupation, 'scribe'
(Faulkner,1988:246), and in the final example (24.c), glyph <A1> is a determiner for son (Faulkner,
1988:207), as in 'his son', and also at the end of the passage denoting the plural of the compound
'king's children', (Faulkner, 1988:116).
Glyph <A1> is highly productive over a large set of texts (table 15), which means we can proceed
with the analysis, as there seems to be enough instances of glyph <A1> for the LSA to make good
predictions of its relationship with other glyphs in the corpus.
13
. These examples are generated in LaTex with the Sesh package by Serge Rosmorduc (1997).
80
Ra nk Glyph Count
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
260
211
177
169
114
109
85
30
19
16
13
11
10
10
9
8
8
8
7
7
Re l. Fre q. Te x t
0.03
< 0.01
< 0.01
0.05
< 0.01
0.02
0.02
0.03
0.03
0.02
0.04
0.03
0.02
0.02
0.01
0.02
0.01
0.01
0.02
0.02
OUN001
WST001
BRO001
NAU001
HET001
DPA001
INE001
IKH001
CTA001
KAG001
AM002
AM022
AM020
AM015
GUA001
AM024
CNE001
AM001
AM025
AM010
Table 15. Top 20 Frequency count of Glyph <A1> in descending order.
The glyphs meeting the >=0.7 threshold are once again returned, (as per Landauer et al.,
2004:5217). A KWIC concordance is generated to validate the results of the LSA. Due to the large
number of instances, the results have been summarised by selecting the most correlated-glyphs,
with a count of their position relative to the node <A1> (see table 16). The second to last column
shows the total number of instances of each correlated-glyph, and the last column, their correlation
to glyph <A1>.
Glyph P OS .1 P OS .2 POS .3
V 31A
9
5
2
A2
9
5
7
D54
4
10
6
G41
9
3
9
N23
4
0
1
gm
0
0
3
mw
3
1
1
U19
0
0
0
N36
0
1
0
POS .4
5
7
14
19
0
6
0
0
0
P OS .5 P OS .6 NODE POS .8 POS .9 POS .10 P OS .11 P OS.12 P OS .13 Tota l Corre la tion
25
4
A1
0
5
12
15
7
7
89
0.91
8
22
A1
0
8
6
1
9
13
82
0.86
11
4
A1
2
0
6
2
9
8
68
0.92
0
0
A1
2
2
3
3
3
6
53
0.81
0
1
A1
0
1
7
2
1
4
17
0.93
0
0
A1
3
0
1
1
2
0
16
0.91
1
1
A1
0
0
0
1
0
1
8
0.91
0
0
A1
0
0
0
0
0
0
0
0.91
1
0
A1
0
3
0
0
1
0
6
0.72
Table 16. Summary of the KWIC concordance and LSA results.
The KWIC concordance show that there are examples of glyph <A1> forming a relationship with
the glyphs identified by the LSA (i.e. <V31A>, <A2>, <D54>, <G41>, <N23>, and <gm>). There
are however, a few concerns in relation to the correlation of 0.91 assigned to glyph <U19> an 'adze',
or variant of the preposition <n> (Gardiner, 2005:518). There are no instances of it in the KWIC
concordances. A further analysis reveals that glyph <U19> is assigned this value due to it appearing
in proximity to the other glyphs associated with glyph <A1> (table 17).
81
No.
POS.1
POS.2
POS.3
POS.4
POS.5
POS.6
NODE
POS.8
POS.9
POS.10
POS.11
POS.12
POS.13
1
2
3
4
5
6
7
8
9
10
A
i
A
Z1
t
ir
W
n
a
W
i
r
s
P1
N23
r
N31
i
n
b
i
i
D26
i
Z1
t
D54
n
b
n
W
W
i
W
Z2
6
D2
z
C7
sw
k
f
A2
f
n
1
1
b
G7
W
n
n
n
n
n
n
n
n
n
n
U19
U19
U19
U19
U19
U19
U19
U19
U19
U19
nw
nw
nw
nw
nw
nw
nw
nw
nw
nw
W
W
W
W
W
W
W
W
6
W
D6
D6
D6
D6
H
ra
D6
D6
4
D6
W
r
r
m
W
Hr
O
i
D6
D
Sd
G41
G41
wA
t
1
i
W
i
d
d
A
A
A
F37B
p
W
s
W
nb
Table 17. KWIC concordance of <U19>.
The correlation of 0.84 between glyph <nw> and <A1>, is explained by the KWIC concordance.
Glyph <U19> appears mainly in combination with glyph <n> and <nw>. On closer examination,
glyph <U19> combined with <n> and <nw> (actually glyph <N35> and <W24> respectively),
forms the demonstrative 'this', (Gardiner, 2005:518). In addition, the other glyphs associated with
<A1> occur with glyph <U19> as shown by the summary (table 18). A few previously observed
glyphs, for example: <A2>, <D64>, <G41>, <mw>, and <N23>, are also returned in the results.
Glyph POS.1
mw
A2
G41
D54
N23
V31A
gm
U19
N36
0
0
0
0
0
0
0
0
0
P OS.2
P OS .3
POS .4
POS.5
P OS .6
NODE
P OS .8
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
U19
U19
U19
U19
U19
U19
U19
U19
U19
0
0
0
0
0
0
0
0
0
P OS .9 POS .10 P OS.11 P OS.12 P OS .13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
2
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
Tota l
4
3
2
1
1
0
0
0
0
Table 18. Summary of KWIC results for glyph <U19>.
What must be considered, is that LSA identifies the similarity between glyphs by looking at
examples of collocations, as well as the clause and paragraph as a whole. LSA does this for all texts
and glyphs in the corpus. Consequently, if a glyph has a small correlation (see <N36> at 0.72, in
table 16), this maybe due to it being associated with many glyphs in the same domain (i.e. as a
determiner or phonetic complement). This is suggested by the plot of the term-to-term similarity
scores (correlation), see figure 25, below.
82
Figure 25. Plot of glyph <N36>
The long flat line at the top of the graph shows that this glyph is equally correlated with many other
glyphs, at approx 0.92. Therefore this glyph is less correlated with glyph <A1> because it is more
productive, and less likely to favour a particular glyph.
Therefore, the strong correlation between glyphs is attributed to those that form collocations, or
where there is a semantic dependency. To illustrate, glyph <N36> and <N23> form compounds,
including: /mrt/ 'weaver', /mr/ 'friend', which includes glyph <A1> as determiner (Gardiner,
2005:491). In addition, <G41> forms the sequence <A1-N35-G41-G54> meaning 'alight, or 'halt',
and <G17-D36-X1-N35-T14-G41-A1>, as 'nomad hunter' (Gardiner, 2005:472), which explains to
some degree its positive correlation with <A1>. Therefore, many of the glyphs retrieved by the
LSA have some function associated with glyph <A1>, either as a determiner, phonetic complement,
or logogram.
In conclusion, the results show that the methods used here have elicited some form of literary genre
based on the features peculiar to each text; whether it is from data containing glyphs or morphemes.
Some of these results, for example, the factor analysis provide a good starting point for exploratory
studies, but there are issues related to the number, and type of features selected. Traditional corpus
linguistic techniques (counts of n-grams, and KWIC concordances) offer more precision in terms of
identifying which glyphs and morphemes form some lexical or grammatical association pattern
(Biber et al., 1998). However, it is often time consuming to filter through the data looking for these
patterns. Latent semantic analysis (LSA) produces results which are easier to interpret, and shows
83
the strength of correlation between terms.
One disadvantage, is that LSA is computationally
expensive when applied to large corpora, however for this study it is capable of processing the
whole rongorongo corpus without any problems. To summarise, future rongorongo studies may
wish to consider adopting LSA as part of a mixed-methods approach, when exploring the semantics
of the rongorongo corpus.
84
Chapter 6 Conclusion and Recommendations for further research
6.1 Conclusion
The analysis of the rongorongo corpus appears to require a mixed-methods approach borrowing
from corpus linguistics; including word-frequency distribution analysis, collocations, KWIC
concordances (see Biber et al., 1998; Gries, 2009; Baayen, 2008); lexicostatistics involving
spectrum frequencies and Vocabulary Growth Curves (Baroni, and Evert, 2006; Baayen, 2001;
2008), combined with authorship-attribution studies (see Mosteller and Wallace, 1964; Baayen,
2008; Stamatatos et al., 2009).
These studies have been successful in classifying unknown
authorship against a list of known texts; one example being the study of the 'Federalist papers'
(Mosteller and Wallace, 1964). However, despite rigorous feature-selection criteria the choice of
the number of features is still left to the judgement of the researcher. Consequently, one of the most
attractive properties of applying LSA is that it requires high dimensional data-sets representative of
'global knowledge' (Landauer, et al., 1997).
An understanding of the structural features, context, or genre of the rongorongo texts will provide
an insight into the use of glyphs and is therefore paramount in revealing the underlying mechanisms
that were exploited for creating inscriptions with such allographic diversity, yet a relative amount
of standardisation (Horley, 2005; 2007a; 2009).
Tests for lexical-richness hints at the nature of a text. The correlation between texts with large
numbers of delimiters (Horley, 2007a) and those without, show that these list-like texts are
lexically-richer and more likely to be lists of content-words, due to their low occurrence - typically
among the hapax-legomena. Analysing Vocabulary Growth Curves (VGC) provides some insight
into how the texts can be segmented, in order to compensate for any skew in statistical measures,
caused by the presence of possible sub-texts.
The most successful of the multivariate methods would appear to be Latent semantic analysis as a
result of the underlying the model, which are not dependant on frequency data in the same way
other analyses are prone (factor analysis, relative frequency counts). Each form of analysis has its
advantages and shortcomings, however they all reveal a small piece of the puzzle, or a different
view of the data. Each method is used to validate the results of previous methods, or provide model
versions of raw data.
85
The groupings of tablets would appear to be broadly divided into two main groups with one
designated 'prose', and the other 'lists' of personal names, places, dates or other possible inventory
style items. These terms are applied generally however, and do not represent a final hypothesis
concerning the literary-genre of the tablets, merely that they share a common theme or contents
denoted by the strength of their correlation. However, it does provide statistical support for the
conclusions drawn by previous studies, (Barthel, 1958; Butinov and Knorozov, 1957; Fedorova,
1978; Guy, 1982; 1985; 1990; 2006; Fischer, 1997; Davletshin, 2002; Horley, 2005; 2007a; 2007b;
Pozdniakov, 1996; Pozdniakov and Pozdniakov, 2007; Melka 2008; 2009b; 2010). If these groups
show that structural variation is restricted to a specific group of tablets, then it may be concluded
that this is as a result of literary-genre, or that they are attributed to different scribal schools,
(Routledge, 1919); resulting in the identification of the mechanisms behind the alloglyphs obvious
from parallel-passages.
In addition, the language corpus provided some evidence for the application of VGCs as a possible
tool for segmentation. The LSA shows the presence of sub-groups, and how they are related to a
broader category of 'chant' or 'public address', and 'prose'. Although more work would need to be
done, a comparative analysis of rongorongo and Rapanui oral tradition is expected to identify some
fragment or passage of text that reflects the structures assigned to the glyphs, which reviewing
previous research (Barthel, 1958; Fedorova, 1978; Davletshin, 2002; Horley, 2007a; Pozdniakov,
and Pozdniakov, 2007; Melka, 2010), 'would seem entirely justified', (Fischer, 1997: 304). Fischer
(1997:305) proposes however, that language texts used in comparative structural analysis should
rely on premissionary texts before 1866 as they are the most likely to parallel the rongorongo
inscriptions. As the majority of texts in the language corpus of this study date before 1853, some
being of ancient origin (particularly Grey, 1853), it is hoped that proper pre-processing and
segmentation will improve the current results.
The study evaluated a series of alternative data-sets. For the moment most scholars agree that
Barthel's (1958) corpus is sufficient, though Guy (1985; 1990; 2006) highlights a number of
inconsistencies, and both Horley, (2007a) and Pozdniakov and Pozdniakov (2007) choose to preprocess the Barthel (1958) corpus, removing particular alpha-codes. More importantly, there is no
current alternative, although Horley (2005) proposes a new scheme based on the compositional
elements of the glyphs, reducing the sign list to 50 elementary signs.
86
For statistical purposes it appears that DD4 provides better results, as it identifies all instances of a
glyph in the corpus, simplifying the programming. The argument whether a glyph like 700 (fish) is
the same as 700y (an upside down fish), will be revealed once further steps have been taken towards
the decipherment.
It is possible to remove many of the (Barthel, 1958) alpha-codes, though 'f'
representing 'feathered' glyphs (i.e. those with a series of lines emanating from their contour), are
problematic (Paul Horley, 2010, personal communication).
To summarise, this paper has attempted to solve one a problem in rongorongo studies, that of
literary genre. Once it is established which tablets are likely to contain similar inscriptions, it will
provide some contextualisation to the study of the glyphs and the motivation for their presence on
specific tablets.
Lastly, this paper has benefited from the application of the R programming language for statistics,
which provides numerous functions that can be incorporated into programs and applied to corpora.
In addition, being free, there are no research costs compared to packages such as SPSS, where an
annual license fee is required. It is hoped therefore that this paper also shows what applied-linguists
can achieve with a little computer programming knowledge (see Biber et al., 1998; Evert and
Baroni, 2006; Baayen, 2008; Johnson, 2008; Gries, 2009; Wild, 2009 for more on these methods).
6.2 Further research
From the correspondence analysis, we see that preprocessing of the texts and tagging of
grammatical constructions; such as the dative, active, and passive, would be beneficial to the study.
In addition, the rongorongo corpus needs further segmentation based on entropy (Rao, R., et al.,
2009), or on the constant proposed by Golcher (2007), which segments morphemes using
unsupervised, and language independent, methods based on word frequency counts of substrings
(character level n-grams, see Golcher, 2007:1). The underlying assumption is that a larger unit is
composed of smaller units commonly occurring together. Consequently, the segmentation of text is
made where the predictability of the following character falls, (Golcher, 2007:2).
The level of orthography is still undetermined for rongorongo. Calculating the sample correlationcoefficients for Chinese texts and glyphs, Penn (2006) demonstrates how the degree of logography
and phonography can be determined by a 3D pairs plot. The level of logography and phonography
present in a particular writing system, can be used as a tool to 'classify entirely unknown writing
87
systems to assist in attempts at archaeological decipherment' (Penn, 2006:1). Addressing this issue
through an analysis of Chinese Hànzi and English, by calculating the sample correlationcoefficients of a corpus of text based on a similar term-document matrix used in LSA. Although still
in its early stages, this would appear to be a promising method for assessing the next issue in
rongorongo research, the Orthographically Relevant Level (ORL), (Sproat, 2000).
Is rongorongo syllabic, logographic, or a mixture of the two, plus semantic categories and phonetic
complements?
The perspective plots (figure 26) show a 3D pairs plot with the strength of
correlation between glyphs represented by the height of each peak for Egyptian (as per previous
LSA), Cuneiform (35 texts from the Laws of Hammurabi), rongorongo, and a number of English
texts (Alice in wonderland, Huckleberry Finn, Dracula, Pride and Prejudice, and Sherlock Holmes,
to name a few, 46 in total).
There does seem to be a difference between a writing system like Egyptian and Cuneiform, versus
English and rongorongo. A logographic writing system may display properties similar to the
Egyptian plot where glyphs are used for a specific text i.e. a funerary formula, which will mean less
'semantic clumpiness' (see Penn, 2006:2), shown by the absence of any peaks in the rest of the
Egyptian plot. However, the depth of the chart shows that there are quite a few similarities in the
usage of glyphs (i.e. syntax particles or phonetic and semantic classifiers). The plot for Cuneiform,
on the other hand, shows there is more correlation between the individual constituents, shown by
the depth of the correlation (filling the whole 3D cube), this however may be due to the repeated
formula in the Laws of Hammurabi, i.e. ‘If a man does X, to Y, he will receive Z punishment’,
causing each section to be quite similar in structure, and they are also considered to be one text
genre (list of laws). English and rongorongo show an interesting parallel between the distribution
of the peaks, though the depth of the plot is slightly deeper for English, showing more of the
documents share common vocabulary.
88
English
Egyptian
Rongorongo
Cuneiform
Figure 26. Sample correlation-coefficients for English, Egyptian, Rongorongo, and Cuneiform.
These are only preliminary results and more comparisons need to be made between writing systems
of the world, both modern and ancient. However, it is obvious from these plots that there is an
interesting difference between these scripts, but whether this difference is quantifiable, and shown
to assign the correct ORL, needs to be determined by a more extensive study. In particular,
parameters for the representativeness and size of the corpora and transliteration schemes need to be
assessed in order to ensure that results are consistent, and do not lead to incorrect assumptions.
Returning to the previous discussion, LSA can be applied to glyphs to quantify glyph behaviour,
89
though it requires more testing, as this paper may be the first to apply LSA to ancient writing
systems and decipherment. However, the study shows that by following the semantic correlation
between glyphs, it is possible to adopt a hierarchical approach similar to syntax trees and the notion
of binding to discover which glyphs 'govern' others (Chomsky, 1981a), and which reveal an
anaphoric relationship to other glyphs in the corpus.
Markov models used in Natural Language Processing, speech recognition, and work on deciphering
the Indus Valley script (see Rao et al., 2009) would produce model glyph collocations, or predict the
value of damaged glyphs and the semantic relationships that may exist had the corpus of surviving
inscriptions been larger. In addition, cryptographic methods such as the transposition cypher, can
be trained with the same data presented in this paper i.e., counts of features, correlations, typical
collocations, and KWIC concordance data. This method can produce statistically based guesses of
glyph values, which although likely to generate incorrect strings, may provide correct 'guesses' or
some avenues for further study.
A final remark is that each study submitted to the ever growing pool of research on rongorongo and
writing systems, is a valid contribution, and when supplemented with statistical analysis as part of a
mixed-methods approach, will result in more robust conclusions. Corpus linguistic methods offer a
good basis on which to build more advance approaches using language modelling techniques, or to
provide support to palaeographic analysis. It is hoped that with the recent increase in the pace of
rongorongo research (Horley, 2009; 2010; Melka, 2009a; 2009b; 2010; Wieczorek, 2010), that we
will see a decipherment in the not too distant future.
90
Bibliography
Aiolli, F., M. Simi, D. Sona, A. Sperduti, A. Starita, and G. Zaccagnini. (1999). 'SPI: a System for
Palaeographic Inspections'. AIIA Notizie. URL: http://www.dsi.unifi.it/AIIA/. 4:34-38. Accessed:
19/09/2008.
Altman, A. (2004). ‘Early Visitors to Easter Island 1864-1877 (translations of the accounts of
Eugène Eyraud, Hippolyte Roussel, Pierre Loti and Alphonse Pinart; with an Introduction by
Georgia Lee)’. Los Osos, CA: Easter Island Foundation.
Baayen, H. (1994). 'Derivational Productivity and Text Typology'. Journal of Quantitative
Linguistics. 1:16-34.
Baayen, H., Halteren, H., and Tweedie, F.(1996) 'Outside the cave of shadows: Using syntactic
annotation to enhance authorship-attribution'. Literary and Linguistic Computing. 11:121-131.
Baayen, H. (2001). 'Word frequency distributions'. Kluwer Academic Publishers.
Baayen, H., Halteren, H., Neijt, A., and Tweedie, F. (2002). 'An experiment in authorshipattribution'. 6es Journées internationales d'Analyse statistique des Données Textuelles.
Baayen, H. (2009). 'Analysing Linguistics Data: A practical introduction to statistics using R'.
Cambridge University Press.
Barthel, T. (1958). ‘Grundlagen zur Entzifferung der Osterinselschrift (Bases for the Decipherment
of the Easter Island Script)’. Hamburg: Cram, de Gruyter.
Barthel, T. (1974). 'The Eight Land – The Polynesian discovery and settlement of Easter Island',
Honolulu Press of Hawaii. Translated from the German by Anneliese Martin.
Baroni, M. (2006). ‘Counting Words: An Introduction to Lexical Statistics’. The 18th European
Summer School in Logic, Language and Information, Málaga, Spain.
Baroni, M. (2009). ‘Distributions in text’. In Anke Lüdeling and Merja Kytö (eds.), Corpus
linguistics: An international handbook, 2. Berlin: Mouton de Gruyter:803-821.
Berthin, G., and Berthin, M. (2006). 'Astronomical Utility and Poetic Metaphor in the rongorongo
Lunar Calendar'. Applied Semiotics. 8:18. 85-98.
Biber, D. Conrad, S. and Reppen, R. (1998). ‘Corpus. Linguistics: investigating language structure
and use’. Cambridge University Press.
Blixen, O. (1979). 'Figuras de hilo tradicionales de las Isla de Pascua'. Moana: Estudios de Antropología Oceánia. 2:1. 1-106.
Bonfante, G., Bonfante, L. (2002). 'The Etruscan Language: An Introduction (2nd Edition)'.
Manchester University Press.
Butinov, N and Knorozov, Y. (1957). ‘Preliminary Report on the Study of the Written Language of
Easter Island.’, Journal of the Polynesian Society. 66:1. 5-17
91
Campbell, R. (1971). 'La Herencia Musical de Rapanui: Etnomusicología de la Isla de Pascua'.
Santiago de Chile: Andrés Bello.
Chomsky, N. (1981a) 'Lectures in Government and binding'. Dordrecht: Foris
Ciula, A. (2005). 'Digital palaeography: Using the digital representation of medieval script to
support palaeographic analysis'. Digital Medievalist. 1.
Coulmas, F. (1991). 'The Writing Systems of the World (The Language Library)'. Basil Blackwell:
Oxford.
Coulmas, F. (1996). 'Encyclopedia of Writing Systems'. Blackwell Publishing.
Coulmas, F. (2003). 'Writing Systems: An Introduction to their Linguistic Analysis'. Cambridge
University Press.
Damerow, P. (2006). ''The Origins of Writing as a Problem of Historical Epistemology'. Cuneiform
Digital Library Journal. 1. Available at http://cdli.ucla.edu/pubs/cdlj/2006/cdlj2006_001.html.
Accessed: 4/3/2010.
Davletshin, A. (2002). 'Names in the Kohau Rongorongo Script'. Paper presented as From Kohau
Rongorongo Tablets to Rapanui Social Organization at the 2nd International Conference 'Hierarchy
and Power in the History of Civilizations'. Saint Petersburg, Russia, July 4-7.
DeCoster, J. (1998). 'Overview of Factor Analysis'. URL: http://www.stathelp.com/notes.html,
Accessed: 10/2/2010
De Feu, V. (1996). 'Rapanui'. Routledge, London.
Dong, Q; Wang, X; Lin, L. (2006). 'Application of Latent Semantic Analysis to Protein Remote
Homology Detection'. Bioinformatics. 22(3):285-290.
De Hevesy, G. (1932). 'Écriture de l'Ile de Pâques'. Bulletin de la Société des Américanistes de
Belgique. 9:120-7
Dörnyei, Z. (2007). 'Research methods in Applied Linguistics'. Oxford University Press.
Elbert, S. (1941). 'Chants and love songs of the Marquesas islands, French Oceania'. Journal of the
Polynesian Society . 50(198):53-91.
Elbert, S. (1982). 'Lexical diffusion in Polynesia and the Marquesan-Hawaiian relationship'.
Journal of the Polynesian Society. 91(4):499-518
Emory, K. (1968). 'Review of Reports of the Norweign Archaeological Expedition to Easter Island
and the East Pacific'. (2) Miscellaneous Papers by Thor Heyerdahl and Edwin N Ferdon, Jr., Eds.
American Anthropologist, 70:152-154.
Englert, S. (1970). ‘Island at the Center of the World'. Translated and Edited by William Mulloy.
New York: Charles Scribner's Sons.
92
Evert, S., and Baroni, M. (2005). 'Testing the extrapolation quality of word frequency models'. Proceedings of Corpus Linguistics 2005, URL http://www.corpus.bham.ac.uk/PCLC/. Accessed
12/02/2009.
Evert, S., and Baroni, M. (2006). 'The ZipfR library: Words and other rare events in R'. Presentation
at useR! 2006: The Second R User Conference, Vienna, Austria.
Evert, S., and Baroni, M. (2007). ‘ZipfR: Word distributions in R’. Proceedings of the ACL 2007
Demo and Poster Sessions. 29–32, Prague, June 2007. Association for Computational Linguistics.
Facchetti, G. (2002). Antropologia della Scrittura: Con un' Appendice dulla Questione del
Rongorongo dell' Isola di Pasqua. Milano: Arcipelago Edizioni.
Faulkner, R. (1988). 'A Concise Dictionary of Middle Egyptian'. Griffith Institute, Oxford.
Fedorova, I. (1978). 'Mify, predaniya i legendy ostrova Paskhi'. Moscow: Nauka.
Felbermayer, F. (1971). 'Sagen und Überlieferungen der Osterinsel'. Darmstadt: Verlag Hans Carl
Nürnberg.
Fischer, S. (1997a). 'Glyph-Breaker'. New York.
Fischer, S. (1997b). 'Rongorongo: The Easter Island script, History, Traditions, Texts', Clarendon
Press: Oxford.
Fischer, S. (1998). ‘Reading Rapanui's rongorongo.’ In C. M. Stevenson, G. Lee, & F. J. Morin
(Eds.), Easter Island in Pacific Context: South Seas Symposium, Proceedings of the Fourth
International Conference on Easter Island and East Polynesia, University of New Mexico,
Albuquerque, 5–10 August 1997. 3–7. Los Osos: The Easter Island Foundation.
Foltz, P.W., Kintsch, W., & Landauer, T.K. (1998). 'The measurement of textual coherence with
Latent Semantic Analysis'. Discourse Processes. 25. 285-307.
Gardiner, A. (2005). 'Egyptian grammar'. Third Edition. Published on behalf of the Griffith
Institute, Ashmolean Museum, Oxford, by Oxford University Press, London
Golcher, F. (2007). ‘A stable statistical constant specic for human language texts’. In Recent
Advances in Natural Language Processing 2007 (RANLP-07), to appear. Available at:
http://amor.rz.hu-berlin.de/~golcherf/ranlp.pdf Accessed: 24/02/2010.
Greenhill, S., Blust. R, and Gray, R. (2008). 'The Austronesian Basic vocabulary Database: From
Bioinformatics to Lexomics'. Evolutionary Bioinformatics, 4:271-283. Available at:
http://language.psy.auckland.ac.nz/austronesian/ Accessed: 06/12/2009.
Grey, G. (1853). 'Ko Nga Moteatea, Me Nga Hakirara O Nga Maori '. Robert Stokes, Wellington.
Gries, S. (2009). 'Quantitative Corpus Linguistics with R: A Practical Introduction'. Routledge.
93
Guy, J. (1982). 'Fused Glyphs in the Easter Island Script'. Journal of the Polynesian Society.
91:445–447.
Guy, J. (1985). 'On a Fragment of the ‘‘Tahua’’ Tablet'. Journal of the Polynesian Society. 94:367–
387.
Guy, J. (1990). 'On the Lunar Calendar of Tablet Mamari'. Journal de la Societe des Oceanistes.
91(2):135–149.
Guy, J. (2006). ‘General properties of the rongorongo Writing’. The Rapanui Journal. 20:1, May.
Hyman, M. (2006). 'Of glyphs and glottography.' Language & Communication. 26,3-4. 231-249.
Holmes, D. (1992). ‘A Stylometric Analysis of Mormon Scripture and Related Texts’. Journal of
the Royal Statistical Society. Series A (Statistics in Society), 155:1. 91-120.
Horley, P. (2005). ‘Allographic variations and statistical analysis of the rongorongo script’.
Rapanui Journal. 19:2.
Horley, P. (2007a). ‘Structural Analysis of rongorongo Inscriptions’. Rapanui Journal. 21(1). 2532.
Horley, P. (2007b) Presentation at the VII International Conference on Easter Island. Gotland
University: Sweden. 20-25th August 2007.
Horley, P. (2009). 'Rongorongo Script: Carving Techniques and Scribal Corrections', Le Journal de
la Société des Océanistes. 129. Juillet-Décembre
Horley, P. (2010). 'Rongorongo Tablet Keiti'. Rapanui Journal. 24:1. 45-56
Jaussen, T. (1893). ‘L’île de Pâsques. Historique et Écriture'. Bulletin de Geographie, Historique et
Descriptive 2.
Johnson, K. (2008). 'Quantitative methods in Linguistics'. Blackwell Publishing Ltd.
Karena-Holmes, D. (1997). Māori language : understanding the grammar. Auckland: Reed
Publishing (NZ)
Kudrjavtsev, B. (1949). 'The Writing of Easter Island'. Compilation of the Museum of Anthropology
and Ethnography 11. Saint Petersburg. 175–221.
Landauer, T., and Dumais, S. T. (1997). 'A Solution to Plato’s Problem: The Latent Semantic
Analysis Theory of Acquisition, Induction, and Representation of Knowledge'. Psychological
Review. 104(2). 211-240.
Landauer, T. (2002). 'Applications of Latent Semantic Analysis'. Paper presented at the 24th
Annual Meeting of the Cognitive Science Society. August.
Landauer, T. Laham, D., and Derr, M. (2004). 'From paragraph to graph: Latent semantic analysis
94
for information visualization'. Proceedings of the National Academy of Science. 101. 5214-5219.
Landauer, T., McNamara, D., Dennis, S., and Kintsch, W. (2007). 'Handbook of Latent Semantic
Analysis (University of Colorado Institute of Cognitive Science)'. Psychology Press 1st edition.
Lee, G. (1992). 'Rock Art of Easter Island'. Monumenta Archaeologica 17. Los Angeles: UCLA
Institute of Archaeology.
Manning, C., Raghavan, P., and Schutze, H.
Cambridge University Press.
(2008). 'Introduction to Information Retrieval'.
Maunsell, R. (1862). 'Grammar of the New Zealand Language'. W. C. Wilson, Auckland.
McLaughlin, S. (2004). 'Rongorongo and the rock art of Easter Island'. Rapanui Journal. 18:87-94
Métraux, A. (1940). ‘Ethnology of Easter Island'. Bernice P. Bishop Museum Bulletin 160.
Melka, T. (2008). ‘Structural Observations Regarding rongorongo Tablet Keiti’, Cryptologia. 32.
155-179. Taylor and Francis group, LLC.
Melka, T. (2009a). ‘The Corpus Problem in the rongorongo Studies’. Glottotheory. 1: 11-136.
Melka, T. (2009b). ‘Linearity, Calligraphy and Syntax in the rongorongo script’. Glottotheory. 2(2).
70-96.
Melka, T. (2010). 'On Some Examined Features of Rongorongo: Tablet Mamari', Oxford Journal of
Writing Systems, Oxford.
Mosteller, F and Wallace, D. (1964), ‘Inference and Disputed Authorship - The Federalist’. CSLI
Publications.
Penn, G. (2006). 'Quantitative methods for classifying writing systems'. Proceedings of the 18th
International Congress of Linguists (CIL-18), 2. 175-176.
Peng, R and Hengartner, N. (2002). ‘Quantitative Analysis of Literary Styles’. The American
Statistician. 5:3. 175-185
Peng, F. Schuurmans, D. Keselj, V. Wang, S. (2003). 'Language Independent Authorship-attribution
using Character Level Language Models'. In Proceedings of 10th Conference of the European
Chapter of the Association for Computational Linguistics (EACL 2003). 267-274, April 12-17,
2003, Budapest, Hungary.
Pino, J. and Eskenazi, M. (2009). 'An application of latent semantic analysis to word sense
discrimination for words with related and unrelated meanings'. In Proc. of the 4th Workshop
onInnovative Use of NLP for Building Educational Applications.
Pozdniakov, K. (1996) 'Les Bases du Dechiffrement de l'Ecriture de l'Ile de Paques'. Journal de la
Societe des Oceanistes, 103:2. 289-303.
Pozdniakov, K and Pozdniakov, I (2007). ‘Rapanui Writing and the Rapanui Language: Preliminary
95
Results of a Statistical Analysis’. Forum for Anthropology and Culture 3. 3–36.
http://pozdniakov.free.fr/1620Easter%20Island%20english.pdf. Accessed: 6/8/2009.
Source:
Ray, S. (1932). ‘Note on Inscribed Tablets from Easter Island’. Royal Anthropological Institute of
Great Britain and Ireland, Man 32. 153-155
Rao, R., Yadav, N., Vahia, M., Joglekar, H., Adhikari, R., Mahadevan, I. (2009). ‘Statistical analysis
of the Indus script using n-grams’. PLoS ONE 5(3): e9506.
Robinson, A. (2002). 'Lost Languages'. BCA: Great Britain.
Rosmorduc, S. (1997), SETH
http://www.iut.univparis8.fr/~rosmord/hieroglyphes/hieroglyphes.html, accessed 20th March 2010.
Routledge, K (1919). ‘The Mystery of Easter Island: The story of an expedition'. London and
Aylesbury: Hazell, Watson and Viney.
Sproat, R. (2000). 'A Computational theory of Writing Systems'. Cambridge University Press:
Studies in Natural Language Processing.
Sproat, R. (2003). 'Approximate String matches in the rongorongo Corpus'.
http://www.cslu.ogi.edu/~sproatr/ror/.
Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). ‘Test Genre Detection Using Common
Word Frequencies’, In Proc. of the 18th Int. Conf. on Computational Linguistics (COLING2000).
808-814
Stamatatos, E. (2009). ‘A Survey of Modern Authorship-attribution Methods’, Journal of the
American Society for Information Science and Technology. 60:3. 538-556.
Summers, D. (1993). 'Longman/Lancaster English Language Corpus Criteria and Design'.
International Journal of. Lexicography, 6:3.
Thomson, W. J. (1889). 'Te Pito te Henua or Easter Island: U.S. National Museum', Annual report:
447-552.
Tomatsu, R. (2006). ‘A Computational Analysis of Literary Style: Comparison of Kawabata
Yasunari and Mishima Yukio’. Second Annual Rhizomes: Re-Visioning Boundaries Conference, The
University of Queensland, Brisbane. Available at:
http://espace.library.uq.edu.au/eserv/UQ:7704/rt_rhiz.pdf) Accessed: 05/12/2009.
Tregear, E. (1891). 'The Maori-Polynesian Comparative Dictionary'. Kessinger Publishing.
Vives Solar, J. (1920). 'Rapanui: Cuentos Pascuences'. Santiago de Chile: Imprenta Universitaria,
Estado 63.
Wieczorek, R. (2010 in press). 'Astronomical Content in Rongorongo Tablet Keiti'. Le Journal de
la Société des Océanistes.
Wild, F. (2009). 'LSA: Latent Semantic Analysis. Open University. LSA package (Version 0.63-1)
96
for the R programming language for statistical computing. http://cran.rproject.org/web/packages/LSA/index.html. Accessed: 16/03/2010.
Zipf, G. (1949). 'Human Behavior and the principle of Least-Effort'. Addison-Wesley.
97
Appendix
Presented here are photographs of tablets Tahua (A), Aruku-kurenga (B), and Mamari (C). Photographed by Ilaria
Rovera (September, 2008). (Not to be reproduced in any form without prior permission from Fr. Jean Louis Schuester,
Congregation of the Sacred Hearts, Rome).
[The images are removed for the online version to prevent commercial use – please request the appendix from the
author]
98