Corpus Linguistics

Corpus Linguistics

A study of language that includes all processes related to processing, usage and analysis of written or spoken machine-readable corpora. Corpus linguistics is a relatively modern term used to refer to a methodology, which is based on examples of ‘real life’ language use. At present, effectiveness and usefulness of corpus linguistics is closely related to the development of computer science. See McEnery and Wilson 1996; Aarts and Meijs 1990; Leech 1991; Svartvik 1992.

Corpus Processing

A general term used to refer to all processes related to annotation, presentation and analysis of corpora. See Aarts and Meijs 1990; McEnery and Wilson 1996: Ch. 2.

Alignment

A term is used to refer to the practice of defining explicit links between texts in a parallel corpus. Alignment is linking the elements (sentences, phrases or words) that are mutual translations of each other in parallel corpus. Sentence and word alignment (the term for performing this operation - aligner) may be performed with a high degree of accuracy automatically. See McEnery and Oakes 1996; McEnery and Wilson 1996: Ch. 2.

Annotation

A term is used to refer to (i) the practice of adding explicit additional information to machine-readable text; (ii) the physical representation of such information. Annotation (or markup) makes it quicker and easier to retrieve and analyse information about the language contained in the corpus. A corpus may be annotated manually, by a single person or by a number of people; alternatively, the annotation may be carried out completely automatically or semi-automatically (output needs to be post-edited by human beings in the latter case) by a computer program. Certain kinds of linguistic annotation, which involve the attachment of special codes to words in order to indicate particular features, are frequently known as tagging rather than annotation, and the codes which are assigned are known as tags. See McEnery and Wilson 1996: Ch. 2; Leech 1993; Aarts and Meijs 1990; Brill 1992; Källgren 1996; Leech and Wilson 1994.

anaphoric annotation A form of annotation that refers to the marking of pronoun reference in corpora. Anaphoric annotation can only be carried out by human analysts, since it is one of the aims of the annotation to provide the data on which to train computer programs to carry out this task (see bootstrapping). It is of great importance to NLP since a large amount of conceptual context of a text is carried out by pronouns. See McEnery and Wilson 1996: Ch. 2; Halliday and Hasan 1976; Garside 1993.

discoursal annotation A type of annotation that is used to annotate items whose role in the discourse is primarily to do with discourse management (i.e. politeness, level of formality etc.) rather than with propositional content. Discoursal annotations have never become widely used in corpus linguistics since their identification in texts is a difficult task that causes a great source of dispute between different linguists. See McEnery and Wilson 1996: Ch. 2; Aone and Bennet 1994; Stenström 1984.

ditto tagging, ditto tag A term used to refer to the practice of assigning the same tag to each word in an idiomatic sequence to indicate that they belong to a single phraseological unit. See McEnery and Wilson 1996: Ch. 1; Garside 1987. part-of-speech tagging A most basic type of linguistic corpus annotation (or grammatical tagging, morphosyntactic annotation, part-of-speech annotation); its aim is to assign a code (or tag) indicating its part-of-speech (e.g. singular common noun - NN, past participle - VBN) to each lexical unit in the text. Part-of-speech information is a fundamental basis for increasing the specificity of data retrieval from corpora and also forms an essential foundation for further forms of analysis such as syntactic parsing and semantic field annotation. See McEnery and Wilson 1996: Ch. 2; Leech and Wilson 1994; Garside 1987; Brill 1992. phonetic transcription A form of phonetic annotation that is used to transcribe spoken corpora. Not many examples of publicly available fully phonetically transcribed corpora exist at the present time. Much of phonetic annotation exist at the level of prosodic annotation. Phonetic transcription needs to be carried out by human beings rather than computer programs, and moreover these need to be human beings who are well skilled in the perception and transcription of speech sounds. See McEnery and Wilson 1996: Ch. 2.

portmanteau tag A term used to refer to the practice of assigning two tags to some words in order to help the user in cases where there is a strong chance that the computer might otherwise have selected the wrong part-of-speech from the choices available to it. See McEnery and Wilson 1996: Ch. 1. problem-oriented tagging A particular type of annotation that is used to annotate only the phenomena directly relevant to the research rather than the whole corpus or text (each word, each sentence etc.). It is not exhaustive. Problem-oriented tagging uses an annotation scheme which is selected not for its broad coverage and consensus-based theory-neutrality but for the relevance of the distinctions which it makes to the specific questions which each analyst wishes to ask of his or her data. See McEnery and Wilson 1996: Ch. 2; Haan 1984.

prosodic annotation A type of annotation that aims to capture in a written form the suprasegmental features of spoken language — primarily stress, intonation and pauses. Prosodic annotation (or prosodic transcription-)is a task which requires the manual involvement of highly skilled phoneticians: unlike part-of-speech analysis, it is not task which can be delegated to the computer. See McEnery and Wilson 1996: Ch. 2; Nespor and Vogel 1990; Johansson et al. 1991; O’Connor and Arnold 1961.

recoverability A term used to refer to the possibility for the user to recover the basic original text from any text which has been annotated with further information. See McEnery and Wilson 1996: Ch. 2.

semantic annotation A type of annotation that is used to mark semantic relationships between items in the text (e.g. agents or patients of particular actions) or semantic features of words in a text (the annotation of word senses in one form or another. See McEnery and Wilson 1996: Ch. 2; Jansen 1990; Schmidt 1991.

tag A term used to refer to

(i) a code attached to words in a text representing some feature or set of features relating to those words;

(ii) in the TEI, to refer to the physical markup of an element such as a paragraph. See McEnery and Wilson 1996: Ch. 2. tagset A term used to refer to a collection of tags in the form of a scheme for annotating corpora. See McEnery and Wilson 1996: Ch. 2; Johansson et al. 1986; Garside et al. 1987.

Concordance

A term that signifies a list of a particular word or sequence of words in a context. The concordance is at the centre of corpus linguistics, because it gives access to many important language patterns in texts. Concordancesof major works such as the Bible and Shakespeare have been available for many years. The computer has made concordances easy to compile.

The computer-generated concordances can be very flexible; the context of a word can be selected on various criteria (for example counting the words on either side, or finding the sentence boundaries). Also, sets of examples can be ordered in various ways. See Sinclair 1991: Ch. 2; McEnery and Wilson 1996: Ch. 1; Collier 1994; Kaye 1990; Hockey and Martin 1988. co-text A more precise term than context or verbal context used to refer to the words on either side of a selected word or phrase. See Sinclair 1991: Ch. 9. collocate A term used to refer to the words that occur to the left and to the right of the node. See Sinclair 1991: Ch. 8; Kennedy 1991; Kjellmer 1991; Kjellmer 1990; Renouf and Sinclair 1991; Jackson 1988. collocation A term used to refer to the combination of words that have a certain mutual expectancy i.e. words regulary keep company with certain other words. When a collocation appears with a greater frequency than chance, then it is called a significant collocation. The usual measure of proximity is a maximum of four words intervening. The identification of patterns of word co-occurrence in textual data is particularly important in dictionary writing, natural language processing and language teaching. See Sinclair 1991: Ch. 8; Kennedy 1991; Kjellmer 1991; Kjellmer 1990; Renouf and Sinclair 1991; Jackson 1988.

KWAL An abbreviation for key word and line; a form of concordance which can allow several lines of context either side of the key word. See McEnery and Wilson 1996.

KWIC An abbreviation for key word in context; a form of concordance in which a word is given within x words of context and is normally centered down the middle of the page. See Sinclair 1991: Ch. 2; Kaye 1989. node A term used to refer the word or phrase in a collocation whose lexical behaviour is under examination. See Sinclair 1991: Ch. 8; Jackson 1988. span A term used to refer to the measurement, in words, of the co-text of a word selected for study. A span of -4, +4 means that four words on either side of the node word will be taken to be its relevant verbal environment. See Sinclair 1991; Jackson 1988.

Text Chunking

A term used to refer to the practice of dividing sentences into non-overlapping segments on the basis of fairly superficial analysis. Text chunking is a useful preliminary step to parsing. Chunking includes identifying the non-recursive portions of noun phrases, it can also be useful for other purposes including index term generation. See Ramshaw and Marcus; Sinclair 1991: Ch. 9.

Disambiguation

A term used to refer to the practice of doing away with ambiguity by choosing one specific analysis, or code (tag), from a variety of possibilities in corpus processing. Procedure of disambiguation may be used at many levels from deciding the part-of-speech of an ambiguous word (i.e. a word that may be associated with a number of different parts-of-speech) through to choosing one possible translation from many. Disambiguation may be probabilistic, i.e., carried out using statistically based methods, or rule-based, i.e., performed using rules created by drawing on a linguist’s intuitive knowledge. See McEnery and Wilson 1996: Ch. 5; Jansen 1990; Hindle 1989; DeRose 1988.

Encoding

A term used to refer to the practice of representing textual and linguistic data (i.e. annotations, or tags) in a certain format in a corpus. The demand for extensive reusability of large text collections requires standardisationof encoding formats. A standard encoding format must provide the most possible generality and flexibility, i.e., accommodate all potential types of information and processing. See Bryan 1988; McEnery and Wilson 1996: Ch. 2; Ide 1996. CES An abbreviation for Corpus Encoding Standard used to refer to a set of encoding standards developed by MULTEXT (one of the largest EU projects in the domain of language tools and resources). The CES is an application of SGML, based on and in broad agreement with the TEI Guidelines and is optimally suited for use in corpus linguistics and language engineering applications. See Ide and Véronis 1995; Erjavec et al. 1995.

COCOA references A name of a very early computer program used for extracting indexes of words in context from machine-readable texts. Its conventions were carried forward into several other programs (e.g. Oxford Concordance Program (OCP)). COCOA references only represent an informal trend for encoding specific types of textual information, for example, authors, dates, and titles. See McEnery and Wilson 1996: Ch. 2; Hockey and Martin 1988.

DTD An abbreviation for Document Type Definition used in the TEI. TEI DTD is a formal representation which tells the user or a computer program what elements a text contains and how these elements are combined. A TEI DTD is composed of the core tagsets, a single base tagset, and any number of user selected additional tagsets, built up according to a set of rules documented in the TEI Guidelines. See Ide 1995; McEnery and Wilson 1996: Ch. 2; Sperberg-McQueen and Burnard 1994.

EAGLES An abbreviation for Expert Advisory Groups on Language Engineering Standards, an EU sponsored project to define standards for the computational treatment (e.g. annotation) of EU languages, and also used to refer to a base set of features for the annotation of parts-of-speech. See McEnery and Wilson 1996: Ch. 2.

entity reference A term in the TEI used to refer to a shorthand way of encoding information in a text. See Sperberg-McQueen and Burnard 1994. SGML An abbreviation for Standard Generalized Markup Language used to refer to a text encoding standard (TEI conformant). SGML is an internationally recognized standard. SGML-aware software is widely used in corpus processing. See McEnery and Wilson 1996: Ch. 2; Erjavec 1995; Ide 1995; Goldfarb 1990; Bryan 1988.

TEI An abbreviation for Text Encoding Initiative, which signifies an international cooperative research project established (1988) to develop a general and flexible set of guidelines for the preparation and interchange of electronic texts. TEI employs an already existing form of document markup known as SGML. The TEI’s own original contribution is a detailed set of guidelines as to how this standard is to be used in text encoding. See Ide 1995; McEnery and Wilson 1996: Ch. 2; Sperberg-McQueen and Burnard 1994. base tagset A term is used in the TEI to refer to a particular group of codes (tags) which determines the basic structure of the document with which it is to be used. Eight distinct TEI base tagsets are proposed: prose, verse, drama, transcribed speech, letters and memos, dictionary entries, terminological entries, language corpora and collections. See Ide 1995; Sperberg-McQueen and Burnard 1994.

TEI Guidelines A term used to refer to standardized encoding conventions for encoding and interchange of machine-readable texts. TEI Guidelines (issued in May 1994) provide standardized encoding conventions for a large range of text types and features relevant for a broad range of applications, including NLP, information retrieval, hypertext, electronic publishing, various forms of literary and historical analysis, lexicography, etc. The Guidelines are intended to apply to texts, written or spoken, in any natural language, of any date, in any genre or text type, without restriction on form or content. SGML is the framework for development of the Guidelines. See Sperberg-McQueen and Burnard 1994; Ide 1995; McEnery and Wilson 1996: Ch. 2. header A term used to refer to a part of electronic document preceding the text proper and containing information about the document such as author, title, source and so on. See Ide 1995; McEnery and Wilson 1996: Ch. 2; Sperberg-McQueen and Burnard 1994. WSD An abbreviation for Writing System Declaration used in the TEI to define the character set used in encoding an electronic text. See Sperberg-McQueen and Burnard 1994.

Lemmatisation

A term refers to the practice of reduction of word forms to their respective lexemes (head word forms that one would look up if one were looking for words in a dictionary) in a corpus. For example, the forms kicks, kicked, and kicking would all be reduced to the lexeme KICK. These variants are said to form the lemma of the lexeme KICK. Lemmatisation applies equally to morphologically irregular forms, so that went as well as goes, going, and gone, belongs to the lemma of GO. Lemmatisation allows the researcher to extract and examine all the variants of a particular lexeme without having to input all the possible variants. (A software for lemmatisation is called lemmatizer). See McEnery and Wilson: Ch. 2; Beale 1987; Sinclair 1991: Ch. 3.

Parsing

A term used to refer to the practice of assigning the syntactic structure to a text. Parsing is usually performed after basic morphosyntactic categories have been identified in a text; it brings these categories into higher level syntactic relationships with one another. Parsing is probably the most commonly encountered form of corpus annotation after part-of-speech tagging. Corpora which have been parsed are sometimes known as treebanks. See McEnery and Wilson 1996: Ch. 2; Garside and McEnery 1993; Sampson 1992; Aarts and Heuvel 1985.

full parsing A type of parsing that aims to provide analysis of the sentence structure as detailed as possible. See McEnery and Wilson 1996: Ch. 2. skeleton parsing A type of parsing that is a less detailed approach which tends to use a less finely distinguished set of syntactic constituent types and ignores, for example, the internal structure of certain constituent types. See Garside and McEnery 1993; Leech and Garside 1991.

Validation

A term used to refer to the investigation of conformance of any products or elements to certain acknowledged standards, i.e., the corpus has to be the size it claims, it must be composed and encoded the way it claims, all features encoded can be used for retrieval, annotations conform to a given standard, and, the error rate for encoding and annotation does not exceed a certain level. Validation guarantees the client that he gets what he ordered and that he can rely on the resources to the extent stated by the validation certificate. Validation has to be carried out on an unbiased and neutral basis, and this means not by the institution where the resources were created. See Teubert 1995.