Language/Linguistic Resources

Language/Linguistic Resources

A general term used to refer to such resources as corpora of spoken and written language, frequency lists, lexicons, computational linguistic lexicons and tools to extract linguistic knowledge to develop and optimize products. Linguistic resources are divided into corpora, lexical resources and tools. However, the borderline is not very distinct. See Gellerstam 1995; McEnery and Wilson 1996; Aarts and Meijs 1990; Edwards 1994.

Corpora

A central term in corpus linguistics used to refer to (i) (loosely) any body of text; (ii) (most commonly) a body of machine-readable text; (iii) (more strictly) a finite collection of machine-readable texts, sampled to be maximally representative of a language variety. See McEnery and Wilson 1996: Ch. 2; Sinclair 1982 and 1991; Johansson 1991; Collins 1988; Meyer 1986; Aarts and Meijs 1990; Biber and Finegan 1991; Edwards 1994. annotated corpus A type of corpus enhanced with various types of linguistic information (or tagged corpus). An annotated corpus may be considered to be a repository of linguistic information, because the information which was implicit in the plain text has been made explicit through concrete annotation. See McEnery and Wilson 1996: Ch. 1; Aarts and Meijs 1990.

balanced corpus A type of corpus composed according to parameters such as text type, genre or domain. See Teubert 1995. comparable (reference) corpus A type of corpus used for comparison of different languages. Comparable corpus consist of a number of corpora in each language and follows the same composition pattern. The Commission of the European Community is funding a project whose main goal is the creation of comparable reference corpora (of 50 million words each) for all the official languages of the European Union. Comparable corpora are an indispensable source for bilingual and multilingual lexicons and a new generation of dictionaries. See LE-PAROLE 1995: Ann. 1.

monitor corpus A type of corpus which is a growing, non-finite collection of texts, of primary use in lexicography. Monitor corpus reflects language changes in a constant growth rate of corpora, leaving untouched the relative weight of its components (i.e. balance) as defined by the parameters. The same composition schema should be followed year by year, the basis being a reference corpus with texts spoken or written in one single year. See Sinclair and Ball 1995; Sinclair 1991: Ch. 1; Clear 1987.

monolingual corpus A type of corpus which contains texts in a single language. See McEnery and Wilson 1996: Ch. 2. multilingual corpus A type of corpus which represents small collections of individual monolingual corpora (or subcorpora) in the sense that they use the same or similar sampling procedures and categories for each language but contain completely different texts in those several languages (for two languages bilingual corpus). See McEnery and Wilson 1996: Ch. 2; McEnery and Oakes 1994.

opportunistic corpus A type of corpus which stands for inexpensive collection of electronic texts that can be obtained, converted, and used free or at a very modest price; but is often unfinished and incomplete: the users are left to fill in blank spots for themselves. Their place is in environments where size and corpus access do not pose a problem. The opportunistic corpus is a virtual corpus in the sense that the selection of an actual corpus(from the opportunistic corpus) is up to the needs of a particular project. Today’s monitor corpora usually are opportunistic corpora. See Sinclair and Ball 1995.

parallel (aligned) corpus A type of multilingual corpus where texts in one language and their translations into other languages are aligned, sentence by sentence, preferably phrase by phrase. Sometimes reciprocate parallel corpora are set up, corpora containing authentic texts as well as translations in each of the languages involved. This allows double-checking translation equivalents. Note: Some corpus linguists employ a different terminology for multilingual corpora: they refer to parallel corpora (as we defined here) as ‘translation corpora’ and use term ‘parallel corpora’ instead to refer to the other kind of multilingual corpus which does not contain the same texts in different languages. See Sinclair and Ball 1995; McEnery and Wilson 1996: Ch. 2; McEnery and Oakes 1994 and 1996; Zanettin 1994; Erjavec et al. 1995.

reference corpus A type of corpus that is composed on the basis of relevant parameters agreed upon by the linguistic community and should include spoken and written, formal and informal language representing various social and situational strata. They are used as benchmarks for lexicons and for the performance of generic tools and specific language technology applications. They are large in size; 50 million words is considered to be the absolute minimum; 100 million will become the European standard in a few years. See Sinclair and Ball 1995.

sampled corpus A type of corpus which contains a finite collection of texts, often chosen with great care and studied in detail. Once a sampled corpus is established, it is not added to or changed in any way. See Sinclair 1991: Ch. 1. saturated corpus A type of corpus whose growth rate of the vocabulary stops decreasing and becomes constant (i.e. saturated). Thus, saturation is a point from which there will be perhaps eight new words for each 10000 additional words of text. Saturation of corpora is a fairly new concept, and no one knows what it leads to in terms of corpus size. See Teubert 1995.

special corpus A type of corpora that are assembled for a specific purpose, and they vary in size and composition according to their purpose. Special corpora are not balanced (except within the scope of their given purpose) and, if used for other purposes, give a distorted view of the language segment. Their main advantage is that the texts can be selected in such a way that the phenomena one is looking for occur much more frequently in special corpora than in balanced corpus. A corpus that is enriched in such a way can be much smaller than a balanced corpus providing the same data. See Sinclair and Ball 1995.

spoken corpus A type of corpora that contain texts of spoken language. Spoken corpora are annotated using a form of phonetic transcription. Not many examples of publicly available fully phonetically transcribed corpora exist at the present time. Phonetically transcribed corpora are a useful addition to the battery of annotated corpora, especially for the linguist who lacks the technological tools and expertise for the laboratory analysis of recorded speech. See McEnery and Wilson 1996: Ch. 2; Crowdy 1993; Greenbaum 1990.

treebank A type of corpora which have been annotated with phrase structure information (or parsed corpus). This term alludes to the representation of syntactic relationships (see parsing) by tree diagrams or phrase markers. See McEnery and Wilson 1996: Ch. 2; Garside and McEnery 1993; Souter and Atwell 1994.

unannotated corpus A type of corpora that are in raw states of plain text; opposed to annotated corpora. Unannotated corpora (or raw corpus) have been, and are, of considerable use in language study, but the utility of the corpus is considerably increased by the provision of annotation. See McEnery and Wilson 1996: Ch. 2.

Lexical Resources/Data

A general term used to refer to lexical data, preferably in machine-readable form, that can be used in lexical research and/or form the basis of commercial products. See Gellerstam 1995; Calzolari 1989.

computational linguistic lexicon A more complex type of lexicon for parsing, for artificial intelligence (question-answering) and for machine translation. See Gellerstam 1995.

frequency list A term used to refer to a list that is based on word frequency counts or on counts of other textual elements in a text, and listing the frequencies of their appearance. At present, making of frequency lists is one of the most trivial functions that lingware deals with. See Sinclair 1991: Ch. 2; Johansson and Hofland 1989; Woods et al. 1986; McEnery and Wilson 1996.

lexical data base (LDB) A term used to refer to data bases which contain formalized lexical information at many descriptive levels. It is one of the chief tools today for processing great quantities of lexical data. It can be used for various types of linguistic applications and for general research in the lexical field. Data base management system provides user with tools which enable him to access the data without necessarily being familiar with the internal or physical organisation, but only with the type of information he can retrieve. See Gellerstam 1995; Halteren and Heuvel 1990; Haan 1987; Kaye 1988; Calzolari 1989.

lexicon A term essentially synonymous with ‘dictionary’ - a collection of words and information about them, but this term is used more commonly than dictionary to refer to machine-readable dictionary data bases (or electronic dictionary). See Beale 1987; McEnery and Wilson 1996: Ch. 5; Garside and McEnery 1993; Garside 1987; Zernik 1991; Sinclair 1996; Calzolari 1989.

machine lexicon A type of lexicon which is not designed to be read by humans but provide explicit lexical information for performing specific tasks, e.g., automatic lemmatisation. See Gellerstam 1995.

Products

A general concept which includes any tools or applications that are worth putting money into. See Engelien and McBryde 1991. automatic hyphenizer A tool that automatically hyphenates a text according to grammatical conventions. See Gellerstam 1995.

computer-aided learning / computer-assisted language learning (CALL) A term used to refer to computer applications and software based on lexical data that can be used in various types of interactive teaching of written or spoken language skills such as sentence restructuring, checking of translation, dictation tasks, dictionary look-up, etc. One method of language learning is a data-driven learning approach that attempts to give direct access to the data and cut out the middleman. This approach is based on assumption that effective language learning is a form of research performed by the learner himself / herself. See Johns 1991; McEnery and Wilson 1993; McEnery et al. 1995; Wilson and McEnery 1994.

computer-aided translation (CAT) (or translator’s workbench) A term used to refer to computer systems, programs or applications which contain tools and facilities which help translators to increase their productivity and the quality of their work. These include monolingual or bilingual lexicons, translation memories (which help to avoid translating the same or similar fragments more than once), spelling checkers, terminology databases, translation editors, terminology extraction, access to previously translated texts, document comparison, thesauruses, etc. See Krauwer 1995; McEnery and Wilson 1996: Ch. 5.

general text checker A tool that checks practical things like starting a new sentence with a capital letter, spotting extra spaces between words, etc. See Gellerstam 1995.

spelling checker A tool that is usually based on a collection of word forms representing an actual corpus or a list of word forms generated from a dictionary, and it is used to find spelling errors in a text. Spelling checker is probably the number one commercial application and its facilities are more or less standard ingredients in word processing today. See Teubert 1995; Gellerstam 1995.

style checker A tool that performs checking of particular words from stylistic point of view (“why do you use the passive form?”), parsing for spotting grammatical errors (like congruence), and checking of contextual data (“have you used the right preposition after the verb?”). See Gellerstam 1995.