Lingware/Language Engineering Tools

Lingware/Language Engineering Tools

A general term used to refer to relatively small independent pieces of software as well as larger systems, meant for extracting linguistic information from lexical resources or corpora. Generally all language engineering tools are divided into rule-based tools (hand-crafted rules) and statistical tools. However, most systems are hybrid in one sense or another and they form a continuum of approaches. Thus the division is not very distinct and is not used here. The section includes general types of tools and description of their functions as well as some well known and recognized specific lingware. See Erjavec 1995; Engelien and McBryde 1991.

CL tools A general term that stands for Computational Linguistic tools to refer to software that belongs to computational linguistics proper and includes morphological analysers, implementations of formalisms, and lexicon environments. These systems can hardly be considered ‘tools’ as they are often large and complex. They are furthermore, only distantly connected to corpus development or exploitation. See Erjavec 1995: Ch. 5; Engelien and McBryde 1991; Manandhar 1993.

CLAWS The leading English part-of-speech tagger (developed by Garside R. in 1987). CLAWS system employs a system of probabilistic disambiguation based on probabilities which were automatically derived from the previously constructed Brown corpus. See Garside 1987; McEnery and Wilson 1996: Ch. 5. Cutting tagger A part-of-speech tagger (developed by Cutting D. in 1992) which employs similar probabilistic techniques to CLAWS. Cutting tagger has a success rate which is comparable to that achieved by the leading English language part-of-speech taggers. Cutting tagger can train on unannotated sections of text. It also claims to do construction of its lexicon and training of probabilistic model directly from an automatically analysed corpus. See Cutting et al. 1992; McEnery and Wilson 1996: Ch. 5.

parser/syntactic parser A type of tool that semantically analyzes a text i.e. performs parsing. A parser determines in a sentence what part of speech to assign to each of the words and combines these part-of-speech taggedwords into larger and larger grammatical fragments, using some kind of grammar that tells what combinations are possible and/or likely. The output of this analysis, either a single-rooted tree or a string of tree fragments, then goes through semantic analysis, which determines the literal meaning of a sentence in isolation. Developers of parsers employed a variety of approaches. However, it must be mentioned that all the existing systems are far from being robust and their rate of accuracy is rather low yet. They are useless as a practical tool for the corpus linguist at present. See McEnery and Wilson 1996: Ch. 5; Marcus 1995; Eeg-Olofsson 1990.

part-of-speech tagger A tool that assigns a part-of-speech to a word form in corpus i.e. performs part-of-speech tagging. Part-of-speech tagger takes as its input a word form together with all its possible morphosyntactic interpretations and outputs its most likely interpretation, given the context in which the word form appears. Automated part-of-speech taggers are amongst the very best NLP applications in use today, in terms of reliability and hence usefulness. Both probabilistic (or stochastic, i.e. based on statistical grammar) and rule-based taggers (i.e. based on traditional hand-crafted rule grammars) are developed. See for specific taggers: Cutting tagger, CLAWS, TAGGIT. See Cutting et al. 1992; Eeg-Olofsson 1990; Greene and Rubin 1971.

public domain tools A term used to refer to freely available software (sometimes public domain generic tools) that can be used for any purpose. Public domain tools allow for exploring a particular technology even if it is not completely accomplished at the moment. See Erjavec 1995; Engelien and McBryde 1991.

query(ing) tools A type of tools which allows retrieving all or part of a specific data that is in a corpus. In other words, querying tools answer your questions about your lexical or corpus data. An integrated corpus query system must combine speed, a powerful querying language, and a display engine. The choice of corpus querying tools is quite limited at present. See Erjavec 1995: Ch. 4; Engelien and McBryde 1991; Jacobs 1992; Kobsa and Wahlster 1989.

TAGGIT One of the earliest (developed by Green and Rubin in 1971) part-of-speech tagging programs, which achieved a success rate of 77 per cent correctly tagged words. Their program made use of rule-based templates for disambiguation. See Greene and Rubin 1971; McEnery and Wilson 1996: Ch. 2.

terminological data bank (TDB) Terminological tools that are more or less sophisticated organizational structures established for the handling and maintenance of terminological data with the help of TMS (the abbreviation for terminological management systems). TDBs can comprise several or many terminology databases. See Galinski 1995: Ch. 2; Daille 1995.

terminological management system (TMS) Terminological tools used to record, store, process, and output terminological data in a professional manner. TMS modules are integrated into all kinds of application software for co-operative writing, documentation, or co-operative terminology work. See Galinski 1995; Daille 1995.

terminology databases Terminological tools that consist of terminological data and a TMS (an abbreviation for terminological management system) to handle this data. Several terminology databases can be included into one terminological data bank (TDB). See Galinski 1995; Daille 1995; Pearson and Kenny 1991.

Language Engineering (LE)

The aim of Language Engineering (or sometimes can be referred to as language technology) is to facilitate the use of telematics applications and to increase the possibilities for communication in and between world languages by integrating new spoken and written language processing methods. Language Engineering covers the following action lines:

(i) creation and improvement of pilot applications (document creation and management, information and communication services, translation and foreign language acquisition);

(ii) corpora;

(iii) language engineering research;

(iv) support issues specific to language engineering (i.e. standards, assessment and evaluation, awareness activities, user surveys). See Andersen 1995; Cohen et al. 1990.

Machine Translation (MT)

A branch of computational linguistics that includes all the processes related to automatic translation. Literally, the machine translation refers to imitation of a human translator by a computer or machine, however, no fully automatic machine translation systems exist at present, but there are a number of applications that increase the productivity, effectiveness and quality of a human translator (computer-aided translation). See Krauwer 1995; Hutchins and Sommers 1992; Copeland et al. 1991; Hutchins 1986; Nagao 1989.