MSU-LGCLL:: DicTUM-1 - A System for Dictionary-Text Universal Manipulations and Analysis

Вы находитесь на архивной версии сайта лаборатории, некоторые материалы можно найти только здесь.
Актуальная информация о деятельности лаборатории на lex.philol.msu.ru.

[Presented at the XI International Conference "History and Computing" held at Moscow University, August 20-24 1996]

DicTUM-1

A SYSTEM FOR DICTIONARY-TEXT UNIVERSAL MANIPULATIONS AND ANALYSIS

O.V.Kukushkina, A.A.Polikarpov

1. General functions.

From 1991 at the Laboratory for General and Computational Lexicology and Lexicography (within the Philological faculty of Moscow State University) the project DicTUM-1 is being in progress (under the leadership of A.A.Polikarpov). The system is built by use of programming language Borland Pascal 7.0 in MS-DOS 3.3.1 (and higher) surroundings for PC-compatible computers (386 and higher). The system provides general functions as follows:

- (1) to have increasing and coordinated textual and dictionary databases (DBs) without noticeable limitations on their size and loss of there integrity;

- (2) to undertake multuaspectual reseach of textual and dictionary data using developing software, including combination and comparison of various classifications of textual and dictionary elements of Natural Language. Up to now in majority of cases systems for text processing, on the one hand, and systems for accumulation and analysis of dictionary data, on the other, are elaborated separately. Only the real integration of these two kinds of language data processing systems enables to organize on a large scale, empirically and instrumentally supplied linguistic reseach which is oriented on the solution of a wide variety of problems.

2. Main components.

2.1. Shell.

It controls all other components, supports exchange of data between them.

2.2. DIC subsystem.

It supports creation, control, development, combination, analytic research, comparison and use of dictionary DBs of various structure. "Use" means not only getting of any possible reference information from them, but also marking of units of some textual materials by some field information from some dictionary.

2.3. QuText subsystem.

It provides a user with ability to control and analyse some textual data. Among them there are following:

- external characterizing of a text entering the QuText subsystem by some relevant external features and by those which are gotten from analysis of its unternal structure;

- automatic and interactive structuring of a text, singling out units and tagging conventional information to them;

- automatic lexical, morphological and syntactic analysis of a text (i.e., arriving at some automatic tagging of textual units);

- connecting units of textual DBs with the information for the same units from various external dictionaries, and getting textual projections of some dictionary fields in the modes of "transparent text", "characterized text", "shadow text", "textual dencity of some feature";

- automatic getting of frequency and distributional dictionaries, matrices of cooccurence of elements from a text (word-forms characterized grammatically and lexically, lexemes characterized grammatically and referred to related word-forms, semantic and grammatic categories of words, phrases, syntactic constructions etc.);

- recombination and statistical analysis of data from above mentioned frequency dictionaries (with the possibility to look through source textual and dictionary data).

3. Main analytic tools:

- Morpholemmatizer;

- Stable-phrases-Finder;

- LinThes: interactive tool for use and development of semantic classifications of words, phrases, etc.;

- Content-Analysis: a subsystem for semantic analysis of a text as a whole using data obtained by use of Linthes;

- Style: a subsystem for analysis of various individual and group features of texts using data from all mentioned above analytic tools and specifically filtering them for reaching different classificational goals.

4. Main external DBs supporting analytical procedures:

DBs of synonyms, homonyms, idioms, thesauri, semantically characterized meanings, morphemes, word-formational nests, grammatically characterized words, word with stylistically marked word meanings, etc.

5. Main information products produced by the DicTUM:

- family of frequency and distributional dictionaries and cooccurence matrices of a variety of mentioned above lexical, morphemic, phraseological, etc. units preserving association with textual fragments where they were used;

- set of concordances for the same units;

- series of transformed versions of a text ("characterized", "transparent", "shadow" etc.);

- a set of statistically obtained measures of the place of some text in some classificational space;

- series of specific lexical, semantic, grammatic quantitative profiles of a text, stylistically and content-analytically interpreted.