Parsing of Natural Languages

Parsing

The term parsing describes the process of decomposing a given object into its components. So-called parsers are omnipresent in computer science: they are part of each compiler and for many programming languages there are parser generators, which generate code for the decomposition of user input. From a theoretic perspective, these parsers are often recognizers for regular languages or context-free languages – classes of formal languages which have been analysed in the discipline of theoretical computer science.

However, the origin of these classes of formal languages is linguistics. They have been introduced by Noam Chomsky with the intention to find a formalism suitable for the description of natural languages, i.e., languages that are spoken and written by humans. Although such a formalisation has not fully been achieved yet, parsers are of importance in the fields of computational linguistics and natural language processing. The goal of these fields are computer-aided analysis of natural language and automatic processing of natural language by means of a computer, respectively. Possible applications of the latter are for instance translation or summarization of text or chatbots that are capable to communicate fluitly with humans on a wide range of topics. Parsers aid these applications by providing a synactic (i.e., structural) analysis of the a given utterance of natural language.

Mild context-sensitive languages

Since natural languages turned out to be neither regular nor context-free and the richer class of context-sensitive languages has an prohibitive complexity, so-called mild context-senstitive languages have been investigated. Two formalisms in this class are the linear context-free rewriting systems (LCFRS) and the multiple context-free grammars (MCFG); these formalisms are syntactically similar and semantically equivalent. Specifically, these formalisms allow to model discontinuous phrases, i.e., parts of a sentence that form a logical unit but not a sequence of consecutive words in the sentence. Discontinuous phrases occur for instance in languages with free worder such as German, Swiss German, Dutch but also in English.

Theoretic contributions

At the chair for foundations of programming we investigate mild context-senstive grammars. On the one hand, this involves theoretic aspects such as providing a Chomsky-Schützenberger and an automaton characterization of weighted MCFG. As the degree of discontinuity representable by some MCFG influences its parsing complexity, we consider approximation techniques which try to preserve a given language while reducing the complexity. Another endeavor in this direction was the introduction of hybrid grammars, which couple and synchronize e.g. an LCFRS with a tree-generating grammar. This allows to shift part of the complexity into the tree-generating grammar, which reduces the string-parsing complexity. Moreover, we developed an invertable method for lexicalization of MCFG.

Applied contributions

We went further by also implementing our theoretic contributions and evaluating them on large linguistic data-sets, so-called corpora. These implementation are part of the software projects rustomata and panda-parser. Moreover, we investigate how grammar-based parsers can be combined with neural models.

Slides

Folien presented by Heiko Vogler at CAI 2017

Contact

Prof. Dr.-Ing. habil. Dr. h.c./Univ. Szeged
Heiko Vogler
Phone: +49 (0) 351 463-38232

Last modified: 2020-10-20 11:57am