Practical Course in Natural Language Processing (Summer 2026)
Overview
This practical course will explore how the context-free grammars – studied in the course on formal systems and used to describe formal languages – can be used to parse natural languages which are by their nature informal, having been developed organically over centuries. The endgoal will be to implement a parser able to accurately identify the different syntactic parts of a given input sequence in English.
To this end, we will construct a probabilistic grammar out of an extensive corpus and examine various methods to improve the accuracy and speed the parser. The parser will be tested on sentences taken from another data set to see how it generalizes. At the end of the course, the most accurate and fastest parsers will be selected for a competition.
Organization
- Language: English
- Timeslot: Tuesdays at 9:20 a.m.
- Room: APB-E042
The practical course will be split into 4 parts spread over the semester, each of which concludes with a partial delivery. Each part will be introduced with a mandatory lecture going over the minimum required tasks and new concepts. During the rest of the time, the room will be available for independent work and I will be there if you have any questions or need guidance. Outside of these slots, you can always reach me via my e-mail, should you need help with anything.
The introductory lecture will take place on the 21st of April in the APB-E042 at 9:20 a.m. and participation to it is a prerequisite for participation to the course. Please note that you do not need to register in advance.
Please send an e-mail with your name to register for the course. This will help me get a better idea of the number of participants, but also allow me reach out should there be any changes to the course (deadlines, additional lectures, etc).
You can find all relevant organizational information in the schedule, including:
- Important dates for submissions and compulsory tutorials
- Descriptions of the tasks and sub-tasks of each iteration
- Documentation of the expected CLI interface.
Resources
Here you can find all relevant resources to the course, including the corpus used for training and testing and the slides of the tutorials. Please note that to access them outside of TU Dresden’s campus, you need to be connected to the university’s VPN (find more here).
Slides
Corpora
Annotated sentences from the Wall Street Journal in the Penn Treebank (PTB) serve as the training and test corpora. These are made available here in a processed format. Please note that the TU Dresden’s license for the use of the Penn Treebank does not authorize you to redistribute this data.
The data is divided into the folders large and small. Each folder contains a grammar trained on segments of varying sizes from the PTB, sentences for testing the parser, and the gold parse trees for these sentences. In the case of large, the trees used to train the grammar are also included.
The specific contents of the individual files are as follows.
training.mrg Training corpus (annotated sentences) for grammar induction.
training_b.mrg Binarized training corpus for grammar induction (should provide grammar.rules, grammar.lexicon, and grammar.words).
grammar.rules, grammar.lexicon, grammar.words Sample grammar (binarized) for the parser.
testsentences Sample sentences (i.e., test corpus) for the parser.
gold.mrg Gold parse trees = annotated sentences from the test corpus (not binarized, so they may not match the parser output!).
gold_b.mrg Binarized gold parse trees.
Evaluation of the Parser (F1 Score)
The trees output by the parser do not normally match the gold trees. The quality of the parser is therefore determined by the F1 score, which can be calculated using discodop, for example:
discodop eval --fmt=bracket <FILE WITH GOLD TREES> <FILE WITH PARSE TREES>
Working Environment
You are free to choose how to structure the implementation and which tools to use. The only requirements are as follows, to enable automated testing.
-
The working directory must contain a file named
Makefileat the top level. This file must include a target that builds an executable file namedpcfg_tool. A more detailed introduction to Make can be found here, for example. -
The program
pcfg_toolcontains the solutions to all subtasks. Individual functions are called via subcommands (as with git). If a function is not implemented, the program should terminate with status code 22. Further details regarding the command-line interface can be found in the schedule. -
No file at the top level of the working directory may begin with
.eval_.
You can verify yourself whether your implementation meets the specifications regarding input and output formats. To do so, unpack the following test suite into your working directory and run make -f tests.mk. Please ensure that these tests run successfully before submitting a partial solution. Note: the tests do not verify the correctness of the implementation.