Statistical Machine Translation (Winter 2018/2019)

Quick Links

General Information
Course Description
Format of Course
Schedule

General Information

Instructor:	Jakub Waszczuk

Course web page:	https://user.phil.hhu.de/~waszczuk/teaching/hhu-smt-wi18/
	(This web page, which will be updated throughout the course.)
Office hours:	by appointment
Language:	English

Course Description

In this course, we will introduce the basic methods of statistical machine translation (SMT), such as word based and phrase based models (in the end, we will also consider more sophisticated methods). A main issue in SMT is not only the models themselves, but rather the estimation of their parameters, hence there is a strong focus on some methods of machine learning.

Format of Course

Every week, there will be one theoretical session where we introduce the main concepts, methods and techniques, and one practical session where we implement them in a subsequently growing program.

We will program in Java, hence some background is required (but there will be a short introduction).

Passing the Course

BN: You will need to complete the theoretical and the programming exercises, which we will be working on during the practical sessions. You may also need some additional time at home to finalize and polish your solutions.
AP: Just as BN, plus there will be ~~a final written examination~~ a project. The project topics to choose from will be announced at the end of January.

Schedule

Preliminary schedule:

12 Oct	Introduction and Overview Slides
19 Oct	Probability theory (part I) Lecture slides, lab session slides
26 Oct	Probability theory (part II) Lecture slides (handout version), binomial java code Homework exercises and the accompanying code
2 Nov	No session Catch-up sessions doodle: https://doodle.com/poll/pfv3g6ea7v6xnqwe Please fill in the doodle by Wednesday, October 24!
9 Nov	Language models Bayes’ theorem, parameter estimation (maximum a posteriori, maximum likelihood), and n-gram models (lecture slides) Homework exercises and the accompanying code Note: extra session from 16:30 to 18:00
16 Nov	IBM model I (lecture slides)
23 Nov	IBM model I (continued) Expectation-Maximization (lecture slides, complementary proofs) Homework exercises (the part on perplexity updated on 26 November), the accompanying code, and the (partial) expected results Note: extra session from 16:30 to 18:00
30 Nov	Higher IBM models (2 & 3) (lecture slides, updated with answers)
7 Dec	Higher IBM models (continued) Homework 3 feedback, IBM 3 revisited, IBM 4 & 5 (lecture slides, complementary material) Homework exercises and the accompanying code (deadline: Tuesday, 8 January 2019 (definitive)) Notes on efficient EM
14 Dec	Phrase-based models IBM 4 and 5 (slides) Phrase-based translation (slides, phrase extraction algorithm)
21 Dec	Phrase-based models (continued) Phrase-based translation continued (slides, complementary material) Supplementary material related to Homework 4: Viterbi for IBM-1, writeMostProbableAlignments2File, writeTransProbTable2File
11 Jan	Decoding Theoretical session: decoding, i.e., how to efficiently determine the best translation for a given sentence using a combination of the phrase-based model and the bigram language model (slides, complementary material) Homework exercises (concerning phrase extraction and parameter estimation within the context of the phrase-based model) and the accompanying code UPDATE 14/01/2019: the complementary material extended with information on A* decoding
18 Jan	Evaluation Theoretical session: quality evaluation in SMT (slides, complementary material about Levenshtein alignment) Practical session: continue working on phrase extraction and phrase translation probability estimation UPDATE 20/01/2019: added complementary material about the Levenshtein alignment
25 Jan	Current trends in SMT Theoretical session: selected approaches in neural MT (slides) Homework exercises (implementation of a simple decoding algorithm) and the accompanying code
01 Feb	Catch-up & project There will be no lecture, the remaining time will be dedicated to (i) finishing the last practical/theoretical exercises (if needed), and (ii) project presentation and discussions (for those who are interested to do the project and get AP).

Project

Here is a potential project topic, which should give you some idea about how the SMT project can look like (and what are the deliverables).

Multiword expressions in SMT

There are other possible topics and you are encouraged to propose your own project topic.

Acknowledgments

A large portion of the material available on this page was originally created by Christian Wurm, Miriam Kaeshammer, Simon Petitjean, and Thomas Schoenemann.

This course draws heavily from the Statistical Machine Translation book by Philipp Koehn.