Toolchain - Leipzig Corpora Collection

Information about our toolchain

The standardized toolchain for corpus production contains the following steps:

Web crawling
HTML stripping (or XML-Stripping for Wikipedia)
Document-based language identification
Sentence segmentation
Removal of duplicate sentences
Pattern-based sentence cleaning
Sentence-based language identification
Corpus production
- Tokenization and word indexing
- Word frequency calculation
- Word co-occurrence calculation
Optional post-processing (depending on the availability of the corresponding tools)
- Part-of-speech tagging
- Lemmatization
- Near duplicate sentences detection and removal
- Word similarity based on co-occurrences
- Word similarity based on string similarity (Levenshtein)

You can find more details in our publications.