To the documentation main page
Information about our toolchain
The standardized toolchain for corpus production contains the following steps:
- Web crawling
- HTML stripping (or XML-Stripping for Wikipedia)
- Document-based language identification
- Sentence segmentation
- Removal of duplicate sentences
- Pattern-based sentence cleaning
- Sentence-based language identification
- Corpus production
- Tokenization and word indexing
- Word frequency calculation
- Word co-occurrence calculation
- Optional post-processing (depending on the availability of the corresponding tools)
- Part-of-speech tagging
- Lemmatization
- Near duplicate sentences detection and removal
- Word similarity based on co-occurrences
- Word similarity based on string similarity (Levenshtein)
You can find more details in our
publications.
To the documentation main page