To the documentation main page
Publications
How to cite the Leipzig Corpora Collection
For the whole collection, please cite the following general paper:
- Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012 (Download).
For another description focusing on the German collection (
Deutscher Wortschatz):
- Uwe Quasthoff and Matthias Richter (2005): Projekt Deutscher Wortschatz, Babylonia 3-2005, p. 33-35 (Download).
How to cite a corpus
For citing a specific corpus, please use the following:
- Leipzig Corpora Collection (YEAR): CORPUS_DESCRIPTION. Leipzig Corpora Collection. Dataset. URL.
For example for the corpus
eng_news_2008:
- Leipzig Corpora Collection (2008): English newspaper corpus based on material from 2008. Leipzig Corpora Collection. Dataset. https://corpora.uni-leipzig.de?corpusId=eng_news_2008.
Frequency Dictionaries
The series
Frequency Dictionaries is published by
Leipziger Universitätsverlag. All dictionaries follow the same scheme:
- The frequency dictionary is based on the word list of the largest corpus available for the corresponding language.
- A chapter on language statistics describes the alphabet, the distribution of vowels and consonants, syllable and word length, text coverage, etc.
- The most frequent words ordered by rank: The top-1,000 words printed and the top-1,000,000 words on the accompanying CD-ROM
- The most frequent words ordered alphabetically: The top-10,000 words printed and the top-1,000,000 words on the accompanying CD-ROM
- The word lists provided on the CD-ROM can be used freely under the Creative Commons Licence CC-BY 3.0.
The following frequency dictionaries are currently available. More information is provided by the
publisher.
- Vol. 1: Frequency Dictionary German (2011)
- Vol. 2: Frequency Dictionary English (2012)
- Vol. 3: Frequency Dictionary Icelandic (2012)
- Vol. 4: Frequency Dictionary French (2013)
- Vol. 5: Frequency Dictionary Hungarian (2013)
- Vol. 6: Frequency Dictionary Esperanto (2014)
- Vol. 7: Frequency Dictionary Indonesian (2015)
- Vol. 8: Frequency Dictionary Ukrainian (2016)
- Vol. 9: Frequency Dictionary Russian (2017)
- Vol. 10: Frequency Dictionary Vietnamese (2018)
- Vol. 11: Frequency Dictionary Czech (2018)
- Vol. 12: Frequency Dictionary Georgian (2018)
- Vol. 13: Frequency Dictionary Afrikaans (2019)
- Vol. 14: Frequency Dictionary Zulu (2020)
- Vol. 15: Frequency Dictionary Danish (2021)
List of publications (selection)
- Biemann, Chris; Bordag, Stefan; Heyer, Gerhard; Quasthoff, Uwe and Wolff, Christian (2004): Language-independent Methods for Compiling Monolingual Lexical Data. In: Proceedings of CicLING 2004, Seoul, Korea and Springer LNCS 2945, pp. 215-228, Springer Verlag Berlin Heidelberg (Download).
- Biemann, Chris; Heyer, Gerhard; Quasthoff, Uwe and Richter, Matthias (2007): The Leipzig Corpora Collection – Monolingual corpora of standard size. In: Proceedings of Corpus Linguistics 2007, Birmingham, UK, 2007.
- Bosch, Sonja; Eckart, Thomas; Klimek, Bettina; Goldhahn, Dirk and Quasthoff, Uwe (2018): Preparation and Usage of Xhosa Lexicographical Data for a Multilingual, Federated Environment. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki (Japan), 2018.
- Eckart, Thomas; Quasthoff, Uwe and Goldhahn, Dirk (2012): Language Statistics-Based Quality Assurance for Large Corpora. In: Proceedings of Asia Pacific Corpus Linguistics Conference 2012, Auckland, New Zealand, 2012.
- Eckart, Thomas and Quasthoff, Uwe (2010): Statistical Corpus and Language Comparison Using Comparable Corpora. In: Workshop on Building and Using Comparable Corpora, LREC, Malta, 2010.
- Goldhahn, Dirk; Eckart, Thomas and Quasthoff, Uwe (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the 8th Language Resources and Evaluation Conference (LREC) 2012.
- Hallsteinsdóttir, Erla; Eckart, Thomas; Biemann, Chris; Quasthoff, Uwe and Richter, Matthias (2007): Íslenskur Orðasjóður - Building a Large Icelandic Corpus. In: Proceedings of NODALIDA-07, Tartu, Estonia, 2007 (Download).
- Quasthoff, Uwe; Richter, Matthias and Biemann, Chris (2006): Corpus Portal for Search in Monolingual Corpora. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa (Italy), 2006.
- Quasthoff, Uwe; Goldhahn, Dirk and Eckart, Thomas (2015): Building Large Resources for Text Mining: The Leipzig Corpora Collection. In: Text Mining - From Ontology Learning to Automated Text Processing Applications, Springer, 2015.
- Richter, Matthias; Quasthoff, Uwe; Hallsteinsdóttir, Erla and Biemann, Chris (2006): Exploiting the Leipzig Corpora Collection. In: Proceedings of IS-LTC'06, Ljubljana, Slovenia, 2006 (Download).
To the documentation main page