Publications

To the documentation main page

How to cite the Leipzig Corpora Collection

For the whole collection, please cite the following general paper:

Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012 (Download).

For another description focusing on the German collection (Deutscher Wortschatz):

Uwe Quasthoff and Matthias Richter (2005): Projekt Deutscher Wortschatz, Babylonia 3-2005, p. 33-35 (Download).

How to cite a corpus

For citing a specific corpus, please use the following:

Leipzig Corpora Collection (YEAR): CORPUS_DESCRIPTION. Leipzig Corpora Collection. Dataset. URL.

For example for the corpus eng_news_2008:

Leipzig Corpora Collection (2008): English newspaper corpus based on material from 2008. Leipzig Corpora Collection. Dataset. https://corpora.uni-leipzig.de?corpusId=eng_news_2008.

Frequency Dictionaries

The series Frequency Dictionaries is published by Leipziger Universitätsverlag. All dictionaries follow the same scheme:

The frequency dictionary is based on the word list of the largest corpus available for the corresponding language.
A chapter on language statistics describes the alphabet, the distribution of vowels and consonants, syllable and word length, text coverage, etc.
The most frequent words ordered by rank: The top-1,000 words printed and the top-1,000,000 words on the accompanying CD-ROM
The most frequent words ordered alphabetically: The top-10,000 words printed and the top-1,000,000 words on the accompanying CD-ROM
The word lists provided on the CD-ROM can be used freely under the Creative Commons Licence CC-BY 3.0.

The following frequency dictionaries are currently available. More information is provided by the publisher.

Vol. 1: Frequency Dictionary German (2011)
Vol. 2: Frequency Dictionary English (2012)
Vol. 3: Frequency Dictionary Icelandic (2012)
Vol. 4: Frequency Dictionary French (2013)
Vol. 5: Frequency Dictionary Hungarian (2013)
Vol. 6: Frequency Dictionary Esperanto (2014)
Vol. 7: Frequency Dictionary Indonesian (2015)
Vol. 8: Frequency Dictionary Ukrainian (2016)
Vol. 9: Frequency Dictionary Russian (2017)
Vol. 10: Frequency Dictionary Vietnamese (2018)
Vol. 11: Frequency Dictionary Czech (2018)
Vol. 12: Frequency Dictionary Georgian (2018)
Vol. 13: Frequency Dictionary Afrikaans (2019)
Vol. 14: Frequency Dictionary Zulu (2020)
Vol. 15: Frequency Dictionary Danish (2021)

List of publications (selection)

Biemann, Chris; Bordag, Stefan; Heyer, Gerhard; Quasthoff, Uwe and Wolff, Christian (2004): Language-independent Methods for Compiling Monolingual Lexical Data. In: Proceedings of CicLING 2004, Seoul, Korea and Springer LNCS 2945, pp. 215-228, Springer Verlag Berlin Heidelberg (Download).
Biemann, Chris; Heyer, Gerhard; Quasthoff, Uwe and Richter, Matthias (2007): The Leipzig Corpora Collection – Monolingual corpora of standard size. In: Proceedings of Corpus Linguistics 2007, Birmingham, UK, 2007.
Bosch, Sonja; Eckart, Thomas; Klimek, Bettina; Goldhahn, Dirk and Quasthoff, Uwe (2018): Preparation and Usage of Xhosa Lexicographical Data for a Multilingual, Federated Environment. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki (Japan), 2018.
Eckart, Thomas; Quasthoff, Uwe and Goldhahn, Dirk (2012): Language Statistics-Based Quality Assurance for Large Corpora. In: Proceedings of Asia Pacific Corpus Linguistics Conference 2012, Auckland, New Zealand, 2012.
Eckart, Thomas and Quasthoff, Uwe (2010): Statistical Corpus and Language Comparison Using Comparable Corpora. In: Workshop on Building and Using Comparable Corpora, LREC, Malta, 2010.
Goldhahn, Dirk; Eckart, Thomas and Quasthoff, Uwe (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the 8th Language Resources and Evaluation Conference (LREC) 2012.
Hallsteinsdóttir, Erla; Eckart, Thomas; Biemann, Chris; Quasthoff, Uwe and Richter, Matthias (2007): Íslenskur Orðasjóður - Building a Large Icelandic Corpus. In: Proceedings of NODALIDA-07, Tartu, Estonia, 2007 (Download).
Quasthoff, Uwe; Richter, Matthias and Biemann, Chris (2006): Corpus Portal for Search in Monolingual Corpora. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa (Italy), 2006.
Quasthoff, Uwe; Goldhahn, Dirk and Eckart, Thomas (2015): Building Large Resources for Text Mining: The Leipzig Corpora Collection. In: Text Mining - From Ontology Learning to Automated Text Processing Applications, Springer, 2015.
Richter, Matthias; Quasthoff, Uwe; Hallsteinsdóttir, Erla and Biemann, Chris (2006): Exploiting the Leipzig Corpora Collection. In: Proceedings of IS-LTC'06, Ljubljana, Slovenia, 2006 (Download).

To the documentation main page