The Leipzig Corpora Collection provides different tools and data for download, which are protected by copyright. For more details please refer to our terms of usage.
The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. All data are available as plain text files and can be imported into a MySQL database by using the provided import script. They are intended both for scientific use by corpus linguists as well as for applications such as knowledge extraction programs.
The corpora are identical in format and similar in size and content. They contain randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed. Because word co-occurrence information is useful for many applications, these data are precomputed and included as well. For each word, the most significant words appearing as immediate left or right neighbor or appearing anywhere within the same sentence are given. More information about the format and content of these files can be found here.
The corpora are automatically collected from carefully selected public sources without considering in detail the content of the contained text. No responsibility is taken for the content of the data. In particular, the views and opinions expressed in specific parts of the data remain exclusively with the authors.
If you use one of these corpora in your work we kindly ask you to cite this paper as
D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012
To download a corpus please select a language. You will be forwarded to the download site.
SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS contains around 1,650 positive and 1,800 negative words, which sum up to around 16,000 positive and 18,000 negative word forms incl. their inflections, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one.
SentiWS is organised in two UTF8-encoded text files structured the following way:
<Word>|<POS tag> \t <Polarity weight> \t <Infl_1>,...,<Infl_k> \n
where \t denotes a tab, and \n denotes a new line.
SentiWS is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 Unported License.
If you use SentiWS in your work we kindly ask you to cite this paper as
R. Remus, U. Quasthoff & G. Heyer: SentiWS - a Publicly Available German-language Resource for Sentiment Analysis.
In: Proceedings of the 7th International Language Resources and Evaluation (LREC'10), pp. 1168-1171, 2010
TinyCC 2.0 is a text corpus production engine that can be used to produce corpora in Leipzig Corpus Collection (LCC) format.
Documentation and download:
TinyCC 2.0
The ASV Toolbox is a modular collection of tools for the exploration of written language data. It was created at the NLP Group, Leipzig University and is not actively developed anymore.
Download at the Language Technology Group, University of Hamburg:
ASV Toolbox