Downloads – Wortschatz Leipzig

Corpus downloads

The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. All data are available as plain text files and can be imported into a MySQL database by using the provided import script. They are intended both for scientific use by corpus linguists as well as for applications such as knowledge extraction programs.

The corpora are identical in format and similar in size and content. They contain randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed. Because word co-occurrence information is useful for many applications, these data are precomputed and included as well. For each word, the most significant words appearing as immediate left or right neighbor or appearing anywhere within the same sentence are given. More information about the format and content of these files can be found here.

The corpora are automatically collected from carefully selected public sources without considering in detail the content of the contained text. No responsibility is taken for the content of the data. In particular, the views and opinions expressed in specific parts of the data remain exclusively with the authors.

If you use one of these corpora in your work we kindly ask you to cite this paper as

D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
In: Proceedings of the 8th International Language Resources and Evaluation (LREC 2012), 2012.

To download a corpus please select a language. You will be forwarded to the download site.

German English French Arabic Russian

A Asturian Afrikaans Afar Arabic Armenian Azerbaijani Albanian Aragonese Achinese Anaang Acoli Assamese Amharic Akan Aymara
B Bulgarian Bashkir Belarusian Burmese Bengali Buginese Bosnian Banjar Breton Basque Bavarian Bambara Bishnupriya Balinese Buriat Bemba (Zambia)
C Chinese Czech Catalan Croatian Cebuano Central Khmer Chuvash Corsican Chechen Central Bikol Central Kurdish
D Danish Dutch Dhivehi Dimli Dyula Dari
E English Esperanto Estonian Egyptian Arabic Extremaduran Eastern Mari Ewe Emiliano-Romagnolo Erzya Eastern Yiddish Eastern Maninkakan
F French Finnish Faroese Fiji Hindi Fulah Fon
G German Ganda Gujarati Georgian Galician Guarani Gilaki Goan Konkani Gan Chinese
H Hungarian Hindi Hebrew Haitian Halh Mongolian Hausa Hiligaynon
I Indonesian Iranian Persian Icelandic Italian Igbo Irish Iloko Ido Interlingua Ibibio Interlingue
J Javanese Japanese
K Kannada Kurdish Kölsch Kalaallisut Kazakh Kashubian Korean Kirghiz Kashmiri Kinyarwanda Kabiyè Kabardian Konkani Karachay-Balkar Komi-Permyak Kituba (Congo) Komi Kabyle Kikuyu Koongo Kongo Kabuverdianu
L Lithuanian Latvian Low German Limburgan Luxembourgish Lower Sorbian Latin Lombard Lushai Lao Ladino Lumbu Lugbara Lomwe Ligurian Lingala
M Marathi Modern Greek Malay Min Nan Chinese Macedonian Mirandese Mandarin Chinese Maltese Mongolian Mazanderani Maori Manx Malayalam Malagasy Mingrelian Minangkabau Maithili Madurese Mossi Makonde Min Dong Chinese
N Norwegian Bokmål Norwegian Newari Nepali Norwegian Nynorsk North Azerbaijani Northern Uzbek Northern Sami Neapolitan Nyanja Ndonga Northern Frisian Nyankole Navajo Nigerian Pidgin
O Occitan Oriya Ossetian Oromo
P Polish Panjabi Portuguese Papiamento Persian Pushto Pfaelzisch Pedi Plateau Malagasy Pulaar Piemontese Pontic Pangasinan Pampanga
Q Quechua
R Russian Romanian Rundi Rusyn Romansh Romany
S Spanish Slovenian Swedish Silesian Sicilian Serbian Slovak Sundanese Swiss German Scots Sinhala Sardinian Standard Estonian Swahili Sindhi Seraiki Serbo-Croatian Somali Samogitian Sukuma Southern Sotho Sanskrit South Ndebele Standard Latvian Shona Swati Susu Sena Sami S'gaw Karen Swahili Soninke
T Tamil Tajik Telugu Turkish Tosk Albanian Thai Turkmen Tatar Tiv Tumbuka Tagalog Timne Tibetan Tulu Tuvinian Tsonga Tswana Tigrinya
U Urdu Ukrainian Uzbek Uighur Upper Sorbian Udmurt
V Venda Venetian Vietnamese Volapük Vlaams Võro
W Western Frisian Wu Chinese Western Mari Waray (Philippines) Walloon Welsh Western Panjabi Wolof
X Xhosa
Y Yiddish Yakut Yoruba
Z Zulu Zeeuws Zhuang

SentiWS

SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS contains around 1,650 positive and 1,800 negative words, which sum up to around 16,000 positive and 18,000 negative word forms incl. their inflections, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one.

SentiWS is organised in two UTF8-encoded text files structured the following way:
<Word>|<POS tag> \t <Polarity weight> \t <Infl_1>,...,<Infl_k> \n
where \t denotes a tab, and \n denotes a new line.

SentiWS is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 Unported License.
If you use SentiWS in your work we kindly ask you to cite this paper as

R. Remus, U. Quasthoff & G. Heyer: SentiWS - a Publicly Available German-language Resource for Sentiment Analysis.
In: Proceedings of the 7th International Language Resources and Evaluation (LREC 2010), pp. 1168-1171, 2010.

Downloads

TinyCC

TinyCC 2.0 is Perl script that can be used to create text corpora.

To the documentation and download…

ASV Toolbox

The ASV Toolbox is a collection of different tools for analysing written language. It was developed at the Department of Automatic Speech Processing at the University of Leipzig and is no longer being developed further. Sie kann bei der Language Technology Group, Universität Hamburg heruntergeladen werden.

To the ASV Toolbox download…