Corpora¶

List of (largely) unlabeled text collections.

SoNaR-500: over 500 million words, from different domains and genres, automatically tokenized, pos-tagged, lemmatised and named entities extracted.
SoNaR-1: 1 million words, largely from SoNaR-500, with manually annotated named entities, corefences, spatial and temporal relations.
SoNaR Nieuwe Media Corpus 1.0: tweets, chat and sms messages.