
List of (largely) unlabeled text collections.

  • SoNaR-500: over 500 million words, from different domains and genres, automatically tokenized, pos-tagged, lemmatised and named entities extracted.
  • SoNaR-1: 1 million words, largely from SoNaR-500, with manually annotated named entities, corefences, spatial and temporal relations.
  • SoNaR Nieuwe Media Corpus 1.0: tweets, chat and sms messages.