CorporaΒΆ
List of (largely) unlabeled text collections.
- SoNaR-500: over 500 million words, from different domains and genres, automatically tokenized, pos-tagged, lemmatised and named entities extracted.
- SoNaR-1: 1 million words, largely from SoNaR-500, with manually annotated named entities, corefences, spatial and temporal relations.
- SoNaR Nieuwe Media Corpus 1.0: tweets, chat and sms messages.