Welcome to NLP Resources for Dutch’s documentation!

Tools

List of NLP software for Dutch.

Models

List of pretrained word embeddings and other models.

Word embeddings

  • fastText: fastText word vectors trained on Common Crawl and Wikipedia.
  • Word2Vec: Word2Vec vectors trained by CLIPS on diffent Dutch corpora.
  • Word2Vec: Word2Vec vectors trained by the Nordic Language Processing Laboratory (NLPL) on the CoNLL17 corpus.
  • ConceptNet Numberbatch: multilingual word embeddings in the same semantic space. Built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.

ULMFiT

BERT

  • BERTje: Dutch pre-trained BERT model developed at the University of Groningen.
  • RobBERT: Dutch BERT model using RoBERTa’s pre-training, developed by KU Leuven. Trained on scraped data from the Dutch section of the OSCAR corpus. 4 smaller, distilled models are also available.
  • BERT-NL: cased and uncased BERT model trained on the SoNaR corpus by the Text Mining and Retrieval research group at Leiden University.
  • mBERT: Multilingual BERT model.

Corpora

List of (largely) unlabeled text collections.

  • SoNaR-500: over 500 million words, from different domains and genres, automatically tokenized, pos-tagged, lemmatised and named entities extracted.
  • SoNaR-1: 1 million words, largely from SoNaR-500, with manually annotated named entities, corefences, spatial and temporal relations.
  • SoNaR Nieuwe Media Corpus 1.0: tweets, chat and sms messages.

Indices and tables