Models

List of pretrained word embeddings and other models.

Word embeddings

  • fastText: fastText word vectors trained on Common Crawl and Wikipedia.
  • Word2Vec: Word2Vec vectors trained by CLIPS on diffent Dutch corpora.
  • Word2Vec: Word2Vec vectors trained by the Nordic Language Processing Laboratory (NLPL) on the CoNLL17 corpus.
  • ConceptNet Numberbatch: multilingual word embeddings in the same semantic space. Built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.

ULMFiT

BERT

  • BERTje: Dutch pre-trained BERT model developed at the University of Groningen.
  • RobBERT: Dutch BERT model using RoBERTa’s pre-training, developed by KU Leuven. Trained on scraped data from the Dutch section of the OSCAR corpus. 4 smaller, distilled models are also available.
  • BERT-NL: cased and uncased BERT model trained on the SoNaR corpus by the Text Mining and Retrieval research group at Leiden University.
  • mBERT: Multilingual BERT model.