Models¶

List of pretrained word embeddings and other models.

Word embeddings¶

fastText: fastText word vectors trained on Common Crawl and Wikipedia.
Word2Vec: Word2Vec vectors trained by CLIPS on diffent Dutch corpora.
Word2Vec: Word2Vec vectors trained by the Nordic Language Processing Laboratory (NLPL) on the CoNLL17 corpus.
ConceptNet Numberbatch: multilingual word embeddings in the same semantic space. Built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.

Leiden fastai ULMFiT model: Trained on Wikipedia by the Text Mining and Retrieval research group at Leiden University.

BERTje: Dutch pre-trained BERT model developed at the University of Groningen.
RobBERT: Dutch BERT model using RoBERTa’s pre-training, developed by KU Leuven. Trained on scraped data from the Dutch section of the OSCAR corpus. 4 smaller, distilled models are also available.
BERT-NL: cased and uncased BERT model trained on the SoNaR corpus by the Text Mining and Retrieval research group at Leiden University.
mBERT: Multilingual BERT model.