Models¶
List of pretrained word embeddings and other models.
Word embeddings¶
- fastText: fastText word vectors trained on Common Crawl and Wikipedia.
- Word2Vec: Word2Vec vectors trained by CLIPS on diffent Dutch corpora.
- Word2Vec: Word2Vec vectors trained by the Nordic Language Processing Laboratory (NLPL) on the CoNLL17 corpus.
- ConceptNet Numberbatch: multilingual word embeddings in the same semantic space. Built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.
ULMFiT¶
- Leiden fastai ULMFiT model: Trained on Wikipedia by the Text Mining and Retrieval research group at Leiden University.
BERT¶
- BERTje: Dutch pre-trained BERT model developed at the University of Groningen.
- RobBERT: Dutch BERT model using RoBERTa’s pre-training, developed by KU Leuven. Trained on scraped data from the Dutch section of the OSCAR corpus. 4 smaller, distilled models are also available.
- BERT-NL: cased and uncased BERT model trained on the SoNaR corpus by the Text Mining and Retrieval research group at Leiden University.
- mBERT: Multilingual BERT model.