Welcome to NLP Resources for Dutch’s documentation!¶
Tools¶
List of NLP software for Dutch.
- NLTK
- Good start for notebook(?): https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
- SpaCy
- Pattern
- Flair
Datasets¶
List of labeled datasets.
Models¶
List of pretrained word embeddings and other models.
Word embeddings¶
- fastText: fastText word vectors trained on Common Crawl and Wikipedia.
- Word2Vec: Word2Vec vectors trained by CLIPS on diffent Dutch corpora.
- Word2Vec: Word2Vec vectors trained by the Nordic Language Processing Laboratory (NLPL) on the CoNLL17 corpus.
- ConceptNet Numberbatch: multilingual word embeddings in the same semantic space. Built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.
ULMFiT¶
- Leiden fastai ULMFiT model: Trained on Wikipedia by the Text Mining and Retrieval research group at Leiden University.
BERT¶
- BERTje: Dutch pre-trained BERT model developed at the University of Groningen.
- RobBERT: Dutch BERT model using RoBERTa’s pre-training, developed by KU Leuven. Trained on scraped data from the Dutch section of the OSCAR corpus. 4 smaller, distilled models are also available.
- BERT-NL: cased and uncased BERT model trained on the SoNaR corpus by the Text Mining and Retrieval research group at Leiden University.
- mBERT: Multilingual BERT model.
Corpora¶
List of (largely) unlabeled text collections.
- SoNaR-500: over 500 million words, from different domains and genres, automatically tokenized, pos-tagged, lemmatised and named entities extracted.
- SoNaR-1: 1 million words, largely from SoNaR-500, with manually annotated named entities, corefences, spatial and temporal relations.
- SoNaR Nieuwe Media Corpus 1.0: tweets, chat and sms messages.