Skip to main content

Language identificator

This is a Python 3 application and Docker container for language identification, based on the langdetect PyPI package to which we have added the Luxembourgish lb profile. It works for the languages of the project, namely German, French, Luxembourgish, Romanian, Danish and English. If text in other language is supplied, the ISO 639-1 language code is returned.

https://github.com/racai-ai/e4a-langdetect

Datasets

There are three datasets so far in three domains: COVID-19 (Romanian), construction permits (Romanian), and public administration (Luxembourgish) and are available here: https://github.com/racai-ai/e4all-models.

BERT Medium for Luxembourgish

This BERT model is available at HuggingFace and can be readily used with the transformers Python API. It was trained for 3 epochs, and it reached a final perplexity of 58.76 on the validation set. The vocabulary has 70K word pieces. We are working on extending this model with a bigger vocabulary.