Skip to content Skip to footer

Introducing Occiglot: A Grand-Scale European Initiative for Open-Source Creation and Growth of Extensive Language Models.

OcciGlot, a revolutionary language model introduced by a group of European researchers, aims to address the need for inclusive language modeling solutions that embody European values of linguistic diversity and cultural richness. By focusing on these values, the model intends to maintain Europe’s competitive edge in academics and economics and ensure AI sovereignty and digital language equality. Notably, current large language models provided by top tech companies and deep tech startups predominantly focus on understanding English, neglecting linguistic diversification and cultural variety.

The field of language modeling is currently monopolized by a few key players, which has led to insufficient representation of European languages and cultural heterogeneity. To address this inadequacy, OcciGlot has launched Model Release v0.1 – a set of preliminary 7B model checkpoints concentrating on the five largest European languages: English, German, French, Spanish, and Italian. Developed through bilingual continuous pre-training and instruction adjustment for each specific language, these models are available under an open-source license on Hugging Face, facilitating democratized access to language model resources.

OcciGlot adopts a novel approach involving continuous pre-training and instruction tuning of transformer-based language models for each target language. The process commences from an existing pre-trained model for English, which is then fine-tuned and optimized to suit each specific language, keeping cultural nuances and linguistic diversity at the forefront. This iterative method assures the production of high-quality language models tailored explicitly for the European context. In addition, the collective encourages collaboration within the community to gather large-scale training data, curate instruction-tuning datasets and precisely assess model performance.

OcciGlot’s language models’ performance is assessed based on their ability to sustain diverse linguistic tasks and applications across various European languages. The release of intermediary model checkpoints marks significant progress towards achieving comprehensive language modeling that covers all official European Union languages and beyond. Moreover, the commitment of hessian.AI to supply computing resources underpins the initiative’s scalability and sustainability.

In conclusion, OcciGlot’s initiative directly addresses the urgent need for accessible and culturally sensitive language models in Europe. By releasing open-source LLM checkpoints and encouraging collaboration within the research community, OcciGlot is paving the way for advancements in language technology that align with European ideals of linguistic diversity and cultural richness. The OcciGlot team invites anyone working on multilingual datasets, benchmarks, or models to join their cause, signifying their commitment to an open-source, collaborative research approach.

Leave a comment

0.0/5