The increasingly sophisticated language models of today need vast quantities of text data for pretraining, often in the order of trillions of words. This poses a considerable problem for smaller languages that lack the necessary resources. To tackle this issue, researchers from the TurkuNLP Group, the University of Turku, Silo AI, the University of Helsinki, and the CSC – IT Center for Science have developed Poro 34B, a 34-billion-parameter model that’s been trained on 1 trillion tokens of Finnish, English, and programming languages.
This research shows that a multilingual training approach can significantly enhance the capabilities of Finnish models while competing effectively in English and code tasks. The model, Poro 34B, is a state-of-the-art generative model that was conceived using strategies to overcome the limitations of available data—this includes limited multilingualism, matching scripts, language families, oversampling, and the inclusion of programming language data.
To pretrain Poro 34B, the researchers first eliminated low-quality and duplicate texts from the dataset, and ensured toxic contexts were filtered out. They also introduced a cross-lingual signal component using English-Finnish translation pairs from the Tatoeba challenge dataset—this comprises less than 1% of the pretraining tokens.
The Poro 34B model demonstrated strong performance when evaluated, with low character-level perplexity across English, Finnish, and code datasets. This indicates effective learning across these languages. Poro 34B also excelled in various benchmarks and performed particularly well in Finnish tasks, even surpassing the results of previous monolingual models. The model showed strong proficiency in English that was competitive with models predominantly trained in English. Notably, Poro 34B generated Finnish text that was coherent and grammatically correct. Moreover, it outperformed both dedicated translation models and Google Translate.
The Poro 34B model may serve as a template for creating larger models for other smaller languages, facilitating further research and development. The future agenda for research includes systematic exploration of the impacts of multilingual training.