Date added: 2024-03-12
Potential Polish equivalent of GPT. Successful cooperation on AI between Gdańsk Tech and OPI
Gdańsk Tech and OPI developed Polish generative models called Qra, trained on data corpus containing texts only in Polish. Initially, the corpus used the total of almost 2TB of raw text data, but as a result of cleaning and deduplication processes, it was reduced almost twice in order to maintain the best quality and unique content. This is the first generative model trained on such a large resource of Polish texts, using great computing power. For comparison, the Llama, Mistral and GPT models are mostly trained on data in English, with only a fraction of a percent of the training corpus consisting of data in Polish.
Most complex version of the model trained in a month at STOS
A computing environment dedicated to building artificial intelligence models was created at Gdańsk University of Technology in IT Competence Center STOS, one of the most modern IT centers in this part of Europe with a supercomputer Kraken. A cluster of 21 NVidia A100 80GB graphics cards was used in the process. The teams worked about six months on preparing the environment, creating tools and models, training them (based on, among others, content from areas such as law, technology, social sciences, biomedicine, religion or sport) and testing. Thanks to the rich infrastructure available at STOS, the actual training process for the most complex models was shortened from years to about a month.
Qra uses Polish better
As a result of the cooperation between Gdańsk Tech and OPI, the research teams created three models differing in complexity, i.e. Qra 1B, Qra 7B, Qra 13B. Models Qra 7B and Qra 13B achieve a significantly better perplexity result, i.e. the ability to model the Polish language in terms of comprehension, the lexical layer, or the grammar itself, than the original models Llama-2-7b-hf (Meta) and Mistral-7B-v0.1 (Mistral-AI).
Perplexity measurement tests were performed, among others, on the set of the first 10 000 sentences from the PolEval-2018 test set, and additionally on a set of 5 000 long and more demanding documents written in 2024.
Solutions requiring better language understanding
Qra models will constitute the basis for IT solutions handling issues and processes that require a better understanding of the Polish language.
At this stage, Qra is a fundamental language model that can generate grammatically and stylistically correct answers in Polish. The created content is of very high quality, which can be confirmed, among others, by the perplexity measure. Currently, the team will start working on tuning the models to verify their capabilities of classifying texts, summarizing them, and answering questions.
The developed models were published in the dedicated OPI-Gdańsk Tech repository on huggingface platform. Anyone can download the model and adapt it to their field and problems or tasks, such as providing answers.
-
2024-11-22
INTERNATIONAL DAYS at Gdańsk Tech