MarIA: First Massive Artificial Intelligence System in Spanish
The MarIA project, a language model system created at the Barcelona Supercomputing Center (BSC), has advanced in its development and its new version allows summarizing existing texts and creating new texts from headlines or words. MarIA has been trained with more than 135 billion words from the web archive of the National Library of Spain, promoted by the Secretary of State for Digitization and Artificial Intelligence.
Due to the volume and capacities of MarIA, the Spanish language ranks third among the languages with massive open access models, after English and Mandarin. MarIA has been built from the digital documentary heritage of the National Library, which tracks and archives the websites made in Spanish and has been trained with the MareNostrum 4 supercomputer. It is published openly so that application developers, companies, groups of research and society in general can use it in countless uses.
The latest advances of MarIA constitute a milestone in the achievement of the objectives of the National Strategy for Artificial Intelligence and the Recovery, Transformation and Resilience Plan, with which Spain intends to lead the development of tools, technologies and applications for projection and use of the Spanish language in the fields of application of AI. Specifically, the National Plan for Language Technologies in which this project is framed, aims to promote the development of natural language processing, automatic translation and conversational systems in Spanish and co-official languages.
Models to understand the language and models to generate texts
A language model is an artificial intelligence system formed by a set of deep neural networks that have been trained to acquire an understanding of the language, its lexicon and its mechanisms to express meaning and write at an expert level. These complex statistical models that link words in texts in a systematic and massive way are capable of “understanding” not only abstract concepts, but also their context. With these models, developers of different applications can create tools for multiple uses, such as classifying documents or creating proof-readers or translation tools.
The first version of MarIA was made with RoBERTa, a technology that creates “encoder”-type language models. This type of model, given a text sequence, generates an interpretation that can be used to, for example, classify documents, answer multiple choice questions, find semantic similarities in different texts, or detect the feelings that are expressed in them.
The new version has been created with GPT-2, a more advanced technology that creates generative decoder models and adds features to the system. The decoder models, given a text sequence, can generate new texts. With this, they can be used, for example, to make automatic summaries, simplify complicated wording tailored to different user profiles, generate questions and answers, have complex dialogues with users and even write full texts (which could appear to be written by humans) from a headline or a small number of words.
These new capabilities make MarIA a tool that, with “ad hoc” training adapted to specific tasks, can be very useful for application developers, companies and public administrations. For example, the models that until now have been developed in English are used to generate text suggestions in writing applications, to summarize contracts or the complicated documents that detail the benefits of a product, depending on what each user wants to know, and to search for specific information within large text databases and relate it to other relevant information.
Trained with over 135 billion words and 9.7 trillion operations
In language models, the number of parameters with which the system is trained is the element that gives them the greatest capacity for generalization and, therefore, intelligence. The National Library data with which MarIA has been trained consists of more than 135 billion words (135,733,450,668, specifically), occupying a total of 570 Gigabytes.
To create and train MarIA, BSC's MareNostrum supercomputer was used and a computing power of 9.7 trillion operations (969.exaflops) was required. A flop (floating point operation) is the unit of measure that expresses the computing power of a supercomputer per second and exa is the prefix that expresses 1018, that is, one trillion.
Of these 969 exaflops, 201 were necessary to process the data from the National Library, eliminate everything that was not well-formed text (page numbers, graphics, sentences that do not end, erroneous encodings, duplicate sentences, other languages, etc.) and save only the correct texts in the Spanish language, as it is actually used. The remaining 768 exaflops were used to train the neural networks of the GPT-2 model.
The current version of MarIA will now lead to specialized versions in different application areas, including biomedicine and legal, and will evolve to solve the specific problems mentioned above.
In parallel, PlanTL will continue to expand MarIA to:
- adapt to new technological developments in natural language processing (more complex models than the GP-T2 now implemented) trained with greater amounts of data,
- create workspaces to facilitate the use of MarIA by companies and research groups in the appropriate computing environments, and
- embed them in systems of evaluation and certification of the quality of the systems developed in different domains.