Bigscience: a new language model for all

07.01.2022, by

Reading time: 5 minutes

Open source and covering forty-six different languages, BigScience is a new artificial intelligence heavyweight offering natural language processing. Claire Gardent, a CNRS research professor at the LORIA and winner of the CNRS Silver Medal, who also sits on the project’s steering committee, explains the principle as well as its potential applications.

What is BigScience?
Claire Gardent:1 BigScience represents a major progress for language models, which calculate the probability of a sequence of words, and then generate text by linking together the terms that have the highest probability of following one another. Older models were restricted to narrow contexts, typically five consecutive words, but this limitation has been removed by neural networks that can take context – theoretically unlimited – into account. In practice, they have truly revolutionised text generation by producing content that is practically perfect in terms of grammar. For a short piece of writing, it is very difficult to distinguish between their prose and what a human would have written.

©Christian MOREL / LISN / CNRS Photothèque

Discussion of different examples of automatic language processing, with automatic text and sign language generation with a virtual signer and latent sentence representation, at the Laboratoire interdisciplinaire des sciences du numérique (LISN).

Yet until now these models were developed mostly in English and by multinationals, which made them available, but provided no information regarding their learning details nor the data used. The idea behind BigScience is to create new language models, using a multilingual and unprecedented open-source approach.
BigScience also enjoys a very interesting collaborative aspect. Although coordinated by the private company Hugging Face, which completed a significant part of the engineering, the project has brought together a high-calibre international academic community.

How are language models developed?
C. G.: They are made of neural networks that self learn from large quantities of texts. They take the first word in a sentence and try to predict the second, then the third, etc. Each prediction is compared with the term that features in the text, and in the event of an error the model's weighting is adapted using what is known as a backpropagation mechanism. Learning continues until the model converges, in other words until its predictions are reliable.

What obstacles and difficulties had to be overcome by BigScience?
C. G.: To learn these probabilities, the model needs data, and computing power, which is provided by the Jean Zay supercomputer.2 BigScience also had to manage the scaling up resulting from its multilingual nature. Considerable reflection and engineering was required to create high-quality training data, as well as to train these large models.
The information used is automatically collected on the Internet. This poses little difficulty in the case of English, for which large bodies of text are already available. But BigScience covers forty-six languages, which are more or less present online. Significant amounts of data must therefore be assembled in tongues for which little material is available, and is not always of good quality. For example, working on Breton, I observed that a body of texts that had automatically been created for machine translation included content that was actually in English and Chinese. This must be avoided across vast volumes of data, and the presence of biases and offensive wording must also me minimised.

©Cyril FRESILLON / IDRIS / CNRS Photothèque

The Jean Zay converged supercomputer extends the traditional uses of high-performance computing (HPC) to new uses for artificial intelligence (AI). Through two successive extensions, its capacity has been increased to 28.3 petaflops by 2022.

What are the potential applications?
C. G.: All tasks relating to text generation. Machine translation remains the leading application, although simplifying, paraphrasing, or summarising are important. Generation models are also used to convert knowledge bases or numeric databases into text, for example to produce a weather report from raw meteorological data. And if it is trained on dialogue, a language model can be integrated within a conversational agent. Finally, question and answer models often use generation to provide answers based on the question and relevant information from the Internet.

A Google AI engineer recently affirmed that the language model he is working on, LaMDA, had gained consciousness. What is your reaction to such declarations?
C. G.: The intelligence of language models is an issue that arises regularly. They have truly amazing capacities, and some examples are actually quite remarkable. However, this does not mean that the programme is aware of what it is saying. Natural-language understanding involves reasoning that is complex and varied (spatial, temporal, ontological, arithmetical), based on knowledge, and can connect form and meaning (objects and actions in the real world). While a few selected examples may suggest that language models are capable of such reasoning, many incorrect cases crop up when working with them everyday.

With Angela Fan of Facebook AI Research Paris, I focused on an algorithm that generates Wikipedia pages. The wording is perfect, and seems to have been written by a human. Yet while the form is correct, the content is full of factual errors. A basketball player’s webpage sometimes indicates that he is a tennis player, certain dates are wrong, and the university where he studied is incorrect. After a few lines, an automatically-generated text often ends up contradicting itself. If language models were capable of understanding, they would not generate such incoherence.

We must be wary of hype created by a few well-chosen examples. Research should continue to develop tasks and benchmarks that truly test the capacity of natural language processing models. They should be assessed based on reasoning exercises that are multidomain, multilingual, and that interact with the world, for example via systems in which the computer has to answer questions regarding the content of an image, video, or conversation.

There are other essential objectives, such as the models’ capacity for generalisation, in other words ensuring that they function equally well on content that is journalistic, technical, literary, etc. Models must also be designed for languages that are supposedly less well-endowed. By gathering data covering forty-six different tongues, BigScience has taken a major step in this direction.

Footnotes