The biggest neural nets, as measured by their parameters or “weights,” are clocking in at over half a trillion weights. Models such as Google’s Pathways Language Model, or PaLM, and Nvidia and Microsoft’s Megatron-Turing NLG 530B are among the biggest, with 540 billion and 530 billion parameters, respectively. The more parameters a program has, in general, the greater the amount of computing power it consumes to train, and also to run for making predictions, what’s called inference. The cognoscenti of AI insist the path is definitely up and to the right for parameter count, toward a trillion parameters and way beyond in the not-so-distant future. The figure of 100 trillion is a kind of magical target because it is believed to be the number of synapses in a human brain, so it serves as a benchmark of sorts. Also: Nvidia clarifies Megatron-Turing scale claim At the same time, there is a fervor to make deep neural networks that can be as general as possible. For much of the machine learning history of the last 40 years, programs were specialized for tasks such as image recognition or speech recognition. That has changed in recent years, with more and more programs offering to be generalists, such as DeepMind’s Perceiver AR, and another DeepMind program, Gato, referred to as “a generalist agent” capable of solving myriad tasks. The generalizing tendency has been reinforced by the observations of machine learning pioneers such as Richard Sutton, who has remarked that “historically, generic models that are better at leveraging computation have also tended to overtake more specialized domain-specific approaches eventually.” Also: DeepMind’s ‘Gato’ is mediocre, so why did they build it? And yet, there are deep learning results that sometimes run the other way: against giant and general to economical and somewhat focused, if not specialized. In addition to the paper, Soltan has written a blog post on the topic. As with AlexaTM 20B, the Atlas program beats PaLM by a significant margin, the authors write, even with just the 64 examples. The key to Atlas is to combine the pre-trained language model with an ability to retrieve information from online sources, such as Wikipedia, as if phoning a friend for the answer. Also: DeepMind’s Perceiver AR: A step toward more AI efficiency The first interesting tweak is to go back to basics and restore something taken out of recent giant language models. The basis of AlexaTM 20B is the same as PaLM and GPT-3 and others, a Transformer encoder-decoder – the approach pioneered in 2017 by Google scientists Ashish Vaswani and colleagues. The Transformer uses units called “self-attention” to come up with a probability score for how every word may be found in the context of other words. That score is then used to fill in the blanks when predicting words to form meaningful text blocks. In the case of AlexaTM 20B, Soltan and colleagues make a critical departure from PaLM and GPT-3 and other gigantic descendants of the original Transformer. Those more-recent models dispensed with one half of the Transformer, what’s called the encoder (the thing that maps input data into hidden states to then be decoded into an answer). Instead, PaLM and GPT-3 merge the input with the decoder, to form a stripped-down program that is a “decoder-only” model. The Alexa team puts the encoder back into the program. Their claim is that having both elements helps to improve accuracy in what’s called “de-noising,” which means reconstructing an original sentence where some of the words have dropped out. In the decoder-only model, the conditional probability of predicted text runs only in one direction: every next answer is based only on what came before. In the full encoder-decoder version, by contrast, the model is making an assessment of probabilities in both directions: what came before a given word and what follows. That serves better in tasks where one is not only generating the next element in a sentence, but also doing things like word-for-word comparison, as in translation tasks from one language to another. Also: Meta’s massive multilingual translation opus still stumbles on Greek, Armenian, Oromo As they write, “AlexaTM 20B achieves a new state-of-the-art of 82.63% in the zero-shot setting in the denoising mode. The main reason denoising mode performs better for this task is that in the denoising mode, the input is being repeated in encoder and decoder allowing the model to use both encoder and decoder fully to find the best answer.” The second thing the authors add is to train the model with what’s called “causal language modeling.” CLM, for short, is the task that is used in GPT-3 and other decoder-only Transformers. It specifically represents every word as dependent only on the words that came before – a sequential, one-way dependency that is trained to generate sentences based on an initial prompt. The authors mix the de-noising task with the causal task in training AlexaTM 20B, with de-noising taking up 80% of the training activity and causal modeling the remaining fifth. The virtue of adding causal modeling is that, similar to GPT-3, it aids in what is called “in-context learning.” In-context learning is a broad rubric covering any models that are able to perform zero- or few-shot learning. That means that the program has no domain-specific knowledge; you just give it an example prompt, and the program makes a prediction that is in accord with the type of question being posed. Because of that hybrid training regime, AlexTM 20B not only does well at reconstructing sentences – the de-noising task, it also is “the first multilingual seq2seq [sequence to sequence] model capable of in-context learning,” the authors write. It’s a hybrid program, in other words. The third interesting tweak by Soltan and colleagues is to increase enormously how many data points are input to the program during training. They input one trillion “tokens,” individual pieces of data, during training; that’s more than three times as many as GPT-3 receives. The training data sets in this case consist of Wikipedia entries and also what’s called mC4, a data set for training Transformers introduced last year by Linting Xue and colleagues at Google. It’s based on natural-language text in 101 languages from the Common Crawl Web-scraped data sources. Also: Sentient? Google LaMDA feels like a typical chatbot The use of a very large amount of input training data is one of the key elements of the Alexa work. Soltan and team decided to go that route, they write, based on an observation made by Jordan Hoffman and colleagues at OpenAI, as published in a paper this past March, “Training compute-optimal large language models.” In that paper, Hoffman and colleagues conclude that “current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant.” By taking a wide range of language models of different sizes and testing them all with varying amounts of input tokens, the authors concluded that “for compute-optimal training, the model size and the number of training tokens should be scaled equally.” Hence, AlexaTM 20B is not just parsimonious – it aims to prove that fewer parameters can be balanced with more training data to equal compelling performance. Incidentally, the authors also take pains to shape the majority of the input as natural spoken text, dropping capitalization and punctuation, which has importance in an Alexa setting. “We include more spoken than written text to satisfy our internal use cases,” they write. Also: Google’s massive language translation work identifies where it goofs up The authors train the AlexaTM 20B model “for 120 days on 128 [Nvidia] A100 GPUs for the total of 500k updates with the accumulated batch size of 2 million tokens (total of 1 trillion token updates),” they write. That might sound like a lot, but it’s less than PaLM, which was trained by Google on two of its fourth-generation TPU Pods, consisting of 3,072 TPU chips in each Pod, which are attached to 768 host computers. As Google authors Aakanksha Chowdhery and team noted in April, that was “the largest TPU configuration described to date.” The results are spelled out in specific test results. Soltan and team place a special emphasis on their success in particular tasks as opposed to every task conceivable. For example, Soltan and team observe that “AlexaTM 20B performs better or in par to the largest dense decoder-only model to date (i.e., PaLM 540B) in summarization both in 1-shot and fine-tuning settings.” This is specifically true in a task of summarizing paragraphs known as MLSum; in German, Spanish, and French, AlexaTM 20B beat PaLM handily. The MLSum benchmark test, introduced in 2020 by France’s National Centre for Scientific Research, comprises 1.5 million articles from newspapers. The task is for a language model to output a few sentences of text that express the idea laid out in the entire article. This requires a lot of reduction, obviously, of hundreds of words down to perhaps a few dozen. On a fourth test, XSum, performed in English, the AlexaTM 20B model was a close second, and it beat out a version of PaLM that was bigger than AlexaTM 20B but smaller than the 540-billion-parameter version of PaLM. While it excels at summarization, the AlexTM 20B falls down on some other tasks. For example, tested on “reasoning” data sets (such as MultiArith) and “chain of thought” reasoning tasks (which are very simple arithmetic problems written in natural language), the program falls far behind what is accomplished by the much-larger models like GPT-3. Also: The future of AI is a software story, says Graphcore’s CEO Write Soltan and team, “AlexaTM 20B performs slightly better than similar sized models, however, we did not observe the gain that much larger models like GPT3 175B show from such special prompts,” meaning, clues given to the program about the next step in a problem. “The results indicate that scaling up the model parameters is crucial in performing well in ‘reasoning’ tasks as was previously demonstrated […] in decoder-only architectures using Instruct-GPT3 models.” Focusing on the successful task,s such as summarization, the main conclusion that Soltan and team arrive at is that their mixed approach to training the program – using both objectives of de-noising and causal language modeling – is a key to making things more efficient. “This suggests that mixed pre-training, and not necessarily additional multitask training […] is the key to train strong seq2seq-based Large-scale Language Models (LLM),” they write. To return to the original question of size, as has been noted in many contexts, the energy usage of increasingly large AI programs is an ethical concern within AI practices. The authors make a strong case for the relevance of their more-efficient approach. Also: Ethics of AI: Benefits and risks of artificial intelligence Because the AlexaTM 20B “is much smaller in size than models like GPT3 175B, yet achieving similar or better performance across different tasks,” they write, “the ongoing environmental impact of using AlexaTM 20B for inference is much lower than that of larger models (roughly 8.7 times lower).” They add, “Hence, overtime, AlexaTM 20B has [a] lower carbon footprint as well.” The authors offer a table of stats showing the relative carbon footprint, and there’s a big difference in the numbers. That table of carbon footprints is perhaps the most interesting aspect of all this. More deep learning research is going to seek to put up scores for environmental assessments, it would seem, in order to show how energy-efficient a given approach can be. That’s in keeping with the world’s increasing focus on “ESG,” meaning, environmental, social, and governance factors, in all things. That may mean that being eco-conscious has in some ways become part of the objective of mainstream AI research. Also: AI in sixty seconds