A look under the hood of transfomers, the engine driving AI model evolution


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Today, virtually every cutting-edge AI product and model uses a transformer architecture. Large language models (LLMs) such as GPT-4o, LLaMA, Gemini and Claude are all transformer-based, and other AI applications such as text-to-speech, automatic speech recognition, image generation and text-to-video models have transformers as their underlying technology.  

With the hype around AI not likely to slow down anytime soon, it’s time to give transformers their due, which is why I’d like to explain a little about how they work, why they are so important for the growth of scalable solutions and why they are the backbone of LLMs.  

Transformers are more than meets the eye 

In brief, a transformer is a neural network architecture designed to model sequences of data, making them ideal for tasks such as language translation, sentence completion, automatic speech recognition and more. Transformers have really become the dominant architecture for many of these sequence modeling tasks because the underlying attention-mechanism can be easily parallelized, allowing for massive scale when training and performing inference.  

Originally introduced in a 2017 paper, “Attention Is All You Need” from researchers at Google, the transformer was introduced as an encoder-decoder architecture specifically designed for language translation. The following year, Google released bidirectional encoder representations from transformers (BERT), which could be considered one of the first LLMs — although it’s now considered small by today’s standards. 

Since then — and especially accelerated with the advent of GPT models from OpenAI — the trend has been to train bigger and bigger models with more data, more parameters and longer context windows.   

To facilitate this evolution, there have been many innovations such as: more advanced GPU hardware and better software for multi-GPU training; techniques like quantization and mixture of experts (MoE) for reducing memory consumption; new optimizers for training, like Shampoo and AdamW; techniques for efficiently computing attention, like FlashAttention and KV Caching. The trend will likely continue for the foreseeable future. 

The importance of self-attention in transformers

Depending on the application, a transformer model follows an encoder-decoder architecture. The encoder component learns a vector representation of data that can then be used for downstream tasks like classification and sentiment analysis. The decoder component takes a vector or latent representation of the text or image and uses it to generate new text, making it useful for tasks like sentence completion and summarization. For this reason, many familiar state-of-the-art models, such the GPT family, are decoder only.   

Encoder-decoder models combine both components, making them useful for translation and other sequence-to-sequence tasks. For both encoder and decoder architectures, the core component is the attention layer, as this is what allows a model to retain context from words that appear much earlier in the text.  

Attention comes in two flavors: self-attention and cross-attention. Self-attention is used for capturing relationships between words within the same sequence, whereas cross-attention is used for capturing relationships between words across two different sequences. Cross-attention connects encoder and decoder components in a model and during translation. For example, it allows the English word “strawberry” to relate to the French word “fraise.”  Mathematically, both self-attention and cross-attention are different forms of matrix multiplication, which can be done extremely efficiently using a GPU. 

Because of the attention layer, transformers can better capture relationships between words separated by long amounts of text, whereas previous models such as recurrent neural networks (RNN) and long short-term memory (LSTM) models lose track of the context of words from earlier in the text. 

The future of models 

Currently, transformers are the dominant architecture for many use cases that require LLMs and benefit from the most research and development. Although this does not seem likely to change anytime soon, one different class of model that has gained interest recently is state-space models (SSMs) such as Mamba. This highly efficient algorithm can handle very long sequences of data, whereas transformers are limited by a context window.  

For me, the most exciting applications of transformer models are multimodal models. OpenAI’s GPT-4o, for instance, is capable of handling text, audio and images — and other providers are starting to follow. Multimodal applications are very diverse, ranging from video captioning to voice cloning to image segmentation (and more). They also present an opportunity to make AI more accessible to those with disabilities. For example, a blind person could be greatly served by the ability to interact through voice and audio components of a multimodal application.  

It’s an exciting space with plenty of potential to uncover new use cases. But do remember that, at least for the foreseeable future, are largely underpinned by transformer architecture. 

Terrence Alsup is a senior data scientist at Finastra.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers



Source link

About The Author

Scroll to Top