Understanding diffusion models

Diffusion models are among the most innovative techniques in generative AI. They allow for the creation of new images, text, or other data from random noise. This educational article, aimed at high school students, explains in an accessible way how diffusion models work using simple analogies and diagrams. We will discover why they are important in artificial intelligence and how they have revolutionized content generation. Finally, we present the latest advancements in the field, particularly the recent model from Google DeepMind named Gemini Diffusion, which stands out for its exceptional speed.

What is a Diffusion Model?

A diffusion model is a type of generative model, i.e., an AI program capable of generating new data similar to what it was trained on. Initially popularized for image generation, these neural network models learn to "diffuse" noise into examples (such as training images), and then to reverse this process to produce high-quality images from noise. Diffusion models are thus at the heart of modern generative AI, used by well-known text-to-image programs such as Stable Diffusion (from Stability AI), DALL-E 2 (from OpenAI), Midjourney, or Imagen (from Google). Compared to older approaches (like GANs or variational autoencoders), diffusion models often offer better stability and improved quality for image generation.

The intuition behind the term "diffusion" is inspired by a physical phenomenon. Think of a drop of ink diffusing in a glass of water: the ink molecules gradually disperse in the water until it is uniform. Similarly, if you randomly add noise to an image, it eventually becomes a texture of "TV snow," i.e., pure random noise. By mathematically modeling this diffusion process (adding noise) and then learning to reverse it, an AI model can generate new images by simply starting from noise and denoising it step by step. The noise can be likened to the static on an off-air television: this random "grain" is what the model uses as the raw material for creation.

‍

How Do Diffusion Models Work?

Diffusion models operate in two main phases: first, the gradual addition of noise to data during training, and then the removal of noise (generation) to create new data. In the learning phase, the model trains to gradually destroy data by drowning it in noise, then to reconstruct that data in reverse. Directly transforming random noise into a clear image is a very difficult problem, but transforming a slightly noisy image into a slightly less noisy image is much simpler. The model therefore learns to perform successive small improvements rather than a large direct leap towards the final image. This step-by-step process ensures much more controlled and efficient generation.

Forward Diffusion (Training): we take a training data (e.g., an image) and gradually inject random noise into it, at each time step, until it becomes unrecognizable (reduced to pure noise).
Reverse Diffusion (Learning): the model simultaneously learns to perform the inverse operation for each small noise addition step. In other words, it must predict how to remove a small amount of noise from a slightly noisy image, using the previous example.
Generation (Deployment): once trained, the model can generate a new sample. It starts with a complete sample of random noise, then applies the learned inverse process (iterative denoising) to gradually make a coherent image or other content emerge from this chaos.

Diagram illustrating the diffusion process on images: at the top (blue arrow), the forward process gradually adds noise to a starting image (here a photo of a chair) until a completely random image is obtained. At the bottom (orange arrow), the reverse diffusion process conversely generates an image starting from pure noise and removing it in steps. Thus, we see that by progressively "denoising" the initial noise, the model manages to reconstruct a clear image.

Applications and Importance of Diffusion Models

Diffusion models have demonstrated remarkable efficiency for image generation. For example, Stable Diffusion (released in 2022) popularized the creation of images from text descriptions in open source. These models can also perform inpainting (completing missing areas of an image) or super-resolution (improving the quality/resolution of an image) in a very realistic way. They have opened the door to new forms of digital creativity by allowing anyone to generate illustrations or artwork from their imagination described in words.

Technically, one of the major advancements involved accelerating generation. Early diffusion models could be relatively slow because they require many denoising iterations. Stable Diffusion partially bypassed this problem thanks to latent diffusion: instead of adding and removing noise directly on the raw image (e.g., 512x512 pixels), the model works on a compressed version of the image (e.g., 64x64 "features" instead of pixels). Once the denoising is performed in this reduced space, a final step reconstructs the image in high resolution. This latent space trick, made possible thanks to an autoencoder, has drastically reduced the time and computation needed to generate images. In practice, this has multiplied inference speed by about 2.7 (compared to a model diffusing directly on pixels) while maintaining equivalent quality.

If diffusion models are initially associated with images, they also find applications in other fields. In audio, for example, diffusion models can be trained to generate music or voices starting from noise (imagine white noise refined into music). In video, recent work allows generating animated sequences by diffusing noise over time (although this remains computationally expensive). Their use is even explored in chemistry and medicine: models learn to randomly diffuse molecular structures and then improve them to discover new molecules with interesting properties (e.g., for drugs). These examples illustrate the growing importance of diffusion models in AI: they offer a flexible framework for generating all kinds of high-quality data.

Diffusion and Text Generation: The Example of Gemini Diffusion

Until recently, AI text generation relied almost exclusively on autoregressive models (like GPT-3, ChatGPT, etc.), which produce text word-by-word sequentially. Now, the concept of diffusion also extends to text: a language diffusion model generates text starting from a chaotic initial input (e.g., a sequence of random characters or scrambled text) and progressively refines it to obtain coherent sentences. Google DeepMind presented its first model of this type in May 2025, named Gemini Diffusion. According to Google, Gemini Diffusion is a state-of-the-art model that "learns to generate coherent text or code by converting random noise into structured information," much like image diffusion models generate visuals from noise. In other words, where a model like GPT builds a sentence by adding each word one after another, Gemini Diffusion starts from the equivalent of a blank page filled with noise and makes the complete text emerge in several passes.

This "noise-to-text" approach offers several advantages. On one hand, the model considers the entire sentence at each generation step, allowing it to adjust and correct the text as it goes to improve coherence. By comparison, an autoregressive model that has poorly chosen a word at the beginning of a sentence cannot easily go back to change it, which can lead to inconsistencies. Diffusion, by re-evaluating the entire text at each iteration, can avoid this problem and produce more coherent texts, especially when they are long. On the other hand, since generation is not purely sequential, it can be partially parallelized, opening the way to much higher production speeds than classical models.

Gemini Diffusion illustrates precisely these advancements. It stands out for its record generation speed. Where the best previous language models produce perhaps tens of words per second, Gemini can generate the equivalent of several pages in an instant. In figures, Google announces a throughput reaching approximately 1,479 tokens per second (a "token" is a unit of text, comparable to a word or word fragment) based on their tests, with a very low initial latency of about 0.8 seconds. This result makes Gemini one of the fastest text models in the world. Moreover, this speed does not compromise quality: the new model manages to match the performance of Google's previous flagship model in code generation, while being much faster. In essence, Gemini Diffusion produces text of similar quality to the best existing models, but at a significantly higher speed.

Technically, little detailed information has been made public about the innovations enabling this acceleration. However, it is known that researchers have optimized the scheduler (the program that regulates the denoising progression) and explored masked input strategies to better guide the model in text reconstruction. Google has also indicated working on reducing the latency of its entire suite of Gemini models, with an upcoming version called 2.5 Flash Lite being even faster. It is therefore likely that Gemini Diffusion benefits from unprecedented algorithmic and architectural tricks to accelerate the diffusion process without sacrificing precision. In any case, its experimental launch (demo available upon registration) has had the effect of a major breakthrough, demonstrating that a diffusion model can compete with – even surpass – traditional models for text generation, including demanding tasks like programming.

‍

In a few years, diffusion models have transitioned from academic curiosities to a pillar of modern generative AI. Their unique way of creating content – by starting from noise and progressively sculpting the information – has proven extremely powerful for generating strikingly realistic images, coherent text, and many other types of data. These models have improved the quality and diversity of outputs compared to previous techniques, while avoiding certain pitfalls (for example, less mode collapse risk than classical GANs).

Recent work, exemplified by Google DeepMind's Gemini Diffusion, shows that diffusion models continue to evolve rapidly. They are gaining efficiency and speed, which gradually removes the main obstacle to their large-scale use. Admittedly, the iterative denoising process remains computationally expensive and can require time for very high-resolution outputs or complex tasks. However, the gap is narrowing thanks to clever optimizations. One can imagine that in the future, diffusion models will be increasingly integrated into AI applications, whether for creating 3D virtual worlds, assisting content creators, generating code instantly, or aiding scientific discovery (molecules, materials, etc.). Their ability to start from chaos and arrive at a structured result is not only a technical feat but also a new way of thinking about automatic content generation. For today's high school students who will be tomorrow's creators, understanding diffusion models means looking at the state-of-the-art of AI and envisioning the exciting possibilities ahead in this rapidly evolving field.

‍

Written by

Jérémy Martin

Research Director

Contact

Discover other articles

Understanding diffusion models

What is a Diffusion Model?

How Do Diffusion Models Work?

Applications and Importance of Diffusion Models

Diffusion and Text Generation: The Example of Gemini Diffusion

Discover other articles

The MCP Protocol

Tokenization in LLMs

Deepseek and Artificial Intelligence Technologies

The Art of Token Design: Avoiding Common Pitfalls

Introduction to RAG

Scalable Smart Contracts on Tezos

Introduction to Aztec: Value hiding on public blockchain

Ethena’s USDe, a breakthrough or a potential risk ?

Traceability via blockchain anchoring

Scalable smart contracts on Tezos

Ecosytem Skarknet

Chain of Responsibility — The Functional Way

Spot Bitcoin ETF: Towards an institutional adoption ?

What are stablecoins, and how do they work ?

Starknet: The Future of Decentralized Scaling