Diffusion models are among the most innovative techniques in generative AI. They allow for the creation of new images, text, or other data from random noise. This educational article, aimed at high school students, explains in an accessible way how diffusion models work using simple analogies and diagrams. We will discover why they are important in artificial intelligence and how they have revolutionized content generation. Finally, we present the latest advancements in the field, particularly the recent model from Google DeepMind named Gemini Diffusion, which stands out for its exceptional speed.
A diffusion model is a type of generative model, i.e., an AI program capable of generating new data similar to what it was trained on. Initially popularized for image generation, these neural network models learn to "diffuse" noise into examples (such as training images), and then to reverse this process to produce high-quality images from noise. Diffusion models are thus at the heart of modern generative AI, used by well-known text-to-image programs such as Stable Diffusion (from Stability AI), DALL-E 2 (from OpenAI), Midjourney, or Imagen (from Google). Compared to older approaches (like GANs or variational autoencoders), diffusion models often offer better stability and improved quality for image generation.
The intuition behind the term "diffusion" is inspired by a physical phenomenon. Think of a drop of ink diffusing in a glass of water: the ink molecules gradually disperse in the water until it is uniform. Similarly, if you randomly add noise to an image, it eventually becomes a texture of "TV snow," i.e., pure random noise. By mathematically modeling this diffusion process (adding noise) and then learning to reverse it, an AI model can generate new images by simply starting from noise and denoising it step by step. The noise can be likened to the static on an off-air television: this random "grain" is what the model uses as the raw material for creation.
Diffusion models operate in two main phases: first, the gradual addition of noise to data during training, and then the removal of noise (generation) to create new data. In the learning phase, the model trains to gradually destroy data by drowning it in noise, then to reconstruct that data in reverse. Directly transforming random noise into a clear image is a very difficult problem, but transforming a slightly noisy image into a slightly less noisy image is much simpler. The model therefore learns to perform successive small improvements rather than a large direct leap towards the final image. This step-by-step process ensures much more controlled and efficient generation.
Diagram illustrating the diffusion process on images: at the top (blue arrow), the forward process gradually adds noise to a starting image (here a photo of a chair) until a completely random image is obtained. At the bottom (orange arrow), the reverse diffusion process conversely generates an image starting from pure noise and removing it in steps. Thus, we see that by progressively "denoising" the initial noise, the model manages to reconstruct a clear image.
Diffusion models have demonstrated remarkable efficiency for image generation. For example, Stable Diffusion (released in 2022) popularized the creation of images from text descriptions in open source. These models can also perform inpainting (completing missing areas of an image) or super-resolution (improving the quality/resolution of an image) in a very realistic way. They have opened the door to new forms of digital creativity by allowing anyone to generate illustrations or artwork from their imagination described in words.
Technically, one of the major advancements involved accelerating generation. Early diffusion models could be relatively slow because they require many denoising iterations. Stable Diffusion partially bypassed this problem thanks to latent diffusion: instead of adding and removing noise directly on the raw image (e.g., 512x512 pixels), the model works on a compressed version of the image (e.g., 64x64 "features" instead of pixels). Once the denoising is performed in this reduced space, a final step reconstructs the image in high resolution. This latent space trick, made possible thanks to an autoencoder, has drastically reduced the time and computation needed to generate images. In practice, this has multiplied inference speed by about 2.7 (compared to a model diffusing directly on pixels) while maintaining equivalent quality.
If diffusion models are initially associated with images, they also find applications in other fields. In audio, for example, diffusion models can be trained to generate music or voices starting from noise (imagine white noise refined into music). In video, recent work allows generating animated sequences by diffusing noise over time (although this remains computationally expensive). Their use is even explored in chemistry and medicine: models learn to randomly diffuse molecular structures and then improve them to discover new molecules with interesting properties (e.g., for drugs). These examples illustrate the growing importance of diffusion models in AI: they offer a flexible framework for generating all kinds of high-quality data.
Until recently, AI text generation relied almost exclusively on autoregressive models (like GPT-3, ChatGPT, etc.), which produce text word-by-word sequentially. Now, the concept of diffusion also extends to text: a language diffusion model generates text starting from a chaotic initial input (e.g., a sequence of random characters or scrambled text) and progressively refines it to obtain coherent sentences. Google DeepMind presented its first model of this type in May 2025, named Gemini Diffusion. According to Google, Gemini Diffusion is a state-of-the-art model that "learns to generate coherent text or code by converting random noise into structured information," much like image diffusion models generate visuals from noise. In other words, where a model like GPT builds a sentence by adding each word one after another, Gemini Diffusion starts from the equivalent of a blank page filled with noise and makes the complete text emerge in several passes.
This "noise-to-text" approach offers several advantages. On one hand, the model considers the entire sentence at each generation step, allowing it to adjust and correct the text as it goes to improve coherence. By comparison, an autoregressive model that has poorly chosen a word at the beginning of a sentence cannot easily go back to change it, which can lead to inconsistencies. Diffusion, by re-evaluating the entire text at each iteration, can avoid this problem and produce more coherent texts, especially when they are long. On the other hand, since generation is not purely sequential, it can be partially parallelized, opening the way to much higher production speeds than classical models.
Gemini Diffusion illustrates precisely these advancements. It stands out for its record generation speed. Where the best previous language models produce perhaps tens of words per second, Gemini can generate the equivalent of several pages in an instant. In figures, Google announces a throughput reaching approximately 1,479 tokens per second (a "token" is a unit of text, comparable to a word or word fragment) based on their tests, with a very low initial latency of about 0.8 seconds. This result makes Gemini one of the fastest text models in the world. Moreover, this speed does not compromise quality: the new model manages to match the performance of Google's previous flagship model in code generation, while being much faster. In essence, Gemini Diffusion produces text of similar quality to the best existing models, but at a significantly higher speed.
Technically, little detailed information has been made public about the innovations enabling this acceleration. However, it is known that researchers have optimized the scheduler (the program that regulates the denoising progression) and explored masked input strategies to better guide the model in text reconstruction. Google has also indicated working on reducing the latency of its entire suite of Gemini models, with an upcoming version called 2.5 Flash Lite being even faster. It is therefore likely that Gemini Diffusion benefits from unprecedented algorithmic and architectural tricks to accelerate the diffusion process without sacrificing precision. In any case, its experimental launch (demo available upon registration) has had the effect of a major breakthrough, demonstrating that a diffusion model can compete with – even surpass – traditional models for text generation, including demanding tasks like programming.
In a few years, diffusion models have transitioned from academic curiosities to a pillar of modern generative AI. Their unique way of creating content – by starting from noise and progressively sculpting the information – has proven extremely powerful for generating strikingly realistic images, coherent text, and many other types of data. These models have improved the quality and diversity of outputs compared to previous techniques, while avoiding certain pitfalls (for example, less mode collapse risk than classical GANs).
Recent work, exemplified by Google DeepMind's Gemini Diffusion, shows that diffusion models continue to evolve rapidly. They are gaining efficiency and speed, which gradually removes the main obstacle to their large-scale use. Admittedly, the iterative denoising process remains computationally expensive and can require time for very high-resolution outputs or complex tasks. However, the gap is narrowing thanks to clever optimizations. One can imagine that in the future, diffusion models will be increasingly integrated into AI applications, whether for creating 3D virtual worlds, assisting content creators, generating code instantly, or aiding scientific discovery (molecules, materials, etc.). Their ability to start from chaos and arrive at a structured result is not only a technical feat but also a new way of thinking about automatic content generation. For today's high school students who will be tomorrow's creators, understanding diffusion models means looking at the state-of-the-art of AI and envisioning the exciting possibilities ahead in this rapidly evolving field.