It’s not about audio and video anymore
about compression as an audio/video problem; today, it’s about all-kind-of-data compression: genomes, point clouds, haptics, 3D scenes, neural networks, and machine features.
Every data type now has to go through some form of compression, simply because we are generating an absurd amount of bits in every sector, from entertainment to medicine to autonomous vehicles.
In just over 70 years since the transistor was invented in 1947, we’ve unlocked unprecedented computing power, wireless networks, the internet, artificial intelligence, mobile devices, high‑resolution displays, and spectacular advances in genetics, medicine, and space exploration.
All of this rests on one substrate: digital data.
As humans, we love data. Food and water may feed our bodies, but data, once transformed into knowledge, feeds our minds.
When we share it, we evolve as a species.
We innovate.
And we don’t seem to be slowing down.
Back in 2020, global data created, captured, copied and consumed in a single year was about 59 zettabytes (the equivalent of 59 trillion gigabytes), and projections put 2025 at around 175 ZB. One zettabyte is 8,000,000,000,000,000,000,000 bits. We’re good at generating data, but the problem is how to transfer it, store it, process it… and trust it.
You’ve probably heard the line “data is the new oil.” It isn’t. Oil is finite. Data is not.
That’s exactly why compression is now a foundational technology for the entire digital ecosystem.
The backbone of the media world
ISO/IEC JTC 1/SC 29 is not a brand many people recognize, but its work underpins the entire digital media and entertainment industry. This subcommittee coordinates JPEG, which defines image compression standards, and the MPEG affiliated groups, which develop technologies for compressing and transporting video, audio, and other multimedia data.
The standards coming out of SC 29 cover the full value chain: content creation, processing, and storage; broadcast distribution; streaming over IP; and consumption on everything from smartphones to large‑screen TVs. What’s changing now is that their scope is widening from “media for humans” to “data for humans and machines” across images, video, 3D, AI, and beyond.
JPEG: From .jpg to AI, Trust, Plenoptic and DNA
For over 30 years, JPEG (.jpg) has been the default visual format of the web. But the committee has significantly expanded its portfolio.
JPEG AI: Latent tensors, not pixels
JPEG AI is the first learning‑based image coding standard that uses AI and latent spaces instead of hand‑crafted transforms.
At the core of JPEG AI, the codec transforms an image into a latent tensor that is then compressed and transmitted. The decoder reconstructs the image from this latent representation, but can also operate directly in the compressed domain, enabling analysis pipelines and computer vision tasks without fully decoding pixels.
A single compressed representation serves both human viewing and machine analysis.
JPEG AI also integrates the concept of “on-demand” complexity: the standard defines three decoding variants with different computational cost, letting devices choose the best transformation for their hardware capabilities. This adaptive architecture therefore allows for optimised use across a wide range of devices, from low-power terminals to high-performance platforms.
JPEG Trust: Authenticity in an AI‑synthetic world
The web is being flooded with synthetic images, videos, and even news, all generated in seconds by powerful generative models. Humans and algorithms can no longer easily distinguish what’s real from what’s synthetic, and that directly impacts search ranking, brand trust, and user perception of authenticity.
JPEG Trust defines a framework for tracking origin, authenticity, and ownership of digital images, including AI‑generated content. This is essential for properly managing the dissemination and use of media that has been deliberately modified or created to manipulate public opinion (deepfakes), an objective defined in the AI Act issued by the European Union.
JPEG Trust defines a framework for establishing trust in digital media.
It is built upon and extends the Coalition for Content Provenance and Authenticity (C2PA) engine, a solution to verify authenticity on the web. It defines a standard way to attach metadata to media that records its origin and any subsequent modifications.
Think of it as an embedded digital signature that travels with your content.
JPEG Pleno: Light fields, point clouds, holograms
JPEG Pleno is an international framework for representing and compressing plenoptic data: light fields, point clouds, and holograms. It goes beyond 2D images to capture the direction and intensity of light in space, not just color and brightness.
This is crucial for VR/AR, medical imaging, and cultural‑heritage applications, where you need high compression, random access, and interactivity on complex volumetric content.
JPEG Pleno standardises encoding tools and file formats so that next-generation capture devices, such as light-field cameras, LiDAR systems and volumetric platforms, can be integrated into interoperable workflows rather than customised, isolated pipelines.
JPEG XS: Lightweight, low‑latency video
JPEG XS is different.
This is because it targets video, but with a very different focus than MPEG. Instead of pushing compression to the limit, JPEG XS prioritizes ultra‑low latency and low complexity. In many professional workflows, it emerges as a practical alternative to uncompressed video.
A lightweight compression ratio of around 4:1 can be enough to transport a 4K/50p/4:2:2/10‑bit signal within the bandwidth traditionally required for 1080p50 over 3G‑SDI. That means: same cable, HD bandwidth, but 4K content. This not only saves bandwidth but also reduces energy consumption whenever the cost of compression is lower than the cost of transmission.
This is one of the reasons the Television Academy awarded a 2025 Technology & Engineering Emmy to Fraunhofer and intoPIX for their work on JPEG XS.
JPEG DNA: Storing images in molecules
One of the most fascinating explorations is JPEG DNA, where the storage medium is not magnetic or optical, but biological. DNA is nature’s original data store, with incredible longevity, orders of magnitude beyond any disk or tape.
The problem: our digital data production is exponential, while conventional storage media have limited durability and often become unreadable within a couple of decades (think floppy disks and CDs). DNA, by contrast, can preserve information for centuries or longer under the right conditions.
JPEG DNA aims to define how digital images can be encoded into DNA sequences in a way that is both efficient and robust, while respecting biochemical constraints and handling high error rates in synthesis and sequencing.
It sounds like science fiction, but it is an attempt to imagine what a ‘future-proof’ storage solution might look like when we start thinking in terms of centuries rather than years.
MPEG: Beyond bitrate, and toward AI‑native and energy-aware codecs
Why we still need new video codecs
The Moving Picture Experts Group, aka MPEG, in over 35 years of activities has released codecs that literally built the media industry: MPEG‑2, MPEG‑4, AVC, HEVC, VVC, AAC, and more.
The latest video standard is called VVC (Versatile Video Coding) and was published in 2020.
Why we need a new video codec?
Video is still the most bandwidth‑hungry media we distribute today; it floods IP networks, terrestrial and satellite links, and data centers.
Each new codec generation brought bitrate reductions that translated directly into lower delivery costs and wider reach (e.g., UHD for users who couldn’t access it under AVC/HEVC constraints).
That logic still holds, but it’s no longer the only driver.
Time is changing and bitrate-reduction-for-the-same-visual-quality, although important, is not the only motivation behind the creation of a new video codec. Next-gen codecs will be evaluated not just on compression efficiency but also latency, deployability, implementation cost, and use‑case relevance .
MPEG Enhanced Compression Model (ECM) project has reached version 19, demonstrating roughly 27% bitrate savings over VVC in random-access configurations. ECM is a likely foundation for the future H.267 codec, which aims for a ~40% bitrate reduction relative to VVC (H.266).
The codec is designed for diverse applications, including mobile streaming, live broadcasting, immersive VR/AR, cloud gaming, and AI-generated content. It targets efficient real-time decoding and scalable encoder complexity, supporting resolutions up to 8Kx4K and frame rates up to 240 fps. It supports stereoscopic 3D, multi-view content, wide color gamut, and high dynamic range.
The H.267 standard is currently projected to be finalized in 2028, with meaningful deployment likely not occurring until around 2034–2036.
Interestingly, the codec cycle has shortened: roughly 10 years between AVC and HEVC, about 7 between HEVC and VVC, and an expected ~8 between VVC and H.267. One reason is that MPEG is no longer the only “rooster in the henhouse”: AOMedia is advancing AV1 and AV2, AVS3 is growing in China, and proprietary or niche codecs are emerging in parallel.
The proposed timeline is aimed at maintaining an appropriate cycle such that new standard remains competitive in the marketplace in terms of providing superior compression performance and fulfilling industry’s needs.
Call of proposals expected for July 2026.
Between mid-2026 and early 2027, a series of subjective video quality assessments will be conducted, covering various categories of content (SDR, HDR, games and user-generated content). Independent laboratories are invited to participate (deadline extended to 15 April 2026), subject to strict technical, organisational and conflict-of-interest requirements; the results will contribute to the evaluation of future video coding standards.
Energy efficiency and green metadata
As codecs grow more complex, encoding becomes more energy‑intensive. That’s increasingly unacceptable in a world where sustainability is a board level priority.
ISO/IEC 23001‑11 (Green Metadata) addresses energy efficient media consumption by defining metadata that allows devices and displays to reduce power usage, for example by adapting backlight levels to content characteristics.
Rather than treating energy as an afterthought, the ecosystem is starting to treat “joules per bit” as seriously as bits per pixel.
“Energy efficiency” is becoming a formal selection criterion for codecs, alongside BD‑Rate, especially in mobile and large‑scale streaming contexts.
AI in video coding: Hybrid, super‑resolution, end‑to‑end
What about AI applied to video coding?
MPEG continues its work to move beyond the constraints of the traditional 2D transform plus motion‑compensation framework.
AIs have become increasingly significant in the 2020s and will without doubt influence many aspects of our lives. However, their impact on the near term evolution of communication technologies is still uncertain. In the context of video coding, it is important to recognize that widely deployed, mass market video systems must be practical, robust, energy efficient, and cost effective, while still delivering state‑of‑the‑art compression performance.
Any AI-based approach must therefore address the entire video processing chain, from pre‑processing and encoding to storage, transmission, decoding, post‑processing, analysis, and content repurposing. Moreover, these solutions must support high resolution, high frame rates, and high dynamic range, all while operating in real time.
This is why JVET evaluates neural tools under several operating points: VLOP (Very Low Complexity), LOP (Low Complexity), and HOP (High Complexity), explicitly balancing coding gains with computational budget.
The JVET group is exploring Neural Network Video Coding (NNVC) along three main directions:
- Hybrid codecs with neural tools
Neural components are added into the traditional transform + motion‑compensation framework, replacing or augmenting existing tools. Examples include:
- Deep Reference Frame (DRF inter), which enhances reference frames for motion compensation… at the cost of higher decoder complexity.
- Cross Component Convolutional Models (CCCM), which improve chroma prediction and denoising by learning cross‑component structure, offering a favorable trade‑off.
- Neural super‑resolution and post‑filters
The codec remains conventional (e.g., VVC), but the pipeline is modified:
- Input resolution is reduced before encoding, resulting in a much smaller bitstream. If the pre‑processing stage downsamples the input by a factor of two in both the horizontal and vertical directions, the amount of data entering the encoder is immediately reduced by a factor of four.
- After decoding, neural super‑resolution upsamples the video back to the target resolution. The key is to recover perceived quality with super‑resolution.
- Neural post‑filters (NNPF) operate after decoding to enhance quality while preserving bitstream conformance.
- End‑to‑end neural codecs
Here, the entire pipeline, analysis transform, entropy model, synthesis transform, is learned as a single network. In MPEG it is considered end-to- end also neural super resolution, learned intra coding and the DRF inter (Deep Reference Frame)
NNVC is in version 15 of its algorithm and software specification and reports BD‑rate reductions in the ~6–14% range versus VVC under Y‑PSNR, with higher gains at high‑complexity operating points [6% NN-Intra & VLOP filter (2 tools); 14% NN-Intra & HOP filter (2 tools)].
The downside is decoder complexity: from an order of magnitude higher than VVC at very low complexity up to two orders of magnitude higher in the most aggressive modes, which is challenging for mobile devices [think of 14x (VLOP) to 118x (HOP) that of the VTM anchor].
The 2026–2027 roadmap is shaping up to be particularly compelling. From a technical perspective, there is a strong and growing emphasis on reducing computational complexity and energy consumption. Complexity reporting has become a standardized and integral part of the development process, not afterthoughts. Techniques such as weight pruning, reduced receptive fields, knowledge distillation, and fully integer‑only inference are increasingly viewed as baseline requirements rather than optional optimizations.
At this stage, true differentiation is expected to come from deep, kernel‑level optimizations, particularly targeting SIMD architectures, where a single instruction operates on multiple data elements in parallel, and NPU backends, which are specialized processors designed to efficiently accelerate AI and machine‑learning workloads.
Ultimately, success will belong to those who can deliver the highest performance in real deployments, in other words, those who can ship the fastest, most efficient solutions.
Reproducibility is another major theme.
There is a clear shift toward bit‑exact inference and the establishment of model registries, official repositories where the committee hosts the exact neural network models used for experiments and cross‑checks. These registries capture not only the model architecture and weights, but also versioning, training recipes, and relevant metadata.
The goal is to ensure that every submission is fully reusable, auditable, and verifiable by others. The ecosystem is increasingly embracing a “trust, but verify” philosophy, and the tooling is evolving accordingly to support transparent validation and long‑term reproducibility.
We’re also seeing early consolidation around specific NN tools. For mainstream profiles, Neural Network Loop Filter (NNLF), both LOP and VLOP, and Cross Component Convolutional Models (CCCM) are looking like the early winners. Meanwhile, DRF inter seems poised to appear more often in higher‑tier encoders and decoders, especially in environments where NPUs are available.
The question is no longer “Do neural tools help?” but “How much gain can we keep while meeting decoder power and latency budgets?”
The next two JVET cycles will put these ideas to the test, as the community defines a path beyond VVC. The lessons learned from NNVC are expected to play a major role in shaping future test conditions and in setting expectations for permissible complexity, helping to establish a realistic and well‑grounded baseline for next‑generation video coding technologies.
Video for Machines: VCM and FCM
Most people still think of video compression as something done “for humans” to watch. But today, a large share of visual data, especially from cameras, is consumed by machines: autonomous vehicles, drones, industrial robots, smart‑city sensors, and surveillance systems.
Yet, the majority of these systems still stream pixel‑based video compressed with human‑centric codecs. This wastes bandwidth, doesn’t scale well, and exposes raw visual content, including faces and sensitive scenes, to third‑party servers.
MPEG‑AI (ISO/IEC 23888) responds to this with a family of standards designed for machine‑to‑machine (M2M) communication, with two key pillars: Video coding for machines and Feature coding for machines.
Video coding for machines (VCM)
VCM reorganizes the classical video coding pipeline around machine task performance rather than human visual quality. Instead of optimizing PSNR or SSIM, VCM optimizes object detection, tracking, segmentation, and similar tasks in scenarios like smart cities and autonomous driving.
VCM represents an important step toward machine centric video coding.
To achieve this, VCM departs from traditional signal‑centric approaches in several key ways. It applies temporal resampling, dropping frames that provide no additional information for the target task. Rather than transmitting full resolution frames, VCM adaptively downsamples spatial resolution based on task relevance. In addition, it safely reduces luma and chroma precision by discarding least‑significant bits that do not impact machine inference performance.
Importantly, VCM still wraps around a standard H.26x codec (AVC/HEVC/VVC) but surrounds it with task aware pre and postprocessing. The drawback is that it still transmits recognizable frames, which raises privacy concerns.
VCM reached DIS status (Draft International Standard).
Feature coding for machines (FCM)
FCM addresses that privacy and bandwidth problem by compressing intermediate neural features instead of pixels.
Today, most machine‑to‑machine systems rely on remote inference, where edge devices send full video frames to the cloud for processing. This approach is problematic because pixel video data is bandwidth‑intensive, and raw images often contain sensitive information such as faces, locations, and contextual scenes that should not necessarily be exposed to third party servers.
At the same time, modern edge devices increasingly include NPUs capable of executing parts of a neural network locally, even if they cannot run full deep models end‑to‑end. This creates an opportunity to split the model: execute the early layers on the device and transmit only the compressed intermediate features, significantly reducing data transfer while preserving task performance.
The idea is collaborative intelligence.
That means run the first part of a neural network on the edge device, then transmit only the intermediate feature tensor to the cloud, which finishes the inference.
To achieve this, FCM reduces the dimensionality of the intermediate feature tensors, eliminating unnecessary degrees of freedom. It prunes redundant feature channels that do not contribute meaningful information to the task, and it quantizes numerical precision, mapping 32‑bit floating‑point values to 10‑bit (or similar) integer representations to significantly reduce data size and computational cost without compromising task performance.
Bandwidth savings can be massive, up to about 97% reduction in some scenarios, while features preserve semantics but not identities. Even if intercepted, they do not directly reveal images. In practice, using HEVC as an inner codec for these features can perform nearly as well as VVC, making deployment more flexible.
FCM is now at Working Draft (WD) progressing toward Committee Draft (CD).
Compression of neural networks for multimedia content description and analysis
The irony of AI is that the models we use to compress and interpret data are themselves huge blobs of data. Shipping a state‑of‑the‑art deep model over a network, or deploy it on a device with tight memory, is expensive in both bandwidth and storage.
To understand the need for compression, consider a camera that adjusts its automatic mode based on scene or object recognition performed by a trained neural network. This is a rapidly evolving area, and it is common for newer, better trained models to become available over time.
However, developing this “intelligence” is time and labor intensive, so once a model is ready it is typically deployed from a central location to millions of user devices. With modern neural networks now reaching hundreds of megabytes in size, this creates a scalability problem. A scenario in which millions of devices simultaneously download the latest model with enhanced features would place a significant and potentially unsustainable load on the network.
While simpler deployments involve training a neural network once, transferring it to the device, and using it locally for inference, emerging paradigms such as federated learning require continuous, bidirectional communication among large numbers of devices and central servers. In these scenarios, efficient compression and communication mechanisms become essential.
Neural Network Coding (NNC) comes to the rescue with a standardized, efficient, modular way to shrink neural networks dramatically without hurting their accuracy.
Therefore, the NNC standard is designed to achieve very high compression efficiency for deep neural networks by combining several complementary techniques. These include pre‑processing methods for data reduction, such as sparsification (e.g., setting selected weights to zero to make tensors more compressible) and structural pruning, where entire neurons or filters that contribute little to performance are removed. These steps are followed by quantization and context‑adaptive arithmetic binary coding, specifically DeepCABAC, to efficiently encode the remaining information
The result: neural networks can be compressed by up to about 97% while keeping their accuracy.
This technology appears to be a sort of Lego brick that can be ‘mounted’ into other MPEG standards. This modularity explains why the group is advancing work on NNC as an inner codec for FCM, as well as on the application of NNC to spherical harmonics (SH) coefficients within Gaussian Splat Coding (GSC).
Gaussian Splatting and Point Clouds
MPEG continues to explore Gaussian Splat Coding (GSC), which addresses the compression of Gaussian Splat (GS) representations used for 3D scene capture and rendering.
3D Gaussian Splatting has significantly changed how real world scenes are captured and visualized. Unlike traditional photogrammetry, which produces mesh based models, or neural radiance fields (NeRFs), which rely on computationally expensive ray tracing, Gaussian splatting represents a scene as millions of fuzzy ellipsoids (splats) that can be rendered efficiently. New viewpoints are generated simply by drawing these splats from the desired perspective.
The result is photorealistic rendering at real time frame rates on consumer hardware, making Gaussian splatting particularly well suited for virtual and augmented reality, immersive video, interactive web experiences, and game development.
Gaussian splat data fundamentally consists of collections of points in 3D space, each associated with attributes such as position, orientation, scale, opacity, and color coefficients. Determining the most effective way to compress and transport this data remains an active area of research.
Compressed Gaussian splat representations could potentially be carried over existing video infrastructure, leveraging familiar encoding and delivery pipelines, or alternatively be handled within geometry‑based point cloud frameworks. Different industries tend to favor different approaches, largely depending on the infrastructures they already operate.
On the point cloud side, the G‑PCC family of standards has continued to expand to address a wider range of use cases. This now includes E‑G‑PCC, which introduces enhanced temporal prediction to improve the compression of dynamic and time‑evolving point clouds; GeS‑PCC, targeting dense, solid objects and surface‑like structures that behave more like continuous manifolds; and L3C2, a low‑latency point‑cloud codec designed specifically for spinning LiDAR sensors, enabling real‑time processing for applications such as autonomous driving and robotics.
Audio: Immersion, personalization, and dialogue clarity
Video usually gets the spotlight, but immersive experiences fail without great audio. MPEG‑H Audio is gaining momentum precisely because it moves from channel‑based to object‑based approaches.
Object‑based audio allows users to personalize their mix, raising commentary, lowering crowd noise, or switching between different perspectives offered by broadcasters. Experts are working on full six degrees‑of‑freedom (6DoF) audio, where users can localize sound sources in 3D space, perceive loudness changes as they move, and experience realistic reverberation and occlusion (when a physical object is interposed between a sound source and a user).
One particularly impactful feature is MPEG‑H Dialog+, which separates speech from the rest of the soundtrack and allows selective enhancement of dialogue. This is a practical solution for people with hearing difficulties and for anyone who simply wants to understand speech in a mix without sacrificing music and effects quality. The result: consistently enhanced dialogues while maintaining the high sound quality of the background music.
Conclusion
A clear pattern comes into focus: compression is no longer merely a technique for reducing file sizes but has become a unifying principle that guides the design, scalability and reliability of digital systems.
We are compressing everything: from images and audio to genomes, neural networks and high-dimensional world models. In the course of this process, the focus is shifting from human-centred representations to machine-centred ones, where semantics matter more than pixels and meaning takes precedence over raw fidelity. At the same time, trust, authenticity, energy consumption and computational complexity are moving to the foreground, embedded directly into the fabric of our media and data pipelines.
In this broader view, compression functions like an “operating system” for the global datasphere: rarely noticed when it works, but foundational to everything built on top of it. It defines what we can store, what we can transmit, how efficiently we can learn and iterate, and increasingly, what we can believe.
