Introduction
usually comes with an implicit assumption: you need a lot of labeled data.
At the same time, many models are capable of discovering structure in data without any labels at all.
Generative models, in particular, often organize data into meaningful clusters during unsupervised training. When trained on images, they may naturally separate digits, objects, or styles in their latent representations.
This raises a simple but important question:
If a model has already discovered the structure of the data without labels,how much supervision is actually needed to turn it into a classifier?
In this article, we explore this question using a Gaussian Mixture Variational Autoencoder (GMVAE) (Dilokthanakul et al., 2016).
Dataset
We use the EMNIST Letters dataset introduced by Cohen et al. (2017), which is an extension of the original MNIST dataset.
- Source: NIST Special Database 19
- Processed by: Cohen et al. (2017)
- Size: 145 600 images (26 balanced classes)
- Ownership: U.S. National Institute of Standards and Technology (NIST)
- License: Public domain (U.S. government work)
Disclaimer
The code provided in this article is intended for research and reproducibility purposes only.
It is currently tailored to the MNIST and EMNIST datasets, and is not designed as a general-purpose framework.
Extending it to other datasets requires adaptations (data preprocessing, architecture tuning, and hyperparameter selection).
Code and experiments are available on GitHub: https://github.com/murex/gmvae-label-decoding
This choice is not arbitrary. EMNIST is far more ambiguous than the classical MNIST dataset, which makes it a better benchmark to highlight the importance of probabilistic representations (Figure 1).
The GMVAE: Learning Structure in an Unsupervised Way
A standard Variational Autoencoder (VAE) is a generative model that learns a continuous latent representation 𝒛\boldsymbol x) of the data.
More precisely, each data point 𝒙\boldsymbol x) is mapped to a multivariate normal distribution 𝒒(𝒛|𝒙)\boldsymbolp(c , called the posterior.
However, this is not sufficient if we want to perform clustering. With a standard Gaussian prior, the latent space tends to remain continuous and does not naturally separate into distinct groups.
This is where GMVAE come into play.
A GMVAE extends the VAE by replacing the prior with a mixture of 𝑲\boldsymbol x)components, where 𝑲\boldsymbol \ell) is chosen beforehand.
To achieve this, a new discrete latent variable 𝒄\boldsymbol x) is introduced:
This allows the model to learn a posterior distribution over clusters:
Each component of the mixture can then be interpreted as a cluster.
In other words, GMVAEs intrinsically learn clusters during training.
The choice of 𝑲\boldsymbol \ell) controls a trade-off between expressivity and reliability.
- If 𝑲\boldsymbol x) is too small, clusters tend to merge distinct styles or even different letters, limiting the model’s ability to capture fine-grained structure.
- If 𝑲\boldsymbol \ell) is too large, clusters become too fragmented, making it harder to estimate reliable label–cluster relationships from a limited labeled subset.
We choose 𝑲=100\boldsymbol{K = 100} as a compromise: large enough to capture stylistic variations within each class, yet small enough to ensure that each cluster is sufficiently represented in the labeled data (Figure 1).
Figure 1 — Samples generated from several GMVAE components.
Different stylistic variants of the same letter are captured, such as an uppercase F (c=36) and a lowercase f (c=0).
However, clusters are not pure: for instance, component c=73 predominantly represents the letter “T”, but also includes samples of “J”.
Turning Clusters Into a Classifier
Once the GMVAE is trained, each image is associated with a posterior distribution over clusters: 𝒒(𝒄|𝒙)\boldsymbol{q(c | x)}.
In practice, when the number of clusters is unknown, it can be treated as a hyperparameter and tuned via grid search.
A natural idea is to assign each data point to a single cluster.
However, clusters themselves do not yet have semantic meaning. To connect clusters to labels, we need a labeled subset.
A natural baseline for this task is the classical cluster-then-label approach: data are first clustered using an unsupervised method (e.g. k-means or GMM), and each cluster is assigned a label based on the labeled subset, typically via majority voting.
This corresponds to a hard assignment strategy, where each data point is mapped to a single cluster before labeling.
In contrast, our approach does not rely on a single cluster assignment.
Instead, it leverages the full posterior distribution over clusters, allowing each data point to be represented as a mixture of clusters rather than a single discrete assignment.
This can be seen as a probabilistic generalization of the cluster-then-label paradigm.
How many labels are theoretically required?
In an ideal scenario, clusters are perfectly pure: each cluster corresponds to a single class. In such a case, clusters would also have equal sizes.
Still in this ideal setting, suppose we can choose which data points to label.
Then, a single labeled example per cluster would be sufficient — that is, only K labels in total.
In our setting (N = 145 600, K = 100), this corresponds to only 0.07% of labeled data.
However, in practice, we assume that labeled samples are drawn at random.
Under this assumption, and still assuming equal cluster sizes, we can derive an approximate lower bound on the amount of supervision needed to cover all 𝑲\boldsymbol{K} clusters with a chosen level of confidence.
In our case (𝑲=100\boldsymbol{K = 100}), we obtain a minimum of approximately 0.6% labeled data to cover all clusters with 95% confidence.
We can relax the equal-size assumption and derive a more general inequality, although it does not admit a closed-form solution.
Unfortunately, all these calculations are optimistic:
in practice, clusters are not perfectly pure. A single cluster may, for example, contain both “i” and “l” in comparable proportions.
And now, how do we assign labels to the remaining data?
We compare two different ways to assign labels to the remaining (unlabeled) data:
- Hard decoding: we ignore the probability distributions provided by the model
- Soft decoding: we fully exploit them
Hard decoding
The idea is straightforward.
First, we assign to each cluster 𝒄\boldsymbol{c} a unique label ℓ(𝒄)\boldsymbol{\ell(c)} by using the labeled subset.
More precisely, we associate each cluster with the most frequent label among the labeled points assigned to it.
Now, given an unlabeled image 𝒙\boldsymbol{x}, we assign it to its most likely cluster:
We then assign to 𝒙\boldsymbol{x} the label associated with this cluster, i.e. ℓ(𝒄𝒉𝒂𝒓𝒅(𝒙))\boldsymbol{ \ell(c_{hard}(x))}:
However, this approach suffers from two major limitations:
1. It ignores the model’s uncertainty for a given input 𝒙\boldsymbol{x} (the GMVAE may “hesitate” between several clusters)
2. It assumes that clusters are pure, i.e. that each cluster corresponds to a single label — which is generally not true
This is precisely what soft decoding aims to address.
Soft decoding
Instead of assuming that each cluster corresponds to a single label, we use the labeled subset to estimate, for each label ℓ\boldsymbol{\ell}, a probability vector of size 𝑲\boldsymbol{K}:
This vector represents empirically the probability of belonging to each cluster cc, given that the true label is ℓ\boldsymbol{\ell}, which is actually an empirical representation of 𝒑(𝒄|ℓ)\boldsymbol{p(c | \ell)}!
At the same time, the GMVAE provides, for each image 𝒙\boldsymbol{x}, a posterior probability vector:
We then assign to 𝒙\boldsymbol{x} the label ℓ\boldsymbol{\ell} that maximizes the similarity between 𝒎(ℓ)\boldsymbol{m(\ell)} and 𝒒(𝒙)\boldsymbol{q(x)}:
This formulation accounts for both uncertainty in cluster assignment and the fact that clusters are not perfectly pure.
This soft decision rule naturally takes into account:
- The model’s uncertainty for xx, by using the full posterior 𝒒(𝒙)=𝒒(𝒄|𝒙)\boldsymbol{q(x) = q(c \mid x)} rather than only its maximum
- The fact that clusters are not perfectly pure, by allowing each label to be associated with multiple clusters
This can be interpreted as comparing 𝒒(𝒄|𝒙)\boldsymbol{q(c \mid x)} with 𝒑(𝒄|ℓ)\boldsymbol{p(c \mid \ell)}, and selecting the label whose cluster distribution best matches the posterior of 𝒙\boldsymbol{x}!
A concrete example where soft decoding helps
To better understand why soft decoding can outperform the hard rule, let’s look at a concrete example (Figure 2).
Figure 2 — An example showing the interest of soft decoding.
In this case, the true label is e. The model produces the cluster posterior distribution shown in the center of the figure 2:
for clusters 76, 40, 35, 81, 61 respectively.
The hard rule only considers the most probable cluster:
Since cluster 76 is mostly associated with the label c, the hard prediction becomes
which is incorrect.
Soft decoding instead aggregates information from all plausible clusters.
Intuitively, this computes a weighted vote of clusters using their posterior probabilities.
In this example, several clusters strongly correspond to the correct label e.
Approximating the vote:
while
Even though cluster 76 clearly dominates the posterior, most of the probability mass actually lies on clusters associated with the correct label. By aggregating these signals, the soft rule correctly predicts
This illustrates the key limitation of hard decoding: it discards most of the information contained in the posterior distribution 𝒒(𝒄|𝒙)\boldsymbol{q(c \mid x)}. Soft decoding, on the other hand, leverages the full uncertainty of the generative model.
How Much Supervision Do We Need in Practice?
Theory aside, let’s see how this works on real data.
The goal here is twofold:
- to understand how many labeled samples are needed to achieve good accuracy
- to determine when soft decoding is beneficial
To this end, we progressively increase the number of labeled samples and evaluate accuracy on the remaining data.
We compare our approach against standard baselines: logistic regression, MLP, and XGBoost.
Results are reported as mean accuracy with confidence intervals (95%) over 5 random seeds (Figure 3).
Even with extremely small labeled subsets, the classifier already performs surprisingly well.
Most notably, soft decoding significantly improves performance when supervision is scarce.
With only 73 labeled samples — meaning that several clusters are not represented — soft decoding achieves an absolute accuracy gain of around 18 percentage points compared to hard decoding.
Besides, with 0.2% labeled data (291 samples out of 145 600 — roughly 3 labeled examples per cluster), the GMVAE-based classifier already reaches 80% accuracy.
In comparison, XGBoost requires around 7% labeled data — 35 times more supervision — to achieve a similar performance.
This striking gap highlights a key point:
Most of the structure required for classification is already learned during the unsupervised phase — labels are only needed to interpret it.
Conclusion
Using a GMVAE trained entirely without labels, we see that a classifier can be built using as little as 0.2% labeled data.
The key observation is that the unsupervised model already learns a large part of the structure required for classification.
Labels are not used to build the representation from scratch.
Instead, they are only used to interpret clusters that the model has already discovered.
A simple hard decoding rule already performs well, but leveraging the full posterior distribution over clusters provides a small yet consistent improvement, especially when the model is uncertain.
More broadly, this experiment highlights a promising paradigm for label-efficient machine learning:
- learn structure first
- add labels later
- use supervision primarily to interpret representations rather than to construct them
This suggests that, in many cases, labels are not needed to learn — only to name what has already been learned.
All experiments were conducted using our own implementation of GMVAE and evaluation pipeline.
References
- Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: Extending MNIST to handwritten letters.
- Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., & Shanahan, M. (2016). Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders.
© 2026 MUREX S.A.S. and Université Paris Dauphine — PSL
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

