has been recently released by Mistral. It is a powerful text to speech model that is beating ElevenLabs v2.5 Flash, according to Mistral’s tests. Apart from state-of-the-art performance on text-to-speech tasks (among models of similar sizes), Mistral announced voice cloning capabilities and published the weights of their model. That attracted a huge interest, because a small enough for local inference, high quality text-to-speech (TTS) model with voice cloning capabilities, is something in demand across both business and the community.
The issue, however, is that Mistral removed the encoder weights of the audio autoencoder, so users cannot clone any voice, we can just use the voices Mistral prepared for us. That is a huge limitation compared to the paper and initial announcement.
Here I am providing (1) an overview of the Voxtral TTS architecture with some technical details and comparisons, (2) my research on the audio autoencoder and how it actually encodes audio, (3) a study on how we can still get representation for any audio to potentially use voice cloning, even though the published weights are truncated in part of the encoder.
Why I am sure Voxtral TTS is a technology worth understanding (short personal story)
Years ago, I was working at Skyeng where we were building our automatic speech recognition (ASR) system. It was in 2021, Whisper has not been released by OpenAI yet, and the ASR task was a hot topic, especially for some uncommon speech — our ASR was for not native speakers.
At that time I was reading many papers about audio understanding and encoding (mainly using transformer encoders). When I read the Wav2Vec2 paper I got a strong feeling that we should adopt this technology, understand in tiny details, and use it, because I was sure the approaches and the state of mentioned technologies are fundamental enough to persist as the new classics in the audio understanding and signal processing domain. I was right. And with the Voxtral-TTS release I got the same feeling.
Voxtral TTS overview
Voxtral-4B-TTS is a 4-billion parameters model that uses an autoregressive large language model (LLM-)based 3B backbone (Ministral 3B model). Simply speaking, the model takes as input some audio tokens that represent voice to clone and text tokens to be voiced. Similar to LLMs, the model autoregressively generates tokens, with the difference that these tokens are voice tokens. In the paper that describes the model, there is the following illustration for this part:
Autoregressive audio generator (based on LLM), source: https://arxiv.org/pdf/2603.25551
Important things to note: both the audio reference and the generated audio can be split to non-overlapping and independent tokens, each token represents 80 ms of audio; there is a complicated head that consists of a linear head and a flow-matching transformer to create semantic and acoustic components (also called tokens) of an individual voice token. These two details make this model special. Independent audio (voice) tokens make it capable of native audio streaming. And this complicated head is a genius combination of two current approaches to audio generation — discrete tokens prediction (linear head) and complex distribution approximation through diffusion process using diffusion models or flow-matching transformers approach (the same we usually use for image and video generation).
That looks conceptually elegant and beautiful. But there is another component required — the audio autoencoder. An audio autoencoder is responsible for obtaining those acoustic and semantic tokens that are fundamental to the model. The architecture overview of the 350M autoencoder from the original paper:
Audio autoencoder, source: https://arxiv.org/pdf/2603.25551
The audio autoencoder, Voxtral Codec, is the model that produces 37 discrete tokens for each 80ms audio frame — in the middle of the architecture within the quantization block — and can reconstruct the audio back from these discrete tokens. The classic description of the autoencoder is encoder -> bottleneck -> decoder. Mistral hasn’t released an encoder, however. That means we still can generate audio after the autoregressive model produces audio tokens applying the decoder of the Voxtral Codec, but we can’t natively feed some audio to get audio tokens that we can use as voice condition in the autoregressive model, as Mistral hasn’t released the weights of the encoder and the decoder is not invertible (as most deep learning models).
Another interesting detail: in the bottleneck there is a mentioning of two types of tokens, semantic and acoustic. The implementation here is also worth understanding. The decoder produces 292-dim latent, that is split into a 256-dim vector and 36 single-dimension scalars. The 256-dim vector is mapped to codebooks embedding and is later represented by the code — closest codebook ID (that is Vector Quantization happening). Each of 36 single-dimension scalars are mapped to a range from 0 to 21 using scaled tanh activation. These elements are already a simplification over some of the previous approaches that used Residual Vector Quantization.
A few words about semantic tokens. The embeddings of the semantic tokens are linearly projected to match logits of the Whisper decoder when it is doing speech to text for the same audio. In other words, there is a constraint applied to link the semantic tokens to the latent state Whisper (ASR model) uses right before producing text out of speech. That is why these tokens are called semantic — they are supposed to be associated with the text-related latents and text is associated with semantics.
The question I got here (and tested later): do these semantic tokens actually represent the meaning, words that should be voiced, while the acoustic represents the voice itself?
Copy of some technical notes I took for my conspectus after reading the Voxtral paper (with some cross-references)
Voxtral-TTS (TTS from Mistral) — https://arxiv.org/pdf/2603.25551. Similar to Qwen-3 TTS (12Hz variant) to some extent. Mistral trained their own codec (encoder-decoder architecture). In this part the encoder produces a 292-dimension latent state vector that is split into 256 for VQ and 36 finite scalar quantization (FSQ). In comparison to Mimi and Qwen-3, Mistral is using FSQ instead of RVQ and concatenate instead of summing the vectors in the decoder backbone. Another difference — VQ vectors are linearly projected to match Whisper decoder (before self-supervised WavLM was used) hidden states (semantics control).
These vectors are used in the decoder-only 3B model to do voice cloning and text-to-speech. During the training there are noise replacements and quantization avoidances in around 50% of cases to make the model more robust. For each of the states there is an embedding, they are summed and used as the input to the transformer. This transformer autoregressively predicts hidden states for audio tokens and (end of audio) tokens. These hidden states are used with linear head to predict VQ-based semantic token and to condition a 3-layer flow matching (comment: diffusion with an improved objective) model that generates continuous acoustic tokens from noise (they are later quantized with FSQ) — (classifier-free guidance) CFG can be applied. Audio is splitted to short not-interleaving parts. During the training both semantic cross entropy loss and flow matching for a single sampled timestamp are computed (loss is reweighted — for example lower weight of the loss for silences). Post-training DPO helps to further improve the results.
Finite scalar quantization FSQ — https://arxiv.org/pdf/2309.15505. A simple approach to improve RVQ and VQ (VQ is just vector quantization, while RVQ is residual vector quantization). RVQ is based on adjusting each of the VQ codebooks with smaller codebooks, so to reconstruct we need to summarize a set of codebook embeddings. FSQ is simple (and produces independent codes compared to RVQ): we use (scaled) tanh activation for each dimension of finite values, round the results to the closest integer and use it as a code — it helps better utilize codebooks and train in a simpler way (without trainable codebooks embeddings).
Do semantic tokens really represent semantics?
Some tests I was about to conduct:
- Do semantic tokens represent the words to be voiced? If they really represent words, then we can manipulate these tokens to change the meaning of the speech preserving the same voice. A more realistic outcome would be just broken words, or no sense in the generated speech, but the voice remains the same.
- How robust is Voxtral Codec’s decoder to noise in codes? If we randomly replace some of the codes and the audio remains similar, that means that there is a way to approximate the audio through some gradient-based code value selection. Otherwise, if a small change in codes destroys the audio — it is nearly impossible to reconstruct the codes from the audio without the actual encoder.
We know that we do not have weights for the encoder part of the Voxtral Codec, but we have a decoder, we have autoregressive backbone weights, and we have some voices Mistral provided as references (we can generate speech from text using these voices).
The pipeline:
- Having the reference voices embeddings we can apply the Coordinate Descent algorithm to extract the codes for the voice. Here I have a script that is doing that.
- Decoder weights and codes are not enough for the Voxtral Codec’s decoder to work; we also need the architecture implemented in a form of code. The vllm-omni has Voxtral implementation and it is under a permissive Apache 2.0 license. I used Claude Code to extract the Voxtral Codec architecture code from the vllm-omni. The standalone Voxtral Codec architecture extracted from the vllm-omni is also in my GitHub repository.
- I prepared a Jupyter Notebook that takes voice embeddings (provided by Mistral), reconstructs its codes, optionally destroys some of the codes (semantic and some acoustic) and reconstructs the audio from it using the Voxtral Codec’s decoder with the real weights.
Here you can download and listen to audio reconstructed from the embeddings — audio
Here is the audio reconstructed from the same embeddings, but with semantic and some acoustic tokens randomization (following the script) — audio
From these audio files, it is clear that semantic tokens, despite their name, do not really represent the actual words or meanings to be voiced. And what is more important — the decoder is robust to some changes in the codes, which means we can try to apply gradient descent to directly train codes for the specific audio.
A gradient descent approach to reconstruct codes when the encoder is missing
By training the code itself I mean we initialize a single layer in a form of nn.Parameter(torch.tensor(num_frames, num_codes_per_frame)), where for each frame we have 37 codes, and train it on real audio reconstruction loss. That could work if we were working in continuous space and the target object to reconstruct was some simple signal, not a high-frequency audio waveform.
Complications due to discrete tokens
Each token is discrete, similar to tokens in the LLM; if there are two discrete tokens A and B, we cannot gradually optimize a transition from A to B. In Voxtral we have separate semantic and acoustic tokens and the acoustic ones are easier to model because of the finite-scalar quantization (FSQ) they use — they are obtained from continuous space through a rounding operation. But still both semantic and acoustic tokens require the straight-through estimator (STE) on the forward step and differentiable transition on the backward.
For acoustic tokens we could simply apply STE directly:
…
# trained acoustic codes for the selected audio
# initialization
self.acoustic_values = nn.Parameter(
torch.randn(num_frames, 36)
)
…
# number of levels for a single acoustic token
# according to the paper and vllm-omni implementation
acoustic_levels = 22
acoustic_normalized = torch.tanh(self.acoustic_values)
# Quantize to discrete levels
acoustic_scaled = ((acoustic_normalized + 1) / 2) * (acoustic_levels – 1)
acoustic_quantized = acoustic_scaled.round()
# STE: forward uses quantized, backward uses continuous
acoustic_codes = acoustic_scaled + (acoustic_quantized – acoustic_scaled).detach()
For semantic tokens it is more complicated. We cannot apply tanh activation, because each semantic codec is associated with the multi dimensional embedding, so we are actually training “which embedding to select out of 8192 options”, not “which value/scalar to use”:
…
# trained semantic codes for the selected audio
# initialization
semantic_vocab = 8192
self.semantic_logits = nn.Parameter(
torch.randn(num_frames, semantic_vocab)
)
…
# Soft probabilities (for gradients)
probs = F.softmax(self.semantic_logits, dim=-1) # [T, 8192]
# Hard selection (for forward pass)
hard_codes = probs.argmax(dim=-1) # [T] integer indices in range 0-8191
# Get semantic embedding table
# uses the actual codebooks, because each of 8192 semantic codes is mapped
# to 256-dim embedding
sem_embedding = tokenizer.quantizer.semantic_codebook.embedding # [8192, 256]
# Soft embedding (weighted sum for gradients)
soft_emb = torch.matmul(probs, sem_embedding) # [T, 256]
# Hard embedding (discrete lookup for forward)
hard_emb = F.embedding(hard_codes, sem_embedding) # [T, 256]
# STE: forward uses hard, backward uses soft
semantic_emb = soft_emb + (hard_emb – soft_emb).detach() # [T, 256]
A full implementation of the training with illustrated and explained STE is here.
The complications arise because of the complicated signal to reconstruct
In every machine learning modeling task, the more high-quality training signal you can provide, the better training goes. That is especially important in modeling high frequency, high dimensional data. In these experiments using L1 loss as reconstruction loss alone led to poor results — the model hit local optima and could not converge further with just a signal from high-frequency data-reconstruction loss signal. In the audio processing domain there is a list of common techniques to provide additional training signals — Short-Time Fourier Transform (STFT), Mel Spectrograms. Both these techniques transform the signal from the time domain to time-frequency representation through frequency bins and frequency filterbanks. Following the description from the Mistral paper, I applied a similar STFT as additional loss to the reconstruction L1 loss I had: “…A multi-resolution discriminator with 8 STFT sizes (2296, 1418, 876, 542, 334, 206, 126, 76) is trained along with the codec…”
There are also voice-cloning-specific losses we could apply. There are models that are able to create speaker embedding, that is used for speaker identification and diarization. For example. SpeechBrain’s models. These models can produce embeddings for the voice. If two embeddings are close to each other, that means two embedded audios are highly likely from the same speaker, otherwise — different. We can apply speaker loss as an additional loss component that should force the model to create codes that we can decode with Voxtral Codec’s decoder to get a similar voice to the target one.
Results of training
I trained the described model for 5000 epochs (with a single sample). It took me around an hour on a Mac machine with an M-series processor to recreate codes for 8 seconds of audio. The speaker diarization loss component made training slower, but led to better final loss.
The audio I got reconstructed from the trained codes is here. You can hear that it is very similar to the target audio first 8-second fragment.
In this training setup, we do not do any evaluation epochs, because we have just one sample and our task is to overfit the trained parameters to be able to reconstruct the final audio. It can sound a little bit unusual, but that is exactly the case when overfitting is the objective.
AI usage disclaimer
I was using LLM tools to support me in this research and to accelerate experimentations. It was very helpful. However there were multiple episodes when I had to adjust the LLM’s decisions to make them work — around some ML and DL details.
Please use AI tools responsibly!
Contacts
My LinkedIn if anybody wants to connect: Roman Smirnov

