MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

Rating
4 - Good
Authors
Xiquan Li
Date
2025
Review Status
Todo
Review Date
2026/03/11 07:23
Key Findings
Venue
Field
Flow Matching
Diffusion Acceleration
Paper Library
R
Review Type

Before getting started

While promising, the performance appears slightly weaker than SoTA
Requirements
I want to maintain the quality when using high NFE of Flow matching that is currently being trained.
Or methods to improve it further, such as DMD2
Since the objective is more comprehensive, it would be good if we could serve with a single model.
In particular, is there a good way to utilize a pre-trained model?
Verify that the training is stable and working well
Verify that it works well in Audio/Speech as well

Motivation

Mean Flows
training instability
the training target u_tgt is defined by the model’s own derivative
a randomly initialized model may fail to provide effective guidance at the beginning, resulting in slow convergence

Related work

TTA with Inference Acceleration
Consistency Distillation
ConsistencyTTA
multi-step ODE solver
AudioLCM
novel feature distance to enable flexible single-step and multi-step
SoundCTM
dual-faceted distillation strategy
Presto
Rectified Flow
FlashAudio
AudioTurbo
contrastive post-training
Stable-audio-small

Method

Model Architecture
Flux-style latent transformer
Flux: (BlackForestLabs 2024)
MMDiT & DiT
MLP → ConvMLP (1d conv)
channel-last conv (in official implementation)
(B,C,T)
(B@T, C)
RoPE
RMSNorm with learnable scale in attention
lightweight architecture with only 120M params
supirit diffusion: 320M
supertonicTTL: 22~30M
Text conditioning
FLAN-T5
instruction-tuned LLM capable of producing fine-grained token embeddings
text-only dataset
CLAP
trained on large-scale audio-text dataset
can offer acoustic-aligned text embeddings
Training Strategy
Latent
STFT → VAE → latent
Curriculum
Stage 1 - instantaneous velocity
Training with standard flow matching objective
Stage 2 - mixed
Finetuning with MeanFlow objective
using r=t
r=t means instantaneous
Mixed Flow
Finetuning
CFG training target

Result

Main results on AudioCaps test set.
Ablation study of the training curriculum
260 epochs
Ablation study of the flow mix-up ratio
Ablation study of CFG scale