MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

Rating

4 - Good

Authors

Xiquan Li

Date

2025

Review Status

Todo

paper: arxiv.org

Before getting started

•

While promising, the performance appears slightly weaker than SoTA

•

Requirements

◦

I want to maintain the quality when using high NFE of Flow matching that is currently being trained.

▪

Or methods to improve it further, such as DMD2

◦

Since the objective is more comprehensive, it would be good if we could serve with a single model.

▪

In particular, is there a good way to utilize a pre-trained model?

◦

Verify that the training is stable and working well

◦

Verify that it works well in Audio/Speech as well

Motivation

•

Mean Flows

◦

training instability

▪

the training target u_tgt is defined by the model’s own derivative

•

a randomly initialized model may fail to provide effective guidance at the beginning, resulting in slow convergence

Related work

•

TTA with Inference Acceleration

◦

Consistency Distillation

▪

ConsistencyTTA

◦

multi-step ODE solver

▪

AudioLCM

◦

novel feature distance to enable flexible single-step and multi-step

▪

SoundCTM

◦

dual-faceted distillation strategy

▪

Presto

◦

Rectified Flow

▪

FlashAudio

▪

AudioTurbo

◦

contrastive post-training

▪

Stable-audio-small

•

Method

•

Model Architecture

◦

Flux-style latent transformer

▪

Flux: (BlackForestLabs 2024)

•

MMDiT & DiT

•

MLP → ConvMLP (1d conv)

◦

channel-last conv (in official implementation)

▪

(B,C,T)

◦

(B@T, C)

•

RoPE

•

RMSNorm with learnable scale in attention

◦

lightweight architecture with only 120M params

▪

supirit diffusion: 320M

▪

supertonicTTL: 22~30M

◦

Text conditioning

▪

FLAN-T5

•

instruction-tuned LLM capable of producing fine-grained token embeddings

◦

text-only dataset

▪

CLAP

•

trained on large-scale audio-text dataset

•

can offer acoustic-aligned text embeddings

◦

•

Training Strategy

◦

Latent

▪

STFT → VAE → latent

◦

Curriculum

▪

Stage 1 - instantaneous velocity

•

Training with standard flow matching objective

▪

Stage 2 - mixed

•

Finetuning with MeanFlow objective

◦

using r=t

▪

r=t means instantaneous

◦

Mixed Flow

▪

Finetuning

◦

CFG training target

▪

•

Result

•

Main results on AudioCaps test set.

◦

•

Ablation study of the training curriculum

◦

260 epochs

◦

•

Ablation study of the flow mix-up ratio

◦

•

Ablation study of CFG scale

◦