paper: arxiv.org
Before getting started
•
While promising, the performance appears slightly weaker than SoTA
•
Requirements
◦
I want to maintain the quality when using high NFE of Flow matching that is currently being trained.
▪
Or methods to improve it further, such as DMD2
◦
Since the objective is more comprehensive, it would be good if we could serve with a single model.
▪
In particular, is there a good way to utilize a pre-trained model?
◦
Verify that the training is stable and working well
◦
Verify that it works well in Audio/Speech as well
Motivation
•
Mean Flows
◦
training instability
▪
the training target u_tgt is defined by the model’s own derivative
•
a randomly initialized model may fail to provide effective guidance at the beginning, resulting in slow convergence
Related work
•
TTA with Inference Acceleration
◦
Consistency Distillation
▪
ConsistencyTTA
◦
multi-step ODE solver
▪
AudioLCM
◦
novel feature distance to enable flexible single-step and multi-step
▪
SoundCTM
◦
dual-faceted distillation strategy
▪
Presto
◦
Rectified Flow
▪
FlashAudio
▪
AudioTurbo
◦
contrastive post-training
▪
Stable-audio-small
•
Method
•
Model Architecture
◦
Flux-style latent transformer
▪
Flux: (BlackForestLabs 2024)
•
MMDiT & DiT
•
MLP → ConvMLP (1d conv)
◦
channel-last conv (in official implementation)
▪
(B,C,T)
◦
(B@T, C)
•
RoPE
•
RMSNorm with learnable scale in attention
◦
lightweight architecture with only 120M params
▪
supirit diffusion: 320M
▪
supertonicTTL: 22~30M
◦
Text conditioning
▪
FLAN-T5
•
instruction-tuned LLM capable of producing fine-grained token embeddings
◦
text-only dataset
▪
CLAP
•
trained on large-scale audio-text dataset
•
can offer acoustic-aligned text embeddings
◦
•
Training Strategy
◦
Latent
▪
STFT → VAE → latent
◦
Curriculum
▪
Stage 1 - instantaneous velocity
•
Training with standard flow matching objective
▪
Stage 2 - mixed
•
Finetuning with MeanFlow objective
◦
using r=t
▪
r=t means instantaneous
◦
Mixed Flow
▪
Finetuning
◦
CFG training target
▪
•
Result
•
Main results on AudioCaps test set.
◦
•
Ablation study of the training curriculum
◦
260 epochs
◦
•
Ablation study of the flow mix-up ratio
◦
•
Ablation study of CFG scale
◦









