Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Rating

4 - Good

Authors

Hila Chefer

Date

2026

Review Status

In Progress

•

Motivation

◦

REPA 성능과 representation model

▪

DINOv2-B > DINOv2-L > DINOv3 > DINOv3-H+

◦

stronger representation learner로 feature alignment 하는 것이 오히려 bottleneck이 된다!

▪

fixed representation이 generative goal에서 떨어져 있다.

◦

generative framework 자체에서 해결해보자.

직관적으로 더 좋은 representation을 쓰면 더 좋은 모델이 될 것 같은데, 결국 main objective는 diffusion loss이니, 그 안에서 상호작용을 일으킬 수 있는 수준에서의 term을 사용해야한다는 교훈.

VAE도 그런데, reconstruction quality와 representation은 어느정도 trade-off 관계에 있다.

하지만 애초에 VAE-based인 REPA와 달리 RAE를 사용한다면, target latent로 사용한 strong representation이 도움이 될 것 같다.

•

Method

◦

teacher model을 external model이 아닌 더 쉬운 noise 에서의 self forward로 활용

◦

Noising method

▪

token 별 다른 noise level

•

token 내에서는 일관된 noise distribution

▪

Dual-Timestep Scheduling

•

Mask에 해당하는 토큰에만 강한 noise

◦

Teacher

▪

약한 noise로 forward 하고, higher layer(k)에서 target feature로써 훈련 [SRA]

▪

τmin=min(τ)∈{ t,s }\tau_{min}=min(\tau)\in\set{t,s}τmin​=min(τ)∈{t,s}

▪

model은 student model을 EMA

•