Skip to content

ExplainableML/SOTAlign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

[ICML 2026] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Authors: Simon Roschmann*, Paul Krzakala*, Sonia Mazelet, Quentin Bouniot, Zeynep Akata

preprint

Abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image–text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

Methodology

SOTAlign architecture

SOTAlign is a two-step method for the alignment of pretrained unimodal image and text encoders. First, we fit a linear alignment model only using the limited amount of available image-text pairs. Then, we use this linear model as a teacher to regularize the training of alignment layers $f$ and $g$ for a joint embedding space leveraging unimodal (unpaired) data.

Code

Coming soon.

Citation

If you find SOTAlign useful, please star this repository and cite our work:

@article{roschmann2026sotalign,
  title={SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport},
  author={Simon Roschmann and Paul Krzakala and Sonia Mazelet and Quentin Bouniot and Zeynep Akata},
  journal={arXiv preprint arXiv:2602.23353},
  year={2026}
}

Contact

If you have any questions, feel free to contact us: simon.roschmann@tum.de

About

[ICML 2026] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors