[ICML 2026] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Authors: Simon Roschmann*, Paul Krzakala*, Sonia Mazelet, Quentin Bouniot, Zeynep Akata

Abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image–text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

Methodology

SOTAlign is a two-step method for the alignment of pretrained unimodal image and text encoders. First, we fit a linear alignment model only using the limited amount of available image-text pairs. Then, we use this linear model as a teacher to regularize the training of alignment layers $f$ and $g$ for a joint embedding space leveraging unimodal (unpaired) data.

Code

Coming soon.

Citation

If you find SOTAlign useful, please star this repository and cite our work:

@article{roschmann2026sotalign,
  title={SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport},
  author={Simon Roschmann and Paul Krzakala and Sonia Mazelet and Quentin Bouniot and Zeynep Akata},
  journal={arXiv preprint arXiv:2602.23353},
  year={2026}
}

Contact

If you have any questions, feel free to contact us: simon.roschmann@tum.de

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICML 2026] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Abstract

Methodology

Code

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

[ICML 2026] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Abstract

Methodology

Code

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages