Conclusion
In this work we present an enhanced strategy for training dual-encoder vision-language embedding models for cross-modal and text retrieval tasks, using contrastive learning.
We introduce improvements to the pipeline, namely support for multiple languages, Matryoshka Representation Learning, and finer visual perception.
We have trained and released a new model, jina-clip-v2, which has strong cross-modal and text retrieval performance on standard benchmarks.
Finally, we have conducted analyses of three important considerations for CLIP models going forward: the ability to understand complex visual inputs performance how the image resolution affects it, and the effects of efforts to mitigate the modality gap.
Last updated