Related Work
Radford et al. pioneered the dual encoder architecture trained contrastively on image-text pairs, demonstrating impressive zero-shot performance and laying the groundwork for pre-trained vision-language models. Building upon this foundation, Zhai et al. (2024) propose an alternative and more efficient sigmoid objective, while Sun et al. (2024) optimize the model’s dimensions and scale up both dataset and model sizes, achieving state-of-the-art performance on cross-modal tasks.
Koukounas et al. (2024) enhance a CLIP-based retriever with strong text-to-text and cross-modal retrieval capabilities, though fall short in addressing retrieval scenarios that involve candidate pools with heterogeneous modalities, due to the modality gap (Liang et al., 2024).
To mitigate the modality bias, Lin et al. (2024) take a different approach by fine-tuning a multimodal LLM with modality-aware hard-negative mining, creating a universal multimodal retriever and maintaining competitive text-to-text retrieval performance across diverse tasks.
Extensive research has also focused on extending CLIP’s capabilities to multilingual contexts.
To address the lack of image-caption pairs in languages other than English, Carlsson et al. (2024) apply knowledge distillation to retrain the text encoder using machine translation data.
NLLB-CLIP (Visheratin, 2024a) leverages Locked-image Tuning (LiT) (Zhai et al., 2024) alongside the NLLB (Costa-jussà et al., 2024) text encoder, achieving state-of-the-art results in both retrieval and classification tasks.
The importance of good information retrieval in Retrieval-Augmented Generation (RAG) systems has motivated rapid progress in multilingual text embedding models.
These models are typically based on either a robust encoder-only architecture, such as XLM-RoBERTa (Conneau et al., 2024) or a decoder-only multilingual large language model, such as Mistral 7B (Jiang et al., 2024).
Multilingual E5 (Wang et al., 2024) and BGE-M3 (Chen et al., 2024) both use XLM-RoBERTa as their backbone, leveraging extensive multilingual training with instruction tuning and multi-task learning techniques, respectively.
mGTE (Zhang et al., 2024b) achieves comparable performance to BGE-M3 using a smaller transformer architecture, enhanced by RoPE (Su et al., 2024) to extend the context length.
Similarly, Sturua et al. (2024) employ RoPE to increase the model’s context length and apply LoRA tuning (Hu et al., 2024) to optimize embeddings for downstream tasks.
Last updated