Introduction
Contrastive text-image pre-training has become a widely adopted approach to building robust text-image alignment models.
These models are particularly effective for tasks like cross-modal retrieval and zero-shot classification (Radford et al., 2024), demonstrating strong generalization capabilities.
Conventional text-image pre-training excels at aligning text and image embeddings but struggles with aligning text-to-text embeddings (Koukounas et al., 2024).
This is because cross-modal retrieval training is inadequate for text-only retrieval. The texts available are typically image captions, which are generally short and information-poor, and training procedures don’t use standard techniques for training text embeddings, like hard negatives. (Zhang et al., 2024a)
To overcome these limitations, Koukounas et al. (2024) introduced a multi-task, multi-stage contrastive learning approach to simultaneously align text-text and text-image embeddings.
Their method involves three stages: optimizing text-text and text-image embeddings for short image-caption and text-text pairs (stage 1), refining alignment using long text-text pairs and detailed image-caption pairs (stage 2) and further enhancing performance with text triplets containing hard negatives and detailed image-caption pairs (stage 3).
The resulting model, jina-clip-v1, achieves strong performance on both the cross-modal CLIP Benchmark11https://github.com/LAION-AI/CLIP_benchmark and text embedding MTEB Benchmark (Muennighoff et al., 2024).
Despite its strengths, this model has several limitations. First, it is an English-only multimodal embedding model, which makes it unsuitable for multilingual document retrieval.
Second, jina-clip-v1 struggles with images that contain text, tables, graphs, diagrams, and other complex content (Faysse et al., 2024).
To address these challenges, we propose an enhanced text-image pre-training scheme.
Our approach incorporates multilingual text-image and text-text pairs, along with text-image pairs containing text and other difficult-to-process complex visual structures.
Additionally, we gradually increase image resolution during pre-training to balance computational efficiency and model performance.
The resulting model, jina-clip-v2 depicted in Figure 1, not only achieves competitive performance on cross-modal retrieval benchmarks against state-of-the-art models like NLLB-CLIP-SigLIP (Visheratin, 2023b) but also performs comparably to dedicated multilingual text embedding models like jina-embeddings-v3 (Sturua et al., 2024) on the Retrieval and STS tasks of the multilingual MTEB benchmark (Muennighoff et al., 2024).
Moreover, due to the inclusion of visually rich training data and the progressive increase in image resolution, jina-clip-v2 demonstrates significantly improved performance on benchmarks for visually rich document retrieval, compared to jina-clip-v1.
To reduce vector storage costs, we employ Matryoshka Representation Learning (Kusupati et al., 2024) to enable dimension truncation on the output vectors with minimal performance degradation.
Last updated