Training

The jina-clip-v2 model uses the dual encoder architecture, introduced in the original CLIP (Radford et al., 2024) model and reused in jina-clip-v1 (Koukounas et al., 2024). One encoder is responsible for encoding texts in a 1024-dimensional embedding space, while a second encoder embeds images in the same embedding space.

The text encoder is initialized with the pre-trained Jina-XLM-RoBERTa model weights. Introduced in jina-embeddings-v3 (Sturua et al., 2024), the Jina-XLM-RoBERTa model is a port of the XLM-RoBERTa (Conneau et al., 2024) weights to a modern encoder-only architecture that adds Flash Attention (Dao et al., 2024), rotary positional embeddings (Su et al., 2024) and LoRA (Hu et al., 2024).

For the image encoder, we have opted for the EVA02 (Fang et al., 2024b) family of ViT models, similarly to jina-clip-v1. We have selected a relatively large pre-trained model, similar in size to the text encoder. This model implementation includes 2D rotary positional embeddings (Su et al., 2024) and a memory-efficient attention implementation based on xFormers (Lefaudeux et al., 2024).

Architectural details for both encoders are presented in Table 1.

PreviousRelated Work NextEvaluation

Last updated 7 months ago