Analysis
Last updated
Last updated
Our empirical study demonstrates that image resolution plays a critical role in retrieving visually rich documents. To better understand the relationship between image resolution and model performance on such documents, we have conducted a targeted experiment.
We checkpointed jina-clip-v2 at the end of Stage 1, when it had only seen images with a resolution of (224, 224). This model checkpoint was trained on 1.45 billion image-caption pairs and 1.45 billion text-text pairs. We conducted four runs, keeping the image resolution to (224,224) for the first run and increasing to (384,384), (512,512), and (768,768) for three additional runs. Each run is trained for 3500 steps using the same visually rich training set and hyperparameters. The models are then evaluated on the ViDoRe benchmark (Faysse et al., 2024), which comprises 10 datasets designed for retrieving visually rich documents, based on text queries.
We have plotted the results in Figure 2. Unsurprisingly, increasing image resolution has a positive impact on linking captions to visually rich documents. The most significant improvement occurs when resolution increases from (224,224) to (384,384), with the average nDCG@5 score across 10 benchmarks rising from 0.256 to 0.454. A further increase in resolution to (512,512) also brings a noticeable gain in performance.
Considering that the model is trained with a patch size of 14, increasing the resolution beyond (512,512) significantly raises the number of patches, a 2.25x increase when moving from (512,512) to (768,768). This leads to substantial additional computational overhead while yielding only marginal improvement, with nDCG@5 increasing by just 0.019. Therefore, we identify (512,512) as an optimal image resolution, offering a good balance between performance and computational efficiency.