Benchmarking Models for Multi-modal Search

April 10, 2024
mins read
TL;DR We benchmark over one hundred CLIP models to assess them not only on text to image retrieval (i.e. a proxy for search relevancy), but also for latency when embedding text and images. We also provide a comprehensive table of model specific details like context length and input image size and recommendations on which models to use.

Why performance matters

There are a multitude of factors need to be considered when selecting models and model architectures for production search and recommendation systems. Models that return more relevant results might have more parameters and require more FLOPS for inference. This will mean there is a relevancy-latency trade-off as the larger models take longer to produce embeddings while retrieving better candidates.

Score Vs Time for CLIP Models

However, this can be further complicated by other differences like context length, image input size, and embedding dimension. Longer context allows for more text to be encoded per embedding but typically takes longer for inference. Larger images can take longer to process but can yield better results. Increasing embedding dimensions take up more memory and require more compute when calculating distances which will impact latency.

Score Vs Dimension for CLIP Models

The table below provides some high level guidance about how to reason about model specifications.

Attribute What does it mean? Why does it matter?
Image size This is the size of image that the model will use as input. All images will be resized to the model input size. Increasing image size can improve retrieval performance but can increase latency.
Context length This is the amount of tokenized text that the model can use. There is no rule for converting words to tokens but tokens ~ 0.75 words Longer context lengths can increase latency but allows for less vectors to be used to represent large text.
Embedding dimension This is the size of the vector that the model will output. The number of dimensions directly impacts latency in the retrieval stage since larger dimensions require more computations.
Parameters/FLOPS Indicates the model “size” in terms of parameters, memory and compute. The larger these are, the more memory is required and typically increases latency.

Introduction to CLIP

CLIP and derivative models have become pervasive amongst multi-modal applications. CLIP stands for “contrastive language image pretraining” and provides a simple and effective way to learn visual representations with natural language. CLIP has been used extensively in image generation, zero-shot classification, LLM’s and generating embeddings for images and text. The latter application has also meant that CLIP models have found popularity in multi-modal retrieval tasks like search. It is this latter use cases which is the focus of this post. We combine retrieval performance (from open_clip) with latency measurements, context length and image input sizes to provide a holistic and practical view of model performance. All benchmarking done in this article utilises the open_clip implementations.

Figure 1. A diagram illustrating how CLIP works in pre-training (left) and as a classifier post-training.

CLIP in Retrieval

CLIP models and its derivatives [CLIPA, EVA, SigLIP] have become popular for retrieval use cases like search and recommendations. CLIP can produce embeddings for images and text that live in the same latent (embedding) space. This means that text can be compared directly to images and vice versa. This simple property permits cross-modal search by comparing the embedding generated from a text query to a corpus of image embedding using nearest neighbor search.

Measuring Latency

The latency was performed across two commonly used GPU’s  for inference - T4 and A10g. Each model had its text and image inference (not pre-processing) timed using an average from 100 inferences. Each model also had a warm-up of 100 inferences. The text used for benchmarking was random combinations of three letter dictionary words. Each text inference was on a unique sentence and all models saw the exact same set of text. Images were pre-processed and then had random noise added to them. Each image was unique but all the models saw the same set of images. Finally, all models were inferenced using PyTorch’s AMP.

Measuring Retrieval Performance

The retrieval results are the same that appear in the open clip repository. These scores are the average retrieval performance across three different datasets (Flickr, MSCOCO and WinoGAViL) and asymmetric text-image and image-text retrieval tasks. These tasks provide a general view of performance. It should be noted that the performance measured from these tasks is representative of the models performance on these tasks and generalizing beyond these may yield different results, particularly in search. It is strongly suggested to develop an evaluation benchmark that represents the end use case as close as possible.


Below we show the performance of the models along different dimensions. We use the average score for retrieval performance and then plot that with respect to different aspects that impact performance.

All benchmarks are performed using an NVIDIA A10 Tensor Core GPU.

Benchmark Score by Embedding Dimensionality

Score by embedding dimensionality

Text Inference Time (ms) by Maximum Context Length

Text inference time by text token context length

Image Inference Time (ms) by Input Image Shape

Image inference time by image input size

All the Data in One Table

For an interactive version please refer to our hugging face space.

How do I Choose a Model?

If you do not know where to start then we suggest either of the following models:

Use case Model Pretrained What it is best for
Fastest inference ViT-B-32 laion2b_s34b_b79k When the best performance at the lowest latency/memory is required.
Best balanced ViT-L-14 laion2b_s32b_b82k When low latency is still required but with much better retrieval performance. GPU recommended.
Best all-round xlm-roberta-large-ViT-H-14 frozen_laion5b_s13b_b90k When the best performance is required. Latency is increased along with memory. GPU recommended.

Alternatively, selecting a model based on the table below when specific requirements are required will work well for selecting a model. A good rubric to follow is;

  • Are there any image/text latency requirements? → Filter any that have unacceptable times.
  • Are there storage requirements? → Filter out larger dimension models.
  • Are there memory requirements → Filter out larger parameter/FLOP’s models.

Then select the best performing model based on the average score.

Although some models score better on the three benchmark datasets, what we have found is that they may not generalise as well. In addition to the benchmark results shown, internal benchmarking has shown those models above to be very good at specific domains like product search. The models above are very good general models and should perform well across many tasks. It is always a good idea to develop a benchmark closely related to the task and evaluate multiple models. At the very least, a vibe check can be used on the real use case.

Jesse Clark
Jesse is a co-founder and the CTO at Marqo, he leads the applied sciences division performing R&D in AI for search and recommendations.