Announcement

Introducing Marqo Specialized Embedding Models for Ecommerce: Powering Multimodal AI Search

We have launched two foundation models for ecommerce that deliver much higher performance for product search and recommendations. These models excel in generating multimodal product embeddings from images and text, outperforming models from Amazon, Google, and Cohere, as well as the leading open source models. These models are optimized specifically for ecommerce, offering enhanced performance in real-world search scenarios.

We've developed two new state-of-the-art foundation models for generating multimodal product embeddings from images and text, Marqo-Ecommerce-B and Marqo-Ecommerce-L . These models outperform existing state-of-the-art solutions like Amazon Titan's Multimodal Embedding by up to 88% and the best open source model (ViT-SO400M-14-SigLIP) by up to 31%.

Figure 1: Average performance of all benchmarked models for both marqo-ecommerce-easy (200k products across all tasks) and marqo-ecommerce-hard (4M products across all tasks). Baseline model is ViT-B-16-SigLIP.

Summary of Results

We benchmarked a number of embedding models for multimodal product retrieval. These included the base models ViT-B-16-SigLIP and ViT-L-16-SigLIP, as well as the best open source CLIP/SigLIP model, ViT-SO400M-14-SigLIP. We also included API-based multimodal embeddings offered by Amazon-Titan-Multimodal, GCP-Vertex, Jina-V1-CLIP, and Cohere-Embedding-v3, although there are limitations on some of these private providers (see below for more details).

Our benchmarking process was divided into two distinct regimes, each using different datasets of ecommerce product listings: marqo-ecommerce-hard and marqo-ecommerce-easy. Both datasets contained product images and text and only differed in size. The "easy" dataset is approximately 10-30 times smaller (200k vs 4M products), and designed to accommodate rate-limited models, specifically Cohere-Embeddings-v3 and GCP-Vertex (with limits of 0.66 rps and 2 rps respectively). The "hard" dataset represents the true challenge, since it contains four million ecommerce product listings and is more representative of real-world ecommerce search scenarios. For both the marqo-ecommerce-hard and marqo-ecommerce-easy datasets, the models were benchmarked on three different tasks:

  • GoogleShopping-Text2Image: uses the product title to search product images from Google Shopping data. This is representative of descriptive queries in search.
  • GoogleShopping-Category2Image: uses the product categories as queries to search product images from Google Shopping data. This is analogous to short keyword like queries in search.
  • AmazonProducts-Text2Image: uses the product title to search product images from Amazon product data. This is representative of descriptive queries in search.

We have made these datasets available on Hugging Face along with scripts to reproduce the evaluation.

The benchmarking results show that the Marqo-Ecommerce models consistently outperformed all other models across various metrics. Specifically, marqo-ecommerce-L achieved an average improvement of 17.6% in MRR and 20.5% in nDCG@10 when compared with the current best open source model, ViT-SO400M-14-SigLIP across all three tasks in the marqo-ecommerce-hard dataset. When compared with the best private model, Amazon-Titan Multimodal, we saw an average improvement of 38.9% in MRR and 45.1% in nDCG@10 across all three tasks, and 35.9% in Recall across the Text-to-Image tasks in the marqo-ecommerce-hard dataset.

Why Did We Create Marqo-Ecommerce Embedding Models?

The Problem

While contrastive learning models like CLIP and SigLIP are powerful, they are not optimized for the needs of ecommerce. They were trained on a large collection of images, many of which aren't related to ecommerce, with little curation or domain specificity. The product data in ecommerce datasets differs significantly from general-purpose datasets, resulting in suboptimal performance when these models are used for search and recommendations. Additionally, these models were trained on data that is now several years old, and they have no understanding of recent products or trends.

Our Solution

We built Marqo-Ecommerce-B and Marqo-Ecommerce-L models, which excel at ecommerce search, retrieval, and recommendation tasks. The models were trained on 100s of millions of samples from ~50 million unique products across 20,000 Amazon asin categories spanning from appliances to automotive to office products to pet supplies. The models were evaluated on extensive benchmark datasets that spanned over 4 million unique products covering the 20,000 categories. The categories are taken from Amazon’s product taxonomy.

The Marqo-Ecommerce embedding models are designed specifically to work seamlessly with Marqo Cloud, our end-to-end embeddings platform. Additionally, you can fine tune our embedding models on your own product catalogs and user behavior, using Marqtune. Marqtune is our embedding model training platform backed by our contrastive learning framework - GCL.

Why Do These Models Matter for Ecommerce?

If you're building an ecommerce site, here's how Marqo's new models can help you:

  • Faster, More Accurate Searches: Our benchmarks show that marqo-ecommerce-L achieved an average improvement of 17.6% in MRR when compared with the current best open source model, ViT-SO400M-14-SigLIP across all three tasks on the 4M evaluation dataset. This translates to customers finding relevant products more quickly and accurately, enhancing their experience and boosting conversion rates.
  • Enhanced Precision and Recall: When compared with the best private model, Amazon-Titan Multimodal, the marqo-ecommerce-L model demonstrated an average improvement of 45.1% in nDCG@10 across all three tasks, and 35.9% in Recall across the Text-to-Image tasks on the 4M evaluation dataset. This means customers are more likely to see highly relevant products immediately in their search results, leading to higher satisfaction and increased likelihood of repeat business, whether they're searching for niche items or popular products.
  • Better Category-Based Recommendations: In the Google Shopping Category-to-Image task, marqo-ecommerce-L achieved an improvement of 43.6% in MRR and 35.4% Precision@10 against Amazon-Titan Multimodal on the 4M evaluation dataset. This capability enables more accurate category-based product suggestions, helping customers discover new products and increasing opportunities for upselling and cross-selling, which directly drives revenue.

These performance gains over the best open source (ViT-SO400M-14-SigLIP) and best private model (Amazon-Titan Multimodal) highlight the potential of Marqo's ecommerce-specific models to significantly enhance real-world ecommerce applications, leading to improved customer satisfaction, higher conversion rates, and increased revenue for online retailers.

Released Marqo Models

We've released both our models, Marqo-Ecommerce-B and Marqo-Ecommerce-L on Hugging Face. The B model is smaller and faster for inference (with times of 5.1 ms for single batch text, and 5.7 ms for image) and a smaller embedding dimension (768). The L model is larger (652M parameters), has a larger embedding dimension (1024), but has better retrieval performance. Marqo-Ecommerce-L has up to 7.3% MRR and 7.4% nDCG@10 average improvement over Marqo-Ecommerce-B across the three tasks on the 4M evaluation dataset.

Table 1. Summary of model parameters and inference times.

Before we show you how to use these models in Marqo Cloud and/or Hugging Face, let's first take a look at their performance against existing, state-of-the-art embedding models. If you want to jump straight in to using the models, jump ahead to Deploy Marqo-Ecommerce Models on Marqo Cloud or Loading Marqo-Ecommerce Models.

Performance

Here are the detailed results for three general ecommerce retrieval tasks. These tasks measure the performance of various embedding models in retrieving images based on long and short text descriptions and categories. We focus on Precision, Recall, MRR (Mean Reciprocal Rank), and nDCG to showcase how our Marqo-Ecommerce models stack up against existing solutions, such as Amazon Titan Multimodal and other popular open-weights SigLIP ViT models from Google.

Marqo-Ecommerce-Hard

As previously noted, our benchmarking process was structured around two distinct scenarios: marqo-ecommerce-hard and marqo-ecommerce-easy. This section will look into the comprehensive evaluation conducted using the full 4 million products across the two datasets.

GoogleShopping-Text2Image Retrieval

In this task, we evaluate how well models retrieve relevant images when given text descriptions. This dataset has 1 million image-title pairs.

Figure 2: Performance comparison of various models on the GoogleShopping-Text2Image Retrieval task, evaluated using 1 million image-title pairs. Benchmarking for Marqo-Ecommerce-Hard.

The benchmark results are:

Table 2. Benchmark results for GoogleShopping-Text2Image Retrieval task in Marqo-Ecommerce-Hard.

These results demonstrate the advantage of Marqo’s ecommerce models over other leading models in the GoogleShopping-Text2Image Retrieval task. Marqo-Ecommerce-L and Marqo-Ecommerce-B achieved top performance across key metrics, with Marqo-Ecommerce-L scoring a relative improvement of 43.7% in MRR and 35.4% in Recall@10 over Amazon-Titan-Multimodal, and 19% in MRR and 15% in Recall@10 over ViT-SO400M-14-SigLip.

GoogleShopping-Category2Image Retrieval

For this task, we asses the model's ability to retrieve images that correspond to a particular product category. Using the same split of 1 million image-title pairs we evaluate how well each model can associate text inputs with categories rather than specific product titles. The categories typically consists of a few words and are shorter than titles. While the Text2Image queries have exactly 1 corresponding image, for the Category2Image task there can be multiple images that correspond to a category.

Figure 3: Performance comparison of various models on the GoogleShopping-Category2Image Retrieval task, evaluated using 1 million image-title pairs. Benchmarking for Marqo-Ecommerce-Hard.

The benchmark results are:

Table 3. Benchmark results for GoogleShopping-Category2Image Retrieval task in Marqo-Ecommerce-Hard.

Again, the Marqo-Ecommerce-L model has the highest scores across all metrics with Marqo-Ecommerce-B also outperforming all other models. This includes an improvement of 88% in mAP, 52% in Precision@10, and 49.3% in nDCG@10 over Amazon-Titan. When compared with the best open source model, ViT-SO400M-14-SigLip, we see an improvement of 31.5% in mAP, 26.4% in Precision@10, 16.3% in MRR, and 25.9% in nDCG@10. These results demonstrate how well Marqo-Ecommerce models can recognize product categories and retrieve relevant images from large datasets.

AmazonProducts-Text2Image Retrieval

In this task, we scaled the evaluation to 3 million image-title pairs taken from the Amazon-products dataset. This test focuses on the challenge of finding the correct image based on a product's title, simulating a real-world ecommerce environment where users search for products based on text queries.

Figure 4: Performance comparison of various models on the AmazonProducts-Text2Image Retrieval task, evaluated using 3 million image-title pairs. Benchmarking for Marqo-Ecommerce-Hard.

The benchmark results are:

Table 4. Benchmark results for AmazonProducts-Text2Image Retrieval task in Marqo-Ecommerce-Hard.

We can see from both Figure 4 and Table 4 that the Marqo-Ecommerce embedding models outperformed on this task too with an improvement of 36% in Recall@10, 45% in MRR, and 43% in nDCG@10 when compared to the best private model, Amazon-Titan. When compared with the best open source model, ViT-SO400M-14-SigLip, we saw an improvement of 15% in Recall@10, 17% in both MRR and nDCG@10.

Marqo-Ecommerce-Easy

As mentioned, our benchmarking process was divided into two distinct scenarios: marqo-ecommerce-hard and marqo-ecommerce-easy. This section covers the latter which features a corpus 10-30 times smaller and was designed to accommodate rate-limited models. We will look into the comprehensive evaluation conducted using the full 200k products across the two datasets. In addition to the models already benchmarked above, these benchmarks also include Cohere-embedding-v3 and GCP-Vertex.

GoogleShopping-Text2Image Retrieval

In this task, we evaluate how well models retrieve relevant images when given text descriptions. For this, we selected 100k samples from the original 1M image-title pairs.

Figure 5: Performance comparison of various models on the GoogleShopping-Text2Image Retrieval task, evaluated using 1 million image-title pairs. Benchmarking for Marqo-Ecommerce-Easy.

The benchmark results are:

Table 5. Benchmark results for GoogleShopping-Text2Image Retrieval task in Marqo-Ecommerce-Easy.

Again, the Marqo-Ecommerce-L model has the highest scores across all metrics with Marqo-Ecommerce-B also outperforming all other models. This includes an improvement of 11.8% in Recall@10, 26.8% in MRR, and 22.9% in nDCG@10 when compared with Amazon-Titan. Against ViT-SO400M-14-SigLip, we saw an improvement of 3.9% in Recall@10, 11% in MRR, and 9.2% in nDCG@10.

GoogleShopping-Category2Image Retrieval

For this task, we asses the model's ability to retrieve images that correspond to a particular product category. For this, we selected 100k samples from the original 1M image-title pairs.

Figure 6: Performance comparison of various models on the GoogleShopping-Category2Image Retrieval task, evaluated using 1 million image-title pairs. Benchmarking for Marqo-Ecommerce-Easy.

The benchmark results are:

Table 6. Benchmark results for GoogleShopping-Category2Image Retrieval task in Marqo-Ecommerce-Easy.

For GoogleShopping-Category2Image Retrieval task, there is an improvement of 67% in mAP, 55% in Precision@10, 36.9% in MRR, and 56.5% in nDCG@10 when compared with the best private model, Amazon-Titan. For the best open source model, ViT-SO400M-14-SigLIP, we saw an improvement of 21.7% in mAP, 18.5% in Precision@10, 18.6% in MRR, and 21.1% in nDCG@10.

AmazonProducts-Text2Image Retrieval

In this task, we use a 100k image-title pairs from the Amazon-Products dataset.

Figure 7: Performance comparison of various models on the AmazonProducts-Text2Image Retrieval task, evaluated using 3 million image-title pairs. Benchmarking for Marqo-Ecommerce-Easy.

The benchmark results are:

Table 7. Benchmark results for AmazonProducts-Text2Image Retrieval task in Marqo-Ecommerce-Easy.

For AmazonProducts-Text2Image Retrieval task, there is an improvement of 10% in Recall@10, 36.9% in MRR, and 18.8% in nDCG@10 when compared with the best private model, Amazon-Titan. When compared with the best open source model, ViT-SO400M-14-SigLIP, we saw an improvement of 2.5% in Recall@10, 7.9% in MRR, and 6.6% in nDCG@10.

Deploy Marqo-Ecommerce Models on Marqo Cloud

These models are available in both Marqo Cloud and Marqo open source, the end-to-end vector search engine. Here’s how you can implement them yourself.

Marqo-Ecommerce-B


import marqo

# To obtain your API Key, visit https://www.marqo.ai/blog/finding-my-marqo-api-key
api_key = "your_api_key"
mq = marqo.Client("https://api.marqo.ai", api_key=api_key)

# Alternatively, if you want to use Docker to run Marqo. For more information, see, https://github.com/marqo-ai/marqo
# mq = marqo.Client("http://localhost:8882", api_key=None)

# Define settings for creating an index
settings = {
    "type": "unstructured",  # Specify the type of data to be indexed
    "model": "Marqo/marqo-ecommerce-embeddings-B",  # Name of the embedding model to be used
    "modelProperties": {
        "name": "hf-hub:Marqo/marqo-ecommerce-embeddings-B",  # Full name of the model on Hugging Face Hub
        "dimensions": 768,  # Dimensions of the embeddings
        "type": "open_clip"  # Type of the model
    },
    "treatUrlsAndPointersAsImages": True,  # Treat URLs and pointers as images for indexing
}

# Create a new index called "marqo-ecommerce-b" with the specified settings
mq.create_index("marqo-ecommerce-b", settings_dict=settings)

Marqo-Ecommerce-L


import marqo

# To obtain your API Key, visit https://www.marqo.ai/blog/finding-my-marqo-api-key
api_key = "your_api_key"
mq = marqo.Client("https://api.marqo.ai", api_key=api_key)

# Alternatively, if you want to use Docker to run Marqo. For more information, see, https://github.com/marqo-ai/marqo
# mq = marqo.Client("http://localhost:8882", api_key=None)


# Define settings for creating an index
settings = {
    "type": "unstructured",  # Specify the type of data to be indexed
    "model": "Marqo/marqo-ecommerce-embeddings-L",  # Name of the embedding model to be used
    "modelProperties": {
        "name": "hf-hub:Marqo/marqo-ecommerce-embeddings-L",  # Full name of the model on Hugging Face Hub
        "dimensions": 1024,  # Dimensions of the embeddings
        "type": "open_clip"  # Type of the model
    },
    "treatUrlsAndPointersAsImages": True,  # Treat URLs and pointers as images for indexing
}

# Create a new index called "marqo-ecommerce-l" with the specified settings
mq.create_index("marqo-ecommerce-l", settings_dict=settings)

For a guide on how you can build your own Ecommerce Search Application with these models, visit our article.

Loading Marqo-Ecommerce Models

Hugging Face Transformers

If you’re ready to integrate Marqo-Ecommerce-B or Marqo-Ecommerce-L into your application, here’s how you can load them using Hugging Face's transformers library in Python:


from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
import requests

# Choose your model: 'Marqo/marqo-ecommerce-embeddings-L' or 'Marqo/marqo-ecommerce-embeddings-B'
model_name = 'Marqo/marqo-ecommerce-embeddings-B'

# Load the model and processor
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Load an image and text for testing
img = Image.open(requests.get('https://raw.githubusercontent.com/marqo-ai/marqo-ecommerce-embeddings/refs/heads/main/images/dining-chairs.png', stream=True).raw).convert("RGB")
image = [img]
text = ["dining chairs", "a laptop", "toothbrushes"]

# Process the inputs
processed = processor(text=text, images=image, padding='max_length', return_tensors="pt")
processor.image_processor.do_rescale = False

# Perform inference
with torch.no_grad():
   image_features = model.get_image_features(processed['pixel_values'], normalize=True)
    text_features = model.get_text_features(processed['input_ids'], normalize=True)

    # Compute the similarity probabilities
    text_probs = (100 * image_features @ text_features.T).softmax(dim=-1)

print(text_probs)
# Output: [1.0000e+00, 8.3131e-12, 5.2173e-12]

OpenCLIP

You can also load the models using the OpenCLIP framework. Here’s how to do it:


from PIL import Image
import open_clip
import requests
import torch

# Specify model from Hugging Face Hub: 'hf-hub:Marqo/marqo-ecommerce-embeddings-L' or 'hf-hub:Marqo/marqo-ecommerce-embeddings-B'
model_name = 'hf-hub:Marqo/marqo-ecommerce-embeddings-L'
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(model_name)
tokenizer = open_clip.get_tokenizer(model_name)

# Preprocess the image and tokenize text inputs
# Load an example image from a URL
img = Image.open(requests.get('https://raw.githubusercontent.com/marqo-ai/marqo-ecommerce-embeddings/refs/heads/main/images/dining-chairs.png', stream=True).raw)
image = preprocess_val(img).unsqueeze(0)
text = tokenizer(["dining chairs", "a laptop", "toothbrushes"])

# Perform inference
with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image, normalize=True)
    text_features = model.encode_text(text, normalize=True)

    # Calculate similarity probabilities
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# Display the label probabilities
print("Label probs:", text_probs)
# [1.0000e+00, 8.3131e-12, 5.2173e-12]

Evaluating with Generalized Contrastive Learning (GCL)

GCL (Generalized Contrastive Learning) is a framework, built by Marqo, that's designed to go beyond binary relevance and leverage fine-grained rankings for multimodal retrieval tasks.

Here’s how to run the evaluation. First, install git if you don't already have it, and then run the following command from your terminal:


git clone https://github.com/marqo-ai/GCL

Install the packages required by GCL and then input the following into your terminal:


cd ./GCL
MODEL=hf-hub:Marqo/marqo-ecommerce-B
outdir=/MarqoModels/GE/marqo-ecommerce-B/gs-title2image2
hfdataset=Marqo/google-shopping-general-eval
python  evals/eval_hf_datasets_v1.py \
      --model_name $MODEL \
      --hf-dataset $hfdataset \
      --output-dir $outdir \
      --batch-size 1024 \
      --num_workers 8 \
      --left-key "['title']" \
      --right-key "['image']" \
      --img-or-txt "[['txt'], ['img']]" \
      --left-weight "[1]" \
      --right-weight "[1]" \
      --run-queries-cpu \
      --top-q 4000 \
      --doc-id-key item_ID \
      --context-length "[[64], [0]]"

All of the scripts to perform evaluations with GCL can be found on our GitHub.

Conclusion

With the release of Marqo-Ecommerce-B and Marqo-Ecommerce-L, ecommerce platforms now have access to powerful, purpose-built embedding models that outperform existing solutions by up to 88%. These models are specifically tailored for the unique challenges of ecommerce, delivering highly accurate retrieval results, whether it's matching product titles to images or associating products with broader categories. The Marqo-Ecommerce models are set to transform search, retrieval, and recommendation tasks in the ecommerce industry.

Next Steps

Try out the following for yourself:

To learn more about how we can help drive revenue to your business, book a demo with us or read more in our Redbubble Case study.

Ellie Sleightholm
Head of Developer Relations at Marqo