Back to all Blog Posts

February 18, 2026

How AI Transforms Visual Product Search in Ecommerce

The hot-dog 100k dataset

Multimodal AI is changing how people discover products online. In ecommerce, shoppers often search visually, whether they are browsing styles, looking for similar products, or trying to match an image to something in a retailer’s catalog. To build high performing visual discovery experiences, search systems need to understand both text and images at scale.

Marqo is an AI native ecommerce search and product discovery platform designed to power multimodal search experiences, including image search and text to image retrieval. In this post, we explore how semantic search can be applied to a large image dataset using a fun example, a dataset of nearly 100,000 AI generated hot dog images.

I was particularly interested in seeing what a generative model could produce from the same prompt across a large number of images. Using the Huggingface diffusers library, I generated close to 93,000 images and used them as the basis for this experiment.

My original plan of 1 million hot-dogs in a day quickly unraveled as I soon realized the required 695 hot-dogs/minute would be unattainable so I had to settle for 13 hot-dogs/minute and ~93,000 (only 7,000 off) images.

Here is a sample of 100 images randomly selected. There is a pretty wide variety of dogs that are generated and some very interesting interpretations of what constitutes a hot dog.

Indexing the hot-dog 100k dataset

To dig a bit deeper into hot dog-100k dataset, we can index the data using Marqo.

In ecommerce, this same workflow applies to product catalogs. Indexing images with multimodal AI makes it possible to build visual search and product discovery experiences where shoppers can search using natural language, style descriptions, or even images. The goal is not just similarity search, but commercially relevant discovery that improves conversion and revenue.

After downloading the dataset we start up Marqo:

Once Marqo is up and running we can get the files ready for indexing:

Now we can start indexing:

Check we have our images in the index:

Cleaning the hot-dog 100k dataset

One noticeable thing is the presence of some black images. These are caused by the built-in filtering that suppresses images which may be deemed NSFW. I couldn’t be bothered to remove the filter so we have some of these images in the dataset. We can easily remove these though using Marqo. Since I am lazy I will just search for a blank image using a natural language description — “a black image”.

Top 3 results for “a black image”

Now we have a black image, we can search using that and find all the other duplicate black images and remove them from the dataset.

Now our dataset should be free from images that do not contain hot dogs.

Labeling the hot-dog 100k dataset

Given the variety of hot dogs generated from a single prompt, I was keen to understand more. A quick search of the following query — “two hot dogs” yielded something of interest but more was to follow, “a hamburger” and “a face”.

The top two images (up/down) for the queries “two hot dogs”, “a hamburger” and “a face” (left/right).

Armed with this survey, I created a new index with the following four documents — “one hot dog”, “two hot dogs”, “a hamburger” and “a face”. We can use our dataset images as queries against these labels and get back scores for each image:label pair. This is effectively doing zero-shot learning to provide a score for each category which could be thresholded to provide classification labels to each image for each category.

Now for each image we have the computed scores against each category which were just “documents” in our small index.

Updating the hot-dog 100k dataset

We have now calculated scores for the different categories described previously. The next thing to do is update our indexed data to have the scores. If we re-index a document it will update it with any new information. We can remove our image field from the documents as it has already been indexed with a model previously.

Animating the hot-dog 100k dataset

After all this work we can now animate the hot-dog-100k. We will do this by effectively “sorting” the vectors and creating a movie from these vectors accompanying images. Before we animate, we can make it more interesting as well by restricting the images we can search over by pre-filtering on some of the scores we calculated. This effectively restricts the space we can use for the animation based on the filtering criteria.

To animate the images based on their sorted vectors we can take an image as a start point (based on the query “a photo of a smiling face”) and find the next closest one (as seen above). Repeat the process until no more images are left and you have walked across the latent space. We are effectively solving a variant of the traveling salesmen problem — albeit with an approximate algorithm.

After we have done this walk we have the “sorted” list of images. These can then be animated in the order they appear in the list.

Closing thoughts

Multimodal AI is rapidly reshaping how search and discovery works. As generative models improve, the line between image understanding, text understanding, and product discovery continues to blur. For ecommerce teams, this creates a major opportunity to deliver better shopping experiences through visual search, intent based discovery, and catalog aware relevance.

Marqo applies these multimodal capabilities within an AI native ecommerce search and product discovery platform. By combining advanced image understanding with catalog trained intelligence, Marqo helps retailers deliver more relevant discovery experiences that increase conversion and reduce friction across the buying journey.

Ready to explore better search?

Marqo drives more relevant results, smoother discovery, and higher conversions from day one.

Talk to a Search Expert