Often the items we want to search over contain more than just text. For example, they may also contain images or videos. These modalities other than text will often contain a wealth of information that is not captured by the text. By incorporating these other modalities into search, the relevancy of results can be improved as well as unlocking new ways to search. Examples of multimodal search include domains like fashion and e-commerce which may have title, description as well as multiple images displaying the item. This data can also help disambiguate the subject of the image - for example if there are multiple items present like pants and a top, the text can provide the necessary context to identify the correct subject. The information contained in these data across modalities is complementary and rich.
Standard search methodologies aren’t well suited to solve this semantic search problem, but machine learning algorithms and AI-focused technologies like vector databases and multimodal embedding models provide a rich toolset for implementing multimodal search. Using these tools opens up the possibility of multiple types of search use cases. For eCommerce companies, a popular implementation is building a powerful and intuitive product search engine as part of a website or app.
Let’s dig into multimodal search. This articles has three main parts:
Multimodal search is search that operates over multiple modalities. We can think of two ways of doing multimodal search, using multimodal queries and multimodal documents. In both cases, they may contain any combination of text and image data. For clarity we will stick to two modalities for now, text and images but the concepts are not restricted to just those and can be extended to video, audio and other data types (for example).
There are numerous benefits to this multimodal approach. For example:
In this section we will walk through a number of ways multimodal search can be used to improve and curate results.
Multimodal queries are queries that are made up of multiple components and/or multiple modalities. The benefit is that it effectively allows us to modify the scoring function for the approximate-knn to take into account additional similarities - for example, across multiple images or text and images. The similarity scoring will now be against a weighted collection of items rather than a single piece of text data. This allows finer grained curation of search results than by using a single part query alone. We have seen previous examples of this earlier in the article already where both images and text are used to curate the search.
Shown below is an example of this where the query has multiple components. The first query is for an item while the second query is used to further condition the results. This acts as a “soft” or “semantic” filter.
This multi-part query can be understood to be a form of manual query expansion. The animation below illustrates how the query can be used to modify search results.
In the previous examples we saw how multiple queries can be used to condition the search. In those examples, the terms were being added with a positive weighting. Another way to utilise these queries is to use negative weighting terms to move away from particular terms or concepts. Below is an example of a query with an additional negative term:
Now the search results are also moving away from the buttons while being drawn to the green shirt and short sleeves.
Negation can help avoid particular things when returning results, like low-quality images or ones with artefacts. Avoiding things like low-quality images or NSFW content can be easily described using natural language as seen in the example query below:
In the example below the initial results contain three low-quality images. These are denoted by a red mark for clarity and the poor image quality can be seen by the strong banding in the background of these images.
An alternative is to use the same query to clean up existing data by using a positive weight to actively identify low-quality images for removal.
In the earlier examples we have seen how searching can be performed using weighted combinations of images and text. Searching with images alone (via image embeddings) can also be performed to utilize image similarity to find similar looking items. An example query is below:
It can also be easily extended in the same way as with text to include multiple multimodal terms.
Another way to utilize the multimodal queries is to condition the query using a set of items. For example, this set could come from previously liked or purchased items (this is a form of similarity search). This will steer the search in the direction of these items and can be used to promote particular items or themes. This method can be seen as a form of relevance feedback that uses items instead of variations on the query words themselves. To avoid any extra inference at search time, we can pre-compute the set of items vectors and fuse them into a context vector.
Below is an example of two sets of 4 items that are going to be used to condition the search. The contribution for each item can also be adjusted to reflect the magnitude of its popularity.
An alternative method to constructing multi-part queries is to append specific characteristics or styles to the end of a query. This is effectively the same as "prompting" in text to image generation models like DALLE and Stable Diffusion. For example, additional descriptors can be appended to a query to curate the results. An example query with additional prompting is below:
The impact of this prompting on the results can be seen in the animation.
Another example query of searching as prompting:
In addition to curating the search with the methods outlined above, we can modify the similarity score to allow ranking with other signals or metrics. For example document specific values can be used to multiply or bias the vector similarity score. This allows for document specific concepts like overall popularity to impact the ranking. Below is the regular query and search results based on vector similarity alone. There are three low-quality images in the result set and can be identified by the strong banding in the background of the images.
To illustrate the ability to modify the score and use other signals for ranking we have calculated an aesthetic score metric for each item. The aesthetic score is meant to identify "aesthetic" images and rate them between 1 and 10. We can now bias the score using this document (but query independent) field. An example is below:
In the image above, the results have now been biased by the aesthetic score to remove the low-quality images (which have a low aesthetic score). This example uses aesthetic score but any other number of scalars can be used - for example ones based around sales and/or popularity.
Multimodal entities or items are just that - representations that take into account multiple pieces of information. These can be images or text or some combination of both. Examples include using multiple display images for ecommerce. Using multiple images can aid retrieval and help disambiguate between the item for sale and other items in the images. If a multimodal model like CLIP is used, then the different modalities can be used together as they live in the same latent space.
In the next section we will demonstrate how all of the above concepts can be implemented using Marqo.
The dataset consists of ~220,000 e-commerce products with images, text and some meta-data. The items span many categories of items, from clothing and watches to bags, backpacks and wallets. Along with the images they also have an aesthetic score, caption, and price. We will use all these features in the following example. Some images from the dataset are below.
The first thing to do is start Marqo (you can set up our open source product, or start immediately with our cloud offering). To start the workflow, we can run the following docker command from a terminal (for M-series Mac users see here).
The next step is to install the python client (a REST API is also available).
The first step is to load the data. The images are hosted on s3 for easy access. We use a file that contains all the image pointers as well as the meta data for them (found here).
Now we have the data prepared, we can set up the index. We will use a ViT-L-14 from open clip as the model. This model is very good to start with. It is recommended to use a GPU (at least 4GB VRAM) otherwise a smaller model can be used (although results may be worse).
Now we can add images to the index (these become vector embeddings, specifically, image embeddings) which can then be searched over. We can also select the device we want to use and also which fields in the data to embed. To use a GPU, change the device to cuda (see here for how to use Marqo with a GPU).
Now the images are indexed, we can start searching.
Like in the examples above, it is easy to do more specific searches by adopting a similar style to prompting.
Now we can extend the searching to use multi-part queries. These can act as "semantic filters" that can be based on any words to further refine the results.
In addition to additive terms, negation can be used. Here we remove buttons from long sleeve shirt examples.
In addition to text, searching can be done with images alone.
The multi-part queries can span both text and images.
We can now extend the search to also include document specific values to boost the ranking of documents in addition to the vector similarity. In this example, each document has a field called aesthetic_score which can also be used to bias the score of each document.
Results at a per-query level can be personalized using sets of items. These items could be previously liked or popular items. To perform this we do it in two stages. The first is to calculate the "context vector" which is a condensed representation of the items. This is pre-computed and then stored to remove any additional overhead at query time. The context is generated by creating documents of the item sets and retrieving the corresponding vectors. The first step is to create a new index to calculate the context vectors.
Then we construct the objects from the sets of items we want to use for the context.
We can now define mappings objects to determine how we want to combine the different fields. We can then index the documents.
To get the vectors to use as context vectors at search time - we need to retrieve the calculated vectors. We can then create a context object that is used at search time.
For the final part of this example, we demonstrate how both text and images can be combined together as a single entity and allow multimodal representations. We will create a new index in the same way as before but with a new name.
To index the documents as multimodal objects, we need to create a new field and add in what we want to use.
The next step is to index. The only change is an additional mappings object which details how we want to combine the different fields for each document.
Finally we can search in the same way as before.
To summarise, we have shown how vector search can be easily modified to enable a number of useful use cases. These include the use of multimodal queries that comprise text and images, queries with negative terms, excluding low quality images, searching with images, per query search curation using popular items, verbose searching via prompting, ranking with external scalars and multimodal representations.
You can also see that building powerful search experiences requires more than a basic large language model (like OpenAI’s GPT LLM), NLP, or AI image technology. Marqo gives you all of the tools you need, out of the box, to ship powerful search experiences across data types.
If you are interested in learning more, then head to Marqo, see other examples or read more in our blog.