In the age of big data and artificial intelligence, the ability to efficiently search through large datasets has become increasingly important. Traditional keyword-based search methods often fall short in providing accurate and relevant results, especially in complex scenarios involving natural language processing and multimedia data. This is where vector search comes into play. In this article, we’ll be showing you how to set up your own vector search applications using Marqo.
Marqo is super easy to implement (only takes a few lines of code to set up) and they handle a lot of the complicated stuff for you, including embedding generation.
To run Marqo, you will need the following:
Let's now take a look at how we can set up and install Marqo.
We’ll start with downloading and installing Marqo. If you have any issues setting up Marqo, visit our Slack Community and send us your issue on the ‘get-help’ channel where we’ll be there to help!
docker pull marqoai/marqo:latest
docker rm -f marqo
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
First, you will begin pulling from marqoai/marqo followed by setting up a vector store. Next, Marqo artefacts will begin downloading. Then, you’ll be greeted with this lovely welcome message once everything is set up successfully. This can take a little bit of time while it downloads everything needed to begin searching.
That’s it - It really is as easy as that! Now we’re ready to use Marqo! It’s important that you keep your terminal open while we begin programming.
While Docker is running, we can use Marqo as we would any other Python library. We’ll begin with a simple example where we create an index and perform searches on movie descriptions. If you have any issues with the following code, visit our Slack Community and send us your issue on the ‘get-help’ channel where we’ll be there to help!
Let’s first install Marqo in our terminal:
pip install marqo
Now we’re ready to write our first vector search system!
Navigate to a Python script and begin by importing Marqo:
import marqo
Next, we need to create a Marqo client that will communicate with the Marqo server. We'll specify the server URL, which in this case is running locally on http://localhost:8882.
# Create a Marqo client
mq = marqo.Client(url="http://localhost:8882")
This step sets up the client to interact with the Marqo API, allowing us to perform various operations such as creating indexes and adding documents.
Before we create a new index, it's good practice to delete any existing index with the same name to avoid conflicts. Here, we are deleting the "movies-index" if it already exists.
# Delete the index if it already exists
try:
mq.index("movies-index").delete()
except:
pass
This ensures that we start with a clean slate every time we run our script.
Next, we create an index named "movies-index" using a specific machine learning model, hf/e5-base-v2. This model is designed to generate embeddings for various types of text inputs. It will be used for vectorizing the documents we add to the index.
# Create an index - Using this model: https://huggingface.co/intfloat/e5-base-v2
mq.create_index("movies-index", model="hf/e5-base-v2")
Creating an index is crucial as it prepares Marqo to store and manage the documents we'll be working with.
Now, we add some movie descriptions to our index. These descriptions will be vectorized and stored in the index, making them searchable. We specify a 'Title' and 'Description' for each movie.
# Add documents (movie descriptions) to the index
mq.index("movies-index").add_documents(
[
{
"Title": "Inception", # Title of the movie
"Description": "A mind-bending thriller about dream invasion and manipulation.", # Movie description
},
{
"Title": "Shrek",
"Description": "An ogre's peaceful life is disrupted by a horde of fairy tale characters who need his help.",
},
{
"Title": "Interstellar",
"Description": "A team of explorers travel through a wormhole in space to ensure humanity's survival.",
},
{
"Title": "The Martian",
"Description": "An astronaut becomes stranded on Mars and must find a way to survive.",
},
],
# Specifies which fields of the documents should be used to generate vectors. In this case, 'Description'.
tensor_fields=["Description"],
)
In this step, we specify that the "Description" field of each document should be used for vector search by including it in the tensor_fields parameter.
With our index populated with movie descriptions, we can now perform a search query. Let's search for a movie related to space exploration.
# Perform a search query on the index
results = mq.index("movies-index").search(
# Our query
q="Which movie is about space exploration?"
)
This query searches the descriptions in our index for content related to space exploration.
Finally, we print out the search results, including the title, description, and the relevance score for each movie that matches the query.
# Print the search results
for result in results['hits']:
print(f"Title: {result['Title']}, Description: {result['Description']}. Score: {result['_score']}")
The relevance score ( _score ) indicates how well each document matches the search query.
Let’s look at the outputs:
Title: Interstellar, Description: A team of explorers travel through a wormhole in space to ensure humanity's survival.. Score: 0.8173517436600624
Title: The Martian, Description: An astronaut becomes stranded on Mars and must find a way to survive.. Score: 0.8081475581626953
Title: Inception, Description: A mind-bending thriller about dream invasion and manipulation.. Score: 0.7978701791216605
Title: Shrek, Description: An ogre's peaceful life is disrupted by a horde of fairy tale characters who need his help.. Score: 0.7619883916893311
Interstellar has the highest relevance score (0.817), indicating it is the most relevant to the query "Which movie is about space exploration?". The Martian follows closely with a score of 0.808, also highly relevant to the query. Inception and Shrek have lower scores (0.798 and 0.762, respectively), indicating they are less relevant to the space exploration theme. These scores help us understand how well each movie's description aligns with the search query, allowing us to identify the most pertinent results efficiently.
Awesome! Now we’ve seen how to get started with a simple search demo with Marqo, let’s look at searching over different types of data!
We’ll now walk through a practical example of using the Marqo library for multimodal indexing. We'll create an index that can handle both text and image data, add a document to the index, and perform a search.
As with the previous example, we'll import the Marqo library and create a Marqo client.
import marqo
# Create a Marqo client with the specified URL
mq = marqo.Client(url="http://localhost:8882")
As with our previous example, before we create the index, it's important to delete any index with the same name that may already exist.
# Delete the movie index if it already exists
try:
mq.index("my-multimodal-index").delete()
except:
pass
Next, we'll define the settings for our index. We'll enable image indexing and specify the model to use for indexing. In this case, we're using the open_clip/ViT-B-32/laion2b_s34b_b79k model. Note that if you do not configure multi modal search, image urls will be treated as strings.
# Settings for the index creation, enabling image indexing and specifying the model to use.
settings = {
"treat_urls_and_pointers_as_images": True, # allows us to treat URLs as images and index them
"model": "open_clip/ViT-B-32/laion2b_s34b_b79k", # model used for indexing: https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K
}
# Create the index with the specified settings
response = mq.create_index("my-multimodal-index", **settings)
Now we'll add a document to our index. This document includes an image of a hippopotamus and a description. The image URL is treated as a tensor field.
# Add documents to the created index, including an image and its description
response = mq.index("my-multimodal-index").add_documents(
[
{
"My_Image": "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg/640px-Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg",
"Description": "The hippopotamus, also called the common hippopotamus or river hippopotamus, is a large semiaquatic mammal native to sub-Saharan Africa",
"_id": "hippo-facts", # unique identifier for the document
}
],
tensor_fields=["My_Image"], # specify that "My_Image" should be treated as a tensor field
)
Finally, we can perform a search on our index. We'll search for the term "animal" and print the results.
# Search the index for the term "animal"
results = mq.index("my-multimodal-index").search("animal")
# Print the search results
import pprint
pprint.pprint(results)
After running the search query for the term "animal," we received the following output:
{'hits': [{'Description': 'The hippopotamus, also called the common '
'hippopotamus or river hippopotamus, is a large '
'semiaquatic mammal native to sub-Saharan Africa',
'My_Image': 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg/640px-Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg',
'_highlights': [{'My_Image': 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg/640px-Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg'}],
'_id': 'hippo-facts',
'_score': 0.5586894792769398}],
'limit': 10,
'offset': 0,
'processingTimeMs': 256,
'query': 'animal'}
Let's break down what each part of the output means:
The search output provides detailed information about the documents that match your search query, including their descriptions, image URLs, relevance scores, and more. By understanding this output, you can gain insights into how your data is being indexed and retrieved, allowing you to refine your search capabilities and improve the relevance of your results.
You can see the complete code for these examples on the Marqo GitHub repo.
Movie Search: Source Code
Multimodal Search: Source Code
In this article, we've walked through the steps of setting up a Marqo client, creating an index, adding documents, and performing a search query. This process allows us to efficiently search through content using vector search. Marqo makes it straightforward to implement powerful search capabilities in your applications.
If you want to see what else Marqo is capable of, visit our documentation here.