Showcase

How to Build a Search Tool to Find Clips in YouTube Videos

December 24, 2024

mins read

Ever wished you could search YouTube videos for specific moments, topics, or phrases as easily as searching a document? Imagine typing a query like "goals scored" and being instantly directed to the exact timestamp in a video where that happens. With Marqo’s YouTube Video and Audio Search, this vision becomes a reality. In this blog, we show you exactly how to do it.

In this blog, we’ll show you how to search through video and audio content effortlessly. We’ll guide you through setting up Marqo to search YouTube videos. You’ll learn how to create your own searchable video library, leverage Marqo’s seamless search capabilities, and get a user-friendly interface to explore your video content with ease. The video below is what you can expect to see. Let’s dive in!

Step 1: Set Up Your Environment

First, install Marqo using pip. We will use this to build the video and audio search application.


pip install marqo

‍

We will also be using Pandas to load our dataset so install this library too:


pip install pandas

‍

This will prepare everything you need to start working with the demo.

Step 2: Create Your Marqo Index

We need to create a Marqo index which we can add our documents to. The following code sets up your index with the appropriate settings for video and audio data.


from marqo import Client

# Initialize the Marqo client
api_key = "input_your_api_key" # replace this with your actual API Key. See this article: https://www.marqo.ai/blog/finding-my-marqo-api-key
mq = Client(url="https://api.marqo.ai", api_key=api_key)

# Define settings for the index
settings = {
    "type": "unstructured",  # Specifies the type of data to be indexed; supports multiple media types
    "vectorNumericType": "float",  # Defines the numeric type of vector embeddings for precision
    "model": "LanguageBind/Video_V1.5_FT_Audio_FT_Image",  # Specifies the embedding model for audio, video, and images
    "normalizeEmbeddings": True,  # Normalizes embeddings to ensure uniform vector magnitudes
    "treatUrlsAndPointersAsMedia": True,  # Treats URLs and pointers as media files for processing
    "treatUrlsAndPointersAsImages": True,  # Specifically treats URLs and pointers as images
    "audioPreprocessing": {  # Configuration for audio file preprocessing
        "splitLength": 10,  # Splits audio into 10-second segments for embedding
        "splitOverlap": 5,  # Adds 5-second overlap between audio segments for context
    },
    "videoPreprocessing": {  # Configuration for video file preprocessing
        "splitLength": 20,  # Splits video into 20-second segments for embedding
        "splitOverlap": 5,  # Adds 5-second overlap between video segments for context
    },
    "inferenceType": "marqo.GPU",  # Specifies GPU usage for faster inference
}

# Name of the index to be created
index_name = "youtube-search"

# Create a new index in Marqo with the specified name and settings
mq.create_index(index_name=index_name, settings_dict=settings)

The settings dictionary tailors the index to handle unstructured data, which is perfect for managing multimodal content like videos, images, and audio. Key settings include:

model: Specifies the model to use for generating embeddings. In this example, we use a model designed to handle video, audio, and image modalities.
audioPreprocessing and videoPreprocessing: Split the media into overlapping segments for context preservation, ensuring more accurate embeddings.
normalizeEmbeddings: Ensures that embeddings are on a consistent scale, enhancing search accuracy.
inferenceType: Leverages GPU for faster processing, making it efficient for large datasets.

We also specify our index name as youtube-search but feel free to name this to whatever you’d prefer. We finally create the index with this name and our defined settings.

When you create your index, your terminal will populate similar to the following:


2024-12-23 11:12:32,551 logger:'marqo' INFO Current index status: IndexStatus.CREATING
2024-12-23 11:12:43,705 logger:'marqo' INFO Current index status: IndexStatus.CREATING
2024-12-23 11:12:54,834 logger:'marqo' INFO Current index status: IndexStatus.CREATING

‍

Your index will also become available in Marqo Cloud. This index may take some time to create, you can track the status of your index under the ‘status’ header in Marqo Cloud.

Step 3: Adding Documents to the Index

Once our index is created (the status of our index will be ‘ready’), we need to populate it with our data. For this example, we took all of our YouTube videos, chunked them into 20 second clips, and then hosted them on a public AWS S3 bucket. This provides us with video URLs that we can then add to our index. We created a CSV file containing all of the links called video_urls.csv which you can find on our GitHub here. If you want to run this for your own YouTube data, please see our GitHub README for more information.


import pandas as pd

# Define the path to the CSV file containing video URLs or metadata - replace this with the path of your video_urls.csv file. You can find this file at https://github.com/marqo-ai/youtube-search/blob/main/data/video_urls.csv
path_to_data = "data/video_urls.csv"  

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(path_to_data)

# Prepare the documents for indexing in Marqo
# Each document contains a single field named "video_field" populated with data from the "video_field" column in the CSV
documents = [
    {"video_field": video_field}
    for video_field in df["video_field"]
]

# Add the documents to the specified Marqo index
res = mq.index(index_name).add_documents(
    documents=documents,          # List of documents to be indexed
    client_batch_size=1,          # Processes one document at a time (adjust for larger datasets)
    tensor_fields=['video_field'],  # Fields to be converted to embeddings
    use_existing_tensors=False,   # Ensures new embeddings are created instead of using precomputed ones
)

# Print the response from Marqo to confirm successful indexing
print(res)

‍

While our documents are being added to our index, we can check the status of our index with the following code. Alternatively, you can click on your index in Marqo Cloud and it will give you this information.


res = mq.index(index_name).get_stats()

‍

You are able to perform searches across your index even while documents are being added. Let’s take a look at how you can do that now.

Step 4: Search!

With the index set up and populated, you’re ready to search… and you can do it in just one line of code!


res = mq.index(index_name).search("fun fact")
print(res['hits'])

‍

This will return the top hits related to your query, including the video URLs and other metadata. To return the top hit you can print res['hits'][0] instead. For the best results, try to make queries specific to your channel here.

Step 5: (Optional) Build a User-Friendly Interface

If you prefer a more interactive experience, Streamlit offers a simple and intuitive interface. Users can input a query, view search results, and jump to specific timestamps on YouTube.

Copy the following code into a Python script called app.py. You will need to pip install streamlit, re, and csv for this. Note, this code uses "data/youtube_ids.csv" which contains information about the youtube video ids. This allows us to correctly identify the video and timestamp of our search result. If you want to run this tutorial for your own YouTube channel, see our GitHub README for information on how to create this CSV.


import streamlit as st
from marqo import Client
import re
import csv

# Set full-screen page layout
st.set_page_config(page_title="Marqo YouTube Video Search App", layout="wide")

api_key = "put_your_api_key_here"

# Initialize Marqo client
mq = Client(url="https://api.marqo.ai", api_key=api_key)

# Load VIDEO_ID_MAP from CSV
VIDEO_ID_MAP = {}
with open("data/youtube_ids.csv", "r") as file:
    reader = csv.DictReader(file)
    for row in reader:
        VIDEO_ID_MAP[row["video_name"]] = row["video_id"]


def get_youtube_link(url):
    """
    Given a chunked video URL, return the corresponding YouTube link with timestamp.
    """
    match = re.search(r"video(\d+)_(\d+)\.mp4", url)
    if not match:
        return "Invalid URL"

    video_num = match.group(1)  # Video number
    chunk_num = int(match.group(2))  # Chunk number

    # Map video number to YouTube video ID
    video_key = f"video{video_num}"
    youtube_id = VIDEO_ID_MAP.get(video_key)

    if not youtube_id:
        return "YouTube video ID not found"

    # Calculate the start time in seconds
    start_time = (chunk_num - 1) * 20  # Each chunk is 20 seconds

    # Construct the YouTube link
    youtube_url = f"https://www.youtube.com/watch?v={youtube_id}&t={start_time}s"
    return youtube_url

@st.cache_data
def search_videos(query):
    """Fetch top 6 video URLs."""
    try:
        index_name = "youtube-search"
        res = mq.index(index_name).search(query)
        print(res['hits'][0])
        video_urls = [result.get('video_field') for result in res.get('hits', [])[:6]]
        return video_urls
    except Exception as e:
        print(e)
        return []

# Streamlit app layout
st.title("Marqo YouTube Video Search")
st.text("Perform visual and audio searches over YouTube videos. This demo uses Marqo's YouTube channel to search for relevant information and will direct you to the corresponding timestamp in the YouTube video.")
st.text("Examples: \"What are embedding models?\", \"Marqo API Key?\", \" Demo Presentation?\"")

query = st.text_input("Input your query...", "fun fact")

if st.button("Search") and query:
    with st.spinner("Fetching videos..."):
        video_urls = search_videos(query)

    # Display videos in a 3x2 grid
    if video_urls:
        rows = 2
        cols = 3
        video_grid = st.columns(cols)  # Define columns for the grid layout

        for i, video_url in enumerate(video_urls):
            if video_url:
                youtube_link = get_youtube_link(video_url)  # Generate the YouTube link

                with video_grid[i % cols]:  # Place video in the correct column
                    st.video(video_url)  # Display video
                    if "youtube" in youtube_link:
                        st.markdown(f"[**Watch this on YouTube**]({youtube_link})", unsafe_allow_html=True)

            if (i + 1) % cols == 0:  # Create new columns for the next row
                video_grid = st.columns(cols)

    else:
        st.error("No videos found. Try a different query.")

‍

You can launch this app by running the following:


streamlit run app.py

‍

This will launch a UI similar to the following:

Here we type in a query such as 'embedding models' and it takes us to the video we did where we discuss what embedding models are. When we click on 'Watch on YouTube', it directs us straight to the YouTube video timestamp. Awesome!

Clean Up

If you follow the steps in this guide, you will create an index with GPU inference and a basic storage shard. This index will cost $1.03 per hour. When you are done with the index you can delete it with the following code:


mq.delete_index(index_name)

Next Steps

You can find all scripts mentioned here as well as further detailed instructions on our GitHub. This also contains additional information e.g. how to delete specific items from your index, checking your index statistics, and listing all documents in your index. These optional tools provide valuable insights into your index.

Have questions? Reach out to us: