Growth Metrics

April 15, 2024

What Is NDCG? The Metric That Measures Whether Your Ecommerce Search Actually Works

April 15, 2024

Ellie SleightholmHead of Developer Relations

Growth Metrics

Every ecommerce team has the same problem: shoppers search, results appear, but nobody knows whether those results are actually good. Revenue goes up or down, and the search team attributes it to seasonality, pricing, or inventory. Rarely does anyone ask the harder question: are we surfacing the right products in the right order?

Normalized Discounted Cumulative Gain, or NDCG, is the metric that answers that question. It does not just measure whether relevant products appear somewhere in the results. It measures whether the most relevant products appear first, where shoppers actually look.

For ecommerce teams serious about search quality, NDCG is the single most important ranking metric to understand. It is the difference between a search engine that technically returns relevant results and one that puts the best products at the top of the page, where they convert.

This post explains what NDCG is, how it works, and why it matters more for ecommerce search than any other retrieval metric. We will also show how Marqo uses NDCG to benchmark its Commerce Superintelligence against legacy search platforms, and what those benchmarks reveal about the state of product discovery.

The Problem NDCG Solves

Imagine a shopper searches for "black running shoes" on your site. Your search engine returns 20 results. Ten of them are genuinely relevant black running shoes. The other ten are black dress shoes, white running shoes, and shoe care kits.

A simple relevance metric might score this as 50% accurate. But that number hides something critical: where did those ten relevant products appear? If they all appear in positions 11 through 20, below the fold, most shoppers will never see them. The search technically "worked" but practically failed.

NDCG solves this by weighting position. A relevant product in position one contributes far more to the score than a relevant product in position fifteen. This matches how shoppers actually behave. Eye-tracking studies consistently show that the first three to five results receive the vast majority of attention. After that, engagement drops off logarithmically.

How NDCG Is Calculated

NDCG builds on two simpler concepts: Cumulative Gain and Discounted Cumulative Gain.

Cumulative Gain (CG) is the sum of relevance scores across all results. If you have a five-point relevance scale (0 = irrelevant, 4 = perfect match), you simply add up the scores. A result set of [4, 3, 0, 1, 2] has a cumulative gain of 10.

The problem with CG is that it ignores order. The result set [0, 1, 2, 3, 4] also scores 10, even though all the best products are at the bottom.

Discounted Cumulative Gain (DCG) fixes this by applying a logarithmic discount based on position. The formula is:

DCG = Σ (relevance_i / log₂(i + 1))

For position 1, the divisor is log₂(2) = 1, so the relevance score counts fully. For position 2, the divisor is log₂(3) ≈ 1.58. By position 10, the divisor is log₂(11) ≈ 3.46. This means a perfect-relevance product in position 10 contributes less than a third of what it would in position 1.

Normalized DCG (NDCG) takes the final step: it divides the actual DCG by the ideal DCG (IDCG), which is the DCG you would get if results were perfectly ordered by relevance. This normalization puts the score on a 0-to-1 scale, where 1.0 means perfect ranking.

NDCG = DCG / IDCG

A score of 0.95 means your search engine is ranking results very close to the ideal order. A score of 0.60 means significant ranking failures are pushing relevant products below irrelevant ones.

Why NDCG Matters for Ecommerce Specifically

Generic search metrics often treat all queries equally. But in ecommerce, different queries carry different commercial intent and different revenue potential. NDCG captures something that binary metrics miss: the gradient of relevance.

Consider the query "red dress for wedding." A bright red cocktail dress is somewhat relevant. A burgundy formal gown is more relevant. A red chiffon wedding guest dress in the shopper's size and price range is highly relevant. Binary relevance (relevant or not) collapses these distinctions. NDCG preserves them.

This matters because ecommerce conversion is not binary either. A shopper who sees a "somewhat relevant" result might browse but probably will not buy. A shopper who sees a "highly relevant" result in position one is far more likely to convert. NDCG captures this revenue-critical distinction by allowing graded relevance judgments and weighting them by position.

For most ecommerce search systems, NDCG at position 10 (NDCG@10) is the most informative cutoff. This roughly corresponds to the products visible on the first page of results. If your NDCG@10 is low, it means your best products are buried, and your revenue is suffering for it.

How Legacy Search Platforms Score on NDCG

Most ecommerce sites run on keyword-based search platforms that were built in the early 2010s or earlier. These systems match queries to products using text overlap: if the query words appear in the product title or description, the product is returned. Ranking is then determined by a combination of text match strength and manually configured boost rules.

This approach has a fundamental NDCG problem. Keyword systems do not understand products. They match strings. When a shopper searches for "summer office outfit," a keyword system has no concept of what makes an outfit appropriate for summer, for an office, or for both simultaneously. It returns products that contain those words, in whatever order its boost rules dictate.

The result is mediocre NDCG scores across the board. Relevant products appear, but they are scattered across the result set instead of concentrated at the top. The best product for the query might be in position 12 because its title says "lightweight linen blazer" instead of "summer office outfit."

Manual merchandising rules can improve NDCG for high-volume queries. A merchandiser can hand-curate the top results for "summer office outfit" and push the best products to the top. But this does not scale. Most ecommerce catalogs have hundreds of thousands of unique queries per month. No team can manually optimize all of them.

How Marqo Approaches NDCG

Marqo is an AI-native product discovery platform that takes a fundamentally different approach to ranking. Instead of matching keywords, Marqo understands every product in the catalog: what it looks like, what it pairs with, what it substitutes, and what drives margin. This deep product understanding translates directly into higher NDCG scores.

When a shopper searches for "summer office outfit," Marqo does not look for keyword matches. It understands the concept of summer-appropriate office wear and identifies products that fit, regardless of what words appear in their titles. A lightweight linen blazer ranks highly because the system understands it is a summer office staple, not because its metadata contains the right keywords.

Marqo's Commerce Superintelligence combines product intelligence with behavioral data to continuously improve ranking. When shoppers consistently click, add to cart, and purchase certain products for a given query, those signals refine future rankings. But unlike systems that rely solely on historical behavior, Marqo's product understanding ensures that new products with no behavioral data are ranked accurately from day one.

In benchmark testing against leading search platforms, Marqo demonstrated an 88% improvement in NDCG over Amazon Titan, one of the most widely used models for product search. This is not a marginal gain. An 88% improvement in NDCG means dramatically more relevant products appearing in the positions where shoppers actually look.

Real Revenue Impact

NDCG improvements are not abstract. They translate directly to revenue. When a leading fast fashion retailer deployed Marqo's Commerce Superintelligence, the results were measurable within weeks: $130M in incremental revenue attributed to improved product discovery. That revenue came from the same catalog, the same traffic, and the same shoppers. The difference was that the right products appeared in the right positions.

This is what NDCG measures in practice. It is the distance between "the product exists in our catalog" and "the product appears where the shopper will see it and buy it."

NDCG vs. Other Search Metrics

NDCG is not the only metric that matters, but it captures something the others miss.

Precision measures what fraction of returned results are relevant. A precision of 0.8 means 80% of results are relevant. But it says nothing about order. All relevant results could be at the bottom.

Recall measures what fraction of all relevant products appear in the results. High recall means nothing is missing. But it says nothing about ranking or result quality.

MRR (Mean Reciprocal Rank) measures how quickly the first relevant result appears. It is useful but limited. It only cares about the single best result and ignores everything else.

NDCG combines the best aspects: it cares about relevance, order, and the full result set. For ecommerce, where shoppers scan multiple results and the ranking of each product affects conversion probability, NDCG provides the most complete picture.

How to Measure NDCG for Your Search

Measuring NDCG requires relevance judgments: human assessments of how relevant each product is for each query. This is the most labor-intensive part.

Step 1: Select representative queries. Choose 200 to 500 queries that represent the range of search behavior on your site. Include head queries (high volume), torso queries (medium volume), and tail queries (low volume, specific).

Step 2: Collect results. For each query, capture the top 10 to 20 results from your current search engine.

Step 3: Judge relevance. Have human evaluators rate each result on a graded scale. A common scale is 0 (irrelevant), 1 (marginally relevant), 2 (relevant), 3 (highly relevant), 4 (perfect match). Use multiple judges and average their scores to reduce bias.

Step 4: Calculate NDCG. Apply the formula for each query, then average across all queries for your overall NDCG score.

Step 5: Segment and analyze. Break NDCG down by query type, category, and volume tier. You will likely find that NDCG varies significantly. Head queries with manual merchandising rules may score well. Tail queries with no manual curation often score poorly.

Most ecommerce search platforms score between 0.45 and 0.65 on NDCG@10 across a representative query set. If your score is below 0.50, your search is actively losing revenue. If it is above 0.70, you are performing better than most of the industry.

Why Marqo Benchmarks on NDCG

Marqo publishes NDCG benchmarks because it is the most honest measure of search quality. It is easy to cherry-pick metrics that make any system look good. High recall does not mean good search. High precision at position one does not mean the rest of the results are useful.

NDCG captures the full picture: are the right products in the right order across the entire result page? When Marqo's Commerce Superintelligence delivers an 88% improvement in NDCG, it means the entire result set is dramatically better, not just one product in one position.

Marqo operates as an AI-native product discovery platform built specifically for commerce. Every architectural decision, from how products are indexed to how queries are interpreted to how results are ranked, optimizes for the metrics that drive ecommerce revenue. NDCG is the primary one.

Sibbi and the Future of Search Quality Measurement

As product discovery evolves beyond traditional search boxes, the way we measure quality evolves too. Sibbi is the conversational interface of Marqo's Commerce Superintelligence, an autonomous agent that guides shoppers from discovery through post-purchase using deep product understanding. In a conversational context, NDCG still applies. When Sibbi recommends products in response to a natural language request, the relevance and ranking of those recommendations determine whether the shopper converts.

The difference is that conversational commerce creates opportunities to refine intent in real time. A traditional search box gives you one shot to interpret the query. Sibbi can ask clarifying questions, narrow preferences, and serve increasingly precise recommendations. This iterative refinement naturally pushes NDCG higher because each round of interaction produces a more accurate understanding of what the shopper wants.

Practical Takeaways

If you manage ecommerce search, here is what to do with this information:

1Start measuring NDCG today. If you do not have relevance judgments for your top queries, that is the first gap to close. You cannot improve what you do not measure.

1Benchmark against your category. An NDCG@10 of 0.55 might be average for your vertical, or it might be well below. Context matters.

1Identify your NDCG gaps. The queries where NDCG is lowest are the queries where search is failing the most. These represent the largest revenue opportunities.

1Evaluate whether your current platform can close those gaps. If NDCG is low on tail queries because your system relies on keyword matching, no amount of manual curation will fix it. You need a system that understands products, not just product metadata.

1Look at what a modern approach delivers. Marqo's benchmarks show what is possible when product discovery is built on genuine product understanding rather than keyword matching. Results in 14 days, not months.

FAQ

What is a good NDCG score for ecommerce search? Most ecommerce sites score between 0.45 and 0.65 on NDCG@10. Scores above 0.70 are strong. Scores above 0.80 are exceptional and typically require AI-driven ranking that goes beyond keyword matching. Marqo customers consistently achieve scores in the 0.75 to 0.90 range.

How is NDCG different from click-through rate? Click-through rate measures what shoppers clicked, not what they should have clicked. CTR is influenced by product images, prices, and promotions, not just relevance. NDCG measures ranking quality independent of those factors. Both metrics are useful, but NDCG tells you whether your search is putting the right products in the right positions.

Can NDCG be gamed by showing fewer results? NDCG is evaluated at a fixed cutoff (such as @10 or @20), so reducing the number of results does not help. The normalization against the ideal ranking means you cannot score well simply by hiding bad results. You must rank good results at the top.

How often should I measure NDCG? Quarterly at minimum, monthly if your catalog changes frequently. Seasonal shifts, new product launches, and catalog expansion all affect NDCG. Continuous measurement catches degradation before it impacts revenue.

Does Marqo improve NDCG for all query types? Marqo's Commerce Superintelligence shows the largest NDCG improvements on natural language queries and tail queries, where keyword-based systems struggle most. For simple, high-volume queries that are already well-curated, improvements are still significant but the baseline is higher. The 88% improvement over Amazon Titan represents a blended score across all query types.

Ready to See Your NDCG Score Improve?

If your ecommerce search is ranking products based on keywords instead of genuine product understanding, your NDCG score reflects it, and so does your revenue. Marqo combines product intelligence with behavioral data to deliver the ranking quality that drives conversion.

Book a demo to see how Marqo's AI-native product discovery platform performs on your catalog, with your queries, measured by NDCG and the metrics that matter.

POST 2: what-is-mrr-in-machine-learning

Title: What Is MRR? Why First-Result Accuracy Determines Ecommerce Revenue

Meta description: MRR measures how often the best product appears first in search results. Learn why first-result accuracy drives ecommerce revenue and how to improve it.

What Is MRR? Why First-Result Accuracy Determines Ecommerce Revenue

There is a moment in every ecommerce search interaction that determines whether a shopper converts or leaves. It happens in the first one to two seconds after results load. The shopper's eyes land on the first result. If that product is exactly what they wanted, they click, they engage, and the path to purchase begins. If the first result is wrong, doubt sets in. The shopper starts scanning, scrolling, and mentally downgrading their confidence in the site.

Mean Reciprocal Rank, or MRR, is the metric that captures this moment. It measures how quickly the first relevant result appears in a ranked list. An MRR of 1.0 means the best product is always in position one. An MRR of 0.5 means the best product is typically in position two. An MRR of 0.33 means it is typically in position three.

For ecommerce, where the first result carries disproportionate weight in purchase decisions, MRR is one of the most commercially relevant search metrics. This post explains how MRR works, why it matters for product discovery, and what Marqo's benchmark results reveal about the gap between legacy search and modern product understanding.

How MRR Is Calculated

MRR is elegantly simple. For each query, find the position of the first relevant result. Take the reciprocal of that position (1 divided by the position number). Average the reciprocals across all queries.

The formula:

MRR = (1/N) × Σ (1 / rank_i)

Where N is the number of queries and rank_i is the position of the first relevant result for query i.

Example:

Query A: first relevant result at position 1 → reciprocal rank = 1/1 = 1.0

Query B: first relevant result at position 3 → reciprocal rank = 1/3 = 0.33

Query C: first relevant result at position 2 → reciprocal rank = 1/2 = 0.50

MRR = (1.0 + 0.33 + 0.50) / 3 = 0.61

An MRR of 0.61 tells you that, on average, the first relevant product appears somewhere between position one and position two. That sounds acceptable until you realize what it means in practice: for a significant portion of queries, the first result is wrong.

Why the First Result Matters Disproportionately

Ecommerce search behavior follows a harsh power law. Research on search result interaction consistently shows that position one receives 30 to 40 percent of all clicks. Position two receives 15 to 20 percent. Position three receives 8 to 12 percent. By position five, click probability drops below 5 percent.

This means the first result is two to three times more likely to be clicked than the second result, and five to eight times more likely than anything below position three. If your best product is not in position one, you are not just slightly suboptimal. You are leaving the majority of potential engagement on the table.

MRR captures this reality more directly than any other metric. While NDCG evaluates the entire result set and recall checks for completeness, MRR focuses ruthlessly on the question that matters most to revenue: did we nail the first result?

The Ecommerce MRR Problem

Most ecommerce search platforms have an MRR problem they do not know about, because they do not measure it.

Consider how a typical keyword-based search platform handles the query "moisturizer for dry skin." The system identifies products containing those keywords. It ranks them using a combination of text match strength, popularity signals, and manually configured boost rules. The product that appears first is often the one with the strongest keyword match or the highest sales volume, not necessarily the one that best answers the query.

A high-sales-volume moisturizer that works for all skin types might rank first because it has "moisturizer" in the title and strong sales data. But the shopper specifically asked for dry skin. The ideal first result is a rich, hydrating formula specifically designed for dry skin, even if it sells fewer units overall.

Keyword systems cannot make this distinction because they do not understand products. They match text. "Moisturizer for dry skin" matches any product that contains those words, regardless of whether the product actually addresses dry skin concerns.

This is why MRR on most ecommerce sites hovers between 0.45 and 0.60. The first result is wrong often enough to measurably suppress conversion.

How Product Understanding Improves MRR

Improving MRR requires the search system to genuinely understand what a product is, not just what words describe it. When a system understands that a particular moisturizer contains hyaluronic acid and ceramides, has a rich cream texture, and is formulated for dehydrated and flaky skin types, it can match that product to "moisturizer for dry skin" with high confidence, even if those exact words do not appear in the product title.

This is where Marqo's approach diverges from legacy platforms. Marqo is an AI-native product discovery platform that understands every product in the catalog: what it looks like, what it pairs with, what it substitutes, and what drives margin. This product understanding is not a supplementary feature layered on top of keyword matching. It is the foundation of how Marqo indexes, retrieves, and ranks products.

When Marqo processes the query "moisturizer for dry skin," it identifies products based on their actual characteristics, ingredients, use cases, and suitability. The product that best matches the intent ranks first, not the product with the strongest keyword overlap or the highest historical sales.

Marqo's Commerce Superintelligence combines product intelligence with behavioral data to continuously refine first-result accuracy. Behavioral signals confirm or adjust the system's understanding of query intent. But crucially, the product intelligence layer ensures that even queries with no behavioral history produce accurate first results from the moment they appear.

Marqo's MRR Benchmark Results

In standardized benchmark testing across ecommerce product search tasks, Marqo demonstrated a 17.6% improvement in MRR over the best-performing proprietary model in the comparison set. This improvement represents the difference between the first result being wrong on roughly one in three queries versus being wrong on roughly one in five.

At scale, this gap is enormous. A site with 10 million monthly search queries and an MRR improvement of 17.6% is delivering better first results on approximately 1.76 million additional queries per month. If even a fraction of those improved first results convert to clicks and purchases, the revenue impact compounds rapidly.

Mejuri, the fine jewelry brand, saw this play out directly. After deploying Marqo, Mejuri achieved a 19.8% increase in search revenue. Jewelry search is particularly sensitive to first-result accuracy because shoppers have specific aesthetic preferences. When the first result matches their taste, they buy. When it does not, they leave. Improving MRR for jewelry queries, where intent is highly visual and subjective, requires the kind of deep product understanding that keyword systems simply cannot provide.

MRR and the Zero-Query Problem

One of MRR's limitations in traditional measurement is that it only applies to queries with at least one relevant result. But in ecommerce, a significant percentage of searches return no relevant results at all. These "zero-result" queries are invisible to MRR but devastating to revenue.

Marqo addresses this by ensuring that product understanding extends to the full catalog, including new arrivals, long-tail items, and products with sparse metadata. Commerce Superintelligence does not depend on historical query-product pairs to rank effectively. It understands the products themselves, which means it can surface relevant results for novel queries that have never been searched before.

This capability directly improves MRR by reducing the number of queries where no relevant product appears in the top positions. When the system understands what products actually are, it can match them to queries it has never seen, putting the right product first even in unfamiliar territory.

MRR vs. Other Metrics: When to Use What

MRR is powerful but incomplete on its own. Understanding when to rely on MRR and when to complement it with other metrics is important.

Use MRR when: you want to know if your search nails the first result. MRR is the best single metric for "did we get it right immediately?" It is especially useful for navigational queries (where the shopper knows exactly what they want) and high-intent queries (where the shopper is ready to buy).

Complement with NDCG when: you care about the full result page. MRR ignores everything below the first relevant result. A search that puts a great product first but fills positions two through ten with garbage will score well on MRR but poorly on NDCG.

Complement with Recall when: you need to ensure coverage. MRR does not penalize missing products. If only one relevant product appears in the entire result set, MRR can still be 1.0 as long as that product is in position one.

Complement with Precision when: you want to minimize irrelevant results. MRR does not care how many irrelevant results appear below the first relevant one.

The strongest ecommerce search evaluation combines all four: MRR for first-result accuracy, NDCG for ranking quality, Recall for coverage, and Precision for relevance density. Marqo benchmarks across all four because optimizing one at the expense of others creates blind spots.

How to Measure MRR on Your Site

Measuring MRR is straightforward if you have relevance judgments.

Step 1: Sample queries. Select a representative set of 200 to 500 queries from your search logs. Include high-volume and low-volume queries. Exclude queries with obvious typos unless your search handles spelling correction (in which case, include them).

Step 2: Capture results. For each query, record the top 10 results from your current search engine.

Step 3: Judge relevance. For each query, identify which results are relevant. For MRR, binary relevance (relevant or not) is sufficient. You do not need graded relevance like NDCG requires.

Step 4: Find the first relevant result. For each query, note the position of the first relevant product.

Step 5: Calculate. Take the reciprocal of each position, then average across all queries.

Interpreting your score:

MRR above 0.80: Excellent. Your first result is almost always relevant.

MRR 0.60 to 0.80: Good but with clear room for improvement. Many queries have the best product in position two or three.

MRR below 0.60: Your search is failing on first-result accuracy for a large portion of queries. This is likely costing significant revenue.

MRR by Query Type

Not all queries are created equal, and MRR varies dramatically by query type.

Navigational queries ("Nike Air Max 90 black") typically have high MRR even on keyword systems because the product name appears directly in the query. If your MRR is low on navigational queries, something is fundamentally broken.

Attribute queries ("waterproof hiking boots under $150") have moderate MRR on keyword systems. The keywords partially match, but the system struggles to combine multiple attributes accurately. This is where product understanding starts to matter.

Intent queries ("something to wear to a beach wedding") have low MRR on keyword systems because there are no product-specific keywords to match. These queries require understanding of occasions, dress codes, and product suitability. This is where legacy platforms fail and where Marqo's approach delivers the largest MRR improvements.

Visual queries ("dress like the one Zendaya wore at the Met Gala") are almost impossible for keyword systems. MRR is near zero. Marqo's visual product understanding can interpret these queries because it understands what products look like, not just what their text descriptions say.

The Revenue Equation

The relationship between MRR and revenue is more direct than most metrics.

Consider a simplified model. Your site receives 5 million search queries per month. Your current MRR is 0.55, meaning the first relevant product appears at position one about 55% of the time. The click-through rate on position one is 35%, and the conversion rate from click to purchase is 8%.

At MRR 0.55: 5M × 0.55 × 0.35 × 0.08 = 77,000 conversions from first-result clicks.

Now improve MRR to 0.70 (a 27% improvement, less than what Marqo delivers in benchmarks): 5M × 0.70 × 0.35 × 0.08 = 98,000 conversions from first-result clicks.

That is 21,000 additional conversions per month from improving first-result accuracy alone, before accounting for improvements in positions two through ten. At an average order value of $80, that is $1.68M in additional monthly revenue.

This is a simplified model, but it illustrates why MRR matters so directly to the bottom line. Kogan, the Australian ecommerce retailer, saw $10.1M in attributable value from improved product discovery after deploying Marqo. First-result accuracy was a significant contributor.

Sibbi and Conversational MRR

The concept of MRR extends naturally into conversational commerce. Sibbi is the conversational interface of Marqo's Commerce Superintelligence, an autonomous agent that guides shoppers from discovery through post-purchase using deep product understanding.

In a conversational context, every product recommendation is effectively a "first result." When a shopper tells Sibbi they need a gift for their mother who likes gardening, the first product Sibbi suggests carries the same weight as position one in traditional search. If that recommendation is right, trust builds and the conversation moves toward purchase. If it is wrong, the shopper disengages.

Marqo's deep product understanding ensures that Sibbi's recommendations are accurate from the first suggestion. The system understands every product in the catalog and can match it to nuanced, natural language intent without depending on keyword overlap.

Practical Steps to Improve MRR

1Measure it first. You cannot improve what you do not track. Most ecommerce platforms do not report MRR natively. You will need to calculate it from search logs and relevance judgments.

1Segment by query type. Your overall MRR may be acceptable, but specific query types (intent queries, attribute queries) may be severely underperforming. Target improvements where MRR is lowest.

1Evaluate your ranking signals. If your ranking is primarily driven by keyword match strength and popularity, your MRR ceiling is limited. These signals cannot accurately rank products for nuanced or novel queries.

1Consider whether your platform can improve. If your search platform does not understand products beyond their metadata, MRR improvements require manual curation for every query pattern. That does not scale.

1See what genuine product understanding delivers. Marqo's 17.6% MRR improvement over the best proprietary model reflects what happens when ranking is driven by understanding rather than matching. Results in 14 days, not months.

FAQ

What is a good MRR score for ecommerce? An MRR above 0.75 is strong for ecommerce search. Most sites score between 0.45 and 0.65. The gap between "average" and "good" represents thousands of queries per day where the first result is wrong, each one a potential lost sale.

Is MRR the same as MAP (Mean Average Precision)? No. MRR only considers the first relevant result. MAP considers all relevant results and their positions. MRR is simpler and more directly tied to the "did we nail it immediately?" question. MAP provides a more complete view of ranking quality across the full result set.

Can MRR be 0? Yes. If no relevant result appears in the result set for a query, that query contributes 0 to MRR. If MRR is calculated only over queries with at least one relevant result (which is common practice), then the minimum is bounded by 1/K where K is the number of results evaluated.

How does Marqo achieve higher MRR than legacy platforms? Marqo's AI-native product discovery platform understands products at a level that keyword systems cannot. By understanding what a product is, not just what words describe it, Marqo can accurately identify the single best product for any query and place it in position one. This is the core reason for the 17.6% MRR improvement in benchmarks.

Does improving MRR hurt other metrics like recall? Not when done correctly. Improving MRR by genuinely understanding products better improves all metrics simultaneously. Marqo's Commerce Superintelligence improves MRR, NDCG, recall, and precision together because the underlying improvement, deeper product understanding, benefits all aspects of ranking.

See First-Result Accuracy in Action

If your ecommerce search puts the wrong product first on even 30% of queries, you are losing revenue on every one of those searches. Marqo combines product intelligence with behavioral data to ensure the best product appears where shoppers look first.

Book a demo to benchmark your current MRR against what Marqo delivers on your catalog, with results in 14 days, not months.

POST 3: what-is-recall-in-machine-learning

Title: What Is Recall? Why Missing Products in Search Results Costs You Revenue

Meta description: Recall measures how many relevant products your search actually surfaces. Learn why low recall means invisible inventory and lost ecommerce revenue.

What Is Recall? Why Missing Products in Search Results Costs You Revenue

Your catalog has 50,000 products. A shopper searches for "lightweight summer jacket." Your search returns 24 results. Twelve are relevant. But there are actually 45 lightweight summer jackets in your catalog. Your search just made 33 products invisible.

This is a recall problem. Recall measures the fraction of relevant products that actually appear in search results. A recall of 0.27 (12 out of 45) means your search engine is hiding 73% of the products that match what the shopper wants. Those products exist in your inventory. You paid to source them, photograph them, and list them. But your search engine does not know they are relevant, so shoppers never see them.

For ecommerce, low recall is invisible revenue loss. Unlike a broken checkout or a 404 error, nobody complains about products they never knew existed. The shopper sees 12 jackets, picks one or leaves, and never knows about the 33 others that might have been a better fit, a better price, or the exact color they wanted.

This post explains what recall is, how it works, why it is the most underappreciated search metric in ecommerce, and how Marqo's approach to product understanding solves the recall problem at its root.

How Recall Is Calculated

Recall is one of the simplest metrics in information retrieval:

Recall = (Relevant items retrieved) / (Total relevant items in the collection)

If your catalog contains 45 lightweight summer jackets and your search returns 12 of them, recall is 12/45 = 0.267, or 26.7%.

Recall is always measured relative to a specific query and a specific relevance definition. Different queries have different numbers of relevant products, and different definitions of "relevant" produce different recall scores for the same result set.

Recall at K (Recall@K) is a common variant that limits evaluation to the top K results. Recall@10 asks: of all relevant products, how many appear in the top 10 results? This is often more practical than total recall because shoppers rarely look beyond the first page.

For ecommerce, Recall@20 or Recall@50 are typical evaluation points, corresponding to one or two pages of search results on most sites.

Why Low Recall Happens in Ecommerce

Low recall in ecommerce search has a specific and well-understood cause: the search system cannot recognize relevance beyond keyword overlap.

A shopper searches for "lightweight summer jacket." A keyword system looks for products containing those words. It finds products with "lightweight" and "jacket" and "summer" in their titles or descriptions. It returns them.

But many relevant products do not contain those words:

A "packable windbreaker" is a lightweight summer jacket, but neither "lightweight" nor "summer" appears in its name.

A "linen blazer" is a lightweight summer jacket in many contexts, but the vocabulary is completely different.

A "UV protection layer" is functionally a lightweight summer jacket for outdoor use, but shares zero keywords with the query.

These products are relevant. They are in the catalog. The shopper would consider them. But the search engine cannot connect them to the query because it matches words, not concepts.

This vocabulary gap is the primary driver of low recall in ecommerce. It affects every product category but is especially severe in fashion (where style vocabulary is fluid and subjective), home goods (where function-based queries rarely match product names), and beauty (where ingredient, concern, and product-type vocabularies are largely separate).

The Revenue Impact of Missing Products

Low recall costs money in three distinct ways.

Lost direct sales. Every relevant product hidden from the shopper is a potential sale that never happens. If the ideal product for a shopper's query is in position 35 instead of position 5, most shoppers will never scroll to find it. They either buy a suboptimal product (lower satisfaction, higher return rate) or leave (zero revenue).

Reduced catalog efficiency. Ecommerce businesses invest heavily in assortment planning, product development, and inventory management. If 30 to 50% of relevant products are invisible to search, those investments are partially wasted. You are paying to stock products that shoppers cannot find.

Concentrated demand on a few products. When search consistently surfaces the same subset of products regardless of query variation, demand concentrates on those products. This leads to stockouts on popular items and stale inventory on invisible items. The problem looks like a demand planning issue, but it is actually a search recall issue.

SwimOutlet experienced this directly. After deploying Marqo, SwimOutlet saw a 10.6% increase in revenue per visit. Part of that improvement came from surfacing products that had been invisible to the old search system. When the full catalog participates in search results, demand distributes more naturally, conversion improves, and inventory moves faster.

How Marqo Solves the Recall Problem

Marqo is an AI-native product discovery platform that understands every product in the catalog: what it looks like, what it pairs with, what it substitutes, and what drives margin. This understanding eliminates the vocabulary gap that causes low recall.

When a shopper searches for "lightweight summer jacket," Marqo does not look for keyword matches. It identifies products that are functionally lightweight, seasonally appropriate for summer, and categorically jackets, regardless of what words appear in their titles. The packable windbreaker, the linen blazer, and the UV protection layer all surface because the system understands what they are.

This is what Marqo's Commerce Superintelligence delivers: comprehensive product understanding that connects any query to any relevant product in the catalog, even when the vocabulary is entirely different.

The system combines product intelligence with behavioral data to continuously refine relevance boundaries. If shoppers who search for "lightweight summer jacket" consistently engage with linen blazers, that behavioral signal strengthens the connection. But the critical difference from legacy systems is that Marqo's product understanding identifies the linen blazer as relevant from day one, before any behavioral data exists.

Zero-Shot Recall: New Products From Day One

One of the most commercially damaging recall failures happens with new products. A product arrives in the catalog, gets listed with basic metadata, and then sits invisible for weeks or months because the search system has no behavioral data to connect it to queries.

In keyword systems, a new product only appears in search results if its metadata contains the right keywords. If the copywriter uses different vocabulary than the shopper (which happens constantly), the product remains hidden until manual intervention.

In behavior-dependent systems, a new product cannot rank until enough shoppers have clicked on it to generate signals. This creates a cold-start problem: the product needs exposure to generate data, but it cannot get exposure without data.

Marqo solves this with zero-shot understanding. Because the system understands what a product is based on its attributes, images, and characteristics, it can determine relevance to any query from the moment the product enters the catalog. No keywords required. No behavioral history required. The product surfaces for relevant queries on day one.

For retailers with frequent new arrivals, fast fashion cycles, or seasonal inventory, this zero-shot recall capability directly translates to revenue. Products start generating revenue from the day they are listed, not weeks later when the system finally learns about them.

Recall vs. Precision: The Ecommerce Tradeoff

Recall and precision exist in natural tension. Increasing recall (showing more relevant products) often means showing more products overall, which can decrease precision (the fraction of shown products that are relevant). Showing fewer, more targeted results improves precision but risks missing relevant products.

In ecommerce, this tradeoff has a specific commercial dimension:

High recall, lower precision: Shoppers see more relevant products but also more irrelevant ones. This works well for browse-heavy categories (home decor, fashion) where shoppers enjoy exploring.

High precision, lower recall: Shoppers see fewer but more accurate results. This works well for high-intent, specific queries ("iPhone 15 Pro Max 256GB blue") where the shopper knows exactly what they want.

The ideal search system adapts the recall-precision balance to the query. Broad exploratory queries should favor recall. Specific navigational queries should favor precision.

Marqo's Commerce Superintelligence handles this naturally because it understands query intent, not just query words. A broad query like "summer dresses" triggers high-recall behavior, surfacing diverse options from across the catalog. A specific query like "Reformation Juliette dress green size 6" triggers high-precision behavior, narrowing to exact matches.

How to Measure Recall for Your Search

Measuring recall requires knowing the total number of relevant products for each query, which is harder than it sounds.

Step 1: Select evaluation queries. Choose 100 to 300 queries that represent the range of search behavior on your site.

Step 2: Define relevance. For each query, determine which products in your catalog are relevant. This is the hard part. For small catalogs, manual review is feasible. For large catalogs, you may need to use a combination of category filtering, attribute matching, and human judgment on samples.

Step 3: Capture results. Record the top 20 to 50 results from your current search engine for each query.

Step 4: Calculate. For each query, divide the number of relevant products in the results by the total number of relevant products in the catalog. Average across all queries.

Interpreting your score:

Recall@20 above 0.70: Strong. Your search surfaces most relevant products on the first page or two.

Recall@20 between 0.40 and 0.70: Moderate. A significant portion of your catalog is invisible for many queries.

Recall@20 below 0.40: Your search is missing the majority of relevant products. This represents major revenue leakage.

Step 5: Analyze by query type. You will almost certainly find that recall is high for navigational queries (where the product name is in the query) and low for conceptual or attribute-based queries (where the shopper describes what they want rather than naming it). The conceptual queries are where the revenue opportunity lives.

Common Recall Failures in Ecommerce

Synonym blindness. "Couch" vs. "sofa." "Sneakers" vs. "trainers." "Swimsuit" vs. "bathing suit." Keyword systems miss these unless every synonym is manually added to every product. Manual synonym management does not scale across a 100,000-product catalog.

Attribute-concept gaps. "Something warm for skiing" requires understanding that down jackets, fleece layers, and thermal base layers are all relevant. No product title says "something warm for skiing."

Cross-category relevance. "Gift for a runner" spans shoes, apparel, accessories, nutrition, and technology. Keyword systems typically search within a single category at a time, missing relevant products in other categories.

Visual similarity. A shopper sees a product on Instagram and searches for something similar using descriptive language. The relevant products in the catalog may look similar but use completely different descriptive text.

Marqo eliminates all four failure modes because its product understanding operates at the concept level, not the keyword level. Products are indexed by what they are, not just what their descriptions say.

Recall in the Age of Conversational Commerce

Recall becomes even more critical as product discovery moves into conversational interfaces. Sibbi is the conversational interface of Marqo's Commerce Superintelligence, an autonomous agent that guides shoppers from discovery through post-purchase using deep product understanding.

In conversation, shoppers describe what they want in natural, often imprecise language: "I need something for my friend's housewarming, she just moved to a new apartment and loves cooking." The recall challenge here is immense. The relevant products span categories (kitchen tools, cookbooks, serving ware, specialty ingredients), and the vocabulary gap between the query and product metadata is enormous.

Marqo's deep product understanding ensures that Sibbi can surface relevant products from across the entire catalog for queries like these. High recall in conversational contexts means the shopper sees the full range of options the catalog offers, leading to better purchases and fewer missed opportunities.

Why Recall Is the Most Underappreciated Metric

Search teams tend to focus on what they can see: the results that appear. Precision problems are visible. Irrelevant results on the first page are obvious and get flagged. Ranking problems are visible. A great product in position eight instead of position one is noticeable.

But recall problems are invisible by definition. The products that do not appear in results are not seen by anyone. No shopper complains about a product they do not know exists. No merchandiser notices that 30 jackets are missing from search results because they never appear on any report.

This invisibility makes recall the most dangerous metric to neglect. You can have perfect precision (every shown result is relevant) and perfect ranking (the shown results are in the ideal order) and still lose enormous revenue because half the relevant catalog is missing from results.

Marqo makes recall visible. By understanding every product in the catalog and connecting it to every relevant query, Marqo surfaces the products that legacy systems hide. The revenue impact of this expanded recall is often the single largest contributor to overall search revenue improvement.

The Business Case for Better Recall

If your current Recall@20 is 0.45 and you improve it to 0.70, you have made 25 percentage points more of your relevant catalog visible for each query. Across a full catalog and a full month of queries, this means hundreds of thousands of additional product impressions for products that were previously invisible.

Not every additional impression converts. But even a small conversion rate on previously invisible products represents pure incremental revenue. These are sales that were not possible before because the products were not surfaced.

Marqo customers consistently report that improved recall is one of the most surprising and valuable outcomes. Products that had been in the catalog for months suddenly start receiving traffic and generating sales. Inventory that had been marked for clearance starts moving at full price because shoppers can finally find it.

This is what happens when an AI-native product discovery platform replaces keyword matching with genuine product understanding. The catalog comes alive.

FAQ

What is the difference between recall and coverage? Recall measures the fraction of relevant products that appear in results for a specific query. Coverage typically refers to the fraction of the total catalog that appears in results across all queries. Both matter: recall tells you about individual query quality, coverage tells you about overall catalog utilization.

Can recall be too high? In theory, you could achieve perfect recall by returning every product in the catalog for every query. But precision would be near zero. The goal is high recall with high precision: surfacing most relevant products without overwhelming the shopper with irrelevant ones. Marqo's Commerce Superintelligence achieves this balance through genuine product understanding.

How does recall relate to "no results" searches? "No results" is the most extreme recall failure: recall of zero. If your site has a high "no results" rate, recall is failing at the most basic level. Marqo virtually eliminates "no results" queries because product understanding can find relevant products even when vocabulary does not match.

Does Marqo improve recall for long-tail queries? Yes, dramatically. Long-tail queries are where keyword systems have the worst recall because the specific language used in these queries rarely appears in product metadata. Marqo understands products at a concept level, so it can connect long-tail queries to relevant products regardless of vocabulary overlap. Results in 14 days, not months.

How often should recall be evaluated? At minimum, evaluate recall quarterly. Catalog changes (new products, discontinued products) directly affect recall. Seasonal shifts change which products are relevant to common queries. Continuous evaluation catches recall degradation early.

Stop Hiding Products From Your Shoppers

If your search engine understands keywords but not products, a significant portion of your catalog is invisible to every search query. Marqo combines product intelligence with behavioral data to surface every relevant product for every query, from day one.

Book a demo to see how much of your catalog your current search is hiding, and what happens when Marqo's AI-native product discovery platform makes it visible.

POST 4: what-is-precision-in-machine-learning

Title: What Is Precision? Why Irrelevant Search Results Are Killing Your Conversion Rate

Meta description: Precision measures how many of your search results are actually relevant. Learn why irrelevant results destroy shopper trust and tank conversion rates.

What Is Precision? Why Irrelevant Search Results Are Killing Your Conversion Rate

A shopper searches for "gold hoop earrings" on your site. The first result is gold hoop earrings. Good. The second result is silver hoop earrings. The third is gold stud earrings. The fourth is a gold bracelet. By the fifth result, the shopper is looking at a leather handbag with gold hardware.

Five results shown. One is what the shopper wanted. That is a precision of 0.20, or 20%. Four out of five results are wasting the shopper's time, eroding their trust, and pushing them closer to the back button.

Precision measures the fraction of returned results that are actually relevant. It is the metric that answers the question: when we show a shopper products, how many of them are actually what they asked for?

For ecommerce, low precision is a conversion killer. Every irrelevant result on the page is a signal to the shopper that your site does not understand them. That signal accumulates fast. One irrelevant result is forgivable. Three in the top five, and the shopper starts questioning whether the right product even exists in your catalog. Five in the top ten, and they leave.

This post explains what precision is, how it works, why it matters critically for ecommerce conversion, and how Marqo's product understanding delivers the precision that keyword-based systems cannot.

How Precision Is Calculated

Precision is straightforward:

Precision = (Relevant results retrieved) / (Total results retrieved)

If your search returns 20 results and 14 are relevant, precision is 14/20 = 0.70, or 70%.

Precision at K (Precision@K) limits evaluation to the top K results. Precision@5 asks: of the top 5 results, how many are relevant? This is often more useful than total precision because shoppers focus on the top results.

For ecommerce, Precision@10 is the most commercially relevant cutoff. It roughly corresponds to the products visible without scrolling on a desktop results page. If Precision@10 is 0.60, four out of ten visible products are irrelevant. That is four products actively damaging the shopper's experience.

The Psychology of Irrelevant Results

Irrelevant search results do more damage than most ecommerce teams realize. The harm goes beyond the simple missed opportunity of not showing a relevant product.

Trust erosion. When a shopper searches for something specific and sees irrelevant results, they lose confidence that the site has what they want. Even if the right product exists and appears further down the page, the irrelevant results above it have already planted doubt.

Cognitive load. Every irrelevant result forces the shopper to evaluate and reject it. This takes mental effort. After rejecting several irrelevant results, the shopper experiences decision fatigue and is more likely to abandon the search entirely.

Perceived catalog quality. Irrelevant search results make the entire catalog seem lower quality. If searching for "gold hoop earrings" returns a leather handbag, the shopper wonders what else is wrong with this site. The problem is not the catalog. The problem is the search. But the shopper does not make that distinction.

Bounce acceleration. Each irrelevant result increases the probability of the shopper leaving. The effect is not linear. The first irrelevant result in the top five has a moderate impact. The third has a severe impact. By the time half the visible results are irrelevant, most shoppers have already decided to leave.

Why Keyword Search Has a Precision Problem

Keyword-based search platforms struggle with precision because they match words, not meaning.

Consider the query "gold hoop earrings." A keyword system searches for products containing "gold," "hoop," and "earrings." This seems precise enough. But the system also returns:

Products tagged with "gold-tone" (which may be brass or plated, not what the shopper means by "gold")

Products where "hoop" appears in "hoop and chain bracelet"

Products where "earrings" appears in a cross-sell field ("pairs well with our earrings collection") rather than the product type

Products where "gold" refers to a color name applied to a completely unrelated item

Each partial keyword match generates an irrelevant result. At scale, across thousands of queries, these false matches accumulate into a significant precision problem.

The conventional fix is manual tuning: adding negative keywords, adjusting field weights, creating synonym lists, writing boost rules. This works for high-volume queries where a merchandiser has time to review and optimize results. But it does not scale. Most ecommerce sites have tens of thousands of unique queries. Manual precision tuning for all of them is impossible.

How Product Understanding Fixes Precision

Precision improves when the search system actually understands what products are, not just what words describe them.

When Marqo processes a product listing for gold hoop earrings, it does not just index the words in the title. It understands that this is a piece of jewelry, specifically earrings, in the hoop style, made from or colored in gold. When a shopper searches for "gold hoop earrings," Marqo matches against this structured understanding, not against raw text.

This means:

A "gold-tone chain bracelet" does not match because it is a bracelet, not earrings.

A "silver hoop earring" does not match because it is silver, not gold (unless the shopper's query is ambiguous enough to include it).

A product where "gold" appears only in an unrelated metadata field does not match because the system understands the product's actual color and material.

Marqo is an AI-native product discovery platform that understands every product in the catalog: what it looks like, what it pairs with, what it substitutes, and what drives margin. This understanding is the foundation of precision. When the system knows what a product actually is, it can determine with high confidence whether it matches a query.

Marqo's Commerce Superintelligence combines product intelligence with behavioral data to continuously refine precision. If shoppers who search for "gold hoop earrings" consistently skip gold-plated options and engage only with solid gold, the system tightens its precision for that query pattern. But the product intelligence layer ensures high precision from the start, even for queries with no behavioral history.

Precision and Conversion: The Data

The relationship between search precision and conversion rate is well-documented in ecommerce analytics. Across the industry, each 10-percentage-point improvement in Precision@10 correlates with a 4 to 8% improvement in search-to-purchase conversion rate.

The mechanism is straightforward. Higher precision means more relevant products visible to the shopper. More relevant products visible means higher click-through rates on results. Higher click-through rates mean more product page views. More product page views on relevant products mean more add-to-cart actions. More add-to-cart actions mean more purchases.

KICKS CREW, the global sneaker marketplace, saw this play out after deploying Marqo. With a 17.7% improvement in search-driven conversions, the precision improvement was a significant factor. Sneaker search is particularly sensitive to precision because shoppers have extremely specific preferences: exact model, exact colorway, exact year. An Air Jordan 4 Retro in "Bred" is not the same product as an Air Jordan 4 Retro in "Military Black" to the shopper who wants one of them. High precision means showing the exact right sneakers, not just sneakers that share some keywords with the query.

Precision by Query Specificity

Precision challenges vary based on how specific the query is.

Highly specific queries ("Levi's 501 Original Fit 32x30 dark wash") should have near-perfect precision. If your search returns irrelevant results for a query this specific, the system has fundamental matching failures.

Moderately specific queries ("men's dark wash straight leg jeans size 32") are where precision starts to degrade on keyword systems. The system matches individual attributes but struggles to enforce all of them simultaneously. A pair of slim-fit jeans in dark wash and size 32 partially matches, creating an irrelevant result.

Broad queries ("jeans for men") have an interesting precision dynamic. Almost any men's jeans are relevant, so precision is naturally higher. But this is misleading. The ease of achieving precision on broad queries masks the failure on specific queries, which are typically higher-intent and more likely to convert.

Natural language queries ("jeans that look good with boots for a country wedding") present the hardest precision challenge. Keyword systems return products containing "jeans," "boots," or "wedding" in their metadata, most of which are irrelevant to the actual intent. These queries are where Marqo's product understanding delivers the most dramatic precision improvements.

The Precision-Recall Balance in Ecommerce

Precision and recall pull in opposite directions. Returning more results tends to improve recall (more relevant products appear) but hurt precision (more irrelevant products also appear). Returning fewer results tends to improve precision but hurt recall.

This tradeoff has real commercial implications in ecommerce:

When to favor precision: High-intent queries where the shopper knows what they want. Gift searches where the shopper is unfamiliar with the category and overwhelmed by irrelevant options. Mobile searches where screen space is limited and every result slot is valuable.

When to favor recall: Browse-oriented queries where the shopper wants to explore. Category-level queries where variety is valuable. Queries where the shopper's intent is broad and multiple product types could satisfy it.

Marqo's Commerce Superintelligence adapts this balance dynamically based on query understanding. It does not apply a one-size-fits-all precision-recall tradeoff. Specific queries get high-precision treatment. Broad queries get high-recall treatment. This adaptive behavior is possible because the system understands query intent, not just query keywords.

How to Measure Precision on Your Site

Measuring precision is relatively straightforward compared to recall, because you only need to evaluate the results that were returned, not the entire catalog.

Step 1: Sample queries. Select 200 to 500 queries from your search logs, representing the full range of specificity and volume.

Step 2: Capture results. Record the top 10 results for each query from your current search engine.

Step 3: Judge relevance. For each query-result pair, determine whether the result is relevant. Binary judgment (relevant or irrelevant) is sufficient for precision measurement. Use multiple judges for contested cases.

Step 4: Calculate Precision@10. For each query, divide relevant results by 10. Average across all queries.

Interpreting your score:

Precision@10 above 0.80: Strong. Most results are relevant. Shoppers see a clean, focused result page.

Precision@10 between 0.60 and 0.80: Moderate. Two to four irrelevant results per page. Noticeable to shoppers but not devastating.

Precision@10 below 0.60: Your search is showing more irrelevant results than relevant ones. This is actively damaging conversion.

Step 5: Segment by query type. Calculate precision separately for navigational queries, attribute queries, and natural language queries. The differences will reveal exactly where your search is failing.

Common Precision Failures and Their Causes

Partial attribute matching. The query specifies multiple attributes ("blue waterproof hiking boots size 10"), but the search returns products matching only some attributes. A brown hiking boot in size 10 partially matches but is irrelevant to the shopper who specified blue.

Category leakage. Products from irrelevant categories appear because they share keywords with the query. "Apple" returns fruit alongside electronics. "Coach bag" returns coaching bags alongside the luxury brand. "Tank" returns fish tanks alongside tank tops.

Metadata pollution. Products contain keywords in non-primary fields (SEO tags, cross-sell descriptions, marketing copy) that cause false matches. A dress described as "perfect with our new boots" appears in results for "boots."

Popularity bias overriding relevance. The search system boosts high-selling products regardless of query relevance. A bestselling product appears in results for queries where it is not relevant, simply because its popularity score overwhelms the relevance signal.

Marqo eliminates all four failure modes through genuine product understanding. When the system knows what a product is, partial matches, category confusion, metadata noise, and popularity bias cannot override true relevance.

Precision in Conversational Commerce

As product discovery moves beyond the search box, precision becomes even more critical. Sibbi is the conversational interface of Marqo's Commerce Superintelligence, an autonomous agent that guides shoppers from discovery through post-purchase using deep product understanding.

In a conversation, every product recommendation must be precise. There is no results page where the shopper can scan past irrelevant options. When Sibbi recommends a product, it carries an implicit promise: "this is right for you." An irrelevant recommendation breaks that promise and damages trust immediately.

Conversational commerce raises the precision bar because the format is intimate and personal. A search results page showing three irrelevant products out of ten is tolerable. A conversational agent recommending one irrelevant product out of three feels like a failure.

Marqo's deep product understanding ensures Sibbi's recommendations maintain high precision even for complex, multi-attribute, natural language requests. The system understands what the shopper wants and what each product is, enabling precise matching without keyword dependency.

The Compounding Effect of Precision Over Time

Precision does not just affect individual queries. It affects shopper behavior over time.

A shopper who consistently sees precise, relevant results develops trust in the search function and uses it more frequently. More search usage means more opportunities for discovery and purchase. This creates a virtuous cycle: high precision leads to more search, which leads to more revenue, which justifies further investment in search quality.

Conversely, a shopper who repeatedly encounters irrelevant results stops using search and relies on manual navigation (category browsing, menu navigation). Manual navigation is slower, surfaces fewer products, and produces lower conversion rates. Low precision leads to less search usage, which leads to less revenue.

Kogan observed this behavioral shift after deploying Marqo. As search precision improved, search usage increased, and search-attributed revenue grew. The $10.1M in attributable value from Marqo reflects not just better results on existing searches but also increased search engagement driven by shopper trust in result quality.

Why Precision Requires Product Understanding, Not More Rules

The traditional approach to improving precision is writing more rules. More negative keywords. More field weight adjustments. More category constraints. More manual curation.

This approach hits a ceiling quickly. Every rule you add fixes precision for one query pattern and potentially breaks it for another. Excluding "boots" from "coach" queries fixes the category leakage for Coach brand searches but breaks results for actual boot queries mentioning coaching features. The rule graph becomes increasingly complex, fragile, and impossible to maintain.

Marqo takes the opposite approach. Instead of writing rules to handle exceptions, Marqo builds understanding that handles all queries. When the system understands that Coach is a fashion brand and coaching is an activity, it resolves the ambiguity without rules. When it understands that "blue waterproof hiking boots size 10" requires all four attributes simultaneously, it enforces that constraint without manual configuration.

This is why Marqo's AI-native product discovery platform delivers sustainably high precision. The precision does not degrade as the catalog grows or query patterns shift. Product understanding scales in a way that rule systems cannot.

Building a Precision-Focused Search Strategy

1Measure Precision@10 today. If you do not know your current precision, start there. The number will likely surprise you. Most ecommerce sites overestimate their search precision because merchandising teams focus on high-volume queries that are manually curated.

1Identify precision failure patterns. Is precision failing on specific query types? On certain categories? On mobile vs. desktop? Understanding the pattern reveals the root cause.

1Quantify the conversion impact. Calculate how many irrelevant impressions your search generates per month. Estimate the conversion rate improvement if those impressions were replaced with relevant products. The revenue opportunity is typically larger than expected.

1Evaluate whether rules can close the gap. If your precision failures stem from fundamental limitations in keyword matching, more rules will not solve the problem. You need a system that understands products.

1See what product understanding delivers. Marqo's Commerce Superintelligence eliminates the root cause of precision failures by understanding products at a level that keyword systems cannot match. Results in 14 days, not months.

FAQ

What is a good Precision@10 for ecommerce search? Above 0.80 is strong. Between 0.60 and 0.80 is average and improvable. Below 0.60 means more than half the visible results are irrelevant, which is actively harming conversion. Most sites score between 0.55 and 0.75 when measured honestly across all query types, not just manually curated head queries.

Is precision more important than recall? Neither is more important in absolute terms. But for ecommerce conversion, precision has a more direct impact on short-term revenue. Irrelevant results actively push shoppers away. Missing results (a recall problem) represent missed opportunities but do not actively damage the experience. The ideal system, which Marqo delivers, achieves high scores on both simultaneously.

Can improving precision hurt recall? In naive implementations, yes. Aggressively filtering results to improve precision can hide relevant products. Marqo avoids this tradeoff because precision improvement comes from better understanding, not stricter filtering. The system does not remove results to improve precision. It ranks relevant results higher and irrelevant results lower.

How does Marqo handle ambiguous queries where precision is hard? Ambiguous queries like "apple" or "coach" are resolved through context and product understanding. Marqo uses signals from the shopping context, the site's catalog composition, and the broader query pattern to resolve ambiguity. On a fashion site, "coach" maps to the brand. On a sports site, it maps to coaching equipment. This contextual resolution happens automatically, without manual rules.

Does precision matter for browse and category pages? Absolutely. Category pages and filtered browse experiences are essentially pre-defined queries. If a shopper navigates to "Women's Dresses" and sees jumpsuits, tops, or skirts mixed in, that is a precision failure. Marqo's product understanding applies to all product surfacing, not just the search box.

Stop Showing Shoppers Products They Did Not Ask For

Every irrelevant result on your search page is an active conversion deterrent. Marqo combines product intelligence with behavioral data to deliver precise, relevant results for every query, from the simplest product name to the most complex natural language request.

Book a demo to see how Marqo's AI-native product discovery platform transforms your search precision, and your conversion rate, with results in 14 days, not months."}],"stop_reason":"end_turn","stop_sequence":null,"stop_details":null,"usage":{"input_tokens":3,"cache_creation_input_tokens":6268,"cache_read_input_tokens":8374,"output_tokens":16297,"server_tool_use":{"web_search_requests":0,"web_fetch_requests":0},"service_tier":"standard","cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":6268},"inference_geo":"","iterations":[],"speed":"standard"}},"requestId":"req_011Cak7yeskhvMNPWyuorDud","type":"assistant","uuid":"a40e4e70-4c25-404a-9391-bc196fa766c6","timestamp":"2026-05-05T21:26:30.635Z","userType":"external","entrypoint":"cli","cwd":"/Users/ana/marqo-website","sessionId":"cf226049-f886-4e6a-b19d-c488679d1df1","version":"2.1.89","gitBranch":"fix/customer-stories-updates","slug":"mossy-petting-widget"}

Commerce Superintelligence

NDCG measures how well your search ranks the best products at the top. Learn how this metric reveals whether ecommerce search is driving revenue or losing it.

Shape Your Growth With AI-Native
Product Discovery

Transform product discovery with Marqo and get measurable ROI in 14 days, not months.

Get a demo

What Is NDCG? The Metric That Measures Whether Your Ecommerce Search Actually Works

The Problem NDCG Solves

How NDCG Is Calculated

Why NDCG Matters for Ecommerce Specifically

How Legacy Search Platforms Score on NDCG

How Marqo Approaches NDCG

Real Revenue Impact

NDCG vs. Other Search Metrics

How to Measure NDCG for Your Search

Why Marqo Benchmarks on NDCG

Sibbi and the Future of Search Quality Measurement

Practical Takeaways

FAQ

Ready to See Your NDCG Score Improve?

POST 2: what-is-mrr-in-machine-learning

What Is MRR? Why First-Result Accuracy Determines Ecommerce Revenue

How MRR Is Calculated

Example:

Why the First Result Matters Disproportionately

The Ecommerce MRR Problem

How Product Understanding Improves MRR

Marqo's MRR Benchmark Results

MRR and the Zero-Query Problem

MRR vs. Other Metrics: When to Use What

How to Measure MRR on Your Site

Interpreting your score:

MRR by Query Type

The Revenue Equation

Sibbi and Conversational MRR

Practical Steps to Improve MRR

FAQ

See First-Result Accuracy in Action

POST 3: what-is-recall-in-machine-learning

What Is Recall? Why Missing Products in Search Results Costs You Revenue

How Recall Is Calculated

Why Low Recall Happens in Ecommerce

The Revenue Impact of Missing Products

How Marqo Solves the Recall Problem

Zero-Shot Recall: New Products From Day One

Recall vs. Precision: The Ecommerce Tradeoff

How to Measure Recall for Your Search

Interpreting your score:

Common Recall Failures in Ecommerce

Recall in the Age of Conversational Commerce

Why Recall Is the Most Underappreciated Metric

The Business Case for Better Recall

FAQ

Stop Hiding Products From Your Shoppers

POST 4: what-is-precision-in-machine-learning

What Is Precision? Why Irrelevant Search Results Are Killing Your Conversion Rate

How Precision Is Calculated

The Psychology of Irrelevant Results

Why Keyword Search Has a Precision Problem

How Product Understanding Fixes Precision

Precision and Conversion: The Data

Precision by Query Specificity

The Precision-Recall Balance in Ecommerce

How to Measure Precision on Your Site

Interpreting your score:

Common Precision Failures and Their Causes

Precision in Conversational Commerce

The Compounding Effect of Precision Over Time

Why Precision Requires Product Understanding, Not More Rules

Building a Precision-Focused Search Strategy

FAQ

Stop Showing Shoppers Products They Did Not Ask For

Related Blog Posts

How AI Transforms Visual Product Search in Ecommerce

Dedicated vs Shared AI Models in Ecommerce Search: What the Benchmarks Show

Getting Started with Marqo: The AI-Native Product Discovery Platform

Shape Your Growth With AI-NativeProduct Discovery

Shape Your Growth With AI-Native
Product Discovery