Note: All examples included in this article are appropriately censored and pixelated for reader comfort.
Here at Marqo, we've recently compiled a dataset for an E-commerce search demonstration. The dataset is entirely AI-generated, housing approximately 250,000 images paired with product titles, text descriptions, and aesthetic scores.
As the dataset was AI-generated, we lacked explicit knowledge of the images' content. Despite guiding the generation process to stay within the E-commerce domain, the specific image contents remained a mystery. Given the dataset's size, manual inspection was impractical.
To ensure the dataset's appropriateness before going public with our demo, we decided to use Marqo to score the data for any NSFW (Not Safe for Work) images. Unsurprisingly, we stumbled upon several unsuitable images, particularly within the 'stockings' category, and various NSFW lingerie and underwear photos.
Marqo proved instrumental as a discovery tool in identifying and removing these images. Thanks to its powerful multimodal search and query composition, we could straightforwardly seek and weed out unwanted content. Notably, our exploration unearthed a number of bizarre and uncanny images we decided to discard.
The CLIP models we utilise at Marqo display an impressive understanding of semantics, transcending the boundaries of conventional keyword search. Consequently, if you're seeking images of a specific style, you can conveniently request them in natural language.
For instance, the query "weird, AI generated, piercing" helps us unearth some examples of peculiar images.
Similarly, the query "AI Generated, fake, bizarre" presents us with this.
Our exploration revealed that the majority of NSFW content was concentrated under the stockings category, initially brought to light with the query "lingerie".
The query "lingerie, nude", yielded the following top nine results:
Detecting weird, AI-generated images is relatively straightforward. To refine our search, we can append 'deformed' to our existing query.
This query yields a more targeted set of results:
Marqo facilitates the design of nuanced queries by allowing weighted components within the query.
Implementing this with Marqo is incredibly straightforward:
In this case, we employed a blend of intuition and experimentation to formulate our query. Being able to utilise natural language and weighted components makes for an intuitive design process. The query attempts to match NSFW images by combining the embeddings of each query item, according to their corresponding weights. We applied negative weights to some work-appropriate clothing items that might be misidentified as NSFW content.
To further ensure the relevance of our search results, we employed an additional technique. By manually inspecting the top 10 results from the previous step's query, we observed all responses were NSFW. To sharpen our search precision, we took the embeddings from these top 10 results and fed them back into the search - this introduced embeddings specifically representative of the data we aimed to eliminate.
Upon experimenting with various limits, we noticed our NSFW image results dwindled around a similarity score of roughly 0.79.
Subsequently, we conducted the search and deleted all images surpassing this threshold.
The same process was applied to our bizarre AI generated images.
We can clearly see that the density of strange images has been reduced. There are still some oddities however these should be some of the strangest left in the entire dataset.
Through this process, we were able to remove around 1.5k images from our dataset. This effort was instrumental in ensuring the appropriateness and quality of the content in our E-commerce demo. Using Marqo's powerful multimodal search and its ability to craft nuanced queries, we've turned a potentially daunting task into an efficient and manageable process.
This demonstrates Marqo’s ability as not only a powerful search but also as a powerful data curation and mining tool. While this demonstration focused on images, the same principles can apply to any form of data that you might have.