Seems like the search is based only on the transcript/dialogue - not an image em...

petercooper · 2025-12-06T13:10:11 1765026611

Agreed. If you search for Barney, say, none of the top ten picture him at all and is mostly people speaking to or about him. Even running them through a vision LLM for a list of keywords would yield better results than the subtitles, I suspect.

adzm · 2025-12-06T04:36:29 1764995789

How would someone go about doing this, just curious?

wincy · 2025-12-06T05:15:58 1764998158

You’d just run every picture through CLIP, essentially you run an image generator backwards. Instead of text to image like most end users use when using something like stable diffusion (been awhile since I’ve done this), it can do the exact opposite and generate tokens (just words in this case) to describe the input image.

I’d guess famous characters like Bart and Marge and other Simpsons characters would likely be known by the tokenizer so it’d be pretty easy. So then you’d be able to guess.

Feel free to correct me on small details if anyone has this more fresh in their mind but I’m roughly correct here.