Hope this on day will be used for auto-tagging all video assets with time codes....

tontonius · 2025-12-03T13:17:06 1764767826

this is a solved problem already — check out https://getjumper.io where you can do exactly this (search through 100s of hours) offline and locally.

Disclaimer: co-founder

laidoffamazon · 2025-12-03T03:09:47 1764731387

It’s not difficult to hack this together with CLIP. I did this with about a tenth of my movie collection last week with a GTX 1080 - though it lacks temporal understanding so you have to do the scene analysis yourself

vhcr · 2025-12-03T06:56:10 1764744970

I'm guessing you're not storing the CLIP for every single frame, instead of every second or so? Also, are you using the cosine similarity? How are you finding the nearest vector?

laidoffamazon · 2025-12-03T15:25:45 1764775545

I split per scene using pyscenedetect and sampled from each. Distance is via cosine similarity- I fed it into qdrant

dynode · 2025-12-03T05:09:58 1764738598

Would you be willing to share more details of what you did?

laidoffamazon · 2025-12-03T20:34:43 1764794083

Sure. I had a lot of help from Claude Opus 4.5, but it was roughly:

- Using pyscenedetect to split each video on a per scene level

- Using the decord library https://github.com/dmlc/decord to pull frames from each scene at a particular sample rate (specific rate I don't have handy right now, but it was 1-2 per scene)

- Aggregating frames in batches of around 256 frames to be normalized for CLIP embedding on GPU (had to re-write the normalization process for this because the default library does it on CPU)

- Uploading the frames along with metadata (timestamp, etc) into a vector DB, in my case Qdrant running locally along with a screenclip of the frame itself for debugging.

I'm bottlenecked by GPU compute so I also started experimenting with using Modal for the embedding work too, but then vacation ended :) Might pick it up again in a few weeks. I'd like to be able to have a temporal-aware and potentially enriched search so that I can say "Seek to the scene in Oppenheimer where Rami Malek testifies" and be able to get a timestamped clip from the movie.

ArnavAgrawal03 · 2025-12-03T04:27:57 1764736077

you can do that with Morphik already :)

We use an embedding model that processes videos and allows you to perform RAG on them.

eurekin · 2025-12-03T11:13:05 1764760385

Would it allow me to query my library for every movie that contains dance routing move1-move2-move3 in that order?

bn-l · 2025-12-03T09:00:15 1764752415

Rag as in the content is used to generate an answer or rag as in searching for a video?

xnx · 2025-12-04T01:06:04 1764810364

Gemini already does this (and has for awhile): https://ai.google.dev/gemini-api/docs/video-understanding