What I want is to be able to feed in a bunch of videos and generate an animatable (from talking) 3D face from that data. I suppose you in theory only need 3 images (front and sides). But mapping pixels to motion is interesting (facial expressions).
There wouldn't be depth data so it would be inferred from shadows
My case is not directly nefarious, for example an old popular YouTuber that streamed in the early 2000s taking their content and making a model of them for personal use like a 3D chat bot but with that person's quirks
Edit: when I say "nefarious" I mean you can use that tech to impersonate someone (eg. political reason) but for my case it's more the creeper type cloning someone for personal use eg. Replika
Tangent, the holo vtubers industry is interesting since they build up these characters with some unique persona/theme and then people follow that specific model, they could make themselves into an AI easily since it's a rigged 3D asset but of course it would be boring compared to the real thing
There wouldn't be depth data so it would be inferred from shadows