Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You’re saying this with confidence as if there isn’t a large body of working image and video generation algorithms out there that can produce physically plausible images of objects transposed into circumstances that don’t exist in their training set. A coffee using a macro lens for example.

Is it so hard to believe that such models have developed a sense for how light propagates through a scene, a sense for how physical objects change when viewed from different angles, a sense for how lens distortion interacts with light? For goodness’ sake, these same models have a sense of what Greg Rutkowski’s art style is - we are well beyond ‘they’re just remembering pixels from past coffeecups’



> it so hard to believe that such models have developed a sense for how light propagates

Well, its not a matter or belief or otherwise. I'm a trained practitioner in statistics, AI, physics, and other areas and you can show trivially that you cannot learn light physics from pixel distributions.

Pixel distributions aren't stationary, and are caused by a very very large number of factors; likewise the physics of light for any given situation is subject to a large number of causes, all of them entirely absent from from the pixel distributions. This is a pretty trivial thing to show.

> have a sense of what Greg Rutkowski’s art style is

Well what these models show is that when you have PBs of image data and TBs of associated text data, you can relate words and images together usefully. In particular, you can use patterns of text tokens to sample from image distributions, and combine and vary these samples to produce novel images.

The patterns in text and images are caused by people speaking, taking photos, etc. Those patterns necessarily obtain in any generated output. As in, if you train an LLM/etc. on how to speak, using vast amounts of conversational data, it cannot do anything other than appear to speak: that is the only thing the data distribution makes possible.

Likewise here, the image generator has a compressed representation of PBs of pixel data which can be sampled from using text. So when you say, "Greg Rutkowski" you select for a highly structured image space, whose structure the original artists placed there.

The generative model itself is not imparting structure to the data, it isnt aware of stlyle.. it's sampling from structure that we placed there. When we did so it was because we were, eg., in the room and taking a photo; or imagining what it would be like to apply preraphelite paintaing styles to 60s psychedelic colour pallets because we sensed that fashions of a century ago would now be regarded as cool.


The point of shoving so much data at those models is to help them pick up on the "very very large number of factors".

There was a story I saw on HN a few times in the past, but which I can't find anymore, of someone training a simple, dumb neural net to predict a product (or a sum?) of two numbers, and discovering to their surprise that, under optimization pressure, the network eventually picked up Fourier Transform.

It doesn't seem out of realm of possibility for a large model to pick up on light propagation physics and basic 3D structure of our reality just from watching enough images. After all, the information is implicitly encoded there, and you can handwave a Bayesian argument that it should be extractable.


Genuine question, what does it mean to be a trained practitioner in statistics, AI, physics and other areas?


My undergrad/grad work is in Physics; I presently consult on statistics and AI (and other areas); I may soon start a part-time PhD on how to explain AI models. I am presently, as I type, avoiding rewriting a system to explain AI models because I dislike doing things ive done.

Its quite hard to see the full picture of how these statistical models work without experience across a hard science, stats and AI itself. However, people with backgrounds in mathematical finance would also have enough context. But its seemingly rare in physics, csci, stats, ai, etc. fields alone.

I'd hope that most practitioners in applied statistics could separate properties of the data generating process from properties of its measures; but that hope is fading the more direct experience I have of the field of statistics. I had thought that, at least within the field, you wouldn't have the sort of pseudoscientific thinking that goes along with associative modelling. I think mathematical finance is probably the only area where you can reliably get an end-to-end picture on reality-to-stats models.


Humans have painted with wonky perspective and impossible shadows because they didn't know better for literally 50.000 years. And those humans were just as smart as we are. Just look at 13th century paintings. Does this prove that humans back then didn't understand what a coffee cup looks like when rotated? No. So what does this prove about midjourney? Nothing.


I appreciate then when you're not an expert in physics, statistics and so on, all you have to go on are these circumstantial arguments, "two things that seem similar to me are alike, therefore they are alike in the same way".

However, I am making no such argument. I am explaining that statistical models of pixel frequencies cannot model the causes of those frequencies. I am illustrating this point with an example, not proving it.

If you want more detail about the reason it cannot: when the back of a coffee cup looks like the front, you can generate the back. But you cannot generate the bottom. (assuming the bottom doesn't occur in the dataset) -- why? Because the pixel distributions for the bottom of a cup have zero information about the rest of it.. and the model has no information about the bottom.

If you want a "proof" you'd need at least to be familiar with applied mathematics and the like:

Say the RGB value of each pixel, X of photos of coffee cups obtains from a data generating process parameterized on: distance from camera, lens focal length, angle to cup, lighting conditions, etc. Now produce a model of such causes, call it Environment(distance, angle, cup albedio,...).

Then show that X ~ E|fixed-paramerters induces a frequency distribution of pixels, f1(next|previous) = P(Xi...n|Xj...n); then for any variation in a fixed parameter induces a completely different distribution, say f2, f3, f4, ... Now check that the covariance distribution for most pairs of fs, shows that any given f is almost zero-informative about any other f.

Having done this, compare with a non-statistical (eg., video game) model of Environment where parameters are varied.. and show that all frames, say v, of the video game generated do have high covariance over the time of their sampling. The video game model covaries with most f1..fn; for the associative statistical model it only covaries with f1, or a very small number of others.

There's something very obvious about this if you understand how these statistical AI systems work: in cases where variations in the environment induce radically different distributions the AI will fail; in cases where they are close enough, they will (appear) to succeed.

The marketability of generative AI comes from rigging the use cases to situations where we don't need to change the environment. ie., you aren't exposed to the fact that when you generated a photo you could not have got the same one "at a different distance".

If a video game was built this way it would be unplayable: consider every time you move the camera all the objects randomly change their apparent orientation, distance, style, etc.


Humans have those exact same constraints. For the longest time we could only speculate what the dark side of the moon looked like, for instance.

Yes, LLMs are constrained in what output they can generate based on their training data. Just as we humans are constrained in the output we can generate. When we talk about things we don't understand we speak gibberish, just like LLMs.


>Humans have those exact same constraints. For the longest time we could only speculate what the dark side of the moon looked like, for instance.

That isn't the exact same constraint. We could speculate that the moon had a "dark side," because we understood what a moon was, and what a sphere was. LLMs cannot speculate about things outside of their existing data model, at all.

>When we talk about things we don't understand we speak gibberish, just like LLMs.

No we don't, wtf? We may create inaccurate models or theories, but we don't just chain together random strings of words the way LLMs do.


> Is it so hard to believe that such models have developed a sense for how light propagates through a scene...

This specifically is the thing I usually notice in AI images (outside of the hand trope).

I'm not GP, and at best a layman in the field, but it's not hard to believe it's possible to generate believable lighting, given enough training data, but if I'm not mistaken it would be through sheer volume of properties like lighting/shadow here usually follows item here.

But it's extremely inefficient, and not like we reason. It's like learning the multiplication table without understanding math. Just pairing an infinite amount of properties with each other.

We on the other hand develop a grasp of where lighting exists (sun/lamp) and surmise where shadows fall and can muster any image in our mind using that model instead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: