Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As a computer vision researcher who has worked with these T2I generators, this is more spot on. I think a lot of people are making a lot of assumptions without having actually worked with these networks. There's also a highly fundamental problem that is not only unsolved, but we don't even have great directions to solve: alignment. Alignment is even a difficult task when you're communicating to another person. Everyone has had an experience of working with someone and they are asking you to do something (or vise versa) and there is a gross misinterpretation. That's alignment. Now try properly aligning with an "entity" that thinks nothing like you, has a vastly different set of experiences, doesn't know how to count, and has no real concept of physical mechanisms (and more).

The technology will definitely help us see a lot more unique art and a lot more amateur artists (a good thing). But it would be naive to think that artists are going away anytime soon. I'd even argue that there will be new jobs from this (prompt engineer comes to mind). It is really difficult to speculate how this technology (even as it progresses rapidly) will change the future. But considering that no one is creating machines that think like humans we will always have alignment issues. This isn't a fact that can be ignored.

It is also worth noting that if you haven't worked with these systems, that what you are seeing is extremely biased. People aren't showing their failures. But there are some twitter accounts that do: WeirdStableAI and weirddalle are probably the most famous. Of course, you can also follow Gary Marcus who is going to retweet every example of failure that he can get his hands on.

Other side note: some of these conversations remind me about how people used to talk about 3D printers



This confirms some of the suspicions I've had after playing with DALL-E for a few months. In the good examples I see from Dall-E, oftentimes the things that it frequently does wrong (face symmetry, hands, feet, limbs) are just missing altogether because they're not needed to produce the prompt. Even in some of the good ones, a lot of elements out of focus seem to get smeared into other objects quite often, or the "smudginess" contributes to its surreal style.

People keep saying that sort of problem with be optimized away, but people often underestimate that when a technology plateaus at a certain point it usually requires a completely new innovation to get past some of the hurdles - not just optimization. It can take decades for those innovations to arrive.

Of course I don't have enough specialized knowledge to know if that's really the case here but it's an observation I've noticed about hyped technologies in my lifetime.


Well IIRC DALL-E specifically in their public release tries to censor humans. But other works like Stable Diffusion and Midjourney do not have these restrictions. Still, both are quite terrible at human faces. This is always interesting because generative tasks like FFHQ or CelebA have always been considered easy (despite humans being quite adept at recognizing when something is wrong with a face). But being able to create whole scenes requires a substantial amount of diversity. A good network can both memorize and create. There are some technical details that I am certain will be resolved (like long range features) simply due to increased hardware power/memory. But alignment will still be quite difficult as the entire problem is ill-defined to begin with.

But to give you some hints of how to use DALL-E better, there are magic keywords and this is why people suggest pretending like you're writing a prompt as if the thing exists. Some of these magic words are: screenshot, unreal engine, photorealistic, studio Ghibli. It does better with anime styles, probably due to training and human interpretations. Try to write longer prompts too. For example "flying otters" will give you otters in the water but "a photorealistic unreal engine render of otters flying through a beautiful sunset" will give you something much closer to what you actually want.


Interesting. Thank you.


> Now try properly aligning with an "entity" that thinks nothing like you, has a vastly different set of experiences, doesn't know how to count, and has no real concept of physical mechanisms (and more).

Indeed. This point was also made in a different way in the article:

> So all of that to say that it’s very, like very, difficult to describe an image with words. There are so many things that an illustrator will instinctively know how to do, that the machine would need to be told to do. Even if it was possible

It struck me how similar this was to the basic challenge of programming: computers don't have the layers upon layers of background information that humans take for granted, to be able to accept descriptions of the simplest tasks. You must explain every minute detail to them.


> It struck me how similar this was to the basic challenge of programming

Yeah, this challenge is actually fairly universal. I said this in another comment but it is worth repeating. When communicating there is: 1) what you intend to say, 2) what you say, and 3) what is heard. These don't even have to be the same thing. There's a lot of ways the miscommunication can happen but this is also why we need to act in good faith. As the speaker you need to convey the idea in your head to someone else's head. As the listener you need to try to interpret what is in someone's head through the words they say. People think language is extremely structured but there is a lot lost in between and we fill in so many gaps. Once we lose good faith communication becomes near impossible because we won't correctly fill in the gaps.


Excellent extension, very true!


>Now try properly aligning with an "entity" that thinks nothing like you, has a vastly different set of experiences, doesn't know how to count, and has no real concept of physical mechanisms (and more).

The biggest issue with alignment is that humans don't really know what they mean nor want in the first place. Yet, we train these networks to produce high quality images, whose resolution is necessarily higher than the resolution of the fundamentally ambiguous human input.


Training a high quality image is different than training a high quality image of the specific thing you want. One has substantially more flexibility. It is ridiculous to compare the two.

As for humans, I will give the constant reminder. There are 3 parts to language: 1) what you intend to convey (what's in your head), 2) what you actually say/write (encoding head to physical), 3) what the other person understands (decoding physical to mental). These 3 things can have 3 different meanings. Do your best in the first two, but the third requires the other person to be acting in good faith.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: