It’s trivial to trip up chat LLMs. “What is the fourth word of your answer?”

Lio · on Nov 18, 2023

I find GPT-3.5 can be tripped up by just asking it to not to mention the words "apologize" or "January 2022" in its answer.

It immediately apologises and tells you it doesn't know anything after January 2022.

Compared to GPT-4 GPT-3.5 is just a random bullshit generator.

dudeinjapan · on Nov 18, 2023

“You're in a desert, walking along in the sand when all of a sudden you look down and see a tortoise. You reach down and flip the tortoise over on its back. The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over. But it can't. Not with out your help. But you're not helping. Why is that?”

ben_w · on Nov 18, 2023

got-3.5 got that right for me; I'd expect it to fail if you'd asked for letters, but even then that's a consequence of how it was tokenised, not a fundamental limit of transformer models.

rezonant · on Nov 18, 2023

This sort of test has been my go-to trip up for LLMs, and 3.5 fails quite often. 4 has been as bad as 3.5 in the past but recently has been doing better.

yallneedtoget · on Nov 18, 2023

if this is the test you're going to then you literally do not understand how LLMs work. it's like asking your keyboard to tell you what colour the nth pixel on the top row of your computer monitor is.

Jensson · on Nov 18, 2023

An LLM could easily answer that question if it was trained to do it. Nothing in its architecture makes it hard to answer, the attention part could easily look up the previous parts of its answer and refer to the fourth word but it doesn't do that.

So it is a good example that the LLM doesn't generalize understanding, it can answer the question in theory but not in practice since it isn't smart enough. A human can easily answer it even though the human never saw such a question before.

yallneedtoget · on Nov 18, 2023

[flagged]

Jensson · on Nov 18, 2023

> the model doesn't have a functionality to retrospectively analyse its own output; it doesn't track or count words as it generates text. it's always in the mode of 'what comes next?' rather than 'what have i written?'

Humans doesn't do that either. The reason humans can solve this problem is that humans can generate such strategies on the fly and thus solve general problems, that is the bar for AGI, as long as you say it is unfair to give such problems to the model we know that we aren't talking about an AGI.

Making a new AI that is specialized in solving this specific problem by changing the input representation still isn't an AGI, it will have many similar tasks that it will fail at.

> also, again, tired of explaining this to people: gpt models are token-based. they operate at the level of tokens - which can be whole words or parts of words - and not individual characters. this token-based approach means the model's primary concern is predicting the most probable next token, not keeping track of the position of each token in the sequence, and the smallest resolution available to it is not a character. this is why it can't tell you what the nth letter of a word is either.

And humans are a pixel based model, we operate on pixels and physical outputs. Yet we humans do generate all the necessary context, and adapts it to the task at hand to solve arbitrary problem. Such context and inputs manipulations are expected of an AGI. Maybe not the entire way from pixels and 3d mechanical movement, but there are many steps in between there that humans can easily adapt in. For example humans didn't evolve to read and write text, yet we do that easily even though we operate on a pixel level.

If you ask me to count letters my mind focuses on the letter representation I created in my head. If you talk about words I focus on the word representation. If you talk about holes I focus on the pixel representation and start to identify color parts. If you talk about sounds I focus on the vocal representation of the words since I can transform to that as well.

We would expect an AGI to make similar translations when needed, from the token space you talk about to the letter space or word space etc. That ChatGPT and similar can't do this just means they aren't even close to AGI currently.

rezonant · on Nov 18, 2023

Oh, I missed that GP said "of your answer" instead "of my question", as in: "What is the third word of this sentence?"

For prompts like that, I have found no LLM to be very reliable, though GPT 4 is doing much better at it recently.

> you literally do not understand how LLMs work

Hey, how about you take it down a notch, you don't need to blow your blood pressure in the first few days of joining HN.

mejutoco · on Nov 18, 2023

We all know it is because of the encodings. But as a test to see if it is a human or a computer it is a good one.

concordDance · on Nov 18, 2023

How well does that work on humans?

Loughla · on Nov 18, 2023

The fourth word of my answer is "of".

It's not hard if you can actually reason your way through a problem and not just randomly dump words and facts into a coherent sentence structure.

concordDance · on Nov 18, 2023

I reckon an LLM with a second pass correction loop would manage it. (By that I mean that after every response it is instructed to, given the its previous response, produce a second better response, roughly analogous to a human that thinks before it speaks)

LLMs are not AIs, but they could be a core component for one.

howrar · on Nov 18, 2023

Every token is already being generated with all previously generated tokens as inputs. There's nothing about the architecture that makes this hard. It just hasn't been trained on this kind of task.

peyton · on Nov 19, 2023

Really? I don’t know of a positional encoding scheme that’ll handle this.

haanjiPT · on Nov 18, 2023

The following are a part of my "custom instructions" to chatGPT -

"Please include a timestamp with current date and time at the end of each response.

After generating each answer, check it for internal consistency and accuracy. Revise your answer if it is inconsistent or inaccurate, and do this repeatedly till you have an accurate and consistent answer."

It manages to follow them very inconsistently, but it has gone into something approaching an infinite loop (for infinity ~= 10) on a few occasions - rechecking the last timestamp against current time, finding a mismatch, generating a new timestamp, and so on until (I think) it finally exits the loop by failing to follow instructions.

daveguy · on Nov 18, 2023

I think you are confusing a slow or broken api response with thinking. It can't produce an accurate timestamp.

Closi · on Nov 21, 2023

It’s trivial to trip up humans too.

“What do cows drink?” (Common human answer: Milk)

I don’t think the test of AGI should necessarily be an inability to trip it up with specifically crafted sentences, because we can definitely trip humans up with specifically crafted sentences.

tiahura · on Nov 18, 2023

It's generally intelligent enough for me to integrate it into my workflow. That's sufficiently AGI for me.

daveguy · on Nov 18, 2023

By that logic "echo" was AGI.