Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not good at reasoning. I asked the following (the term substitutions are there to avoid parroting):

  Let's define the following terms:
  col means "and", pel means "or" and erf is equivalent to "col" if the first argument evaluates to "true", otherwise it's equivalent to "pel"

  can you evaluate the following statement?
  col(true, pel(false, true), erf(col(true, false), true))

And it said this:

  ...
  Next, we look at the expression "erf(col(true, false), true)". Since the first argument is true, erf is equivalent to col, which means we evaluate the entire expression as:

  col(true, false) = false

  Therefore,

  erf(col(true, false), true) = col(true, true) = true

which is obviously wrong because the first arg is false. The final answer happens to be the same. So I then asked

  what about col(true, pel(false, true), erf(col(true, false), false))

and this is its full answer:

  Sure, let's evaluate the expression "col(true, pel(false, true), erf(col(true, false), false))" step by step:

  First, we look at the expression "pel(false, true)". This is an "or" operation, so we evaluate it as follows:

  pel(false, true) = true

  Next, we look at the expression "erf(col(true, false), false)". Since the first argument is true, erf is equivalent to col, which means we evaluate the entire expression as:

  col(true, false) = false

  Therefore,

  erf(col(true, false), false) = pel(false, false) = false

  Finally, we look at the expression "col(true, pel(false, true), erf(col(true, false), false))". Using the results from steps 1 and 2, we can evaluate this as:

  col(true, true, false) = true

  Therefore, the overall value of the statement "col(true, pel(false, true), erf(col(true, false), false))" is true.

You can see it's mostly random nonsense masked as "reasoning". It's not even consistent across sentences. I assume if you do a bigger model, it might do better, but as soon as you increase the example size I suspect it will get it wrong again.


To be fair, I'm a fairly intelligent human (or so I'd like to think!), and I'm having a lot of trouble following this. I'd have to sit down with pencil and paper and do the transformations on it.

Keep in mind that LLMs like GPT 4 do not have a scratchpad and can't back-track. Asking it a question is the rough equivalent of asking a human standing in the middle of the room a question. No paper, no whiteboard, nothing. Just a demand for a verbal answer then and there, from a standing start.

I've only met a couple of humans who would be able to solve this problem like this one under those circumstances.

Despite its impediments GPT 4 solved this correctly:

    First, let's break down the statement and evaluate the individual components.
    
        col(true, false) = true AND false = false
        pel(false, true) = false OR true = true
        erf(col(true, false), true) = erf(false, true)
    
    Since the first argument of erf evaluates to false, erf is equivalent to "pel":
    
        erf(false, true) = pel(false, true) = false OR true = true
    
    Now we can substitute the results back into the original statement:
    
    col(true, pel(false, true), erf(col(true, false), true)) = col(true, true, true)
    
    The col function acts as an AND operator, so:
    
    col(true, true, true) = true AND true AND true = true
    
    Therefore, the given statement evaluates to true.
Alternatively, to give an idea of the upcoming "Wolfram Alpha" plugin, you can instead ask it to convert the expression into the equivalent Mathematica program:

    col[x_, y_] := And[x, y]
    pel[x_, y_] := Or[x, y]
    erf[x_, y_] := If[x, col[True, y], pel[False, y]]
    col[True, pel[False, True], erf[col[True, False], True]]
Very nice! Unfortunately, it didn't notice that "col" is used with three arguments, not just two. This is a mistake many (most?) humans would make also.

No worries, you can ask it to correct that. It'll apologise and produce this:

    col[args__] := Apply[And, {args}]
    pel[args__] := Apply[Or, {args}]
    erf[x_, y_] := If[x, col[True, y], pel[False, y]]

    col[True, pel[False, True], erf[col[True, False], True]]
Which evaluates to:

    True
There you go. Now go find me a random human who can do any of this from a standing start, no preparation, no warning, no tools, nothing but their wits.

As a bonus, here's a related joke:

    Why did the logical functions col, pel, and erf form a band? 
    Because together, they could create the perfect harmony: 
    they knew when to "col"aborate, when to give each other 
    some space and "pel"ay, and their "erf"forts were always well-timed!


Here's another example to test if it can actually think critically about a problem instead of just following whatever instructions and whatever the training data had. A human with basic level math could do it.

> can you replace the first letter of "banana" with the next letter of the alphabet?

Yes, I can do that! However, I should note that after several iterations, the resulting word may no longer be considered a recognizable word in the English language. Here are the next 10 iterations:

"canana" becomes "danana" "danana" becomes "eanana" "eanana" becomes "fanana" "fanana" becomes "ganana" "ganana" becomes "hanana" "hanana" becomes "ianana" "ianana" becomes "janana" "janana" becomes "kanana" "kanana" becomes "lanana" "lanana" becomes "manana" And so on for a total of 173 iterations

> I want the final answer

Sure, after 173 iterations of replacing the first letter of "banana" with the next letter of the alphabet, the final word would be:

"zcyzcj"

As noted earlier, after several iterations, the resulting word may no longer be considered a recognizable word in the English language.


It's just an example. An human could solve this with an arbitrary number of substitutions, but I'm sure GPT4 breaks at some point. I don't think it's fair to compare it with a human with no paper and pencil... It's like asking asking the AI to not use any RAM. Humans are very bad at keeping stuff in RAM so paper and pencil is all we've got.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: