Benchmarking leading AI agents against Google reCAPTCHA v2

jameslk · 2025-11-10T22:36:35 1762814195

Is it assumed that humans perform 100% against this captcha? Because being one of those humans it’s been closer to 50% for me

I’m guessing Google is evaluating more than whether the answer was correct enough (ie does my browser and behavior look like a bot?), so that may be a factor

daveguy · 2025-11-10T20:04:22 1762805062

Wow. Cross-tile performance was 0-2%. That's the challenge where you select all of the tiles containing an item where the single item is in a subset of tiles. As opposed to all the tiles that contain the item type (static - 60% max) and the reload version (21% max). Seems to really highlight how far these things are from reasoning or human level intelligence. Although to be fair, the cross-tile is the one I perform worst on too (but more like 90+% rather than 2%).

RobertDeNiro · 2025-11-10T20:18:53 1762805933

I think the prompt is probably at fault here. You can use LLMs for object segmentation and they do fairly well, less than 1% seems too low.

mdahardy · 2025-11-10T21:26:17 1762809977

The cross-tile challenges were quite robust - every model struggled with them, and we tried with several iterations of the prompt. I'm sure you could improve with specialized systems, but the models out-of-the-box definitely struggle with segmentation

flakiness · 2025-11-10T17:57:59 1762797479

To be honest I'm surprised how well it holds. I expected close-to-total collapse. It'll be a matter of time I guess, but still.

criddell · 2025-11-10T20:45:20 1762807520

I wonder if any of the agents hit the audio button and listened to the instructions? In my experience, that can be pretty helpful.

mdahardy · 2025-11-10T18:02:30 1762797750

Same! As we talk about in the article, the failures were less from raw model intelligence/ability than from challenges with timing and dynamic interfaces

swyx · 2025-11-10T19:24:19 1762802659

i mean did you see the cross-tile numbers

alexnewman · 2025-11-10T20:32:13 1762806733

Hcaptcha cofounder here. Enterprise users have a lot of fancy configuration behind the scenes. I wonder if they coordinated with recaptcha or just assume there sitekey in the same as others

amirhirsch · 2025-11-10T23:08:43 1762816123

This was done on the re-captcha demo page no invisible fingerprinting, behavioral test, or user classification.

alexnewman · 2025-11-11T19:32:07 1762889527

Ah yea probably not a good test then. Good point

Xenoamorphous · 2025-11-10T17:47:31 1762796851

I’m sure they do better than me. Sometimes I get stuck on an endless loop of buses and fire hydrants.

Also, when they ask you to identify traffic lights, do you select the post? And when it’s motor/bycicles, do you select the guy riding it?

Sayrus · 2025-11-10T18:06:28 1762797988

Testing those same captcha on Google Chrome improved my accuracy by at least an order of magnitude.

Either that or it was never about the buses and fire hydrants.

ACCount37 · 2025-11-10T18:16:22 1762798582

It's a known "issue" of reCaptcha, and many other systems like it. If it thinks you're a bot, it will "fail" the first few correct solves before it lets you through.

The worst offenders will just loop you forever, no matter how many solves you get right.

Alex2037 · 2025-11-10T19:46:07 1762803967

stock Chrome logged into a Google account = definitely not a bot. here, click a few fire hydrants and come on in :^)

I sincerely wish all the folx at Google directly responsible for this particular user acquisition strategy to get every cancer available in California.

varenc · 2025-11-10T21:35:20 1762810520

I would think that when you're viewing recaptcha on a site, if you have 3rd party cookies disabled the embedded recaptcha script won't have anyway of connecting you with your Google account, even if you're logged in. At least that's how disabling 3rd party cookies is supposed to work.

michaelt · 2025-11-10T23:53:24 1762818804

Of course, if you have 3rd party cookies disabled, Google would never link your recaptcha activity to your Google account.

They just link it to your IP address, browser, operating system, screen resolution, set of fonts, plugins, timezone, mouse movements, GPU, number of CPU cores, and of course the fact you've got third party cookies disabled.

varenc · 2025-11-11T04:22:24 1762834944

Isn't Chrome shifting to blocking 3rd party cookies by default? If that's the new default than the default behavior would be that being logged into Google isn't used as a signal for recaptcha

imcritic · 2025-11-11T17:42:28 1762882948

Do you really think they won't make a hidden whitelist for their own domains?

varenc · 2025-11-12T00:25:45 1762907145

There'd be no way to hide this. If 3rd party cookies are disabled it's trivial to observe if an embedded google.com iframe is sending my full google.com 1st party cookies in violation of the 3rd party cookie settings. There's no pinky promises involved, you can just check what it's sending with a MITM proxy.

I'm sure they're doing other sketchy things but wouldn't make sense to lie in such a blindingly obvious way. (I just tested it, and indeed, it works as expected)

Sayrus · 2025-11-12T12:20:19 1762950019

So like X-Client-Data which in many cases uniquely identified you but was, pinky promise, never used for tracking. Sent only to Google domains.

https://9to5google.com/2020/02/06/google-chrome-x-client-dat...

varenc · 2025-11-12T18:37:57 1762972677

that would fall under "I'm sure they're doing other sketchy things".

Tostino · 2025-11-11T01:05:00 1762823100

"Oh, that's interesting...there is one other user that matches all of that metadata"

viccis · 2025-11-10T20:22:05 1762806125

That's because Chrome tracks so much telemetry about you that Google is satisfied with how well it has you surveilled. If you install a ton of privacy extensions like Privacy Badger, uBlock, VPN extensions with information leakage protections, etc., watch that "accuracy" plummet again as it makes you click 20 traffic signals to pass one check.

consp · 2025-11-10T22:31:04 1762813864

I stop going to sites using that method due to this. I have no intention of proving I'm a human it I have to click several dubious images 3-4 times in a row.

timshell · 2025-11-10T21:27:01 1762810021

Yeah, we've looked at it in the context of reCAPTCHA v3 and 'invisible behavioral analysis': https://www.youtube.com/watch?v=UeTpCdUc4Ls

It doesn't catch OpenAI even though the mouse/click behavior is clearly pretty botlike. One hypothesis is that Google reCAPTCHA is overindexing on browser patterns rather than behavioral movement

terminalshort · 2025-11-10T23:23:45 1762817025

There is a doom loop mode where it doesn't matter how many you solve or even if you get them correct. My source for this works on this product at Google.

max51 · 2025-11-11T00:51:30 1762822290

That doesn't surprise me. I find it hard to believe it's a pure coincidence that I would get stuck in the loop regularly when I'm on the university wifi but it would never happen anywhere else ever. After a dozen try, I would remote connect to my home pc and it would magically work on the first try every single time.

fooker · 2025-11-11T05:23:28 1762838608

Someone from your university tried to scrape data from Google.

I know it's still not justified, but it's the easy solution that works for preventing DOS attacks.

max51 · 2025-11-12T03:29:20 1762918160

>Someone from your university tried to scrape data from Google.

Kinda wild that someone scraping google's data would prevent me from getting into my PAID (>90$/yr) Dropbox account. That experience is a big part of why I pay extra to host my host data on my own server now.

fooker · 2025-11-12T05:01:06 1762923666

Yep, that's how the internet works now unfortunately.

Decentralization, hosting your own stuff, is great until you run into DDOS attacks and have to make maintaining your server a full time job. Sure you have the skills (or can acquire it), but do you have the time ?

Tostino · 2025-11-11T00:02:47 1762819367

Tell them I hate them.

terminalshort · 2025-11-11T14:40:41 1762872041

Oh, don't worry. That's just v2. V3 uses the Google panopticon to watch your every move and decide if you're human or not that way without ever making you click on images. I'm sure you'll love it!

jgalt212 · 2025-11-11T02:50:11 1762829411

Firefox and Ubuntu can reliably trigger this mode.

plingbang · 2025-11-11T14:40:41 1762872041

I have a recording of me trying to pass the captcha for straight 5 minutes and giving up. To be fair, this has only happened once.

What is the purpose of such loop? Bots can simply switch to another residential proxy when the captcha success rate gets low. For normal humans, it is literally "computer says no".

nfRfqX5n · 2025-11-11T15:01:51 1762873311

It’s not just IP. The score is linked to your google account as well and tracked across google properties

cm2187 · 2025-11-10T21:39:29 1762810769

The buses and fire hydrants are easy. It is the bicycles. If it goes a pixel over the next box, do I select the next box? Is the pole part of the traffic light? And the guy as you say. There is a special place in hell for the inventor of reCaptcha (and for all of Cloudflare staff as fas as I am concerned!)

charcircuit · 2025-11-10T21:50:02 1762811402

It doesn't matter. Select it of you think other people would select it too.

cm2187 · 2025-11-10T21:55:47 1762811747

That's the thing, you could go either way. I am not sure I can answer the question "what would a resonable person click?".

jorvi · 2025-11-11T01:47:54 1762825674

The trick is to pretend you're an idiot. If the bicycle and the person on it map mostly to a rectangle of 8 squares, most people will be so stupid or hasty that they'll click that, nevermind that a human is not part of the bicycle.

The same is true with, say, buses. See an image of a delivery van? Bus! It asks you select all cars and you see no car but a vague pixel blob that someone stupid would identify as a car? Car!

One of the few things that this doesn't work with is stairs, because the side of stairs being stairs or not is something apparently no one can agree on.

abacadaba · 2025-11-10T23:19:08 1762816748

Answer how you'd want a Waymo to react

charcircuit · 2025-11-11T05:41:48 1762839708

If it can go either way then you can pass the captcha either way too. There isn't a single correct answer to these captchas.

timshell · 2025-11-10T21:59:59 1762811999

The 'Process Turing Test' extends the CAPTCHA from 'What would a reasonable person click' to 'How would a reasonable person click'.

For example, hesitation/confusion patterns in CAPTCHAs are different between humans and bots and those can actually be used to validate humans

dfxm12 · 2025-11-11T01:04:45 1762823085

There's not a right or wrong answer. They're just building consensus for self driving vehicles.

datadrivenangel · 2025-11-10T17:48:36 1762796916

That's not due to accuracy, you're getting tarpitted for not looking human enough.

hnburnsy · 2025-11-10T18:30:10 1762799410

Pro tip, select a section you know is wrong, then de select it before submitting. Seems to help prove you are not a bot.

sph · 2025-11-10T19:23:08 1762802588

Shhh, you're not supposed to tell people. Now they'll patch it and I'll have to select stairs and goddamn motorcycles 4 times in a row.

pants2 · 2025-11-10T21:22:54 1762809774

Another pro tip: the audio version of the captcha is usually much easier / faster to solve if you're in a quiet environment

johnbatch · 2025-11-11T00:32:12 1762821132

I wonder how well the AI models would do on the audio version.

sixhobbits · 2025-11-10T17:51:19 1762797079

Didn't look a lot into this but I think the fact that humans are willing to do this in the "cents per thousand" or something range means that it's really hard to get much interest in automating it

utopman · 2025-11-10T18:08:59 1762798139

Not sure it is your case but I think I sometimes had to solve many of them when I am in my daily task rush. My hypothesis is that I solve them too fast for "average human resolving duration" recaptcha seems to expect (I think solving it too fast triggers bot fingerprint). More recently when I fall on a recaptcha to solve, I consciently do not rush it and feel have no more to solve more than one anymore. I don't think I have super powers, but as tech guy I do a lot a computing things mechanically.

IG_Semmelweiss · 2025-11-10T19:27:56 1762802876

that, and VPN.

Yes.

nistiminic · 2025-11-11T02:19:14 1762827554

Just select the audio option. It's faster and easier. Maybe it's because google doesn't care about training on speech to text. I usually write something random for one word and get the other word correct. I can even write "bzzzzt" at the beginning. They don't care because they aren't focused on training on that data.

Now I think of it, it's really a failure that AI didn't use this and went with guessing which square of an image to select.

felixfurtak · 2025-11-10T19:49:30 1762804170

I always assume that people are lazy and try and click the least amount of squares as possible to get broadly the correct answer. Therefore, if it says motorbikes just click on the body of the bike and leave out rider and tiles with hardly any bike in them.

If it says traffic lights just click on the ones you can see lit and not the posts and ignore them if they are too far in the distance. Seems to work for me.

stephen_g · 2025-11-11T05:25:43 1762838743

The other fun thing is the complete lack of localisation for people not from the US. "Select the squares with crosswalks" - with what? Oh, right, the pedestrian crossings... And the fire hydrants look like we've seen in movies, it's like, oh yeah those do exist in real life!

perfmode · 2025-11-11T00:53:05 1762822385

> do you select the guy riding it? do you select the post?

Just select as _you_ would. As _you_ do.

Imperfection and differing judgments are inherent to being human. The CAPTCHA also measures your mouse movement on the X and Y axes and the timing of your clicks.

mdahardy · 2025-11-10T18:01:34 1762797694

While running this I looked at hundreds and hundreds of captchas. And I still get rejected on like 20% of them when I do them. I truly don't understand their algorithm lol

Semaphor · 2025-11-10T17:57:50 1762797470

There's a browser extension to solve them. Buster.

jameslk · 2025-11-10T22:25:00 1762813500

> Also, when they ask you to identify traffic lights, do you select the post? And when it’s motor/bycicles, do you select the guy riding it?

This type of captcha is too infuriating so I always skip it until I get the ones where I’m just selecting an entire image, not parts of an image

Google’s captchas are too ambiguous and might as well be answered philosophically with an essay-length textbox

nwellinghoff · 2025-11-11T05:05:14 1762837514

No you don’t select the post. No you don’t select the guy. Hence the point. Agreed they are annoying.

xnx · 2025-11-10T17:47:16 1762796836

Seems like Google Gemini is tied for the best and is the cheapest way to solve Google's reCAPTCHA.

Will be interesting to see how Gemini 3 does later this year.

mdahardy · 2025-11-10T18:08:41 1762798121

After watching hundreds of these runs, Gemini was by far the least frustrating model to observe.

dgacmu · 2025-11-10T19:01:52 1762801312

In my admittedly limited-domain tests, Gemini did _far_ better at image recognition tasks than any of the other models. (This was about 9 months ago, though, so who knows what the current state of things). Google has one of the best internal labeled image datasets, if not the best, and I suspect this is all related.

bena · 2025-11-10T17:54:49 1762797289

Makes sense, what do you think it was trained on?

padolsey · 2025-11-11T03:03:43 1762830223

To this day I hate captchas. Back when it was genuinely helping to improve OCR for old books, I loved that in the same way I loved folding@home, but now I just see these widgets as a fundamentally exclusionary and ableist blocker. People with cognitive, sight, motor, (and many other) impairments are at a severe disadvantage (and no, audio isn't a remedy, it is just shifting to other ableisms). You can add as many aria labels as you like but if you're relying on captchas, you are not accessible. It really upsets me that these are now increasing in popularity. They are not the solution. I don't know what is, but this aint it.

TulliusCicero · 2025-11-11T04:33:24 1762835604

> In general, all models performed best on Static challenges and worst on Cross-tile challenges.

I also perform poorly on cross-tile, I never know whether to count a tiny bit of a bicycle in a square as "a bike in that square".

ajsnigrutin · 2025-11-10T18:15:29 1762798529

So, when do we reach a level where AI is better than humans and we remove captcha from pages alltogether? If you don't want bots to read content, don't put it online, you're just inconveniencing real people now.

cubefox · 2025-11-10T18:54:26 1762800866

They can also sign up and post spam/scams. There are a lot of those spam bots on YouTube, and there probably would be a lot more without any bot protection. Another issue is aggressive scrapers effectively DOSing a website. Some defense against bots is necessary.

1gn15 · 2025-11-11T03:08:07 1762830487

I use manual verification on first post (note: can probably use an LLM to automate this), or just not have a comment section in the first place. That way, you moderate based on content, not identity or mental ability (which can be discriminatory and is also a losing game, as seen in TFA).

Either that, or just be honest and allow anonymous posting lol

rkagerer · 2025-11-10T20:47:23 1762807643

Forget whether humans can't distinguish your AI from another human. The real Turing test is whether your AI passes all the various flavors of captcha checks.

timshell · 2025-11-10T20:59:37 1762808377

One of the writers here. We believe the real Turing Test is whether your AI performs a CAPTCHA like a human would/does.

kjok · 2025-11-10T18:30:24 1762799424

If not today, models will get better at solving captchas in the near future. IMHO, the real concern, however, is cheap captcha solving services.

arbol · 2025-11-10T19:25:54 1762802754

The solvers are a problem but they give themselves away when they incorrectly fake devices or run out of context. I run a bot detection SaaS and we've had some success blocking them. Their advertised solve times are also wildly inaccurate. They take ages to return a successful token, if at all. The number of companies providing bot mitigation is also growing rapidly, making it difficult for the solvers to stay on top of reverse engineering etc.

kjok · 2025-11-10T19:38:35 1762803515

> when they incorrectly fake devices

And how often does this happen? Do you have any proof? Most YC companies building browser agents have built-in captcha solvers.

arbol · 2025-11-10T21:23:18 1762809798

That's a good question. I haven't checked the stats to see how often it happens but I will make a note to return with some info. We're dealing with the entire internet, not just YC companies, and many scrapers / solvers will pass up a user agent that doesn't quite match the JS capabilities you would expect of the browser version. Some solving companies allow you to pass up user agent , which causes inconsistencies as they're not changing their stack to match the user agent you supply. Under the hood they're running whatever version of headless Chrome they're currently pinned to.

tim333 · 2025-11-11T20:15:59 1762892159

I'd have a job with the first cross-tile one shown saying select squares with motorcycles. Does the square above the handle bars appearing to maybe contain part of a rear view mirror count? I'm not surprised the LLMs were failing on those.

akimbostrawman · 2025-11-11T09:11:32 1762852292

At this point i am convinced all captchas almost entirely rely on ip reputation. Even on linux with hardened firefox you can get stuck in a infinite loop with one IP but then switch to another one that let's you in after 0-2 tries.

VectorLock · 2025-11-13T16:30:24 1763051424

Is calling Browser Use and "open source framework" a bit misleading it looks like a commercial product that requires an API key to use even if you run the source?

maknee · 2025-11-10T18:04:05 1762797845

interesting results. why does reload/cross-tile have worse results? would be nice to see some examples of failed results (how close did it to solving?)

mdahardy · 2025-11-10T18:07:18 1762798038

We have an example of a failed cross-tile result in the article - the models seem like they're much better at detecting whether something is in an image vs. identifying the boundaries of those items. This probably has to do with how they're trained - if you train on descriptions/image pairs, I'm not sure how well that does at learning boundaries.

Reload are challenging because of how the agent-action loop works. But the models were pretty good at identifying when a tile contained an item.

Youden · 2025-11-10T19:12:20 1762801940

I'm also curious what the success rates are for humans. Personally I find those two the most bothersome as well. Cross-tile because it's not always clear which parts of the object count and reload because it's so damn slow.

cedws · 2025-11-10T19:08:23 1762801703

Would performance improve if the tiles were stitched together and fed to a vision model, and then tiles are selected based on a bounding box?

mdahardy · 2025-11-10T19:39:27 1762803567

That's a cool idea. I bet it would work better.

PaulHoule · 2025-11-10T16:58:44 1762793924

I know people were solving CAPTCHAS with neural nets (with PHP no less!) back in 2009.

mdahardy · 2025-11-10T18:00:31 1762797631

You could definitely do better than we do here - this was just a test of how well these general-purpose systems are out-of-the-box

golfer · 2025-11-10T17:54:39 1762797279

Indeed, captcha vs captcha bot solvers has been an ongoing war for a long time. Considering all the cybercrime and ubiquitous online fraud today, it's pretty impressive that captchas have held the line as long as they have.

mehdibl · 2025-11-10T19:07:02 1762801622

Ok and then? Those models were not trained for this purpose.

It's like the last hype over using generative AI for trading.

You might use it for sentiment analysis, summarization and data pre-processing. But classic forecast models will outperform them if you feed them the right metrics.

daveguy · 2025-11-10T20:06:23 1762805183

These are all multi-modal models, right? And the vision capabilities are particularly touted in Gemini.

https://ai.google.dev/gemini-api/docs/image-understanding

Legend2440 · 2025-11-10T20:05:40 1762805140

It is relevant because they are trained for the purpose of browser use and completing tasks on websites. Being able to bypass captchas is important for using many websites.

It would be nice to see comparisons to some special-purpose CAPTCHA solvers though.

bagacrap · 2025-11-11T06:26:31 1762842391

And more broadly, if an agent is supposed to do everything a human can on the web, its ability to solve a captcha is likely a decent litmus test.

throwawayu5pg · 2025-11-10T21:39:55 1762810795

static, cross-tile and reload. recaptcha call window pings LPRs.

guluarte · 2025-11-10T17:54:21 1762797261

in other words reasoning call fill the context window with crap

theoldgreybeard · 2025-11-10T21:14:13 1762809253

I’ve used LLMs to solve captchas for shits and giggles, just taking a screenshot and pasting it into ChatGPT and having it tell me what squares to click and I think it solves them better than I do.

Can we just get rid of them now, they are so annoying and basically useless.

WhereIsTheTruth · 2025-11-10T18:03:56 1762797836

3 models only, can we really call that a benchmark?

mdahardy · 2025-11-10T18:09:37 1762798177

WhereIsTheTruth · 2025-11-11T06:05:03 1762841103

Ignorance Is Bliss

jngiam1 · 2025-11-10T19:20:53 1762802453

I hypothesize that these AI agents are all likely higher than human performance now.