Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to Bypass Cloudflare: A Comprehensive Guide (zenrows.com)
260 points by jakobdabo on Sept 18, 2022 | hide | past | favorite | 74 comments


There are legitimate use cases for bypassing cloudflare's bot protection.

I discovered our company's help documentation (and integration guides), hosted by readme.com, were completely de-indexed from Google for the past 3 months.

Our Readme docs were formerly our #1 source of organic (free) leads.

After investigating, Cloudflare (as configured by Readme) was blocking Googlebot when using Cloudflare Workers. Cloudflare was returning a 403 for Googlebot, but returning pages as usual for regular users.

The cause: we were using Workers to rewrite some URLs at the edge (replacing Readme's default images with optimized + compressed images, using Cloudflare's own image optimization service).

By using Workers to do this, it resulted in Readme's Cloudflare account receiving requests from our domain with "googlebot" useragent, but from an IP that wasn't verified as a googlebot IP address (I assume the Worker was requesting the Readme site using the Googlebot user agent but with whatever IP address is used when using CF Workers).

I emailed Cloudflare support but it was clear it would take a lot of time to get them to understand the issue (and probably longer to fix it).

So, we had to spend a lot of time figuring out how to allow Googlebot requests past Cloudflare's "fake bot" firewall rule.

In our own Cloudflare account, we have all security settings at the lowest sensitivity possible (or turned off completely). We serve over 500 billion requests a month (10+ TB of bandwidth), and the amount of blocked traffic to seemingly legitimate clients was surprisingly high.

I love Cloudflare (and own quite a bit of their stock) but I'm beginning to rethink my stance on their service. They make it extremely easy to enable powerful features with little visibility or control over the details of how those features work.

Another SEO nightmare is their "Crawler Hints" service. I highly recommend no one uses this if you are ever the target of automated security scanners (e.g. ones used by bug bounty white hat hackers). With "crawler hints" enabled and with a white hat hacker running a scan of your site hitting random URLs... results in bingbot, yandex, and other search engines attempting to index every single one of the URLs hit by the security scanners used by hackers.

Basically, it's a mess, and the only way to really fix it is to bypass cloudflare or spend a lot of time and money with Cloudflare debugging.

Next quarter I'm faced with the decision of either doubling down of Cloudflare and getting an Enterprise plan with them ($20k+) or just ripping them out of our stack and going back to our old AWS Cloudfront set up which has fewer POPs, but was much less of a hassle.


I feel this as well. Cloudflare markets itself as a set-and-forget solution but really doesn't work that way. Furthermore in the limited visibility that they give to blocking they frame each blocked request as a success unconditionally. Of course they would, that is the service they are providing. However this is often not the case, for many websites most requests benefit very little from blocking and bot protection only really needs to be provided for mutating endpoints and DoS attacks.

For example the Cloudflare Blog's RSS feed is very often blocked from public-cloud IP ranges with specific clients. This is an endpoint that is intended to be public, is cachable and even intended to be accessed by bots! This is a common issue that should be very easy to solve technically but highlights how Cloudflare is not a set-and-forget solution. If they can't configure their own blog (a super simple case) correctly it is clear that using the tool correct requires special care and monitoring of the limited visibility that they provide you.


> Furthermore in the limited visibility that they give to blocking they frame each blocked request as a success unconditionally.

Two things that have happened to me:

* Cloudflare has decided that I'm a bot and stalled me, given me capchas, or just blocked me outright.

* Cloudflare has shown me marketing claiming that 40% of traffic is bots.

I'm not particularly impressed.


I use Googlebot as my fake browsers user agent for years. It's really interested to explore the web, when everyone thinks you're Google.


Unless the originating IP address is a Google-controlled one, using Googlebot as a User-Agent header is (IME) generally no better than not sending a UA header at all.^1 If the goal is to make a server believe a request is coming from Google, then the request needs to be sent from a publicised Google-controlled IP address.^2

1. For many years I have had great results with not sending a UA header. It is also, IMO, an effective means to discover the true number of websites that refuse to fulfill a request in the absence of a UA header, which IME is extremely small. For that small handful of sites, one can send a "fake" UA header of one's choosing. sec.gov is an example of such a site.

2. http://developers.google.com/static/search/apis/ipranges/goo...


Interestingly, lite.duckduckgo.com recently started requiring a User-Agent header, after many years of operating without this requirement. Are there any enforceable limits of what DDG can do with the UA header data. There has been no update to DDG's privacy policy.


I wonder if fake bot detectors can distinguish between any Google IP like GCP instances (i.e. do they simply check the ASN) or crawler specific IPs

Or maybe google crawler also runs on GCP and it's indistinguishable from regular $5 compute users


Yes, Google and most major search engines enable a RDNS lookup to validate they are really a googlebot



Like the OP, I’ve employed a custom configuration in Cloudflare which detects (and blocks) browsers which claim to be Googlebot but don’t originate from Google’s approved Googlebot IP ranges.

The vast majority of such requests are dodgy scanning operations likely looking for email addresses or exploitable forms.


What are some of the most interesting differences you've seen?


I think the best thing is that for some sites that has many ads, subscriber content, accept cookie pop-ups and/or captchas you will sometimes see that these are all gone and you get an ad free, full text version without pop ups and captchas. But that's only the case for websites that do not check for the origin IP and just rely on the user agent.


It is indeed interesting. Some sites even let you view their content without JS, registering/subscribing, and/or revert back to something approaching an unstyled static site without showing any ads, sidebars, or other useless content.

To add to some of the other experiences here about no-UA: I've tried that before too, and it was notably worse than pretending to be Google; lots of sites just return "Internal Server Error" or similar messages.


Do websites not spit at you or do they jsut assume you 'will do no evil'?


> Next quarter I'm faced with the decision of either doubling down of Cloudflare and getting an Enterprise plan with them ($20k+) or just ripping them out of our stack and going back to our old AWS Cloudfront set up which has fewer POPs, but was much less of a hassle.

Is Fastly a viable alternative for you?


> (I assume the Worker was requesting the Readme site using the Googlebot user agent but with whatever IP address is used when using CF Workers).

Regardless of using Workers or not, Cloudflare requests URLs with the header `cf-connecting-ip` and `x-forwarded-for` set to the actual client's IP address, and the website behind CF should be using this header to get source IPs.


That's not true when you're sending a request to a site belonging to a different Cloudflare account. In that case, the headers are set to a non-routable placeholder IP address used to represent Workers itself. Passing through the original client IP in this case would be a security flaw because the Worker could have arbitrarily rewritten the request, so it can no longer be considered to have come from that client.


Ah indeed, I hadn't picked up on the use-case being requesting a third-party site from a worker on their site. In that case, the solution would be to add code to rewrite the incoming googlebot user agent to something nondescript for these calls.


> By using Workers to do this, it resulted in Readme's Cloudflare account receiving requests from our domain with "googlebot" useragent, but from an IP that wasn't verified as a googlebot IP address (I assume the Worker was requesting the Readme site using the Googlebot user agent but with whatever IP address is used when using CF Workers).

Was this definitely the cause? It's somewhat surprising to hear that requests would be rejected if the user agent doesn't match a set of hard coded IP addresses.

Were you able to resolve this in the end? If not and the cause is what you suspect then perhaps changing the user agent in your worker might be a workaround.


> It's somewhat surprising to hear that requests would be rejected if the user agent doesn't match a set of hard coded IP addresses.

It’s fairly common for DDoS/scraping prevention, Googlebot (and most other crawlers) publish their IP ranges for that reason[0][1][2]. I don’t work at Cloudflare though, so no insider knowledge of what you folks are doing.

[0] https://developers.google.com/search/docs/crawling-indexing/...

[1] https://developers.facebook.com/docs/sharing/webmasters/craw...

[2] https://developer.twitter.com/en/docs/twitter-for-websites/c...


It actually makes sense to me. I've pinged the bots team to see if we can improve here.


> Was this definitely the cause?

No, I haven't confirmed it. We jumped straight to fixing it without debugging the root cause. It's possible the cause is something totally different (I should have added this caveat in my original post). I was just speculating.


What are you using Cloudflare for??


The actual "easiest" way (at least for me) to bypass Cloudflare is to find the actual IP of the web-server running behind it. Of course in a lot of cases it's not possible, for example when the web admin correctly limits the webserver to only respond to Cloudflare IP ranges, or if https://developers.cloudflare.com/ssl/origin-configuration/o... is used.

Most useful services for that are https://shodan.io/ and https://search.censys.io/. I've had decent successes with Censys on finding real IP addresses of websites behind Cloudflare. Of course you might also have success by checking history of DNS records for a particular domain.


Companies spend thousands of dollars on these anti-bot solutions and then they are so misconfigured that using a specific user agent or faking browsing via mobile, bypasses them. Real life stories.


Often this is because you are hamstrung by old mobile apps or TV apps that can't be updated forcibly and so you break users. So your making a trade-off of user pain and bot deflection. So many times this is actually known and on purpose. Botters hitting that loophole helps prioritize closing that loophole in an agile customer experience and makes it easier for engineering and product to prioritize. Real life stories



Totally agree.


Just want to say THANK YOU for this insight. It never occurred to me, and I just checked out one of the big sites that I scrape for a side project and, lo and behold, you are 100% correct. Found their origin in Censys in about 30 seconds and I've never been able to crunch through their pages more easily.

To others out there who explore this: As with all scraping, be gentle! If you start pounding on someone's origin server directly, you're much more likely to be noticed than if you're pounding on something behind a CloudFlare cache. Set rate limits, scrape during off-peak hours, etc. Be a good scraping citizen.


How did you do it? Asking for a friend who would like to get to the webservers behind CDNs that front Russian websites.


> or if https://developers.cloudflare.com/ssl/origin-configuration/o... is used.

How is using CF’s origin CA preventing the connection to the real backend in order to bypass Cloudflare? you cam just ignore the SSL error couldn’t you?


They probably meant to link https://developers.cloudflare.com/ssl/origin-configuration/a... where Cloudflare uses a client TLS certificate to pull from the origin and the origin should be configured to reject requests without a client certificate.


In addition, I think CF provides a list of IPs to whitelist to only allow access from their servers.


One more technique to find the backend IP address of the web-server is to lookup the DNS record history of the domain in question. Usually, the admins will test their website with with SSL on a real domain to ensure it works.

Only afterward, they will protect it with Cloudflare/Akamai/Cloudfront, etc, which means you'll have a DNS history trail to lookup.

I've had good luck with services such as SecurityTrails, Virustotal and CompleteDNS for DNS history.

Once you find the original A record IP address, you can put it in your hosts file, pointing to the domain. Then, you will be able to access the site, bypassing the CDN without issue.


This is what I expected the article to be about. I would wager a lot of shops don't to the whitelisting. If they wanted to be really intense they could do authenticated origin pulls.


AWS CloudFront with S3 recommends that you just set your S3 to require a specific 'Referer' header variable and you set CloudFront to send that custom 'Referer' with each origin request.

Seems to work great when you use something like a GUID, and no need for IP whitelisting.


I would also like to mention FlareSolverr [1] here, which just uses a headless browser to solve the challenges, which might be acceptable in some situations (that don't need high request rate)

1. https://github.com/FlareSolverr/FlareSolverr


Use zenrows. Got it. It's clickbait but it does provide a good summary of how cloudflare's anti bot stuff works.


Yeah, it's content marketing. It's got all the stylistic tells:

- Giving more background than is appropriate to the subject (explaining what cloudflare is in an article about bypassing it)

- Lots of fluff about "what we're going to cover" like it's a poorly-written highschool essay

- Asking and answering questions rather than stating things: "Can Cloudflare be bypassed? Thankfully, the answer is yes!"

I'm not entirely sure what drives these things, but they seem to be very common in this sort of content marketing article. I'm guessing a lot of it is SEO-driven.

This particular article has more actual content than most, but still ultimately devolves into an ad, of course.


> - Asking and answering questions rather than stating things: "Can Cloudflare be bypassed? Thankfully, the answer is yes!"

> I'm not entirely sure what drives these things, but they seem to be very common in this sort of content marketing article. I'm guessing a lot of it is SEO-driven.

I suspect that this is them trying to get into Google's "frequent question"/"people also ask" [0] box, because that seems like a common search term ("can you bypass cloudflare")

[0]: https://www.brightedge.com/glossary/people-also-ask


Another tell: emphasizing various phrases by boldfacing, as if the rest of the article is not intended to actually be read.

I find the mention of a series A fundraising round at the top interesting too. Do the funders really expect something other than an escalating technical arms race that eventually outpaces them?


I'd call it an ad, not clickbait. An ad with some useful content, but still.

Edit: I think the modern term is content native advertising, although I'm perfectly happy to keep using the word infomercial.


I called it clickbait because the article does not contain the thing promised in the title.


> An ad with some useful content, but still.

There's a term for that, infomercial :-)


I'd call it clickbait since its actual nature is not revealed til the very end.


The article is ~32 pages long if printed using the default settings in my browser.

The first 31 pages pretty extensively provide information on bypassing Cloudflare, which would be very useful to and save a lot of time for someone who is tasked with implementing Cloudflare bypass software.

This is followed by 1 page that talks about their product for doing this.

In other words it delivers pretty much what the title said it would, and then follows that with a small mention of their product.

That's not clickbait by any even remotely reasonable definition.


The bulk of the article is about the protections in place and there is no direct answer to how to bypass them.


The bulk of the article is "look how hard this is, it is beatable in principle but you really don't want to deal with it yourself". And sure enough, at the end, there is a solution for sale.


This is modern marketing. Provide information "for free" to make it seem as if it's in the best interest of those reading, but the reality is you're building a case for your product or service. All of this content is useful but the goal of the post was to show that it's much easier to just pay ZenRows to do this for you. Not bad.


>This is modern marketing

I certainly wouldn't mind if all advertising was done this way. Unless the information here has been copy-pasted from somewhere else, there's enough value there to stand on its own.


The frustrating thing to me is that CF is that invasive and still can't distinguish bots from people; it usually eventually lets me through, but I've spent enough time staring at the "are you sure you're not a not?" screen to laugh off their claims about human/not traffic ratios.


In the past I've always found that the easiest way to bypass Cloudflare was looking up DNS history of their domain. Majority of servers will continue to respond off their IP directly.


If I just need to make plain GET requests in my web scraping, I've found the easiest way to bypass Cloudflare on most sites is to make the requests via the Internet Archive. That has some rate limiting, but it can be worked around by using several source IP addresses in parallel.


I hate cloudflare. I had a really hard time making a web scraper.


That's... the whole point.


And it's an invalid point. Scraping prevention is the most stupid thing Cloudflare has ever done, and that's after a very long list.


It’s not an invalid point. Setting aside Government and business services, you aren’t morally entitled to clean, uninterrupted access to any random website. If a webmaster chooses to make your life difficult for any reason, that’s entirely their prerogative.


I'm saying anti-scraping is a misconception and if you buy it (or get one for free) you just paid for a placebo and it's also worse than nothing. Is there anything confusing about this?


I don't want you scraping my sites, and I can see the hundred of thousands of requests from scrapers/bots blocked by Cloudflare. Seems like it works here.


If it's a misconception, why did you complain that Cloudflare caused you "a really hard time making a web scraper"?


You just confused me with someone else. Cloudflare gives me a hard time as a user only. When you view a page with a proxy or shared IP address, you have to solve a captcha or enable JS to see any text with @ in it, and such insanities. Scraping is a user use case. Web crap is meant to be automateable and automatically navigible (though as a cattle user who sees it as an interactive one time experience, like someone walking into a shop, I can see how you see otherwise). Anyway it's easy to scrape even with Cloudflare, it just causes more bumps and adds to your fingerprint and makes the internet as a whole more insecure.



Correct me if I have missed something but this elaborate fingerprinting exercise called "bot protection" cannot distinguish conclusively whether a person is giving commands to a computer in real-time or from if the computer is reading from a script of commands. It only serves to distinguish what OS, client, IP address, etc. It is collecting tracking data.

Of course, those trying to profit from online advertising services seek to collect the same (fingerprinting) data. Do Cloudflare terms of service/privacy policy allow Cloudflare to do anything they want with this data, or are there limits.


If you use even a slightly customised user-agent and/or OS setup, it's likely you'll be blocked. They'll of course say it's for your "security", but we all know what that really means: use only the software and hardware configurations that we approve. Of all the ways to "herd the sheeple", this is the most insidious because it punishes those who don't want to submit to their whims. Meanwhile the actual attackers are going to still have enough motivation to find ways around it, similar to how DRM has encouraged piracy.


Are there other products out there that offers a similar feature set at this price point?


Some impressive documentation on how to get around this BM solution.


[flagged]


Never underestimate the power of corporate interests and what they'll do if they feel threatened.

and so you will only be able to browse websites in a certified way, like using your bare IP address or a big 4 browser

...and that browser will have to be running on "trusted" software and hardware, which is very much the definition of user-hostility. There were a few articles recently about the imminent threat of remote attestation that got buried in a similar way. They first went after RMS, managed to get their user-hostile antifeatures into Linux, and are slowing locking things down and going after the dissenters. All in the name of "security" --- to secure their interests, not yours.


>"They first went after RMS, managed to get their user-hostile antifeatures into Linux,..."

What is RMS here?


Richard Matthew Stallman


Your observation in your edit is indeed a phenomenon. I have noticed this as well. Cloudflare seems to use HN as their exclusive marketing channel. Their CTO seems to submit almost every one of their blog posts or product updates here. Then many of the company's employees will join in on the thread and it becomes almost like a customer forum. I don't think this mixed with the knee jerk reactions to anyone with a critical voice is healthy.


Exactly this. They used a simple formula to grow their company of posting some novice technical articles to HN ~15 years ago. Then the 2010's standard business practice of "contributing" to open source (I don't consider anything they've done as progress, no different than any other web company). Then ~2011 they broke Tor and kept it broken until 2018 when they made it so if you have a big 4 browser (a fork thereof: Tor Browser), you can visit pages once again. But HN already was brainwashed so they couldn't upvote articles pointing this out so it was never fixed.

The most insane backward nonsense is that Cloudflare proposed a browser extension to bypass the captcha. Like wtf will HN react if Microsoft suddenly made 50% of the web block russian IPs by default until you install a plugin they provide?

Now the only thing I worry about is how they will gimp IPFS since they are sort of leaning towards it once it or something similar replaces the web? They have to make money somehow. Any solution requires breaking things in such a protocol.


> you guys are fucking losers.

you're on this site too, insulting strangers on the internet, for making your text gray. you are part of the club you are insulting. just fyi :)


I don't think the insults are warranted, but he definitely has a point. Now that his comment is dead, it's an even stronger point!


All this and it's just an ad for some SAAS? Fuck I got gypped.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: