Using robots.txt with Cloudflare Pages#
Cloudflare Pages is a great way to host your static website. It’s fast, easy, and free to start with. While the egress traffic is free, you still pay in other forms as some bots will crawl your website. This is where robots.txt comes in handy. You can use it to block bots from crawling your website without having to configure anything with in Cloudflare.
Use robots.txt for a sitemap#
robots.txt is used to tell bots where to find your sitemap. This is useful for search engines to find your sitemap. The sitemap is used to tell search engines what pages are available on your website. Lets start with a simple example to tell bots where to find your sitemap.
After uploading the file together with the sitemap to the website, you can check if it’s working by visiting
Using robots.txt to block bots#
robots.txt can also be used to block bots from crawling your website. This is useful for bots that are not following the rules set in RFC 9309. For example, the bot is crawling your website too fast or is using too much bandwidth. You can block these bots for the whole site, due to the
Disallow: / statement, by adding the following to your
sitemap: https://dailystuff.nl/sitemap.xml User-agent: Nuclei User-agent: WikiDo User-agent: Riddler User-agent: PetalBot User-agent: Zoominfobot User-agent: Go-http-client User-agent: Node/simplecrawler User-agent: CazoodleBot User-agent: dotbot/1.0 User-agent: Gigabot User-agent: Barkrowler User-agent: BLEXBot User-agent: magpie-crawler User-agent: MJ12bot User-agent: AhrefsBot Disallow: /
The list of bots is not complete and have been randomly selected for the example. You can find a more complete list at robotstxt.org.
Block Cloudflare endpoints via robots.txt#
Another use case for the
robots.txt file is to block Cloudflare endpoints. These endpoints are used by Cloudflare to provide services like the Web Application Firewall (WAF) and the Bot Management. You can block these endpoints by adding the following to your
robots.txt file and uploading it to your website.
sitemap: https://dailystuff.nl/sitemap.xml User-agent: * Disallow: /cdn-cgi/bm/cv/ Disallow: /cdn-cgi/challenge-platform/ Disallow: /cdn-cgi/images/trace/ Disallow: /cdn-cgi/rum Disallow: /cdn-cgi/scripts/ Disallow: /cdn-cgi/styles/ Disallow: /cdn-fpw/sxg/
After this Google Search Console will show you a warning that the robots.txt file is blocking Google from crawling your website. This is expected and you can ignore this warning or remove the endpoints from the
robots.txt file, but then the endpoints will be crawled by bots and be part of the crawl budget.
Block Cloudflare endpoints via HTTP headers#
Another option to block Cloudflare endpoints is to use HTTP headers. This is useful if you don’t want to use the
robots.txt file. You can add the following to your
_headers file and upload it to your website. This will add the
X-Robots-Tag: noindex, nofollow header to the endpoints as described in the Google Search Central documentation.
/cdn-cgi/bm/cv/* /cdn-cgi/challenge-platform/* /cdn-cgi/images/trace/* /cdn-cgi/rum /cdn-cgi/scripts/* /cdn-cgi/styles/* /cdn-fpw/sxg/* X-Robots-Tag: noindex, nofollow
After uploading the file to your website, the affected URL can be checked with Web Developer Tools in any recent browser. For example, in Firefox you can open the Web Developer Tools by pressing
F12 and select the Network tab. After that you can reload the page and check the headers for the affected URL.