Using robots.txt with Cloudflare Pages

Using robots.txt with Cloudflare Pages#

Cloudflare Pages is a great way to host your static website. It’s fast, easy, and free to start with. While the egress traffic is free, you still pay in other forms as some bots will crawl your website. This is where robots.txt comes in handy. You can use it to block bots from crawling your website without having to configure anything with in Cloudflare.

Use robots.txt for a sitemap#

The file robots.txt is used to tell bots where to find your sitemap. This is useful for search engines to find your sitemap. The sitemap is used to tell search engines what pages are available on your website. Lets start with a simple example to tell bots where to find your sitemap.

Exampe file robots.txt with a sitemap reference#

sitemap: https://dailystuff.nl/sitemap.xml

After uploading the file together with the sitemap to the website, you can check if it’s working by visiting https://dailystuff.nl/robots.txt.

Using robots.txt to block bots#

The file robots.txt can also be used to block bots from crawling your website. This is useful for bots that are not following the rules set in RFC 9309. For example, the bot is crawling your website too fast or is using too much bandwidth. You can block these bots for the whole site, due to the Disallow: / statement, by adding the following to your robots.txt file.

robots.txt#

sitemap: https://dailystuff.nl/sitemap.xml

User-agent: Nuclei
User-agent: WikiDo
User-agent: Riddler
User-agent: PetalBot
User-agent: Zoominfobot
User-agent: Go-http-client
User-agent: Node/simplecrawler
User-agent: CazoodleBot
User-agent: dotbot/1.0
User-agent: Gigabot
User-agent: Barkrowler
User-agent: BLEXBot
User-agent: magpie-crawler
User-agent: MJ12bot
User-agent: AhrefsBot
Disallow: /

Note

The list of bots is not complete and have been randomly selected for the example. You can find a more complete list at robotstxt.org.

Block Cloudflare endpoints via robots.txt#

Another use case for the robots.txt file is to block Cloudflare endpoints. These endpoints are used by Cloudflare to provide services like the Web Application Firewall (WAF) and the Bot Management. You can block these endpoints by adding the following to your robots.txt file and uploading it to your website.

robots.txt#

sitemap: https://dailystuff.nl/sitemap.xml

User-agent: *
Disallow: /cdn-cgi/bm/cv/
Disallow: /cdn-cgi/challenge-platform/
Disallow: /cdn-cgi/images/trace/
Disallow: /cdn-cgi/rum
Disallow: /cdn-cgi/scripts/
Disallow: /cdn-cgi/styles/
Disallow: /cdn-fpw/sxg/

After this Google Search Console will show you a warning that the robots.txt file is blocking Google from crawling your website. This is expected and you can ignore this warning or remove the endpoints from the robots.txt file, but then the endpoints will be crawled by bots and be part of the crawl budget.

Block Cloudflare endpoints via HTTP headers#

Another option to block Cloudflare endpoints is to use HTTP headers. This is useful if you don’t want to use the robots.txt file. You can add the following to your _headers file and upload it to your website. This will add the X-Robots-Tag: noindex, nofollow header to the endpoints as described in the Google Search Central documentation.

_headers#

/cdn-cgi/bm/cv/*
/cdn-cgi/challenge-platform/*
/cdn-cgi/images/trace/*
/cdn-cgi/rum
/cdn-cgi/scripts/*
/cdn-cgi/styles/*
/cdn-fpw/sxg/*
  X-Robots-Tag: noindex, nofollow

After uploading the file to your website, the affected URL can be checked with Web Developer Tools in any recent browser. For example, in Firefox you can open the Web Developer Tools by pressing F12 and select the Network tab. After that you can reload the page and check the headers for the affected URL.

Upgrading to Terraform 1.5 Installing Ansible on Debian 12