Migration to Cloudflare Pages

Migration to Cloudflare Pages#

What started as a custom content management system quickly moved to WordPress to improve its maintainability and that solution served its purpose over the years. Having an easy web-based editor to maintain every is a good thing, but sadly also a bad thing. WordPress is a known target for attacks and you have to keep up to not be compromised, but this also means you have to keep up with how WordPress generates its pages otherwise the pages will not be shown correctly.

To reduce time and complexity, another solution was required as deploying WordPress with content every time wasn’t very effective. Most of the content was already in Markdown format to bypass certain limitations the next step came how to deploy them from GitHub. Static website generators like Jekyll, Sphinx, and Pelican came into the picture as they would remove the dependencies for installing code and a database.

The next dependency to be solved was the hosting of the static website. After experimenting with GitHub Pages and GitLab Pages, but also hosting my own virtual server, I ended up with Cloudflare Pages. Hosting your own website still requires you to maintain a virtual server and the software on it, but also scale it out when required. A solution like Pages is just like AWS S3 behind a CDN without any other maintenance required.

The choice for Cloudflare Pages instead of GitHub or GitLab Pages is in the details as they all serve the basic needs to host a static website, but both GitHub and GitLab are mainly for hosting project documentation. Cloudflare Pages also allows you to point multiple domains to the same content, but also set up redirects, add headers, and filter who can access your content.

Setting up Sphinx#

Sphinx depends on a recent version of Python, but Cloudflare Pages by default uses Python 2.7 and installs dependencies based on the configuration file found at the root of the project. To circumvent this, we can install Sphinx using pipenv and then use the following command to install the Sphinx with module ablog and use Python 3.7:

$ pipenv --python 3.7
$ pipenv install sphinx ablog

After setting up the environment, the Sphinx quickstart command can create and configure the initial site:

$ pipenv run sphinx-quickstart --no-sep \
    -p 'Project Name' -a 'Author Name' \
    -v '' -r '' -l 'en' \
    --ext-intersphinx --no-batchfile .

Module ablog allows to have blogging functionality within Sphinx, but still has to be enabled in the file conf.py by adding the extension ablog to the extensions variable:

extensions = [
    ...
    'ablog',
    'sphinx.ext.intersphinx',
]

Sphinx is a static website generator and with make html it creates or updates the local build environment under _build/html:

$ pipenv run make html

The command make html processes only new or updated source files and not all references. For this, the built environment needs to be cleaned with the command make clean. This luckily doesn’t apply when pushing the site to Cloudflare Pages as the built environment is always clean.

$ pipenv run make clean

Setting up a development environment#

Creating new pages requires a place to develop them and that could be done by running make html after every change and refreshing the browser, but it can also be done by Sphinx Autobuild automatically every time a file is saved.

$ pipenv install sphinx-autobuild --dev

To start the local development web server we need to give it the source and build directory manually. It will start a webserver on localhost:8000 and automatically rebuild and refresh the browser.

$ pipenv run sphinx-autobuild . _build/html/

Headers for X-Robots-Tag#

All sites point to the same storage location we can’t use robots.txt to tell search engines to not index one site or not archive the other. This can be solved by sending X-Robots-Tag headers for certain sites or URLs instead of adding the file robots.txt. Cloudflare Pages process files starting with an underscore in the root folder the build as a configuration file and _headers is the one for additional headers.

Creating the file _headers with the content below we tell Cloudflare Pages which headers to include in the response for which site. The first site is the main project site from Cloudflare Pages itself which shouldn’t be searched and indexed. The second site is to tell others not to create a cached version like Google likes to do for example. This doesn’t stop Internet Archive from making a copy as they stopped ignoring all these kinds of requests and the only method left is to block them based on their user-agent string in Cloudflare WAF.

The third site listed is an interesting case as we currently can’t do domain-based redirects with Cloudflare Pages another solution is needed. Google for example doesn’t give a penalty when your content is offered via multiple sites or URLs, but they like to know the canonical location so that location can be listed during the search. With the third site Cloudflare Pages signals to the search engine that it isn’t a problem to browse the site, but not to index it. By setting the variable html_baseurl Sphinx will generate canonical metadata as described in RFC 6596 for Google to process.

https://:project.pages.dev/*
  X-Robots-Tag: noindex, nofollow

https://example.org/*
  X-Robots-Tag: noarchive

https://www.example.org/*
  X-Robots-Tag: noindex

Sphinx by default doesn’t add any additional files to the built environment unless they’re listed in the variable html_extra_path. In configuration files conf.py the base URL has to be set to the desired site and the file _headers have to be listed so it will be copied to the built environment.

html_baseurl = 'https://example.org'
html_extra_path = [
    ...
    "_headers",
]

RSS Feeds#

By default ablog also creates an RSS feed and updates the metadata on all pages to announce the location. The location is different from the standard WordPress location and here allows Cloudflare Pages to add a redirection map to your site. Since the redirection will be part of the HTTP protocol most clients will not notice that the location has changed.

The redirection options are still limited both in capabilities and amount but should be sufficient for most use-cases. For the RSS-feed redirection, only the lines below need to be added to file _redirect. After this everyone will get a redirection if /feed or /feed/ is requested.

/feed /blog/atom.xml 301
/feed/ /blog/atom.xml 301

Like the file _headers we also need to tell Sphinx in configuration files conf.py that the file _redirects is part of the build.

html_extra_path = [
    ...
    "_redirects",
]

Generate a sitemap#

Letting search engines discover your content is an option, but telling them via a sitemap what the site structure is and what has changed can speed up a lot of things. Besides an RSS feed, you can also inform search engines via a sitemap.xml file. Both Google and Microsoft use them. There is a module called sphinx-sitemap for Sphinx that can be installed:

$ pipenv install sphinx-sitemap

To enable the module sphinx-sitemap configuration file conf.py has to be updated by adding the extension sphinx_sitemap and listing robots.txt as an additional file to be included during the build process.

extensions = [
    ...
    'sphinx_sitemap',
]

html_extra_path = [
    ...
    "robots.txt",
]

Sitemap files can be manually added to Google Search Console for example, but by updating the file robots.txt and adding the location of the sitemap others can also discover the sitemap file.

sitemap: https://example.org/sitemap.xml

Creating pages and posts#

Creating pages and posts is how it all started and for pages, it is as simple as creating a ReStructuredText file with a title.

About
=====
...

Creating posts is bound to some rules as the ReStructuredText file needs to match the variable blog_post_pattern and the default value is blog/*.rst. Secondly, the files need to contain a title and .. post:: with a date that isn’t in the future otherwise it will be ignored during the build process.

.. post:: Dec 22, 2021
   :tags: WordPress, Sphinx, Cloudflare
   :category: Random

Migration to Cloudflare Pages
=============================
...

Conclusions#

While Pelican has a script to transform WordPress posts into ReStructuredText files, I manually converted them as the layout needed to be corrected. For now, it will be a minimalistic website and more templating needs to be done, but time hasn’t to be invested anymore in keeping WordPress up to date and related tasks. For new pages or posts, a new branch in git with content written in the ReStructuredText format will be enough.

With the transition from WordPress to a static website, we lost the ability to send out a ping to inform search engines and blog readers that new content is available, but there may be a way to resolve this. A second feature that was lost, was the option to leave a comment and it is possible to use Disqus, but most comments are spam anyway.

Overall this is a huge step to use GitOps in combination with Content-as-Code. The next steps will be completing workflows with GitHub Actions to regularly validate links and content, but also maintain the remaining infrastructure with Terraform Cloud as the local Terraform solution already controls most of the Cloudflare configuration.

Start using GitHub Dependabot Removing invalid state from Terraform