Hacking the blog: caching

Well, it had to come to this, didn't it? At some point in the life of every programmer, you have to take a deep sigh, realise that you have a problem where caching is the least bad solution, and get to it. That point in my life was today, when I finally got tired of waiting 30 seconds for my blog to publish, and knew that the reason it was taking so long is that my rendering process was repeating work that had already been done every time I ran it, and that resulted in rewriting files that made my publishing process think they had changes, and upload them to S3. The horror!

Let me quickly sketch out how publishing my blog works. I use S3 static website hosting, which means that publishing my website is nothing more than uploading files to my jmglov.net S3 bucket. I'm using Babashka to manage all this, so I have a task like this in my bb.edn:

{:tasks
 {publish-blog {:doc "Publish blog"
                :depends [render-blog]
                :task (shell "aws s3 sync --delete public/blog/ s3://jmglov.net/blog/")}}}

The aws sync uses some rsync-like logic to only upload files that have changed. This is good for me, as I have a bunch of images, and I don't want to upload them over and over again, as it takes time, and eventually AWS will charge me for the bandwidth.

Let's take a look at how the blog is rendered:

Copy all of the images from assets/ to public/blog/assets/
Copy the style.css file to public/blog/
Read in all of the posts from posts.edn
For each post, read the markdown source file, render it to HTML, insert it as the body into the page template template with Selmer, and write the resulting HTML file to public/blog/
Create an archive.html page with links to all the posts
Create a tags/index.html page with links to all of the tags
For each tag, create a page with links to all the posts with that tag
Create a top-level index.html page with the last three posts
Create an atom.xml RSS feed with all of the posts
Create a planetclojure.xml RSS feed with posts tagged "clojure" or "clojurescript"

Each one of these steps is creating a file in public/blog that will be uploaded to S3 if the local file is newer than the file on S3. Without any caching, all of these files will be created every time I render the blog, which means they will always be uploaded, and this was what I was running into.

Here's how the asset files used to be handled:

(ns render-blog
  (:require
   [babashka.fs :as fs]
   [lib]))

(def blog-dir (fs/file "blog"))
(def out-dir (fs/file "public" "blog"))
(def asset-dir (fs/create-dirs (fs/file out-dir "assets")))

(fs/copy-tree (fs/file blog-dir "assets") asset-dir
              {:replace-existing true})

fs/copy-tree is basically the same thing as cp -r: it copies all of the files from blog/assets to public/blog/assets. The problem is that changes the modification timestamp on the file, thus making s3 sync think it's a newer file and upload it. What I would like to do instead is only copy the new and modified asset files to public/blog/assets.

In order to do this, I wrote a new function, copy-tree-modified, and used it like this:

(lib/copy-tree-modified (fs/file blog-dir "assets")
                        asset-dir
                        (.getParent out-dir))

Here's what the function looks like:

(defn copy-tree-modified [src-dir target-dir out-dir]
  (let [modified-paths (fs/modified-since (fs/file target-dir)
                                          (fs/file src-dir))
        new-paths (->> (fs/glob src-dir "**")
                       (remove #(fs/exists? (fs/file out-dir %))))]
    (doseq [path (concat modified-paths new-paths)
            :let [target-path (fs/file out-dir path)]]
      (fs/create-dirs (.getParent target-path))
      (println "Writing" (str target-path))
      (fs/copy (fs/file path) target-path))))

I'll walk you through what's going on here:

fs/modified-since returns a list of the files in src-dir (which in the case of my assets, is blog/assets) which have been modified since the time target-dir (public/blog/assets) was last modifed.
Since this will not pick up files that have been added to src-dir after target-dir was last modified, I do an fs/glob to get a list of all of the files in src-dir, then remove the ones that already exist in target-dir.
I concatenate the modified files and the new files and then doseq over them, creating subdirectories as needed, and then copy them into out-dir (public/blog).

This handles recursively copying directories, but how about single files? We can take a look at how the style.css used to be handled:

(let [style-src (fs/file templates-dir "style.css")
      style-target (fs/file out-dir "style.css")]
  (fs/copy style-src style-target))

Now it's subtly changed to use a new copy-modified library function:

(let [style-src (fs/file templates-dir "style.css")
      style-target (fs/file out-dir "style.css")]
  (lib/copy-modified style-src style-target))

The copy-modified function looks like this:

(defn stale? [src target]
  (seq (fs/modified-since target src)))

(defn copy-modified [src target]
  (when (stale? src target)
    (println "Writing" (str target))
    (fs/create-dirs (.getParent (fs/file target)))
    (fs/copy src target)))

We're using fs/modified-since in a slightly different way here. When both the target and the src are files, fs/modified-since will notice when src exists but target doesn't (meaning that src has been added since last time we rendered). Wrapping it in a seq will make it return nil when the list of modified files is empty, which we use as a truthy value.

The final piece of the puzzle is how to handle things like posts and archives and tags, which should only be written when there is a new post, an updated post, or something has changed with the rendering code or templates. I'll illustrate this by showing how the archive page is handled:

(def posts-file "posts.edn")
(def rendering-system-files ["render_blog.clj" templates-dir])

(let [archive-file (fs/file out-dir "archive.html")
      new-posts? (lib/stale? posts-file archive-file)
      rendering-modified? (lib/rendering-modified? rendering-system-files
                                                   archive-file)]
  (when (or rendering-modified? new-posts?)
    (println "Writing archive page" (str archive-file))
    (spit archive-file
          (selmer/render base-html
                         {:skip-archive true
                          :title (str blog-title " - Archive")
                          :body (hiccup/html (lib/post-links {} "Archive" posts))}))))

To determine if there are new posts, we check if the archive file is stale with respect to the posts file, meaning that the posts file has changed since we last wrote the archive file.

To determine if any of the rendering code has changed, we use lib/rendering-modified?:

(defn rendering-modified? [rendering-system-files target-file]
  (seq (fs/modified-since target-file rendering-system-files)))

What we're asking here is if render_blog.clj or any of the template files have changed since we last wrote the archive file. If so, we want to re-render the archive file.

If you're interested in seeing this in action, take a look at render_blog.clj and lib.clj. Just note that things are not very polished, and there are likely to be bugs. 😬

Update: shortly after writing this code (the next morning on my dog walk, in fact), I realised that my caching was horribly broken. To see how I fixed it, check out Hacking the blog: actually caching.

jmglov's blog

Hacking the blog: caching