Page 2

Last time out, I was desperately trying to understand why my beautifully crafted page-aware lazy loading S3 list objects function was fetching more pages than it actually needed to fulfil my requirements (doesn't sound very lazy to me!), but to no avail. If you cast your mind back, I had set my page size to 50, and was taking 105 objects:

(comment

  (->> (list-objects logs-client prefix)
       (take 105))
  ;; => ("logs/E1HJS54CQLFQU4.2022-09-15-00.0125de6e.gz"
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-00.3b36a099.gz"
  ;;     ...
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-10.ae86e512.gz"
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-10.b4a720f9.gz")

)

But sadly seeing the following in my REPL buffer:

Fetching page 1
Fetching page 2
Fetching page 3
Fetching page 4

I know that some lazy sequences in Clojure realise in chunks, but those chunks are usually realised 32 at a time in my experience. It is actually absurdly hard to find any documentation that explains exactly how chunking works, but one can gather hints here and there from the dusty corners of the web's ancient past:

They all mention the number 32 (holiest of all powers of 2, clearly), and the first one even suggests looking at the implementation of clojure.core/map and seeing that map calls the magic chunk-first, which "takes 32 elements for performance reasons". Spelunking deeper into the source for the definition of chunk-first leads one to the following lines:

(defn ^:static  ^clojure.lang.IChunk chunk-first ^clojure.lang.IChunk [^clojure.lang.IChunkedSeq s]
  (.chunkedFirst s))

Which leads to Clojure's Java implementation, which leads to me reading a couple of the classes that implement the IChunk interface, looking for some mention of the chunk size, and running away in tears.

The funny thing about all of this is that I know that one is not supposed to use functions with side effects when processing lazy sequences. In fact, it says exactly that in the docstring for clojure.core/iterate:

(iterate f x)

Returns a lazy sequence of x, (f x), (f (f x)) etc. f must be free of
side-effects.

But I figured that it would "probably be fine for this use case." 😂

Having received my well-deserved comeuppance—albeit not completely understanding the form said comeuppance is taking—it's time to figure out how to lazily page without chunking. As luck would have it, right after I published my previous post, I opened up Planet Clojure in my RSS reader and saw a post by Abhinav Omprakash on "Clojure's Iteration function ". According to the post, Clojure has a function called iteration, and:

One of the most common use cases for iteration is making paginated api calls.

OK, this looks interesting. Why in the world didn't I know about this? Well, Abhinav's post links to a post on the JUXT blog called "The new Clojure iteration function" (written by the irrepressible Renzo Borgatti!) wherein it is revealed that iteration is new in Clojure 1.11. In the post's introduction, Renzo mentions:

the problem of dealing with batched API calls, those requiring the consumer a "token" from the previous invocation to be able to proceed to the next. This behaviour is very popular in API interfaces such as AWS S3, where the API needs to protect against the case of a client requesting the content of a bucket with millions of objects in it.

He goes on to make a bold claim:

In the past, Clojure developers dealing with paginated APIs have been solving the same problem over and over. The problem is to create some layer that hides away the need of knowing about the presence of pagination and provides the seqable or reducible abstraction we are all familiar with. It is then up to the user of such abstractions to decide if they want to eagerly load many objects or consume them lazily, without any need to know how many requests are necessary or how the pagination mechanism works.

OK, I buy this, having solved this problem in many sub-optimal ways over the years. So iteration really sounds like what I want here. Let's see if I can modify my code based on the iterate function to use iteration instead. Here's what I ended up with last time:

(defn get-s3-page [{:keys [s3-client s3-bucket s3-page-size]}
                   prefix prev]
  (let [{token :NextContinuationToken
         truncated? :IsTruncated
         page-num :page-num} prev
        page-num (if page-num (inc page-num) 1)
        done? (false? truncated?)
        res (when-not done?
              (println "Fetching page" page-num)
              (-> (aws/invoke s3-client
                              {:op :ListObjectsV2
                               :request (mk-s3-req s3-bucket prefix s3-page-size token)})
                  (assoc :page-num page-num)))]
    res))

(defn s3-page-iterator [logs-client prefix]
  (partial get-s3-page logs-client prefix))

(defn list-objects [logs-client prefix]
  (->> (iterate (s3-page-iterator logs-client prefix) nil)
       (drop 1)
       (take-while :Contents)
       (mapcat (comp (partial map :Key) :Contents))))

The JUXT post helpfully walks through an example of listing objects in an S3 bucket, which is exactly what I'm doing, but unhelpfully bases the example on Amazonica (an excellent Clojure wrapper around the AWS Java SDK that I used for years until some cool kids from Nubank told me that all the cool kids were now using Cognitect's aws-api, and I wanted to be cool like them, so I decided to use it for my next thing, which turned out to be a great decision since my next thing was Blambda, which runs on Babashka, which can't use the AWS Java SDK anyway).

Where was I? Oh yeah, the JUXT blog. So it breaks down the arguments to iteration:

(iteration step & {:keys [somef vf kf initk]
                   :or {vf identity
                        kf identity
                        somef some?
                        initk nil}})

Looking at this, my get-s3-page function sounds a lot like step, in that it contains the logic for making a request to S3. However, step is a function taking one argument, and get-s3-page takes three, so clearly it can't be used it as is. But the same was actually true for my previous attempt at paging that used iterate, and in fact I wrote a function to take care of this:

(defn s3-page-iterator [logs-client prefix]
  (partial get-s3-page logs-client prefix))

s3-page-iterator closes over the client and the prefix and returns a function that takes only one argument: prev, which is the previous page of results from S3. So that's step sorted!

In order to figure out what functions I need for somef, vf, and kf (gotta love the terse names of variables in clojure.core!), I need to look at what get-s3-page returns, since all three of those functions operate on the return value of (step token):

(comment

  (->> (get-s3-page logs-client "logs/A1BCD23EFGHIJ4.2022-09-26-" nil)
       keys)
  ;; => (:Prefix
  ;;     :NextContinuationToken
  ;;     :Contents
  ;;     :MaxKeys
  ;;     :IsTruncated
  ;;     :Name
  ;;     :KeyCount
  ;;     :page-num)

)

I'll tackle vf and kf first, since they are pretty straightforward. vf needs to return the items from the current response page. Those items live in the map returned by get-s3-page under the :Contents key, and since keywords are functions that when called with a map, look themselves up in the map, I can use the :Contents keyword as my vf! 🎉

kf returns the next token, which I have in the response as :NextContinuationToken, so it sounds like I should use that for kf. The only problem is that the second invocation of my step function will look like this:

(step (:NextContinuationToken response))

and get-s3-page expects prev to be the response itself, from which it knows how to extract the token all by itself. So I want to just pass the response to my function as-is, and luckily, Clojure has a function for that: identity, which returns its argument unchanged.

Now it's time to look at somef, a function that returns true if the response contains results and false otherwise. In my case, get-s3-page makes a request to the S3 API and returns the response unless the previous response wasn't truncated, in which case it returns nil. So what I want for somef is a function that returns true for any non-nil value, which is exactly what clojure.core/some? does (not to be confused with clojure.core/some).

Now that somef, vf, and kf are sorted, I'll turn my roving eye to initk, which is the initial value for the token passed to my step function. Just like in my previous attempt, I can use nil as the initial argument.

So putting this all together, my new list-objects function would look like this:

(defn list-objects [logs-client prefix]
  (->> (iteration (s3-page-iterator logs-client prefix)
                  :vf :Contents
                  :kf identity
                  :somef some?
                  :initk nil)
       (mapcat (partial map :Key))))

Looks good, lemme test it out!

(comment

  (->> (list-objects logs-client prefix)
       (take 5))
  ;; => ("logs/A1BCD23EFGHIJ4.2022-09-25-00.0187bda9.gz"
  ;;     "logs/A1BCD23EFGHIJ4.2022-09-25-00.0e46ca54.gz"
  ;;     "logs/A1BCD23EFGHIJ4.2022-09-25-00.348fa655.gz"
  ;;     "logs/A1BCD23EFGHIJ4.2022-09-25-00.4345d6ea.gz"
  ;;     "logs/A1BCD23EFGHIJ4.2022-09-25-00.63005d64.gz")

)

Nice! Except for one thing. My REPL buffer reveals that I actually haven't fixed the problem I set out to fix:

Fetching page 1
Fetching page 2
Fetching page 3
Fetching page 4

Looks like I should have read a little further in the JUXT blog article, because Renzo explains exactly what's happening here:

The results of calling [get-s3-page] are batched items as a collection of collections. In general, we need to collapse the batches into a single sequence and process them one by one [...]

Surprisingly, accessing the [first 5 items from] the first page produces additional network calls for pages well ahead of what we currently need. This is an effect of using [mapcat, which always evaluates the first 4 > arguments]!

The reader should understand that this is not a problem of iteration itself, but more about the need to concatenate the results back for processing maintaining laziness in place.

Renzo being Renzo, of course he has a solution to this:

(defn lazy-concat [colls]
  (lazy-seq
   (when-first [c colls]
     (lazy-cat c (lazy-concat (rest colls))))))

I can fold this into my list-objects function:

(defn list-objects [logs-client prefix]
  (->> (iteration (s3-page-iterator logs-client prefix)
                  :vf :Contents
                  :kf identity
                  :somef some?
                  :initk nil)
       lazy-concat
       (map :Key)))

Since lazy-concat is sewing the lists returned by iteration together, I don't need the chunktacular mapcat anymore; I can just use regular old map. Let's see if this works:

(comment

  (->> (list-objects logs-client prefix)
       (take 5))
  ;; => ("logs/A1BCD23EFGHIJ4.2022-09-25-00.0187bda9.gz"
  ;;     "logs/A1BCD23EFGHIJ4.2022-09-25-00.0e46ca54.gz"
  ;;     "logs/A1BCD23EFGHIJ4.2022-09-25-00.348fa655.gz"
  ;;     "logs/A1BCD23EFGHIJ4.2022-09-25-00.4345d6ea.gz"
  ;;     "logs/A1BCD23EFGHIJ4.2022-09-25-00.63005d64.gz")

)

And the REPL buffer?

Fetching page 1

Amazing!

There's one last thing that's bugging me, though. If I look back at the docs for iteration, I see that it has some default arguments:

(iteration step & {:keys [somef vf kf initk]
                   :or {vf identity
                        kf identity
                        somef some?
                        initk nil}})

So vf and kf default to identity, somef defaults to some?, and initk defaults to nil. Taking a look at how I call iteration, things look quite familiar:

(iteration (s3-page-iterator logs-client prefix)
           :vf :Contents
           :kf identity
           :somef some?
           :initk nil)

My kf, somef, and initk all match the defaults! Looks like the Clojure core team kinda knows what they're doing. 😉

With this knowledge under my belt, I can simplify list-objects even further:

(defn list-objects [logs-client prefix]
  (->> (iteration (s3-page-iterator logs-client prefix)
                  :vf :Contents)
       lazy-concat
       (map :Key)))

The cool thing about all of this is that I could use the exact same get-s3-page function I had before, as well as the same s3-page-iterator function, and only needed to change list-objects and sprinkle in the magic lazy-concat function from Renzo's box o' fun!

Before you try this at home, be sure to read the JUXT blog post carefully enough not to miss this sentence, which probably should have been bold and inside the dearly departed HTML <blink> tag:

You need to remember to avoid using sequence with transducers for processing items even after the initial concatenation, because as soon as you do, chunking will hunt you down.

Discuss this post here.

Published: 2022-10-02

Tagged: clojure

Paging Doctor AWS

I've been using AWS for quite a while now, and one thing that is as inevitable as death and taxes is dealing with paging. All of the APIs that return multiple items support pagination in some way, and anytime you write code that needs to act on all items, you need to handle fetching pages. My quest to parse my own access logs, is no exception, since my logs are stored in S3 and my API needs to list all of the logs for a specific date, of which there could be thousands. The way I've typically handled paging in Clojure with with a loop / recur similar to this:

(defn get-items [{:keys [s3-client s3-bucket s3-page-size] :as logs-client}
                 prefix]
  (loop [items []
         token nil]
    (let [_ (println "Fetching page")
          res (aws/invoke s3-client
                          {:op :ListObjectsV2
                           :request (mk-s3-req s3-bucket prefix s3-page-size token)})
          items (concat items (:Contents res))]
      (if (:IsTruncated res)
        (recur items (:NextContinuationToken res))
        items))))

logs-client is a data structure containing the stuff I need to talk to S3:

{:aws-region "eu-west-1",
 :s3-client ; Cognitect aws-api client here
 :s3-bucket "logs.jmglov.net",
 :s3-prefix "logs/",
 :s3-page-size 25}

And prefix is the start of S3 keys for CloudFront access logs for a specific date, for example: logs/A1BCD23EFGHIJ4.2022-09-15-

Finally, mk-s3-req is a function that constructs a ListObjectsV2 request, adding the continuation token if it's non-nil:

(defn mk-s3-req [s3-bucket prefix s3-page-size continuation-token]
  (merge {:Bucket s3-bucket
          :Prefix prefix}
         (when s3-page-size
           {:MaxKeys s3-page-size})
         (when continuation-token
           {:ContinuationToken continuation-token})))

The loop / recur approach has the benefit of being fairly simple, but the drawback of being a lazily-ticking time bomb, as Stuart Sierra explains. It also has the drawback of fetching all of the items, regardless of how many I need. For example, if I need the first two items for some reason:

(take 2 (get-items))

I'll see something like this in my REPL buffer:

Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
Fetching page

Now imagine instead of nine pages of items, I have hundreds. Yikes!

In practice, this hasn't been a problem for me, as I've seldom had enough pages for this to really matter, but it does feel yucky, to use a technical term. Surely we can do better, right?

Right. In fact, Clojure has an entire abstraction built around the concept of taking just what you need: the sequence abstraction. A sequence (often called a "seq" in the Clojure literature) is a logical list that is usually lazily realised, and many of the sequence library functions in clojure.core are lazy, such as map, filter, take, drop, etc. If we can find a way to generate a lazy sequence, the code above that takes the first two items in the sequence should only fetch the first page.

And where there's a will, Clojure has a way: the iterate function. According to the docs, it returns a lazy sequence generated by a function that is repeatedly called with the previous value returned by the function. This is easier to see than it is to explain:

(comment

  (->> (iterate inc 5)
       (take 5))
  ;; => (5 6 7 8 9)

)

If we unroll this in our heads, this is what's going on:

(comment

  (cons
   5
   (cons
    (inc 5)
    (cons
     (inc (inc 5))
     (cons
      (inc (inc (inc 5)))
      (cons
       (inc (inc (inc (inc 5))))
       nil)))))
  ;; => (5 6 7 8 9)

)

The cool thing about iterate is that our generator function f can do whatever it wants, as long as it takes one argument. So what if we write an f that fetches a page of results from S3? We have something similar to that in the body of the loop we wrote above, so let's adapt that a bit:

(defn get-s3-page [{:keys [s3-client s3-bucket s3-page-size]}
                   prefix
                   token]
  (let [_ (println "Fetching page")
        res (aws/invoke s3-client
                        {:op :ListObjectsV2
                         :request (mk-s3-req s3-bucket prefix s3-page-size token)})]
    :???))

Then we can do stuff like this:

(defn s3-page-iterator [logs-client prefix]
  (partial get-s3-page logs-client prefix))

(->> (iterate (s3-page-iterator logs-client prefix) nil)
     (take 2))

The s3-page-iterator function transforms our get-s3-page function, which takes three parameters, into a function that takes one parameter so we can use it with iterate. It does this through the magic of partial, which takes a function and some arguments, and returns a new function with those arguments applied. For example, we can create a partial function from +, which takes two (or more) arguments, that adds 1 to its argument:

(comment

  (def plus1 (partial + 1))

  (plus1 2)
  ;; => 3

)

There's an open question here, though. What in the world should get-s3-page return? Since it's currently called with a continuation token, and the definition of iterate is that it calls the function with the return value of its previous invocation, we have no choice but to return the continuation token from the result of the ListObjectsV2 call:

(:NextContinuationToken res)

The only problem with this is that now we're throwing away the actual items, so what we'll get is something like this:

(comment

  (->> (iterate (s3-page-iterator logs-client prefix) nil)
       (take 2))
  ;; => (nil
  ;;     "1wfyyUjO9xuARj07VdnGBHFPmwEwTyDn7VTywaAH6L417g/fWqfeVJRrCV+nFPFHwLVJ3+CWT6BXTdzyKIPsmZm4U0VpZRffW")

)

That's not all that interesting, to be honest. The one piece of good news is that our REPL buffer looks like this:

Fetching page

We have at least succeeded in being lazy! 🏆

As things stand, our function takes the token as its argument, meaning have what we need in order to do the pagination, but not what we need in order to build up the sequence of items to return. If we think about where the token comes from, it's actually the same place as the items: the response of the ListObjectsV2 call. So what if we return the API response from the function, meaning that the function would then be called with that response on the next iteration? That could look something like this:

(defn get-s3-page [{:keys [s3-client s3-bucket s3-page-size]}
                   prefix prev]
  (let [{token :NextContinuationToken} prev
        _ (println "Fetching page")
        res (aws/invoke s3-client
                        {:op :ListObjectsV2
                         :request (mk-s3-req s3-bucket prefix s3-page-size token)})]
    res))

(comment

  (->> (iterate (s3-page-iterator logs-client prefix) nil)
       (take 2))
  ;; => (nil
  ;;     {:Prefix "logs/A1BCD23EFGHIJ4.2022-09-15-",
  ;;      :NextContinuationToken
  ;;      "1s4mtjEeNyCwY1wu7URgZz/kHpfSH3HhaG26VfcL8sgSNXZF/iYdAjSNQQpTiLF+TRoIB/tD93dC/QmqcmFYo+ZgxX0oZg7+v",
  ;;      :Contents
  ;;      [{:Key "logs/A1BCD23EFGHIJ4.2022-09-15-00.0125de6e.gz",
  ;;        :LastModified #inst "2022-09-15T00:39:56.000-00:00",
  ;;        :ETag "\"0fc8a817d4f9b742b6ae83292a181385\"",
  ;;        :Size 1496,
  ;;        :StorageClass "STANDARD"}
  ;;       ...
  ;;       {:Key "logs/A1BCD23EFGHIJ4.2022-09-15-02.95762404.gz",
  ;;        :LastModified #inst "2022-09-15T02:49:56.000-00:00",
  ;;        :ETag "\"617e1235db32c79c37f5776abc5ff3ec\"",
  ;;        :Size 800,
  ;;        :StorageClass "STANDARD"}],
  ;;      :MaxKeys 25,
  ;;      :IsTruncated true,
  ;;      :Name "logs.jmglov.net",
  ;;      :KeyCount 25})

)

This is definitely moving in the right direction, but there are a couple of issues. First things first, the first thing is nil, since that's the initial argument we gave to iterate. Second, the second thing is not a list of items, but rather a data structure containing a list of items under the :Contents key. We can address both of these issues by wrapping get-s3-page in a new function that handles turning a list of pages into a list of items:

(defn list-objects [logs-client prefix]
  (->> (iterate (s3-page-iterator logs-client prefix) nil)
       (drop 1)
       (mapcat (comp (partial map :Key) :Contents))))

(comment

  (->> (list-objects logs-client prefix)
       (take 2))
  ;; ...
  ;; ...
  ;; OMG what is going on here??? My REPL hasn't printed the result yet,
  ;; and I've been waiting several minutes! 😱

)

Something is deeply wrong here, clearly. Looking at our REPL buffer, we find a clue:

Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
Fetching page
...

This looks like an infinite loop, and given that we're looping on the continuation token, that is probably the issue. Let's have a look at the AWS documentation for ListObjectsV2 and see what we can learn about the continuation token. In the response:

NextContinuationToken

NextContinuationToken is sent when IsTruncated is true, which means there are more keys in the bucket that can be listed. The next list requests to Amazon S3 can be continued with this NextContinuationToken.

Type: string

Aha! What is this IsTruncated of which they speak?

IsTruncated

Set to false if all of the results were returned. Set to true if more keys are available to return.

Type: Boolean

Let's think about what our get-s3-page function is doing when called by iterate. Assuming we have five pages of objects with the prefix logs/A1BCD23EFGHIJ4.2022-09-15-, here's what our requests will look like:

 -> NextContinuationToken nil
<- IsTruncated true, NextContinuationToken "t1"
 -> NextContinuationToken "t1"
<- IsTruncated true, NextContinuationToken "t2"
 -> NextContinuationToken "t2"
<- IsTruncated true, NextContinuationToken "t3"
 -> NextContinuationToken "t3"
<- IsTruncated true, NextContinuationToken "t4"
 -> NextContinuationToken "t4"
<- IsTruncated false
 -> NextContinuationToken nil
<- IsTruncated true, NextContinuationToken "t1"
...

Right, so we'll just cycle through the pages until the heat death of the universe (assuming the universe continues expanding forever and thus atoms get too far apart to bang against each other and generate thermodynamic energy, an assumption which is frankly outside the scope of this blog post which is supposed to be about making S3 requests and not theoretical astrophysics). This is clearly A Bad Thing™ (cycling through pages, I mean, not the heat death of the universe, though that would also be A Bad Thing™), so let's see if we can't reason our way out of this with a little hammock time (this is actually an excuse for me to go walk Rover, since the poor fellow has been waiting 20 minutes for me to get to a good stopping point).

A dog walks in a grassy field

OK, speaking of stopping points, it seems like what we need in this function is a stopping point. We know that when IsTruncated is false, there is no next page, so we should stop right there. Let's see how we can do that in code:

(defn get-s3-page [{:keys [s3-client s3-bucket s3-page-size]}
                   prefix prev]
  (let [{token :NextContinuationToken
         truncated? :IsTruncated
         page-num :page-num} prev
        page-num (if page-num (inc page-num) 1)
        done? (false? truncated?)
        res (when-not done?
              (println "Fetching page" page-num)
              (-> (aws/invoke s3-client
                              {:op :ListObjectsV2
                               :request (mk-s3-req s3-bucket prefix s3-page-size token)})
                  (assoc :page-num page-num)))]
    res))

(comment

  (->> (iterate (partial get-s3-page
                         (assoc logs-client :s3-page-size 100) prefix)
                nil)
       (take 6))
  ;; => (nil
  ;;     {:Prefix "logs/E1HJS54CQLFQU4.2022-09-15-",
  ;;      :page-num 1,
  ;;      :Contents [...],
  ;;      :NextContinuationToken "1kkaMk4RnoZHxnSRa5TBnVMb9NECfmmq...",
  ;;      :MaxKeys 100,
  ;;      :IsTruncated true,
  ;;      :Name "logs.jmglov.net",
  ;;      :KeyCount 100}
  ;;     {:Prefix "logs/E1HJS54CQLFQU4.2022-09-15-",
  ;;      :page-num 2,
  ;;      :Contents [...],
  ;;      :NextContinuationToken
  ;;      "1Jz5W+GkccdkP6Jc7tDsdqqrHFlgThRSt2ZhgHh1uYPA1gIzR4aer2l...",
  ;;      :ContinuationToken
  ;;      "1kkaMk4RnoZHxnSRa5TBnVMb9NECfmmqFfDLlcxdn6GdCiMc8ZzNQRj...",
  ;;      :MaxKeys 100,
  ;;      :IsTruncated true,
  ;;      :Name "logs.jmglov.net",
  ;;      :KeyCount 100}
  ;;     {:Prefix "logs/E1HJS54CQLFQU4.2022-09-15-",
  ;;      :page-num 3,
  ;;      :Contents [...],
  ;;      :ContinuationToken
  ;;      "1Jz5W+GkccdkP6Jc7tDsdqqrHFlgThRSt2ZhgHh1uYPA1gIzR4aer2l...",
  ;;      :MaxKeys 100,
  ;;      :IsTruncated false,
  ;;      :Name "logs.jmglov.net",
  ;;      :KeyCount 36}
  ;;     nil
  ;;     {:Prefix "logs/E1HJS54CQLFQU4.2022-09-15-",
  ;;      :page-num 1,
  ;;      :Contents [...],
  ;;      :NextContinuationToken
  ;;      "1EPegB1wcwtgRoGRiJpM4YYjiTvLHYYVjKr+ghX30LhowHoHMAepdcR...",
  ;;      :MaxKeys 100,
  ;;      :IsTruncated true,
  ;;      :Name "logs.jmglov.net",
  ;;      :KeyCount 100})
       
)

This is much better! Now we can see that we fetch page 1, which contains 100 items (:KeyCount); page 2, which also contains 100 items; page 3, which is the last page and contains only 36 items; then since :IsTruncated is false the next time iterate calls get-s3-page, we don't fetch the page and just return nil; then we start over again with page 1, and would continue cycling forever if it hadn't been for the (take 6).

This experiment has accomplished something very important, however. Now we have a marker for when we've reached the last page and should therefore break the cycle! We can update our list-objects wrapper function to look for this marker:

(defn list-objects [logs-client prefix]
  (->> (iterate (s3-page-iterator logs-client prefix) nil)
       (drop 1)
       (take-while :Contents)
       (mapcat (comp (partial map :Key) :Contents))))

(comment

  (->> (list-objects logs-client prefix)
       (take 5))
  ;; => ("logs/E1HJS54CQLFQU4.2022-09-15-00.0125de6e.gz"
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-00.3b36a099.gz"
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-00.54775acb.gz"
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-00.6c612378.gz"
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-00.73072440.gz")

)

Victory!

We can test out our laziness by taking 105 items, which should require fetching the first two pages of results but nothing more:

(comment

  (->> (list-objects logs-client prefix)
       (take 105))
  ;; => ("logs/E1HJS54CQLFQU4.2022-09-15-00.0125de6e.gz"
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-00.3b36a099.gz"
  ;;     ...
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-10.ae86e512.gz"
  ;;     "logs/E1HJS54CQLFQU4.2022-09-15-10.b4a720f9.gz")

)

Strangely, our REPL buffer reports that we fetched three pages:

Fetching page 1
Fetching page 2
Fetching page 3

In fact, we see the same thing even when we took just 5 items. What in the world is going on here?

Well, it turns out that Clojure realises lazy sequences in chunks; in other words, it optimistically calls the function producing the lazy sequence a few times to ensure that you have enough items in the sequence. In our case, our sequence seems to be realised in chunks of 3?

I have to admit that my knowledge of Clojure internals doesn't go this deep, and I hope someone reading this can explain things to me on Twitter.

In any case, at least we've accomplished our original goal of not fetching all the things when we only need some of the things. So that's something. 🤷

Discuss this post here.

Published: 2022-09-22

Tagged: clojure

Retrospecting the blog

Back in the beginning of June this year, I left my previous job after six years. Since I wouldn't start my new job until September 5th, I had a nice long summer vacation ahead of me! In fact, my vacation even started before my son's (who's in school) and my wife's (who's a teacher), so I would have some time on my hands. One of the things I decided to do with this abundance of time was to start blogging every day.

A tweet reading: I'm starting a nice long summer vacation today, and I'm going to try a blogging challenge wherein I write and publish something everyday, even if it's a tiny thing. I'm sure there will be some Clojure stuff, some Arsenal stuff, and likely some NixOS stuff as well.

Ironically, despite posting this in the morning of June 13th, it wasn't until noon on June 15th that I posted my first blog entry over on Medium. In my defence, I did spend the 13th and 14th feverishly working on chapter 1 of "Story of a mediocre fan" for my friend Tim's blog, but I have to admit that the optics aren't great. 😉

The reason for deciding to blog every day, as I intended to explain in my first post but got sidetracked (no surprise to people who know me) and ended up talking about tennis, was to improve my writing. As a kid, I read voraciously, and loved books so much I decided to start writing them. My first efforts were picture books, which my mom showed me how to fold and bind with a stapler, but then I moved on to a mystery, inspired by the Hardy Boys series. As this was a serious novel (according to my 7 year old perspective), I needed a more serious writing tool than a pencil. We didn't have a computer yet (that would come the following year, and is a story that I really should write about, if I haven't already (oh yeah, I kinda did already, though I feel the story would support a lengthier treatment (haha, I just realised that I put a parenthetical inside a parenthetical 😂 (what am I, a Lisp programmer?)))), but we did have a typewriter. This was a real typewriter too, not one of those fancy IBM Selectric ones. When you wanted a carriage return, you needed to reach up there and pull a lever (at least you were rewarded with an incredibly satisfying ding!).

I don't remember much about that book other than there was a scary cave chase scene. I'm not sure if I ever finished it, or even what my definition of finishing it would have been back then. I do remember that I started writing a Tolkien-inspired book about elves and such at some point, but I'm sure I didn't get very far with that. I wrote a few chapters of a spy thriller starring characters called The Sniper and The Assassin. I did a bunch of writing in high school, as you do, and then did even more in university, and then pretty much stopped. I did write stuff at work, of course, but it's mostly emails and technical documents, which is a very different sort of writing indeed.

I'm not one of those people who just knows that I have a great novel in me, and by golly I'll write it one of these days, but I am someone who enjoys writing and misses not doing much of it. So when a summer with no work unfurled before me, I decided I was going to seize the day and get to writing: one blog post a day for the rest of the summer.

In the 88 days since I made that decision, I've written 57 posts (including this one). If I saw my goal as writing one post a day, every single day, I would have to conclude that I failed to reach that goal. However, the real goal was to get better at writing through frequent practice, and the act of writing one post a day was simply the process by which I planned to accomplish this. As with any process, the point is not to follow it to the letter, the point is that the process is supposed to enable you to make progress towards whatever your goal is.

Speaking of process, a vital part of any good process is time to reflect, which is formalised in many software methodologies as a periodic meeting called a retrospective. In the retrospective, the team discusses how things have been going and what could be improved. This concept is useful even in individual work, and in fact in individual work, we often don't allocate sufficient time for reflection, which can greatly harm our ability to learn from our mistakes and our successes.

This post is a retrospective of my summer of blogging.

What went well?

What could be improved?

What did I learn?

I'll wrap up this retrospective with some statistics. Here are the 10 most popular posts from the summer:

I suspect the "Story of a mediocre fan" series got even more views, since they were posted on a very widely read Arsenal blog, but I don't want to bother Tim to get those stats, since the point of all of this wasn't page views. 😉

Now that this experiment is over, I've concluded that I like blogging, and it's something that I want to continue. I won't set a regular schedule for posting, but rather set aside time for writing (and hacking on quickblog and Blambda and the stuff that I often write about) and post when I have something interesting (to me) to post. I also want to try writing some more focused pieces, which I'll edit and get feedback from people and so on. We'll see how it goes! 🙂

Discuss this post here.

Published: 2022-09-10

Tagged: waffle blog

Archive