Automate precaching resources

Posted: Apr 8, 2018

Nobody likes to wait. When a user clicks on a link, they want to get an immediate response. If it takes a while, the user might switch to another tab and completely forget about the opened site which is loading. In fact, it means lost customers for site owners. Progressive Web Apps are aimed to fix it by giving a response which might be close to native apps.

Native apps keep all resources on disks, so they don't need to download anything. Using new technologies shipped by browsers, we can serve resources from the browser's cache before the user requests them. Thereby, users might get the similar experience to native apps. The instant response keeps user's focus.

There are a few ways to precache resources:

In a first place, we might want to precache everything, but there are a few reasons not to do that:

Sirko Engine

Having considered arguments above, I started working on Sirko Engine which is aimed to be more accurate in precaching resources. Actually, Sirko Engine is only a part of the solution, there is also Sirko Client. Presently, the project has 2 big features:

Below I am describing how the project technically works. If you want to know how to install it, please, refer to the installation guide.

Gathering data

To know which resources should be precached, the engine gathers information how users navigate on a site.

When the user visits the site, the client makes a request to the engine. That request includes the referrer, the current path and a list of urls to assets on the current page (JS and CSS).

{
   "referrer":"/home",
   "current":"/project",
   "assets":[
      "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css",
      "https://demo.sirko.io/assets/css/style.css",
      "https://engine.sirko.io/assets/client.js",
      "https://demo.sirko.io/assets/app.js",
      "https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js",
      "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js",
      "https://demo.sirko.io/assets/project.js"
   ]
}

A relation between the referrer and the current page is a transition, the transition has direction. Internally, every user is represented by a separate session which keeps several directed transitions made by a particular user.

However, most sessions don't look that simple, there might be transitions back.

It reminds a graph, actually, it is a graph.

The engine isn't interested in a particular user. So, if there is no more transitions from the user within 1 hour, the session get expired even though the user might come back later. Expiration is an important step, expired sessions commit to overall transitions from one page to another. The overall graph looks like this:

You might've found a node without a path, it is an exit node. All sessions connected to this node are expired.

Let's review each element on this graph.

The session relation keeps a number of transitions made by a particular user:

{
  "occurred_at": 1521176451536,
  "count": 2,
  "key": "a9f81dfdefc9dad197bead6d812f0468dee1c5fc7976dc1ded8b0ae1d0535bd5"
}

During one session, the user might visit the same pages several times. To take it into consideration, the session relation keeps the count property. The occurred_at property is required to identify age of the session which is needed to expire inactive sessions and remove stale sessions (this step is described below).

The transition relation keeps the total number of transitions made by all users:

{
  "updated_at": 1521057839027,
  "count": 14
}

For example, we have 3 session relations between the /about and /contact pages, let's say each of those sessions keep 1 as a value in the count property. It means, the transition relation between them keeps 3 in the count property. The updated_at property keeps time when the relation has been updated last time.

The page node keeps details about a page:

{
  "path": "/home",
  "assets": [
    "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css",
    "https://demo.sirko.io/assets/css/style.css",
    "https://engine.sirko.io/assets/client.js",
    "https://demo.sirko.io/assets/app.js",
    "https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js",
    "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"
  ]
}

Prediction model

To predict pages the engine uses Markov Chain. The logic was described in my another article. Although, there are a few changes.

Serving cached resources

Once prediction is received from the engine, the client precaches resources via Cache Storage. Thus, when the user moves to another page, a service worker checks whether a requested resource is in the cache, if so, it is served from the cache, otherwise, it is normally loaded. After loading the page, the cached resources get removed from the cache to avoid serving stale content.

Cache invalidation is a challenge. When the page is opened, the user might submit data which might change the precached pages. Therefore, the service worker not only serves cached resources, but also keeps an eye on fired requests. For example, if there is a request modifying data, the transition won't be tracked.

The service worker verifies requests between the referrer and the current page. In this example, the /contact page is a referrer and message is a current page. Obviously, the user inputed something on the contact page, thus, the service worker spots it and tells the engine not to track this transition, because the message page cannot be precached in this case. It even works for AJAX requests.

The transition between the messages and home pages won't be tracked.

Stale sessions

Besides expired sessions, there are stale sessions. All sessions which kept in the DB more than 7 days are stale. Stale sessions get removed and their counts get subtracted from the total number of transitions between pages. Sites get changed, thus, some pages might be removed. The idea behind removing stale sessions is to slowly fade transitions between pages which aren't used anymore, eventually, they disappear.

For example, these pages don't have session relations anymore (probably, these pages were removed from the site), there is only the transition relation which will be removed by the engine. The page nodes cannot stay lonely, so they get removed as well. This operation makes sure there is no garbage in the DB which might mislead the prediction model and inflate the DB.

Offline work

Above I mentioned that precached resources get removed. Actually, it isn't quite true, the client moves them to a separate cache which is used when the user is offline.

Only predicted resources are served in offline. However, there is a trick how the entire site can be cached to work offline, but it has costs.

Technologies

The engine is written in Elixir. I've been working with Ruby for last 12 years. So, for this project, I wanted a language which keeps me productive as Ruby, nevertheless, it is also very fast, scalable and I don't have to fight with the language (this statement isn't about Ruby). It is easy to start working with Elixir. You just need to understand a few crucial things. A project is an OS where applications are libraries which concurrently work, every application has processes which are kind of objects in OOP languages, they also concurrently work. Processes have behavior and they might have state (very similar to objects in OOP, isn't it?)

The client is written in JavaScript. I chose Rollup to bundle it. Initially, I used Webpack, but after trying out Rollup, I discovered that Rollup compresses my JS code better. Anyway, I don't need most of Webpack plugins, my library is only about JS code.

Neo4j was chosen as DB. When your data structure is a graph, it makes sense to use a graph DB. Nodes and relations can have properties, it is very useful feature for my project. Also, a Cypher language is really powerful. When I can compute some stuff in the DB without fetching them I prefer to do that, thus, memory consumption stays low. In Ruby projects I work with ActiveRecord, it is a great library, but it hides advanced features of DBs. For this project, I decided that I want to have access to everything the DB gives to me.

Further work on the project

I am trying to create a task here and there for each my idea, but there are only finalized tasks. There are still lots of thoughts which I write down outside of Github. Currently, I have on my head (the order is arbitrary, no priority):

Before working on new features, I would like to gather feedback about the idea of this project. If there is no enough interest, there is no point to adjust anything. So, please, leave feedback in comments.

Hey, do you need to host your project? Just click here and get $10 in credit from DigitalOcean.