Anticipage: scalable pagination, especially for ACLs

Pagination is one of the hardest problems for web applications supporting access-control lists (ACLs). Drupal and Pressflow support ACLs through the node access system.

Problems with traditional pagination

  • Because pagination uses row offsets into the results, browsing listings where newly published items get added to the beginning of the results creates “page drift.” Page drift is where a user already browsing through paginated results sees, for example, items E, D, and C on page one, waits awhile, clicks to the next page, and sees items C, B, and A. Going back to page one again shows F (newly published), E, and D. Item C “drifted” to page two while the user was reading page one. If new items are published frequently enough, pagination can become unusable due to this drifting effect.
  • Even if content and ordering are fully indexed, jumping n rows into the results remains inefficient; it scales linearly with depth into pagination.
  • Paginating sets where the content and ordering are not fully indexed is even worse, often to the point of being unusable.
  • The design is optimized around visiting arbitrary page offsets, which does not reflect user needs. Users only need to make relative jumps in pagination of up to 10 pages (or so) in either direction or to start from the end of the results. (If users are navigating results by hopping to arbitrary pages to drill down to what they need, there are other flaws in the system.)

“Anticipage”

With a combination of paginating by inequality and, optionally, optimistic permission review, a site can paginate content with the following benefits:

  • No page drift
  • Stable pagination URLs that will generally include the same items, regardless of how much new content has been published to the beginning or end of the content listing
  • If the ordering is indexed, logarithmic time to finding the first item on a page, regardless of how many pages the user is into browsing
  • Minimal computation of JOINs, an especially big benefit for sites using JOINs for ACLs

The general strategy is to amortize the cost of pagination as the user browses through pages.

Paginating by inequality

The path to achieving fast pagination first involves a fresh strategy for sorting and slicing content. A “pagination key” must be selected for the intended set of content that:

  • Includes the column(s) desired for sorting. For a Drupal site, this might be the “created” column on the “node” table.
  • Is a superkey (unique for all rows in the table but not necessarily minimally unique). Sorting by the columns in a superkey is inherently deterministic. And because a superkey is also unique, it allows us to use where criteria on the deterministically sorted set to deterministically define pages. An existing set of sort columns for a listing can always be converted to a superkey by appending a primary key to the end.

For a Drupal site, a qualifying pagination key could be (created, nid) on the “node” table. This key allows us to deterministically sort the rows in the node table and slice the results into pages. Really, everyone should use such pagination keys regardless of pagination strategy in order to have a deterministic sort order.

Having selected (created, nid) as the key, the base query providing our entire listing would look something like this:

SELECT * FROM node ORDER BY created DESC, nid, DESC;

Traditionally, a site would then paginate the second page of 10 items in MySQL using a query like this:

SELECT * FROM node ORDER BY created DESC, nid, DESC LIMIT 10, 10;

But because we’re ordering by a pagination key (as defined above), we can simply run the base query for the first page and note the attributes of the final item on the page. In this example, the final node on the first page has a creation timestamp of “1230768000” and a node ID of “987.” We can then embed this data in the GET criteria of the link to the second page, resulting in running a query like this for rendering the second page:

SELECT * FROM node WHERE created <= 1230768000 AND (created <> 1230768000 OR nid < 987) ORDER BY created DESC, nid, DESC LIMIT 10;

We’re asking for the same sorting order but adding a WHERE condition carefully constructed to start our results right after the content on the first page. (Note: this query could also be dissected into a UNION if the database does not properly optimize the use of the index.) This strategy allows the database to fully employ indexes on the data to find, in logarithmic time, the first item on any page. Note how page drift becomes impossible when pagination happens using keys instead of offsets.

Should a system choose to support moving more than one page in either direction, it would either have to:

  • Read a sufficient depth into the results in nearby pages to obtain the necessary WHERE attributes. This is a bit inefficient but consistent with the rest of the approach.
  • Adopt a hybrid strategy by using a traditional-style query (a LIMIT that skips records) with WHERE conditions beginning the set on the adjacent page. For example, if a user were currently on page 9, the direct link to page 11 would load a page that runs the query for page 10 but starts its listing 10 items later (“LIMIT 10, 10”). Naturally, this becomes less efficient as we allow users to hop greater distances, but the running time, at worst, converges on how the traditional pagination approach works.

This inequality pagination strategy is already a huge win for pagination queries using expensive joins. If everything can be calculated in the database, this is about as good as it gets without denormalization or alternatives to relational databases. Unless, of course, we have a site where an optimistic permissions strategy works well:

An iterative, optimistic permissions strategy

One feature of ACLs is that they’re hard to generically and flexibly define in fixed schemas. Sometimes, it’s easiest to allow callback functions in the application that don’t have to fit into rigid ACL architectures. And for listings where a very large proportion of items are displayable to a very large proportion of users, it can be non-optimal to use a pessimistic permissions strategy where the database vets every item before sending it to the application.

Inequality-based pagination fits well with an optimistic, iterative pagination strategy:

  1. Fetch an initial batch of rows for a page without regard to permissions. The initial batch of rows need not be equivalent to the number intended for display on a page; the system could be optimized to expect approximately 20% of records it fetches to be non-displayable to most users.
  2. Test whether each item is displayable to the current user.
  3. Render and output the displayable items.
  4. Fetch more items if the quota intended for display on the page (say, 10 items) isn’t met. Each subsequent batch from the database may increase in size as the algorithm realizes that it’s finding a low proportion of displayable content.
  5. Repeat until the quota for the page is filled.

This strategy works well when a low percentage of items evenly distributed through result sets are locked away from general displayability. Fortunately, that case is quite common for large, public sites with:

  • Publishing workflows that exclude small quantities of content during the editorial process
  • Small quantities of content that need to be hidden, like Wikipedia for legally troublesome revisions
  • Small numbers of internal documents, like documentation intended for editors

Commenting on this Blog post is closed.

Comments

Very interesting read.

How do you approach a previous link? Would you simply advise to design without it or do you reverse the query and then reverse it back?

“reverse the query and then reverse it back”

Exactly. You’re just looking for the first X items before the first item on the current page.

Very nice approach.

What if your paginated content is a view of none linear items and/or your ORDER BY was not limited to created DESC?

Another approach would be to take the timestamp from the time the user first accessed the pagination and use it in your where clause. So the timestamp would function as a “caching” key and anything later than the timestamp would not be called. The SQL should also be faster.

“SQL … created <= [timestamp of user’s first access] ORDER BY…”