Software

Flexible Apache Configuration for Reusable Development Environments

Here at Four Kitchens we make BIG websites — a lot of them. In the past, keeping up with the development environments for all of these sites added a lot of overhead, which ultimately means more time managing code than writing it. So like good developers we asked ourselves: “how can we make this better, lower maintenance, and reusable?” What we came up with isn’t necessarily anything new, but has completely changed how our internal development process works.

Previously, we had one codebase per project with one user login to the server; for example, on our website redesign we had a “fk” code base with a “fk” user. This is good in principle but quickly becomes problematic as your team grows. We decided a better approach would be to have individual user logins with sandboxed development environments. This solves our user problem but required some thinking and tinkering to make it flexible enough to work with sandboxes.

Before we get started you’ll need a server that you have root level access to with Apache and the following Apache modules installed:

  • mod_rewrite
  • mod_vhost_alias

To create our sandboxed environments we set up Apache with wildcard virtual document roots. Each developer needed their own sandbox and could have an infinite number of projects they were working on. Our goal was to have domains set up in the form of DEVELOPER.PROJECT.DOMAIN.TLD i.e. elliott.fk.fourkitchens.com.

Here’s an example using virtual document roots that’s derived from our default Apache configuration:

<VirtualHost *:80>
  ServerAdmin webmaster@localhost
  ServerAlias *.*.fourkitchens.com
 
  DirectoryIndex index.php
 
  UseCanonicalName Off
  VirtualDocumentRoot /home/%-4/www/%-5
 
  <Directory /home/*/www/*>
    Options Indexes FollowSymLinks MultiViews
    AllowOverride All
    Order allow,deny
    Allow from all
  </Directory>
 
  # ...
</VirtualHost>

The key pieces of this virtual host are the VirtualDocumentRoot and Directory directives. The %-N values in the VirtualDocumentRoot allow us to construct the file directory on the fly. The % character corresponds to the domain name that was requested. Adding the -N cherry picks a segment from the domain name starting from the end. So %-1 in elliott.fk.fourkitchens.com would return fourkitchens. Alternatively you can cherry pick from the beginning of the array by using %N where in our previous example %2 would return fk. We used %-N rather than %N so we could also support developer site subdomains (i.e. img.elliott.fk.fourkitchens.com) where we know the relative position of domain elements from the end, not the start, of the URL.

What this means to developers is you no longer have to alter Apache settings to host new development sites. One VirtualHost catches them all and adding a new site is as easy as creating a directory that Apache can read. If you’re setting up Drupal sites you’ll need to make one change to your .htaccess file and uncomment the RewriteBase / line:

# If your site is running in a VirtualDocumentRoot at http://example.com/,
# uncomment the following line:
RewriteBase /

This system isn’t without its flaws but is a great starting point for flexible, reusable configurations. In future blogs we’ll talk about how this empowers you to automatically provision new development instances and deploy code on staging and live sites using Jenkins. Happy hacking!

The CAP theorem is like physics to airplanes: every database must design around it

Back in 2000, Eric Brewer introduced the CAP theorem, an explanation of inherent tradeoffs in distributed database design. In short: you can’t have it all. (Okay, so there’s some debate about that, but alternative theories generally introduce other caveats.)

On Twitter, I recently critiqued a presentation by Bryan Fink on the Riak system for claiming that Riak is “influenced” by CAP. This sparked a short conversation with Justin Sheehy, also from the project. 140 characters isn’t enough to explain my objection in depth, so I’m taking it here.

While I give Riak credit for having a great architecture and pushing innovation in the NoSQL (non-relational database) space, it can no more claim to be “influenced” by CAP than an airplane design can claim influence from physics. Like physics to an airplane, CAP lays out the rules for distributed databases. With that reality in mind, a distributed database designed without regard for CAP is like an airplane designed without regard for physics. So, claiming unique influence from CAP is tantamount to claiming competing systems have a dangerous disconnect with reality. Or, to carry on the analogy, it’s like Boeing making a claim that their plane designs are uniquely influenced by physics.

But we all know Airbus designs their planes with physics in mind, too, even if they pick different tradeoffs compared to Boeing. And traditional databases were influenced by CAP and its ancestors, like BASE (warning: PDF) and Bayou from Xerox PARC. CAP says “pick two.” And they did: generally C and P. This traditional — and inflexible — design of picking only one point on the CAP triangle for a database system doesn’t indicate lack of influence.

What Riak actually does is quite novel: it allows operation at more than one point on the triangle of CAP tradeoffs. This is valuable because applications value different parts of CAP for different types of data or operations on data.

For example, a banking application may value availability for viewing bank balances. Lots of transactions happen asynchronously in the real world, so a slightly outdated balance is probably better than refusing any access if there’s a net split between data centers.

In contrast, transferring from one account to another of the same person at the same bank (say, checking to savings) generally happens synchronously. A bank would rather enforce consistency above availability. If there’s a net split, they’d rather disable transfers than have one go awry or, worse, invite fraud.

A system like Riak allows making these compromises within a single system. Something like MySQL NDB, which always enforces consistency, would either unnecessarily take down balance viewing during a net split or require use of a second storage system to provide the desired account-viewing functionality.

Anticipage: scalable pagination, especially for ACLs

Pagination is one of the hardest problems for web applications supporting access-control lists (ACLs). Drupal and Pressflow support ACLs through the node access system.

Problems with traditional pagination

  • Because pagination uses row offsets into the results, browsing listings where newly published items get added to the beginning of the results creates “page drift.” Page drift is where a user already browsing through paginated results sees, for example, items E, D, and C on page one, waits awhile, clicks to the next page, and sees items C, B, and A. Going back to page one again shows F (newly published), E, and D. Item C “drifted” to page two while the user was reading page one. If new items are published frequently enough, pagination can become unusable due to this drifting effect.
  • Even if content and ordering are fully indexed, jumping n rows into the results remains inefficient; it scales linearly with depth into pagination.
  • Paginating sets where the content and ordering are not fully indexed is even worse, often to the point of being unusable.
  • The design is optimized around visiting arbitrary page offsets, which does not reflect user needs. Users only need to make relative jumps in pagination of up to 10 pages (or so) in either direction or to start from the end of the results. (If users are navigating results by hopping to arbitrary pages to drill down to what they need, there are other flaws in the system.)

“Anticipage”

With a combination of paginating by inequality and, optionally, optimistic permission review, a site can paginate content with the following benefits:

  • No page drift
  • Stable pagination URLs that will generally include the same items, regardless of how much new content has been published to the beginning or end of the content listing
  • If the ordering is indexed, logarithmic time to finding the first item on a page, regardless of how many pages the user is into browsing
  • Minimal computation of JOINs, an especially big benefit for sites using JOINs for ACLs

The general strategy is to amortize the cost of pagination as the user browses through pages.

Paginating by inequality

The path to achieving fast pagination first involves a fresh strategy for sorting and slicing content. A “pagination key” must be selected for the intended set of content that:

  • Includes the column(s) desired for sorting. For a Drupal site, this might be the “created” column on the “node” table.
  • Is a superkey (unique for all rows in the table but not necessarily minimally unique). Sorting by the columns in a superkey is inherently deterministic. And because a superkey is also unique, it allows us to use where criteria on the deterministically sorted set to deterministically define pages. An existing set of sort columns for a listing can always be converted to a superkey by appending a primary key to the end.

For a Drupal site, a qualifying pagination key could be (created, nid) on the “node” table. This key allows us to deterministically sort the rows in the node table and slice the results into pages. Really, everyone should use such pagination keys regardless of pagination strategy in order to have a deterministic sort order.

Having selected (created, nid) as the key, the base query providing our entire listing would look something like this:

SELECT * FROM node ORDER BY created DESC, nid, DESC;

Traditionally, a site would then paginate the second page of 10 items in MySQL using a query like this:

SELECT * FROM node ORDER BY created DESC, nid, DESC LIMIT 10, 10;

But because we’re ordering by a pagination key (as defined above), we can simply run the base query for the first page and note the attributes of the final item on the page. In this example, the final node on the first page has a creation timestamp of “1230768000” and a node ID of “987.” We can then embed this data in the GET criteria of the link to the second page, resulting in running a query like this for rendering the second page:

SELECT * FROM node WHERE created <= 1230768000 AND (created <> 1230768000 OR nid < 987) ORDER BY created DESC, nid, DESC LIMIT 10;

We’re asking for the same sorting order but adding a WHERE condition carefully constructed to start our results right after the content on the first page. (Note: this query could also be dissected into a UNION if the database does not properly optimize the use of the index.) This strategy allows the database to fully employ indexes on the data to find, in logarithmic time, the first item on any page. Note how page drift becomes impossible when pagination happens using keys instead of offsets.

Should a system choose to support moving more than one page in either direction, it would either have to:

  • Read a sufficient depth into the results in nearby pages to obtain the necessary WHERE attributes. This is a bit inefficient but consistent with the rest of the approach.
  • Adopt a hybrid strategy by using a traditional-style query (a LIMIT that skips records) with WHERE conditions beginning the set on the adjacent page. For example, if a user were currently on page 9, the direct link to page 11 would load a page that runs the query for page 10 but starts its listing 10 items later (“LIMIT 10, 10”). Naturally, this becomes less efficient as we allow users to hop greater distances, but the running time, at worst, converges on how the traditional pagination approach works.

This inequality pagination strategy is already a huge win for pagination queries using expensive joins. If everything can be calculated in the database, this is about as good as it gets without denormalization or alternatives to relational databases. Unless, of course, we have a site where an optimistic permissions strategy works well:

An iterative, optimistic permissions strategy

One feature of ACLs is that they’re hard to generically and flexibly define in fixed schemas. Sometimes, it’s easiest to allow callback functions in the application that don’t have to fit into rigid ACL architectures. And for listings where a very large proportion of items are displayable to a very large proportion of users, it can be non-optimal to use a pessimistic permissions strategy where the database vets every item before sending it to the application.

Inequality-based pagination fits well with an optimistic, iterative pagination strategy:

  1. Fetch an initial batch of rows for a page without regard to permissions. The initial batch of rows need not be equivalent to the number intended for display on a page; the system could be optimized to expect approximately 20% of records it fetches to be non-displayable to most users.
  2. Test whether each item is displayable to the current user.
  3. Render and output the displayable items.
  4. Fetch more items if the quota intended for display on the page (say, 10 items) isn’t met. Each subsequent batch from the database may increase in size as the algorithm realizes that it’s finding a low proportion of displayable content.
  5. Repeat until the quota for the page is filled.

This strategy works well when a low percentage of items evenly distributed through result sets are locked away from general displayability. Fortunately, that case is quite common for large, public sites with:

  • Publishing workflows that exclude small quantities of content during the editorial process
  • Small quantities of content that need to be hidden, like Wikipedia for legally troublesome revisions
  • Small numbers of internal documents, like documentation intended for editors

Intelligent memcached and APC interaction across a cluster

Anyone experienced with high-performance, scalable PHP development is familiar with APC and memcached. But used alone, they each have serious limitations:

APC

  • Advantages
    • Low latency
    • No need to serialize/unserialize items
    • Scales perfectly with more web servers
  • Disadvantages
    • No enforced consistency across multiple web servers
    • Cache is not shared; each web server must generate each item

memcached

  • Advantages
    • Consistent across multiple web servers
    • Cache is shared across all web servers; items only need to be generated once
  • Disadvantages
    • High latency
    • Requires serializing/unserializing items
    • Easily shards data across multiple web servers, but is still a big, shared cache

Combining the two

Traditionally, application developers simply think about consistency needs. If consistency is unnecessary (or the scope of the application is one web server), APC is great. Otherwise, memcached is the choice. There is, however, a third, hybrid option: use memcached as a coordination system for invalidation with APC as the main item cache. This functions as a loose L1/L2 cache structure. To borrow terminology from multimaster replication systems, memcached stores “tombstone” records.

The “extremely fresh” check for the APC item (see below) allows throttling hits to memcached. Even a one-second tolerance for cache incoherency massively limits the amount of traffic to the shared memcached pool.

Reading

The algorithm below may not be perfect, but I’ll revise it as I continue work on an implementation.

  1. Attempt to load the item from APC:
    1. On an APC hit, check if the item is extremely fresh or recently verified as fresh against memcached. (For perfect cache coherency, the answer is always “not fresh.”)
      1. If fresh, return the item.
      2. If not fresh, check if there is a tombstone record in memcached:
        1. If there is no tombstone (or the tombstone post-dates the local item):
          1. Update the freshness timestamp on the local item.
          2. Return the local item.
        2. Otherwise, treat as an APC miss.
    2. On an APC miss, attempt to load the item from memcached:
      1. On a memcache hit:
        1. Store the item into APC.
        2. Return the item.
      2. On a soft memcache miss (the item is available but due for replacement), attempt to take out a semaphore in APC:
        1. If the APC semaphore was successful, attempt to take out a semaphore in memcached:
          1. If the memcached semaphore was successful:
            1. Write the semaphore to APC.
            2. Rebuild the cache item and write it (see below).
            3. Release the semaphore in memcached. (The semaphore in APC should clear itself very quickly.)
          2. If the memcached semaphore was unsuccessful:
            1. Copy the memcached rebuild semaphore to APC. Store this very briefly (a second or so); it is only to prevent hammering memcached for semaphore checks.
            2. Return the slightly stale item from memcache.
        2. If the APC semaphore was unsuccessful:
          1. Return the slightly stale item.
      3. On a hard memcache miss (no item available at all):
        1. Is a stampede to generate the item acceptable?
          1. If yes:
            1. Generate the item real-time.
            2. Store to the cache.
          2. If no:
            1. Use the APC/memcache semaphore system (see above) to lock regeneration of the item.
            2. If the current request cannot grab the semaphore, fail as elegantly as possible.

Writing/invalidating

  1. Write to/delete from memcached.
  2. Write to/delete from APC.
  3. Set the tombstone record in memcached. This record should persist long enough for all web servers to notice that their local cache needs to be updated.

OpenSolaris 2009.06 first impressions

[img_assist|nid=193|title=|desc=|link=none|align=right|width=120|height=114]

I downloaded and installed OpenSolaris to experiment with its LAMP stack implementation. I eventually want to set up a headless, SSH-accessible box I can use to experiment with unique OpenSolaris features like dtrace on PHP and MySQL.

I have no experience using Solaris or OpenSolaris. Nearly all of my *nix experience is with BSD and GNU/Linux. I understand that anyone even basically familiar with OpenSolaris would not have some of the problems I’ve had.

Downloading the CD

It’s a little too easy to start downloading an OpenSolaris ISO. Many “download” links go right to the x86 ISO. When downloading an operating system install CD, a landing page (like Ubuntu’s) is helpful to get new users oriented; all “download” links on the site should link there.

The landing page is also helpful because people downloading CD images don’t always want to download directly to the local machine, so a basic “click here to actually download the ISO” is helpful.

I do later find the download landing page, but I’m initially confused what to download to install OpenSolaris. It shouldn’t merely be in the descriptive text on the download landing page that the live CD is also the installation CD. “Live CD” does not imply “installation CD.”

First boot

The live CD boot goes quite smoothly, though GRUB asks users to pick among some confusing options. My questions: “Do I pick the ‘SSH-enabled’ option? What does SSH have to do with my boot choice?”

It’s also unnerving to see a terminal login prompt while the GUI login is loading. It gives the impression that X has failed to load. There’s also no indication that a user should wait for the graphical login to load, other than some screen blanks.

Eventually, though, the GDM login displays.

First login

Ubuntu’s live CD automatically logs in a basic user on boot. OpenSolaris gives users a standard login prompt. That wouldn’t be so much of a problem if the download landing page mentioned the username and password. But it doesn’t, at least for the latest release. Users actually wanting to use their live CDs either have to go off-site or view the landing page for a previous edition.

So, the username and password are both “jack.” I’ll file that alongside Oracle’s default included schema for arbitrary naming after someone who probably hasn’t worked on the project for years.

Installation

Starting the installation is straightforward: there’s a nice icon on the desktop. Running the installer is easy and successful.

But like most people trying OpenSolaris, I’m installing it for use as a server, not a desktop. For one, I want the time zone configured as UTC, not a local one. There seems to be no option to do this at installation time. I also don’t want to install a GUI, but “no GUI” is not an option in the installer. Nor is there an obvious “server” installation CD. I guess making it a real server will have to happen post-install.

While running the installation, I try to open the “getting started” link on the desktop, which launches Firefox. Well, at least it tries to. It actually fails, but I hold off investigating in case it starts working after completing the installation.

First login to the installed instance

Completing the first boot and first login is easy. I used the username and password I had set during installation. A GNOME desktop very similar to the one from the live CD loads effortlessly.

There seem to be two network-focused icons in the tray near the clock, and it’s not clear what the difference is. Upon further investigation, one is Network Manager and the other is a simple status icon for the ethernet adapter. Really, only one is necessary, especially given the visual confusion of having both and redundancy in what they do.

Firefox still doesn’t start. It gets a “704 Illegal Instruction” error when I try from the shell. Considering that Thunderbird and the rest of GNOME run fine, this isn’t a problem with the hardware. I guess I’ll withhold judgment until at least updating.

Given the problems with Firefox, the first thing I seriously try is updating the OS. The package manager GUI, which is sort of like Ubuntu’s, is linked from approximately three places (desktop, toolbar, and application menu). It starts quickly and provides a mostly sane interface.

Time to search for packages. Typing in the search box doesn’t immediately do anything, so I eventually press ENTER, which starts the search. There isn’t an obvious visual indicator of the search being in progress, so it’s not clear when all the results have been displayed. Clicking the “clear search box” (broom) button clears the search box and takes the user out of search mode. Other GNOME applications typically have live search in combination with the broom button, which usually makes this behavior less confusing.

The update buttons on the toolbar seem disabled, so I click “refresh” and use the standard menus to initiate the update process. It’s not very clear whether anything actually needs to be updated. The updater takes me through creating a boot system snapshot. This feature is very cool, and I imagine it’s implemented using ZFS under the hood. The installer offers to show me consolidated release notes for the update, which of course uses Firefox. I shrug and reboot.

First login after update

The post-update boot happens smoothly all the way to GNOME, but Firefox still doesn’t start. Same error.

Oh well, this wasn’t going to be used as a desktop, anyway.

I try to set the static IP address for the testing subnet of the Four Kitchens office network. After clicking on the non-Network Manager tray icon and setting the IP, OpenSolaris solves the redundant tray icon problem by having both widgets crash. I try to open the GNOME network connections configuration tool to at least see if the static IP took effect, but that spends about 30 seconds loading the interfaces before also crashing.

Granted all of these problems are GUI-related and shouldn’t affect my final work, but setting the IP and loading Firefox are both pretty basic.

Next adventures

  1. Install Bazaar on Solaris (not expected to be difficult, but necessary for my application deployments)
  2. Set up remote SSH logins
  3. Remove the GUI
  4. Set up the Sun’s packaged LAMP stack
  5. Explore dtrace on PHP and MySQL

David's Epic Presentation Megapost

The hidden costs of proprietary software: #2 your vendor is an adversary

unhappy DRM face

On December 2, 2008, customers of SonicWALL woke up to broken firewalls. This wasn’t the result of a real problem in the firewalls; it was a result of SonicWALL’s DRM server malfunctioning and deactivating all customer firewalls.

The relationship between customers and vendors of proprietary software is fundamentally adversarial: proprietary vendors have business models where customer activity (like installing Windows on a desktop) requires payment to the vendor. Because the activity happens entirely on the customer side and paying the vendor conflicts with the customer’s desire to save money, proprietary software vendors don’t trust their customers to pay them.

So, they’ve developed strategies based on customer distrust. One of these strategies is embedding DRM, software their customers run that looks out for the vendor’s interests. DRM systems, like Microsoft’s activation tools, continually threaten to disable the software customers rely on. And, with few exceptions, they run on code the customers cannot inspect.

Microsoft has removed activation code from many of their server products, but it remains common in their desktop products, like Windows and Office. Other proprietary vendors (like SonicWALL) still have it in their server and infrastructure code. Any software that has secret failure mechanisms integrated for the vendor’s sake has no place in important business infrastructure.

Software fails often enough without being defective by design.

Decorators and directories

Nodes have evolved remarkably over Drupal’s history. In Drupal 4.7, node types were typically created by modules that “owned” their node types. There was no way to create a node type without a module behind it. Modules creating node types would implement hook_node_info() and directly handle the the main loading, saving, and editing of the node type. Drupal core handled the loading and saving of the title and body. Modules doing this were effectively subclassing a pseudo-abstract node class (a class containing title and body only) in core and adding their own fields.

Drupal 4.7 was also the dawn of the Form API and its hook_form_alter(). Combined with the ability (beginning in Drupal 5) to create node types directly in core, the dominant pattern of node type development changed. The decorator pattern emerged as the preferred approach. This allowed multiple modules to simultaneously create fields on the same node types.

Using decorators with node types is straightforward:

  • Create your node types using Drupal’s content type administration tool.
  • Create a module that implements hook_form_alter() and hook_nodeapi() that adds fields to selected content types.
  • Configure the module to add its fields to the selected content types.

Even when it would be equally straightforward to have a module implement its own node type, implementing the fields with a decorator is superior because the approach maintains consistency. All fields are saved and loaded through hook_nodeapi(), and all form elements are defined through hook_form_alter(). No module has any special claim to particular node types. This approach should completely obsolete the implementation of hook_node_info() and other “I own this node type” hooks by modules.

But this model has some downsides, at least in Drupal:

  • When modules directly created node types, data tended to be better consolidated. Because we now prefer to decorate node types using the fields from multiple modules, data is more scattered. We’re typically implementing each decorator as a table with a foreign key to nid and vid. This has massive performance implications.
  • Configuration for managing the mapping of decorators to node types is wildly inconsistent, and there’s no way to see, globally, which decorators apply to which node types.
  • Managing interaction between decorators is inconsistent or absent. This interaction includes namespace conflicts on $node objects.
  • The node editing form and $node object are the only places where decorators all come together in a consistent way. This makes data importing and exporting nearly impossible without custom code for each module performing decoration.

Fortunately, we don’t have to solve this problem on our own. Systems like Sun’s OpenDS have sophisticated, well-reasoned data models that allow decorators to elegantly combine to form coherent, node-like objects. OpenDS discusses its schema model on its wiki, and I’ll use it as my example.

The OpenDS schema contains basic layers:

  • Attributes, which are fields
  • Abstract and structural classes, which contain attributes
  • Object classes, which are the set of classes assigned to an object

Objects (which are node-like) can be assigned multiple object classes, each of which functions like a decorator. The objects may contain values (often even multiple values) for the attributes provided by their object classes.

Drupal modules would create attributes and directory classes, and Drupal core would contain a unified interface for assigning the classes to node types.

What this gets us:

  • Asynchronous multi-master replication support. Right now, node data is scattered all over the database, and there’s no way to “package” it for coherent, asynchronous replication across multiple hosts without a PHP-level implementation. In OpenDS, objects are fundamentally understood and managed by the directory’s data storage layer. It’s easy for it to replicate whole nodes.
  • Similarly, this ability to “package” objects gives us importing and exporting for free. OpenDS can import and export LDIF-formatted data, and this would allow nodes to be transported to dissimilar systems, even different directory servers. You would simply need the same classes supported on the destination system.
  • The “packaged” objects make sharding and partitioning data much easier.
  • We get tools like Apache Directory Studio that give a coherent object view, including the list of classes for each object. There’s no way to view a node in MySQL without a painful number of joins.
  • Built-in protection against namespace collisions for attributes.
  • We get unified indexing across decorators. Because decorators currently store their data in multiple tables, we can’t index, say, a city and the title for a node without denormalization. In OpenDS, you can create VLV indexes that span any set of attributes and selectable subsets of nodes. It basically allows creation of a comprehensive index for anything configured in the Views module. The only comparable features in relational databases are indexed views in SQL Server and materialized views in Oracle. MySQL does not support such indexes.
  • We can change a field from being single-valued to multi-valued without a fundamental change in the way we access the data.

I’ll be experimenting with using OpenDS as a node back-end in the upcoming weeks. It would be great to have a robust, multi-master, free/open-source node storage system.