Software Architecture

Drop that cron; use Hudson instead

Hudson: The butler for your cron jobs, tooHudson: The butler for your cron jobs, too
For years, I used cron (sometimes anacron) without asking questions. The method was simple enough, and every project requiring cron-related capabilities documented the setup.

There is a much better way, and it involves Hudson. I introduced “Hudson for cron” as a sidebar at the Drupal Scalability and Performance Workshop a few weeks ago. To my surprise, several of the attendees remarked on their feedback questionnaires that it was one of the most valuable things they picked up that day. So, I’ve decided to write this up for everyone.

Making Drupal and Pressflow more mundane

Drupal and Pressflow have too much magic in them, and not the good kind. On the recent Facebook webcast introducing HipHop PHP, their PHP-to-C++ converter, they broke down PHP language features into two categories: magic and mundane. The distinction is how well each capability of PHP, a dynamic language, translates to a static language like C++. “Mundane” features translate well to C++ and get a big performance boost in HipHop PHP. “Magic” features are either unsupported, like eval(), or run about as fast as today’s PHP+APC, like call_user_func_array().

Anticipage: scalable pagination, especially for ACLs

Pagination is one of the hardest problems for web applications supporting access-control lists (ACLs). Drupal and Pressflow support ACLs through the node access system.

Problems with traditional pagination

  • Because pagination uses row offsets into the results, browsing listings where newly published items get added to the beginning of the results creates “page drift.” Page drift is where a user already browsing through paginated results sees, for example, items E, D, and C on page one, waits awhile, clicks to the next page, and sees items C, B, and A. Going back to page one again shows F (newly published), E, and D. Item C “drifted” to page two while the user was reading page one. If new items are published frequently enough, pagination can become unusable due to this drifting effect.
  • Even if content and ordering are fully indexed, jumping n rows into the results remains inefficient; it scales linearly with depth into pagination.
  • Paginating sets where the content and ordering are not fully indexed is even worse, often to the point of being unusable.
  • The design is optimized around visiting arbitrary page offsets, which does not reflect user needs. Users only need to make relative jumps in pagination of up to 10 pages (or so) in either direction or to start from the end of the results. (If users are navigating results by hopping to arbitrary pages to drill down to what they need, there are other flaws in the system.)

Intelligent memcached and APC interaction across a cluster

Anyone experienced with high-performance, scalable PHP development is familiar with APC and memcached. But used alone, they each have serious limitations:

APC

  • Advantages
    • Low latency
    • No need to serialize/unserialize items
    • Scales perfectly with more web servers
  • Disadvantages
    • No enforced consistency across multiple web servers
    • Cache is not shared; each web server must generate each item

memcached

  • Advantages
    • Consistent across multiple web servers
    • Cache is shared across all web servers; items only need to be generated once
  • Disadvantages
    • High latency
    • Requires serializing/unserializing items
    • Easily shards data across multiple web servers, but is still a big, shared cache

Combining the two

Traditionally, application developers simply think about consistency needs. If consistency is unnecessary (or the scope of the application is one web server), APC is great. Otherwise, memcached is the choice. There is, however, a third, hybrid option: use memcached as a coordination system for invalidation with APC as the main item cache. This functions as a loose L1/L2 cache structure. To borrow terminology from multimaster replication systems, memcached stores “tombstone” records.

Giving schema back its good name

For modern applications, the word “schema” has become synonymous with the tables, columns, constraints, indexes, and foreign keys in a relational database management system. A typical relational schema affects physical concerns (like record layout on disk) and logical concerns (like the cascading deletion of records in related tables).

Schemas have gotten a bad name because current RDBMS tools give them these rotten attributes:

David's Epic Presentation Megapost

Drupal's vulnerability reports are not signs of security weakness

Photo by loop_oh on Flickr.Photo by loop_oh on Flickr.

I’ve been tweeting back and forth with Alex Limi, one of the founders of Plone, about the validity of the security analysis from a CMS comparison report that includes Plone and Drupal. He’s proud of Plone’s infrequent vulnerability notices; it had two in the last year. Drupal had 26. Alex also cited a related IBM report on security in a later tweet.

While both reports above seem to identify Drupal (and Joomla! and WordPress, to be fair) as having notably bad security, they’re also both based on one superficial metric: self-reported vulnerabilities. Neither severity nor response time nor history of actual exploitation factored in.

Decorators and directories

Nodes have evolved remarkably over Drupal’s history. In Drupal 4.7, node types were typically created by modules that “owned” their node types. There was no way to create a node type without a module behind it. Modules creating node types would implement hook_node_info() and directly handle the the main loading, saving, and editing of the node type. Drupal core handled the loading and saving of the title and body. Modules doing this were effectively subclassing a pseudo-abstract node class (a class containing title and body only) in core and adding their own fields.

Contact Four Kitchens

Pressflow makes Drupal scale