Migrating old HTML files into Drupal

The Internet in the 90’s - a much simpler place.

We’ve done several migrations for clients who need their old, legacy content imported into Drupal from a collection of static HTML files. In this post I’ll outline the procedure we use to migrate, and provide some solutions to common problems related to encoding, line endings and parsing HTML with QueryPath. Code snippets are provided inline, and complete source code is provided as a Github gist.

1. Setup a Migration source

Let’s work with a hypothetical example site, which has the following directory listing:

$ ls /mnt/html
about.html
news.html
h1001.html
h1002.html
...
h1133.html
h1133.html
contact.html

The migration we’ll setup will specifically target the hxxxx.html files, as it’s a pretty common requirement to migrate hundreds or thousands of semi-structured files like this.

The main workhorse is the extremely well-architected Migrate module. If you haven’t yet discovered the wonders of it’s elegant abstraction and flexibility, my suggestion is to watch the presentations on the project page to gain a firm understanding of how it works. I won’t go into the basics here, as I’m going to cover only tips related to importing static HTML.

To use the Migrate module, you need to start with defining your migration. This is done by extending the Migration base class, and providing a constructor that injects the required dependancies into the object. The dependencies are a migration source, a destination, field mappings, etc. The migration source is what we’re interested in right now.

To assemble the source object we’ll need to create two things, the first is a MigrateListFiles, which is an object that provides a list of filenames. These filenames are used as ids in the Migrate module. You just need to create the MigrateListFiles with some parameters that direct it to the files you want:

$regex = '/h[0-9]+\.html/';
$list_files = new MigrateListFiles(array('/mnt/html'), '/mnt/html', $regex);

The first parameter is an array of directories. In our case, we only have one, but there could be multiple different source directories. Next is the base dir, which is the part of the directory structure that you’d like explicitly excluded when ids are created. Finally, the regular expression that is used to filter the files in the directory down to just the ones we care about. In our case, we filter looking for any .html that starts with the letter ‘h’ and some digits. This means we won’t be migrating the individual about.html, contact.html, etc. pages.

We then create an object that provides a method of turning an id (filename) into a migrate-able chunk of data, which in this case means doing a file_get_contents() and returning the file contents to you:

$item_file = new MigrateItemFile('/mnt/html');

We then create an array which explicitly states the fields we’re going to be providing, and finally create the actual migration source, passing it the two objects from above:

$fields = array('title' => t('Title'), 'body' => t('Body'));
$this->source = new MigrateSourceList($list_files, $item_file, $fields);

To re-iterate, the MigrateSource is going to be a MigrateSourceList, which is a basic source that provides the ability to iterate over a MigrateListFiles and get data from a MigrateItemFile.

If having all these objects seems unnecessarily complex, remember that the Migrate module is extremely flexible - there’s a lot of code reuse going on here, so the tradeoff is worth it.

The rest of the source setup is beyond the scope of this article, and the Github gist contains a full example.

2. Fixing HTML file encodings

Old static sites typically get modified over the years by multiple people using many different platforms and editors, which results in a line ending and character encoding disaster. Any broken characters normally don’t show in the browser, even when the web server headers and HTML <head> tags suggest an encoding that conflicts with the actual encoding. This is because browsers are really, really good at sorting out the mess, which is something that they probably should not be doing - developers should be correcting problems in the source. But anyway.

The first step then, would be to use a good text editor to browse through a handful of files one by one, checking them for inconsistency. You may find several are detected as ISO-8859-1 (Latin-1) but are actually Windows 1252 and thus your files contain broken characters where there should be angled quotes and other Windows niceties. Assuming this is the case, you can use a snippet like this to convert the file contents to UTF-8 before doing anything else with it:

$enc = mb_detect_encoding($html, 'UTF-8', TRUE);
if (!$enc) {
  $html = mb_convert_encoding($html, 'UTF-8', 'WINDOWS-1252');
}

It’s pretty simple - if the encoding is not UTF-8 (which is very reliably detected by PHP, unlike WINDOWS-1252), then assume it’s WINDOWS-1252 and convert. If some of your content is in ISO-8859-1 this shouldn’t cause a problem since WINDOWS-1252 is a superset of ISO-8859-1.

Modify the above snippet according to the encodings you find in your source data.

3. Fixing line endings

There’s Unix, Windows, Classic Mac… and I’ve even discovered a crazy hybrid CRCRLF which I assume was created by a buggy editor. Best to convert everything to Unix (LF). Depending on your source, one way to do that is to simply blast the CR characters away:

$html = str_replace(chr(13), '', $html);

It’s worth noting that getting rid of the CR characters is a necessary step if you want to use QueryPath, as described below.

4. Fixing code point HTML entities

The correct entities to use in HTML for symbols are the “named” entities - for example the ™ sign should be &trade;. However, browsers also accept &#153; for ™, which is a reference to a code point in the extended ASCII table. This would be fine, except in the case that the document is UTF-8, because &#153; is not a valid unicode code point. The browser just cheats and assumes you mean the ASCII code point for ™, even though the correct code point in UTF-8 for ™ is &#8482;. PHP isn’t so lenient, and will reject invalid values. Try it yourself:

print html_entity_decode('&trade;', ENT_COMPAT | ENT_HTML401, 'UTF-8');print html_entity_decode('&#153;', ENT_COMPAT | ENT_HTML401, 'UTF-8');
 // Nothing! 
print html_entity_decode('&#8482;', ENT_COMPAT | ENT_HTML401, 'UTF-8');

So, if your source content has HTML entity code points in the extended ASCII range 127-159, you’re going to have to manually substitute them for named entities before trying to parse them in QueryPath, as QueryPath uses PHP’s default encoding functions, and thus will fail to generate the characters you want.

Luckily, this isn’t too hard:

function convertEntities($html) {
  $entities = array(
    '&#150;' => '&ndash;',
    '&#151;' => '&mdash;',
    '&#152;' => '&tilde;',
    '&#153;' => '&trade;',
    // ... get them all!
  );
  $html = str_replace(array_keys($entities), array_values($entities), $html);
  return $html;
}

Here is the full list of ASCII characters.

5. Parsing HTML with QueryPath

You’re probably going to want to process and transform the source in some way, and for that, QueryPath is an excellent choice. It uses PHP’s native DOM handling abilities to load HTML and execute jQuery-like chainable commands. In fact, it provides most of the jQuery functions for your usage in PHP.

Getting the HTML into a QueryPath object can be tricky. Depending on how the source HTML is setup, each individual file may have a complete page with headers and footers, or it could include the sidebar and other components using Apache’s Server-Side Includes (SSI) and just contain the main content area (a saner approach).

In the case where you just have the body content in each file, you’ll need to pad it with the complete

structure before loading it into QueryPath. To do that, use a function like this:
function wrapHTML($body) {
  // We add surrounding <html> and <head> tags.
  $html = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">';
  $html .= '<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>';
  $html .= $body;
  $html .= '</body></html>';
  return $html;
}

I found that I always have to add this surrounding HTML in exactly this way in order for QueryPath to correctly use UTF-8. We can now create our QueryPath object, combining all the above tips in the correct order:

$html = str_replace(chr(13), '', $html);
$enc = mb_detect_encoding($html, 'UTF-8', TRUE);
if (!$enc) {
  $html = mb_convert_encoding($html, 'UTF-8', 'WINDOWS-1252');
}
$html = wrapHTML($html);
$html = convertEntities($html);
$qp_options = array(
  'convert_to_encoding' => 'utf-8',
  'convert_from_encoding' => 'utf-8',
  'strip_low_ascii' => FALSE,
);
$qp = htmlqp($html, NULL, $qp_options);

You can now perform your transformations on the $qp object. For example, you may want to process all anchors, so that you can change the ‘href’ on internal links to point at a new path:

// For all anchor links.
$anchors = $qp->top('a');
foreach($anchors as $a) {
  $href = trim($a->attr('href'));
  $href = getNewHref($href);
  // Set the new href.
  $a->attr('href', $href);
}

What about stripping out all those pesky Dreamweaver HTML comments? Easy:

foreach ($this->qp->top()->xpath('//comment()')->get() as $comment) {
  $comment->parentNode->removeChild($comment);
}

To get your body content out again, use the innerHTML function:

$body = $qp->top('body')->innerHTML();

Conclusion

These are some fairly tough problems to diagnose the first time you run into them, especially when dealing with a large body of content, so I hope this post helps you to get your clients safely into a Unicode-based Drupal world.

Commenting on this Blog post is closed.

Comments

Interesting article. I’m about to start a project that will also have to migrate current content. I had already found the “Import HTML” module mentioned before, so was thinking that is the way to go. Even though the module only exists in D6, thus a content migration from D6 to D7 should have followed the import.

Did you consider using the “Import HTML” module and, if so, can you explain why you chose to do it this way?

Note: There is now an initial D7 version of “Import HTML”

We reviewed the Import HTML module and found that it wasn’t flexible or advanced enough for our needs. Migrating ICANN.org to Drupal involved moving somewhere in the region of 10,000 pages of multilingual content and linking the translations, so we needed a lot of custom code and control. For this, Migrate is the best.

The Import HTML module uses XSLT & HTMLTidy, two technologies that I don’t like very much. QueryPath is much easier than XSLT (which is a nightmare), and we tried to use HTMLTidy but it did destructive things to our content that we couldn’t switch off - for example, if there was a line break in an anchor element, it would simply remove the whole element - things like that. Couldn’t trust it.

Thanks for your clear answer. I will also need to migrate a multilingual site, so it is important that the migrate/convert tool does support it. Talking about multilingual, I don’t see that covered in your article. I guess(/hope) that the Migrate documentation covers that?!

Not really. We wrote custom code to handle the linking of multilingual content, mostly because it was so uniquely organized - each page had a translation table which we parsed and extracted the complete set from, and then reconciled with tables extracted from other members of the set.

The “normal” situation is that languages are organized into folders, like “/en/”, “/fr/”, etc., so this is easier: you implement the post-migration method in your Migration, so once all the content is created, you effectively do a second pass.

In this second pass, you query the complete migrate map in the database, which would look something like this:

id                 ::   node id
/en/content1.html  ::   1  
/fr/content1.html  ::   2
/de/content1.html  ::   3
/zh/content1.html  ::   4

Go through all IDs, and make translation sets of content, storing the data in a PHP array. Then go through the array and manually set the value of tnid in the node table for translation sets. “tnid” is the node id of the “source” translation node. In Drupal, each node that is part of a translation set simply keeps the node id of the “source” translation in it’s “tnid” field.

I don’t think it’s possible to link multilingual content on the first pass, unless you can be absolutely sure that you migrate the “source” language files first.

Wait… There was Internet in the 90’s? I thought there was only Sierra-Net?