Four Kitchens
Insights

Scraping websites into Drupal using feeds and import.io

3 Min. ReadDevelopment

Recently, I was faced with an interesting challenge; develop a system for importing thousands of hand-build sites into Drupal. One of the tools that we encountered in our research was import.io, a web data platform and web scraping tool. It’s a powerful system that hides a lot of the crawling and scraping complexity that can be accomplished with existing open-source tools, and it can create an API for sites that don’t already have one.

I’m not going to dive too deeply into using the tool; there’s already some pretty comprehensive documentation and examples. Instead, I’ll focus on taking the data set from an import.io crawler and ingesting it with Feeds into Drupal.

I tested it with a niche use case; scraping a fan site listing Infocom Interactive Fiction games. The content is structured and lent itself well towards training the import.io tool.

Zork Game Page

I started by creating a new API from the URL, which provides an interface to define both what you want to extract from the page and train the extractor by selecting page elements.

Training Import.io

Once it was trained, I ran the crawler which was preloaded with some logical defaults. The advanced options were particularly useful.

Import.io Crawler

When the crawler completed, I was presented with a nice data set which could be exported into a variety of different formats. However, I was more interested in the API – I want to be able to have direct access to the consumable data.

Feeds seemed like a logical system for ingesting the import.io data set, given its robust API and bevy of supporting modules like Feeds Tamper that can be used to transform data upon import. No module existed, so I wrote one called Feeds: Import.io that provides both a Fetcher and a Parser for consuming an import.io dataset.

Once the module is installed and enabled, create a new Feeds Importer.

Creating a Feed

Then, change the fetcher…

Change the Fetcher

and select the “import.io Fetcher”.

Select a fetcher

The fetcher needs to be configured before it can be used, so click on Settings.

Change fetcher settings

Three things are needed:

Specify the values into the fetcher settings and save.

Specify fetcher settings

We’ll also need to change the parser, but this will be a bit more straight forward.

Change the parser

Choose the “import.io Parser”.

Select the parser

The parser itself doesn’t need any configuration, so straight on to the Node Processor settings.

Node processor settings

No workflow changes are necessary; treat this like any other feed. I had created a Game content type with appropriate fields.

Customize Node processor

The Node processor is the last piece of the puzzle; the mapping will need to be specified so data from import.io has a place to go.

Node Processor mapping

The mapping for the Node processor is routine; the Source should be the same as the import.io machine names that were specified during the training.

Adding mappings

That’s pretty much all the customization that is needed. Optionally, Feeds Tamper or a similar module can be used.

Finished mapping

Triggering the feed execution depends on how it was set up. In this case, I wanted it to be completely manual, so I used the Import link in the nav menu…

Import link

Selected the name of the feed…

Select feed

Clicked Import…

Start import

Took a sip of water…

Importing

And it was done. The result? A few dozen imported nodes, complete with taxonomy terms and images, all without writing code.

Planetfall game page in Drupal

Well, except for the Feeds: Import.io module, but it’s open-source and hosted on Drupal.org because that’s how I roll.

I’ve seen a number of interesting use cases for import.io, such as scraping political data from government websites, real estate listings, retail, and a whole lot more. Any tool that improves the accessibility of data and exposes it to a greater audience is a good thing.