Scraping websites into Drupal using feeds and import.io

Recently, I was faced with an interesting challenge; develop a system for importing thousands of hand-build sites into Drupal. One of the tools that we encountered in our research was import.io, a web data platform and web scraping tool. It’s a powerful system that hides a lot of the crawling and scraping complexity that can be accomplished with existing open-source tools, and it can create an API for sites that don’t already have one.

I’m not going to dive too deeply into using the tool; there’s already some pretty comprehensive documentation and examples. Instead, I’ll focus on taking the data set from an import.io crawler and ingesting it with Feeds into Drupal.

I tested it with a niche use case; scraping a fan site listing Infocom Interactive Fiction games. The content is structured and lent itself well towards training the import.io tool.

I started by creating a new API from the URL, which provides an interface to define both what you want to extract from the page and train the extractor by selecting page elements.

Once it was trained, I ran the crawler which was preloaded with some logical defaults. The advanced options were particularly useful.

When the crawler completed, I was presented with a nice data set which could be exported into a variety of different formats. However, I was more interested in the API – I want to be able to have direct access to the consumable data.

Feeds seemed like a logical system for ingesting the import.io data set, given its robust API and bevy of supporting modules like Feeds Tamper that can be used to transform data upon import. No module existed, so I wrote one called Feeds: Import.io that provides both a Fetcher and a Parser for consuming an import.io dataset.

Once the module is installed and enabled, create a new Feeds Importer.

Then, change the fetcher…

and select the “import.io Fetcher”.

The fetcher needs to be configured before it can be used, so click on Settings.

Three things are needed:

Connector GUID – Go to the appropriate Data Set on the My Data page at http://import.io/data/mine
User GUID, API Key – Found at http://import.io/data/account

Specify the values into the fetcher settings and save.

We’ll also need to change the parser, but this will be a bit more straight forward.

Choose the “import.io Parser”.

The parser itself doesn’t need any configuration, so straight on to the Node Processor settings.

No workflow changes are necessary; treat this like any other feed. I had created a Game content type with appropriate fields.

The Node processor is the last piece of the puzzle; the mapping will need to be specified so data from import.io has a place to go.

The mapping for the Node processor is routine; the Source should be the same as the import.io machine names that were specified during the training.

That’s pretty much all the customization that is needed. Optionally, Feeds Tamper or a similar module can be used.

Triggering the feed execution depends on how it was set up. In this case, I wanted it to be completely manual, so I used the Import link in the nav menu…

Selected the name of the feed…

Clicked Import…

Took a sip of water…

And it was done. The result? A few dozen imported nodes, complete with taxonomy terms and images, all without writing code.

Well, except for the Feeds: Import.io module, but it’s open-source and hosted on Drupal.org because that’s how I roll.

I’ve seen a number of interesting use cases for import.io, such as scraping political data from government websites, real estate listings, retail, and a whole lot more. Any tool that improves the accessibility of data and exposes it to a greater audience is a good thing.

Making the web a better place to teach, learn, and advocate starts here...

When you subscribe to our newsletter!

* indicates required field

Email*

Country*

CAPTCHA

Scraping websites into Drupal using feeds and import.io

Making the web a better place to teach, learn, and advocate starts here...

How can we help?