Using the Linux Shell for Web Scraping

bbcLet’s assume we want to scrape the “Most Popular in News” box from bbc.com. What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID “mostPopular” and you can figure this out using the Developer Tools of your favorite web browser. And now we are going to apply a chain of command line tools – each feeding their output to the next tool (that is called piping btw) and in the end we have a layouted text representation of the box’ content:

So let’s see what is going on. First the echo pipes the URL to wget. I could have also provided the URL directly in the arguments but I chose to do it like this to make clear that the URL or a list of URLs itself might be the result of processing. wget fetches the HTML code from BBC, which is then normalized by hxnormalize to improve digestability by hxselect¬†(both installed on Ubuntu by sudo apt-get install html-xml-utils), which then extracts the part of the code being identified by the CSS selector. Lynx finally turns the code into a layouted text that you would see in a browser.

And this is what we get in the end:

I love simple solutions and this is as simple as it can be (I guess).

stay-tuned twitter feedly github


(original article published on www.joyofdata.de)

14 thoughts on “Using the Linux Shell for Web Scraping

  1. Pingback: HSS8120 – scraping « Mike Hirst

  2. Hi Raffael,

    Never thought this use of terminal .Thanks, you have no idea how much time I will save from your article.

    Thanks again :)

  3. Hi, Raffael Vogler,

    I have searched all over the goolge but all i saw was how to web scrape using php, using dotnet but i saw few article which explains how to web scrape on linux os. I like this article because i like open source technologies. Keep up this good work. I want to ask can we use xpath and regex on linux for web scraping.

  4. Thank you for mentioning the hxtool set.

    I have a couple of “need that now” scrapers of my own for various tasks and just last week I cam across a website with HTML Code which wasn’t really nice.

    Doing my usual sed actions seemed a bit cumbersome. So the hxtool set helped straighten it.

    But say, I dump straight from lynx … why do you first go via wget? I couldn’t quite see the benifit.

    • Hey Steffen, that’s a valid question and right now I am not sure why I chose to do it this way. The post is from a time when I started to use Linux and maybe I was just fascinated about piping … maybe there was another better reason :)

      • Most likely on the wget usage, you could use the -k switch which converts all the links on the page to be the full link. I.e. href=”/some/dir/index.php” to href=”http://www.website.com/some/dir/index.php” which makes the saved HTML file much more useful to parse local.

  5. Very nice post. I had a minor issue installing the html-xml-tools package (I’m running ubuntu 12). Specifically,

    sudo apt-get install html-xml-tools

    generated the following error:

    E: Unable to locate package html-xml-tools

    I got around this by installing html-xml-utils instead:

    sudo apt-get install html-xml-utils

  6. Great article. I was searching for a simple Linux web scraping solution and will give this a go when I get the opportunity. Thanks for sharing :)

  7. Thats nice. Unix toolbox for webscraping. :-)
    Didn’t knew hxselect… W3C-tools, nice.
    Installed them now, maybe they make sense for some tasks,
    saving some programming time.

    But the bbc-webpage has changed, so there will be no new news coming out of your script.

Leave a Reply

Your email address will not be published. Required fields are marked *