Using the Linux Shell for Web Scraping

bbcLet’s assume we want to scrape the “Most Popular in News” box from bbc.com. What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID “mostPopular” and you can figure this out using the Developer Tools of your favorite web browser. And now we are going to apply a chain of command line tools – each feeding their output to the next tool (that is called piping btw) and in the end we have a layouted text representation of the box’ content:

So let’s see what is going on. First the echo pipes the URL to wget. I could have also provided the URL directly in the arguments but I chose to do it like this to make clear that the URL or a list of URLs itself might be the result of processing. wget fetches the HTML code from BBC, which is then normalized by hxnormalize to improve digestability by hxselect (both installed on Ubuntu by sudo apt-get install html-xml-utils), which then extracts the part of the code being identified by the CSS selector. Lynx finally turns the code into a layouted text that you would see in a browser.

And this is what we get in the end:

I love simple solutions and this is as simple as it can be (I guess).

stay-tuned twitter feedly github


(original article published on www.joyofdata.de)

14 thoughts on “Using the Linux Shell for Web Scraping

  1. Pingback: HSS8120 – scraping « Mike Hirst

  2. Hi Raffael,

    Never thought this use of terminal .Thanks, you have no idea how much time I will save from your article.

    Thanks again :)

  3. Hi, Raffael Vogler,

    I have searched all over the goolge but all i saw was how to web scrape using php, using dotnet but i saw few article which explains how to web scrape on linux os. I like this article because i like open source technologies. Keep up this good work. I want to ask can we use xpath and regex on linux for web scraping.

  4. Hi Raffael, Thanks a lot for this. I am using this in my weather conky..rather clumsily, i’ve to admit (Couldn’t adapt hxselect properly).

    ${execi 50 echo ‘http://www.timeanddate.com/astronomy/india/mumbai’| wget -O- -i- | hxnormalize -x | lynx -stdin -dump > Moonrise && cat Moonrise | head -n 131 | tail -n 7}

    http://forums.linuxmint.com/download/file.php?id=27448&mode=view

     

  5. Thank you for mentioning the hxtool set.

    I have a couple of “need that now” scrapers of my own for various tasks and just last week I cam across a website with HTML Code which wasn’t really nice.

    Doing my usual sed actions seemed a bit cumbersome. So the hxtool set helped straighten it.

    But say, I dump straight from lynx … why do you first go via wget? I couldn’t quite see the benifit.

    • Hey Steffen, that’s a valid question and right now I am not sure why I chose to do it this way. The post is from a time when I started to use Linux and maybe I was just fascinated about piping … maybe there was another better reason :)

      • Most likely on the wget usage, you could use the -k switch which converts all the links on the page to be the full link. I.e. href=”/some/dir/index.php” to href=”http://www.website.com/some/dir/index.php” which makes the saved HTML file much more useful to parse local.

  6. Very nice post. I had a minor issue installing the html-xml-tools package (I’m running ubuntu 12). Specifically,

    sudo apt-get install html-xml-tools

    generated the following error:

    E: Unable to locate package html-xml-tools

    I got around this by installing html-xml-utils instead:

    sudo apt-get install html-xml-utils

  7. Great article. I was searching for a simple Linux web scraping solution and will give this a go when I get the opportunity. Thanks for sharing :)

  8. Thats nice. Unix toolbox for webscraping. :-)
    Didn’t knew hxselect… W3C-tools, nice.
    Installed them now, maybe they make sense for some tasks,
    saving some programming time.

    But the bbc-webpage has changed, so there will be no new news coming out of your script.

Comments are closed.