Let’s assume we want to scrape the “Most Popular in News” box from bbc.com. What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID “mostPopular” and you can figure this out using the Developer Tools of your favorite web browser. And now we are going to apply a chain of command line tools – each feeding their output to the next tool (that is called piping btw) and in the end we have a layouted text representation of the box’ content:
1 2 3 4 5 |
~> echo "http://www.bbc.com" | wget -O- -i- | hxnormalize -x | hxselect -i "div.most_popular_content" | lynx -stdin -dump > theMostPoupularInNews |
So let’s see what is going on. First the echo pipes the URL to wget. I could have also provided the URL directly in the arguments but I chose to do it like this to make clear that the URL or a list of URLs itself might be the result of processing. wget fetches the HTML code from BBC, which is then normalized by hxnormalize to improve digestability by hxselect (both installed on Ubuntu by sudo apt-get install html-xml-utils), which then extracts the part of the code being identified by the CSS selector. Lynx finally turns the code into a layouted text that you would see in a browser.
And this is what we get in the end:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
~> cat theMostPoupularInNews [1]Shared * [2]1 Galileo satellites on wrong orbit * [3]2 UK imams condemn Isis in online film * [4]3 Boy held for 'killing pet dinosaur' * [5]4 Experts to review stroke clot-buster [6]Read * [7]1 Merkel in Ukraine as crisis mounts * [8]2 Galileo satellites on wrong orbit * [9]3 UN call to 'prevent Iraq massacre' * [10]4 Many dead in Madrid plane crash * [11]5 Russian mother has 'giant' baby [12]Watched/Listened * [13]1 What are options in fight against IS? * [14]2 Victoria Beckham does ice challenge * [15]3 SpaceX rocket explodes during testing * [16]4 Obama refuses ice bucket challenge * [17]5 'Wedding was saddest day of my life' References 1. file:///tmp/lynxXXXXnXNyNy/L3209-8165TMP.html 2. http://www.bbc.co.uk/news/world-europe-28910662 3. http://www.bbc.co.uk/news/uk-28270296 4. http://www.bbc.co.uk/news/blogs-news-from-elsewhere-28897353 5. http://www.bbc.co.uk/news/health-28900824 6. file:///tmp/lynxXXXXnXNyNy/L3209-8165TMP.html 7. http://www.bbc.co.uk/news/world-europe-28910215 8. http://www.bbc.co.uk/news/world-europe-28910662 9. http://www.bbc.co.uk/news/world-middle-east-28910674 10. http://www.bbc.co.uk/2/hi/europe/7572643.stm 11. http://www.bbc.co.uk/2/hi/europe/7015841.stm 12. file:///tmp/lynxXXXXnXNyNy/L3209-8165TMP.html 13. http://www.bbc.co.uk/news/uk-28902128 14. http://www.bbc.co.uk/news/entertainment-arts-28896231 15. http://www.bbc.co.uk/news/world-us-canada-28910812 16. http://www.bbc.co.uk/news/world-us-canada-28892227 17. http://www.bbc.co.uk/news/world-middle-east-28315346 |
I love simple solutions and this is as simple as it can be (I guess).
(original article published on www.joyofdata.de)
Pingback: HSS8120 – scraping « Mike Hirst
Hi Raffael,
Never thought this use of terminal .Thanks, you have no idea how much time I will save from your article.
Thanks again :)
Hi, Raffael Vogler,
I have searched all over the goolge but all i saw was how to web scrape using php, using dotnet but i saw few article which explains how to web scrape on linux os. I like this article because i like open source technologies. Keep up this good work. I want to ask can we use xpath and regex on linux for web scraping.
Hi Raffael, Thanks a lot for this. I am using this in my weather conky..rather clumsily, i’ve to admit (Couldn’t adapt hxselect properly).
Thank you for mentioning the hxtool set.
I have a couple of “need that now” scrapers of my own for various tasks and just last week I cam across a website with HTML Code which wasn’t really nice.
Doing my usual sed actions seemed a bit cumbersome. So the hxtool set helped straighten it.
But say, I dump straight from lynx … why do you first go via wget? I couldn’t quite see the benifit.
Hey Steffen, that’s a valid question and right now I am not sure why I chose to do it this way. The post is from a time when I started to use Linux and maybe I was just fascinated about piping … maybe there was another better reason :)
Most likely on the wget usage, you could use the -k switch which converts all the links on the page to be the full link. I.e. href=”/some/dir/index.php” to href=”http://www.website.com/some/dir/index.php” which makes the saved HTML file much more useful to parse local.
Very nice post. I had a minor issue installing the html-xml-tools package (I’m running ubuntu 12). Specifically,
sudo apt-get install html-xml-tools
generated the following error:
E: Unable to locate package html-xml-tools
I got around this by installing html-xml-utils instead:
sudo apt-get install html-xml-utils
Hey Robert, thanks for noticing! Actually that was my mistake – corrected now.
Same for Gentoo,
You can actually use
emerge -av app-text/html-xml-utils
Regards
Great article. I was searching for a simple Linux web scraping solution and will give this a go when I get the opportunity. Thanks for sharing :)
Thanks for the kind words. You’re welcome!
Thats nice. Unix toolbox for webscraping. :-)
Didn’t knew hxselect… W3C-tools, nice.
Installed them now, maybe they make sense for some tasks,
saving some programming time.
But the bbc-webpage has changed, so there will be no new news coming out of your script.
When you’re right you’re right – I updated the command sequence and its result. Thanks for mentioning!