Web Scraping Part I

In my spare time I am creating a web scraping environment. Why is it environment not a script or anything like that? Because those pretty much already exist.

The aim of my project is to have a list of websites to monitor, scrape them every now and then, and finally process the data. One of the key features is JavaScript, which is often used for lazy loading of websites’ content. That is why Beautiful Soup is not enough for my needs. I decided to use Selenium (currently in version 2.45), with Firefox. Everything running on Lubuntu – at the moment as a virtual machine. All scraped data land in PostgreSQL, and screenshots (I take them too) in jpeg files. Beside that I use a local DNS server (Bind9) to speed up the DNS queries and block advertising/tracking domains which generate unnecessary traffic, and a local proxy server (Privoxy) to block more sophisticated advertising/tracking. Of course everything coded in Python.

Problems I did face:
– picking a right version of Firefox that would work smoothly with Selenium;
– setting Firefox preferences to speed up scraping;
– screenshots in Selenium by default are saved in PNG;
– sometimes interactions in JavaScript have to be implemented;
– making it work in several browsers in parallel – I did not want to use tabs;
– block content loaded by websites in https that could not be handled by DNS or Proxy.

Leave a Reply