A few months back I had taken a virtual machine on Windows Azure for one month trial, to run an experimental project on web-crawling. The experiment was a good learning experience, more than anything, it helped me appreciate the technology infrastructure Google, Yahoo and MSN etc. might have. I couldn't crawl even a single website completely, while they continue to crawl billions of web-pages regularly. In this post, I’ll describe the way I went ahead in crawling more than 100 million public webpages by a machine with a core-2 duo processor and 3.5 GB RAM, till the trial period lasted.
I had used a popular C based library called libcurl to fetch the web pages and wrote a wrapper in C++ to parse the HTML pages.
Just like any web-crawler, the process started with a list of seed URLs, which may or may not result in new URLs. As you might have seen, most of the websites have many pages and inbuilt links to many other web-pages. These web-pages may be on the same website domain or may be part of other websites. The process of crawling is simple in logic - fetch a webpage, look for new URLs in its html (source code) and add the new URLs in the ever growing list which you have to fetch.
I started the crawler with a seed URL list of a few public web-profiles of a popular professional networking site. As every page has links to close to a dozen other profiles, it wasn't a problem to find new profile pages to add to the URL list.
The crawler started with a call to fetch one webpage at a time. Every request used to take around 5-10 seconds to completely fetch the webpage and parse its content. I was able to fetch close to 600 profiles in first hour. However it was very frustrating to see such a slow rate of crawling. After a few hours of search I came to understand the multiple request function of libcurl, which means that we can fetch up to 1000 pages in a single call. This could have increased the crawling rate very significantly and make the whole exercise meaningful.
As it was the wfirst time that I was using libcurl, it was taking time to explore its features, so I spent a weekend understand it and embedding its functionalities in my code.
So with the new crawler, I started to fetch 500 URLs at one go. As anyone can expect, the web-requests were blocked by the target website in just 2-3 minutes after a few attempts, however the multiple requests were successful, and it was able to fetch the pages successfully. Now the task was to run the crawler without getting blocked.
There can be two ways to do it, first is to hit the target website at a very low rate i.e. just a few hundred pages per day so that it doesn't take you as a threat and continue to respond to your requests. Another way is to send your requests such that these appear to come from disparate sources at a significant delay, so that target website doesn't get alarmed by an automated robot.
I chose the second option. I browsed many websites which provide proxy server IPs to make a list of around 3000 proxies and added a function in the code to continuously check the status of these proxies, 15-20% of these proxies were working regularly. The proxies which were working on my target website were used to fetch the webpages; those which were not active were removed and added at the end of the queue. In this fashion, the crawler was able to run continuously.
There were a few occurrences when the application will crash suddenly, after many debugging hours, I came to realize that the crash was happening in a libcurl function call, which I didn't have much control over as I was using it from the libcurl library. This was quite troubling as I had to restart the application once it crashes, that means I could not leave it running alone. So I found a solution to monitor the running status of the application through WMIC. If the application wasn't running than it launched the application, allowing me to sleep peacefully.
Next challenge came in handling the lists of URLs which were already fetched and the ones which were in the queue to be fetched. Every new URL had to be checked whether it was already fetched or already present in the queue to be fetched. When the list exceeded around 15 million URLs, it came to my notice that 3.5 GB RAM was not able to handle these lists. Since it was necessary to check whether a URL had been fetched already to avoid the repetition? I also had to see that in how many pages a URL has been found, giving it a rank, more the occurrences of the URL, better its rank and higher the priority of it in the queue.
After a few days of continuous crawling, it was a necessity to use a regular database to handle the URL queues. I chose SQLite because of its easy interface with C++. The process continued for over two-weeks, filling the hard drives close to 2 TB, with the fetched web-pages and the parsed content in the SQLite Database. The HTML parser was also written in C++, I tried doing that in Python, but it was relatively slower than C++, so I continued with C++.
Now that the trial period was about to be over as just 3 days’ worth of money was left in the trial account. I thought of taking the parsed content out of the virtual machine and store it in some safe place where I can access it later on. However, there was no way I could store close to 2 TB html content somewhere free of cost.
I thought of uploading a few GBs of meaningful data to an online drive. So I dragged the data file to the online drive and went to sleep. Next morning, what I saw was unbelievable! The trial was over and only a few GBs of data reached the online Drive. I could not log into the virtual machine as the trial was already over because of lack of credit in the trial account.
The cause was that, while the data download to the virtual machine was at very inexpensive cost, the data upload out of the machine was quite expensive. It sucked out all the remaining hundreds of rupees.
The trial account was closed abruptly, leaving all the parsed data and SQLite database files in the remote cloud server!
However, to me it was an amazing learning experience of doing a grain of what Google etc. do on a daily basis.