News flashed about a political party claiming that there are duplicate voters in the voter lists of Delhi. They even showed a few samples on their Facebook and Twitter Accounts.
Is it really the case? I wondered. It certainly can be! Just about every organization struggles in having a single version of a person's identity, be it customers, leads or even its own employees. Even largest online social networking platforms have duplicate and fake profiles. So there is no surprise if the voter lists too have certain number of duplicate or fake voters.
After all these are updated by thousands of government employees, which keep adding, deleting and moving voters from one list to another; even the technology, which is used to store the data, keeps changing, resulting in duplicate and erroneous entries.
Being a professional in analytics industry, I regularly deal with data sets where there may be numerous duplicate or erroneous entries, which need to be removed before some meaningful reports can be generated. So I decided to try my hand in finding the duplicate voters in Delhi's voter lists.
The task was challenging and interesting, and it had to be done! Just imagine finding duplicate voters scattered around in 11,763 voter lists contained in more than 4,00,000 pages. It certainly can not be done manually. Even though the the volunteers of a political party tried to do that, I read.
I used my technology expertise and proceeded with the task. The steps I took are described below -
1. I downloaded all the pdf files (11763 in total) from the website of election commission, by the web-crawler I wrote sometime back.
|Pdf file format (Illustrative)|
2. Converting the pdf files into text/html files using opensource pdf libraries. Ex. xpdf and pdfminer.
|Extracted text file from a pdf file (Illustrative)|
3. Parse the text files to extract the voters information in flat file format. Ex. CSV or TXT files.
|Parsing a text file to columnar format (Illustrative)|
The third part was the most challenging. It took about a weeks late night coding efforts to figure out a way to extract the data and arrange it in columnar format so that it can be pushed in to Databases like MySQL.
There were many improvements, which had to be done during this time to handle the errors, which were present in PDF files.
The result of the task was amazing. I got the details required to find the probable duplicate voters of the NCT of Delhi.
|Distribution of number of voters by Age - Delhi|
The data in columnar format enabled me to find out duplicate voters across constituencies. The voters which had moved their residences and got new voter ID cards but didn't surrender their old cards, were filtered out in an excel file.
The results were shared with the political party which was interested in the task of finding and removing duplicate voters to strengthen the Democratic process in India. This whole exercise helped them in making a strong case with Election Commission to scrutinize the voter lists and remove the duplicate voters as far as possible !!
Now the Elections in Delhi are to happen on 7th Feb'15, I hope that the best candidates get elected and people get a stable, inclusive and progressive government.