Open Source Data Tool For Analysing “Big Data”

Analysing Big Data does not have to be expensive – and it has become even cheaper thanks to a new tool that automates the data analysis needed to make sense of massive amounts of data.

Pete Warden (pictured), the UK-born developer who became famous after he scraped 220 million public Facebook profiles last year, unveiled his Data Science Toolkit at GigaOM’s Structure BigData conference in New York City on March 23. The Data Science Toolkit allows anyone to do automated conversions and data analysis on large data sets, he said.

In a 20-minute talk entitled “Supercomputing on a Minimum Wage”, Warden noted that data analysis does not have to be expensive. “You can hire a hundred servers from Amazon for $10 an hour,” he said.

Cloud Is The Key To “Big Data” Analysis

As a collection of open data sets and open-source data analysis tools wrapped in an easy-to-use interface, the toolkit includes features like being able to filter geographic locations from news articles and other types of unstructured data and use optical character recognition (OCR) functions to convert PDFs of scanned image files to text files, Warden said.

The Data Science Toolkit is available under GPL (general public licence) and can be used either as a Web service or downloaded to run on an Amazon EC2 (Elastic Compute Cloud) or virtual machine.

Users can also convert street addresses or IP addresses into latitude/longitude coordinates and apply those coordinates to map the information against political demographics data, according to the toolkit’s Website.

A quick test of a residential address in Brooklyn, New York, returned information about which Congressional district it was associated with.

It can also pull country, city and regional names from a block of text and return relevant co-ordinates using the Geodict tool. This is similar to Yahoo’s Placemaker tool, according to the toolkit’s description. Users can also put in blocks of HTML from any page, including a news article, and see just the text that would be actually displayed in the browser, as well as to identify real sentences from a block of text. It can also extract people’s names and titles, as well as guess gender from entered text.

Warden had used Amazon servers and a number of tools to analyse user profile data from 220 million Facebook users in February 2010. He used WebCrawler to crawl Facebook and scraped 500 million pages representing 220 million users last year. Thanks to “about a hundred bucks” and Amazon’s servers, he transformed the scraped data into a database-ready format in 10 hours, he said.

He was able to analyse friendship relationships on Facebook using the data and performed some fun visualisations on how cities and states in the United States are connected to each other through Facebook. He also correlated the data to indicate the most common names, fan pages and friend locations around the world.

Warden noted there were a number of ways to harvest similar data from other sources, including Google Profiles.

Facebook did not like what he was doing with the data, and took steps to stop him. It took him two months and $3,000 (£1,850) in legal fees to convince Facebook that what he was doing was not illegal, he said, but he still had to delete the data from the servers. Facebook claimed that he did not have permission to scrape the profiles, although he did not hack or compromise any pages and looked at only publicly available pages. Facebook also claimed that saying he would make the raw data available to researchers violated their terms of service.

“Big Data? Cheap. Lawyers? Not so cheap,” Warden said to audience laughter.

Fahmida Y Rashid eWEEK USA 2014. Ziff Davis Enterprise Inc. All Rights Reserved.

Recent Posts

SoftBank Promises To Invest $100bn In US

Japanese tech investment firm SoftBank promises to invest $100bn during Trump's second term to create…

18 hours ago

Synopsys, SiMa.ai To Collaborate On AI Car Chips

Synopsys to work with start-up SiMa.ai on joint offering to help accelerate development of AI…

19 hours ago

AI Start-Up Basis Raises $34m For Accountancy Agent

Start-up Basis raises $34m in Series A funding round for AI-powered accountancy agent to make…

19 hours ago

Databricks Raises $10bn In Huge AI Funding Round

Data analytics and AI start-up Databricks completes huge $10bn round from major venture capitalists as…

20 hours ago

Congo Files Complaints Against Apple Over Conflict Minerals

Congo files legal complaints against Apple in France, Belgium alleging company 'complicit' in laundering conflict…

20 hours ago