Friday, October 7, 2011

Open sources for big data analytics

Today I attended a webinar called "Big Data Technologies for Social Media Analytics" from Impetus Technologies. They introduced their iLaDaP platform built on top of a bunch of open source libraries. There were some case studies for financial/online retailer data analytic, but not very detailed. My take away from this webinar is - there are many open source projects surrounding Hadoop for big data analysis. Apart from simply adding them into your project, you need understand their pros and cons.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

Hadoop MapReduce
Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.

Hadoop HDFS
Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

Apache Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Oozie - workflow engine for Hadoop

Sqoop is a tool designed to import data from relational databases into Hadoop.

Scalable machine learning libraries. Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining

HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data.

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Apache Camel
Apache Camel is a powerful open source integration framework based on known Enterprise Integration Patterns with powerful Bean Integration.

NLTK: Natural Language Toolkit
Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.

Impetus webinar presenter also mentioned two companies in this area.
Intellicus is one of the leading providers of next generation web-based Business Intelligence and Reporting solution,

Greenplum is the pioneer of Enterprise Data Cloud solutions for large-scale data warehousing and analytics.

No comments:

Post a Comment