Friday, October 7, 2011

Open sources for big data analytics

Today I attended a webinar called "Big Data Technologies for Social Media Analytics" from Impetus Technologies. They introduced their iLaDaP platform built on top of a bunch of open source libraries. There were some case studies for financial/online retailer data analytic, but not very detailed. My take away from this webinar is - there are many open source projects surrounding Hadoop for big data analysis. Apart from simply adding them into your project, you need understand their pros and cons.

Hadoop
http://hadoop.apache.org/
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

Hadoop MapReduce
http://hadoop.apache.org/mapreduce/
Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.

Hadoop HDFS
http://hadoop.apache.org/hdfs/
Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

Hive
http://hive.apache.org/
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

Apache Pig
http://pig.apache.org/
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Oozie
https://github.com/yahoo/oozie
Oozie - workflow engine for Hadoop

Sqoop
https://github.com/cloudera/sqoop/wiki
Sqoop is a tool designed to import data from relational databases into Hadoop.

Mahout
http://mahout.apache.org/
Scalable machine learning libraries. Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining

Hbase
http://hbase.apache.org/
HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data.

Flume
https://github.com/cloudera/flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Apache Camel
http://camel.apache.org/
Apache Camel is a powerful open source integration framework based on known Enterprise Integration Patterns with powerful Bean Integration.

NLTK: Natural Language Toolkit
http://www.nltk.org/
Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.

Impetus webinar presenter also mentioned two companies in this area.
Intellicus
http://www.intellicus.com/
Intellicus is one of the leading providers of next generation web-based Business Intelligence and Reporting solution,

Greenplum
http://www.greenplum.com/
Greenplum is the pioneer of Enterprise Data Cloud solutions for large-scale data warehousing and analytics.

No comments:

Post a Comment