Monday, June 13, 2016

Introduction to Big Data and Hadoop

Last time I wrote about the basics terms and definitions in Data warehousing. Today when I was searching on my next topic, I came across this interesting one. Big data. Let’s see the impact of Big data in a corporate sector and which alternative to look for whether its data warehouse or Hadoop, when storing massive data.
Introduction on Big data
In today’s fast paced moving technology, Organizations generate huge volumes of data that has high velocity and variety (log files, video, images, text etc.). In order to occupy massive data, the storage space plays a crucial role. Those days are gone where traditional relational database systems were used, now it seems those methods are defunct now. Now new technologies, platforms with right analytic tools are introduced that gives a boost to top technology companies.
Big data – Its considered to be the new generation in the data management. As the name suggests data is bigger based on its volume, velocity (impact of the data) and the variety of forms where its analyzed. That’s why it’s called big data. For ex: When a person tweets, huge amount of data is generated. So this data is captured, stored and analyzed using right analytical tools to promote business growth.



What Is Apache hadoop?
It’s an open source java framework that is primarily used for storing and analyzing big data. Hadoop helps in processing big data sets, where data is split into small parts across clusters or nodes. Many major tech companies like Yahoo, IBM, Google use Hadoop framework for advertising, optimization of search engine process etc.
MapReduce – It’s a data platform for Apache Hadoop where the application logic splits the data for processing in parallel on large clusters and nodes. This framework is intended for scheduling tasks, monitoring them and re executing the failed tasks.

Benefits of using Hadoop over other technologies
·         Apache hadoop is considered to be a faster and cheaper analytical tool for big data.
·     Data can be stored either in a structured or unstructured way without the need of formatting it, whereas Relational databases requires the data to be defined with proper schema before storing it.
·         Its cost effective where the data is stored at per terabyte that delivers fast computation.
·    Its Fault tolerant, in case of failure, data is replicated across a cluster and can be recovered.

Why Big data?

Usually collecting and storing huge volumes of data doesn’t generate any potential value to the organization.  We should realize that value is created only when data is analyzed and acted upon. We must ensure that how the stored data can be analyzed and those analytic results provides great value to the business that can be used for decision making strategies, improving customer engagement, product development, and to optimize search engine process in the digital marketing world.

Some Take on points are:
·         Increase of storage capacities
·         Stores and analyze all structured and unstructured data
·         Deep data exploration for analysts
·         Flexible and adapts to changing business trends

Some Popular tools used in Big Data.
·         NoSQL (HBase, Cassandra)
·         MapReduce- Hadoop, Hive, pig, MapR
·         Storage-S3, Hadoop Distributed file system(HDFS)

What is the big difference between big data(Hadoop) and DW?
Nowadays IT organizations faces tremendous challenges in using Big data or Data warehouse to promote business growth. Many organizations have confusion on when to use which alternative.
The major advantage of Hadoop lies in handling two complicated problems.
·         Capacity to handle large data sets
·         Run and Execute complex analytics.
In the below diagram it shows how Hadoop gels well with DW in the above mentioned aspects.
http://www.jeanmartin.com/images/infographics/big-data-3.png





Below is a table that highlights major differences between Hadoop vs Data warehousing.

Hadoop
DWH
Data
All forms of data (structured, semi-structured and unstructured)
Before storing the data, allows only structured data and well defined schema
Application
Newly used concept in corporate sector.  Ex: Health care, retail
Traditional approach established in Organizations already
Tooling
New tools, Ex: MapReduce and business use SQL queries or BI tools
Installed, good Knowledge and experience
Costs
Low (per GB)
High (Per GB)
Access
Batch processing in parallel
Interactive and Batch

In conclusion both Hadoop and DW shares a symbiotic relationship. Try to implement hadoop in case if you are not able to solve your business problem. Keep a check on security, governance, performance. Some differences are clear, but majorly its dependent on your organization and use cases. Do a careful analysis of your business requirements and technical analysis to ensure best business outcome.
Potential value of Big data:
Below are some Insights on how big data captures tremendous market growth:
·         It generates $300 billion potential annual value to US health care
·         As per Forbes report Big data analytics is the next trillion-dollar market, says Michael Dell. IDC has a more modest and specific prediction, forecasting the market for big data technology and services to grow at a 23.1% compound annual growth rate, reaching $48.6 billion in 2019.
·         The Mckinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44x between 2009 and 2020.
Interesting links and videos to look out for:

1.        https://www.youtube.com/watch?v=lz_kIDxbzGA( How we found the worst place to park in   New York City- using Big data)
2.     https://www.youtube.com/watch?v=1RYKgj-QK4I( IBM Big Data and analytics at work in Banking)

Hope this would have helped you. And see you soon on my next blog !!!


No comments:

Post a Comment