Skip to content

Latest commit

 

History

History
133 lines (91 loc) · 4.4 KB

hadoop.org

File metadata and controls

133 lines (91 loc) · 4.4 KB

hadoop

works on big data

relational databases have scalabality issues cant handle TB and PB arent fast (real time etc) querability sophisticated processing (eg ML)

One of the core components of Hadoop is an alternate file system - HDFS - hadoop distributed file system

NoSQL databases –> key/value (redis), columnstore(document, graph) etc

Hadoop is simply a new/alternative filesystem with a processing library Hadoop most commonly implemented with HBase HDFS was developed at Google - (called GFS - google file system to index the www) The open source community took the GFS whitepaper and made HDFS from it. they are vv similar

HBase is a NoSQL database - it is a wide columnstore - has a key and n values eg: customer id - and customer info as key value pairs

what makes it NoSQL is that the attributes can vary. you can have different attributes for different customers

CAP theory

consistency - supports transactions, availability - up-time(data always availabile) partionining - partionining means the data can be stored as chunks - parallel processing of data possible be of P

traditional rdbms have C and A

cap theory says two of three needed for successful databases comapnies need scalabilty, which hadoop has. Hadoop has P - 3 copies of data made by default it has A - because of 3 copies, your data is always there

Hadoop doesnt have C - so, mission critical data, eg transactional data not a good fit for Hadoop

What kind of data for Hadoop? LOB - line of bussiness transactional data - not a good fit

Behavioral data this data will be batch processed (not individually) good fit for Hadoop where C isn’t vv critical

The big data in the Hadoop DFS can be processed by rdbms and NoSQL query methods

What is Hadoop? 2 components + Projects

  1. HDFS (it is a open source datastroage method)
  2. MapReduce - processing API
  • Hive, Pig, HBase (optional)

Hadoop Distributions : Open source: Apache Hadoop Commerical: Cloudera, Hortonworks, MapR Cloud: AWS, Azure HDInsight

On cloud (eg. AWS), you can implement Apache’s opensource version or the cloudera commercial ones etc

The Commerical folks provide additonal tooling, management, support, libararies etc

Why Hadoop? cheaper, faster (parallel data processing), better (for particular types of bigdata)

We have mapreduce processing algo for parallel data processing in Hadoop

examples of Behavioral data: recommendation engine, risk modelling, customer churn analysis, anamoly detection, ad targeting, search quality

Difference between HBase and Hadoop Hadoop core is a HDFS, processing framework - mapreduce the HDFS is a big data filesystem, so each chunk is 64mb or 128mb or can be configured to even more

MapReduce is written in java. folks dont want to query the data in the hdfs directly, they want an abstraction, an api that makes the querying easier. they want an orm, MapReduce is that orm.

MapReduce : Hadoop :: C++ : OOP

It is a libarary that has an id and a dict associated with that id this is called schema on read since the attributes arent enforced on each id, they can have their own attributes Hive is an query language for HBase

Hadoop processes runs in seperate JVMs JVMs don’t share state JVM processes differ b/w Hadoop 1 and 2

Hadoop can be used with multiple filesystems as well. eg:

HDFS (larger chunks, replication 3 fold) –> it has 2 modes (distributed - 3 copies, pseudo-distributed - used for testing)

Regular file system aka standalone mode, works on a single mode

Cloud file system –> s3 on aws, blob on azure

files and jvms: single node - local file system, single jvm

pseudo distributed: uses HDFS, JVM daemons runs processes on a single machine

fully pseudo-distributed: uses HDFS, daemons runs on various locations

Hadoop ecosystem HDFS HBase YARN - MapReduce version 2 is called yarn - yet another resource negotiater Hive - sql like query language used to query HBase Pig - scripting language used for ETL(extract, transform, load) like processes Mahout - ML Oozie - for workflows and scheduling Zookeeper - coordination Sqoop - data exchanger flume - log collector ambari - manages Hadoop clusters

Cloudera additionally has Hue - UI framework which makes interacting with Hadoop easier

Impala, Hive, Pig are all used to query the MapReduce, provide a layer of abstraction Hive has HQL - hive query langage - it is very similar to SQL used in rdbms

you can set up ec2, install Hadoop OR use Elastic (Up and running with )