Hadoop Deployment Cheat Sheet

Introduction

If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI) this document can help you navigate some of the technology and terminology, and guide you in setting up and configuring the system.

In this document we provide some background information about the framework, the key distributions, modules, components, and related products. We also provide you with single and multi-node Hadoop installation commands and configuration parameters.

The final section includes some tips and tricks to help you get started, and provides guidance in setting up a Hadoop project.

Key Hadoop Distributions

Vendor
Strength
Apache Hadoop
The open source distribution from Apache
Hortonworks
A leading vendor committed to a 100% open source package
Cloudera
Hadoop filesystem w/proprietary components for enterprise needs
MapR
Uses its own proprietary file system
IBM
Integration w/ IBM analytics products
Pivotal
Integration w/ Greenplum and Cloud Foundry (CF)

Hadoop Modules

Module
Description
Common
Common utilities. Supports other Hadoop modules
HDFS
Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware
YARN
Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling
MapReduce
Software framework for parallel processing of large data sets based on YARN

Hadoop Components

Component / Module
Description
NameNode / HDFS
The directory tree of the Hadoop HDFS file system (a.k.a Hadoop inode)
Secondary NameNode / HDFS
High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits file into the fsimage file
JournalNode / HDFS
Arbiter node that supports auto failover between NameNodes
DataNode / HDFS
Nodes (or servers) that store the actual data
NFS3 Gateway / HDFS
Daemons that enable NFS3 support
ResourceManager / YARN
Global daemon that arbitrates resources among all the applications in the Hadoop cluster
ApplicationMaster / YARN
Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager to consume them and monitor the tasks
NodeManager / YARN
Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage such as CPU and disk, and reporting back to the ResourceManager
Container / YARN
Running specific tasks on a specific machine for a specific application based on allocated resources

Hadoop Ecosystem – Related Products

Product
Description
Ambari
A completely open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters
Apex
Big data in motion platform based on YARN
Azbakan
Workflow job scheduling and management system for Hadoop
Flume
Reliable, distributed and available service that streams logs into HDFS
Knox
Authentication and Access gateway service for Hadoop
HBase
Distributed non-relational database that runs on top of HDFS
Hive
Data warehouse system based on Hadoop
Mahout
Machine learning algorithm (clustering, classification and batch-based collaborative filtering) implementation based on MapReduce
Impala
Enables low-latency SQL queries on HBase and HDFS
Oozie
Workflow job scheduling and management system for Hadoop
Ranger
Access policy manager for HDFS files, folders, databases, tables and columns
Spark
Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like interface and machine learning library.
Sqoop
Data migration application between RDBMS and Hadoop using CLI
Tez
Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN
Pig
High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark
ZooKeeper
Distributed name registry, synchronization service and configuration service that is used as a sub-system in Hadoop

Major Hadoop Cloud Providers

Cloud operator
Service name
Amazon Web Services
EMR (Elastic Map Reduce)
IBM Softlayer
IBM Brightsight
Microsoft Azure
HDInsight

Common Data Formats

Format
Description
Avro
JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.
Parquet
Columnar storage format
ORC
Fast Columnar storage format
RCFile
Data placement format for Rational tables
SequenceFile
Binary data format with a record of specific data types
Unstructured
Hadoop also supports various unstructured data formats

Single Node Installation

Requirement / Task
Command
Java Installation / Check version
>java -version
Java Installation / Install
>sudo apt-get -y update && sudo apt-get -y install default-jdk
Create User and Permissions / Create User
>useradd hadoop
>passwd hadoop
>mkdir /home/hadoop
>chown -R hadoop:hadoop /home/hadoop
Create User and Permissions / Create keys
>su - hadoop
>ssh-keygen -t rsa &&
>cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
>&& chmod 0600 ~/.ssh/authorized_keys
Install from source
>wget http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz &&
>tar xzf hadoop-2.7.2.tar.gz &&
>mv hadoop-2.7.2 hadoop
Environment / Env Vars
>source ~/.bashrc
>export HADOOP_HOME=/home/hadoop/hadoop

>export HADOOP_INSTALL=$HADOOP_HOME

>export HADOOP_MAPRED_HOME=$HADOOP_HOME

>export HADOOP_COMMON_HOME=$HADOOP_HOME

>export HADOOP_HDFS_HOME=$HADOOP_HOME

>export YARN_HOME=$HADOOP_HOME

>export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

>export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Environment / Set Java_Home
>vi $HADOOP_HOME/etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/opt/jdk1.8.0_05/
Configuration files / Edit if required
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
Format NameNode
>hdfs namenode -format
Start System
>cd $HADOOP_HOME/sbin/
>start-dfs.sh
>start-yarn.sh
Test System
>bin/hdfs dfs -mkdir /user
>bin/hdfs dfs -mkdir /user/hadoop
>bin/hdfs dfs -put /var/log/httpd logs

Multi-node Installation

Task
Command
Configure hosts on each node
>vi /etc/hosts
192.168.1.11 hadoop-master
192.168.1.12 hadoop-slave-1
192.168.1.13 hadoop-slave-2
Enable cross node authentication
>su – hadoop
>ssh-keygen -t rsa
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2
>chmod 0600 ~/.ssh/authorized_keys>exit
Copy system
>su - hadoop
>cd /opt/hadoop
>scp -r hadoop hadoop-slave-1:/opt/hadoop
>scp -r hadoop hadoop-slave-2:/opt/hadoop
Configure Master
>su - hadoop

>cd /opt/hadoop/hadoop

>vi conf/masters
//add your master node to the file:
hadoop-master

>vi conf/slaves
//add your slave nodes to the file, one hostname per line:
hadoop-slave-1
hadoop-slave-2

>su - hadoop

>cd /opt/hadoop/hadoop

>bin/hadoop namenode -format
Start system
>bin/start-all.sh

Backup HDFS Metadata

Task
Command
Stop the cluster
>stop-all.sh
Perform cold backup to metadata directories
>cd /data/dfs/nn
>tar -cvf /tmp/backup.tar.gz
Start the cluster
>start-all.sh

HDFS Basic Commands

Task
Command
List the content of the home directory
>hdfs dfs -ls /data/
Upload a file from the local file system to HDFS
>hdfs dfs -put logs.csv /data/
Read the content of the file from HDFS
>hdfs dfs -cat /data/logs.csv
Change the permission of a file
>hdfs dfs -chmod 744 /data/logs.csv
Set the replication factor of a file to 3
>hdfs dfs -setrep -w 3 /data/logs.csv
Check the size of the file
>hdfs dfs -du -h /data/logs.csv
Move the file to the newly-created subdirectory
>hdfs dfs -mv logs.csv logs/
Remove directory from HDFS
>hdfs dfs -rm -r logs

HDFS Administration

Task
Command
Balance the cluster storage
>hdfs balancer -threshold
Run the NameNode
>hdfs namenode
Run the secondary NameNode
>hdfs secondarynamenode
Run a datanode
>hdfs datanode
Run the NFS3 gateway
>hdfs nfs3
Run the RPC portmap for the NFS3 gateway
>hdfs portmap

YARN

Task
Command
Show yarn help
>yarn
Define configuration file
>yarn [--config confdir]
Define log level
>yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG or TRACE
User commands
Show Hadoop classpath
>yarn classpath
Show and kill application
>yarn application
Show application attempt
>yarn applicationattempt
Show container information
>yarn container
Show node information
>yarn node
Show queue information
>yarn queue
Administration commands
Start NodeManager
>yarn nodemanager
Start Proxy web server
>yarn proxyserver
Start ResourceManager
>yarn resourcemanager
Run ResourceManager admin client
>yarn rmadmin
Start Shared Cache Manager
>yarn sharedcachemanager
Start TimeLineServer
>yarn timelineserver

MapReduce

Submit the WordCount MapReduce job to the cluster
>hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input logs-output
Check the output of this job in HDFS
>hadoop fs -cat logs -output/*
Submit a scalding job
>hadoop jar scalding.jar com.twitter.scalding.Tool Scalding
Kill a MapReduce job
>yarn application -kill

Resource Manager UI

Resource
Default URI
NameNode
http://:50070/
DataNode
http://:50075/
Sec NameNode
http://:50090/
Resource Manager
http://:8088
HBase Master
http://:60010

Secure Hadoop

Aspect
Best Practice
Authentication
  • Define users
  • Enable Kerberos in Hadoop
  • Setup Knox gateway to control access and authentication to the HDFS cluster
  • Integrate with the organization’s SSO and LDAP
Authorization
  • Define groups
  • Define HDFS Permissions
  • Define HDFS ACL’s
  • Enable Ranger policies to control access to HDFS folders, directories, databases, tables and columns
Audit
  • Enable process execution audit trail
Data Protection
  • Wire encryption with Knox or Hadoop

Hadoop Tips and Tricks

Project Concept
Iterate cluster sizing to optimize performance and meet actual load patterns
Hardware
Clusters with more nodes recover faster
The higher the storage per node, the longer the recovery time
Use commodity hardware:
  • Use large slow disks (SATA) without RAID (3-6TB disks)
  • Use as much RAM as is cost-effective (96-192GB RAM)
  • Use mainstream CPU with as many cores as possible (8-12 cores)
Invest in reliable hardware for the NameNodes
NameNode RAM should be 2GB + 1GB for every 100TB raw disk space
Networking cost should be 20% of hardware budget
40 nodes is the critical mass to achieve best performance/cost ratio
Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows for 3 replicas
Operating System and JVM
Must be 64-bit
Set file descriptor limit to 64K (ulimit)
Enable time synchronization using NTP
Speed up reads by mounting disks with NOATIME
Disable hugepages
System
Enable monitoring using Ambari
Monitor the checkpoints of the NameModes to verify that they occur at the correct times. This will enable you to recover your cluster when needed
Avoid reaching 90% cluster disk utilization
Balance the cluster periodically using balancer
Edit metadata files using Hadoop utilities only, to avoid corruption
Keep replication >= 3
Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation
Clean /tmp regularly – it tends to fill up with junk files
Optimize the number of reducers to avoid system starvation
Verify that the file system you selected is supported by your Hadoop vendor
Data and System Recovery
Disk failure is not an issue
Data nodes failure is not a major issue
NameNodes failure is an issue even in a clustered environment
Make regular backups of namenode metadata
Enable NameNode clustering using ZooKeeper
Provide sufficient disk space for NameNode logging
Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml