Hadoop Deployment Cheat Sheet

Introduction

If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI) this document can help you navigate some of the technology and terminology, and guide you in setting up and configuring the system.

In this document we provide some background information about the framework, the key distributions, modules, components, and related products. We also provide you with single and multi-node Hadoop installation commands and configuration parameters.

The final section includes some tips and tricks to help you get started, and provides guidance in setting up a Hadoop project.

Major Hadoop Cloud Providers

Single Node Installation

Multi-node Installation

Hadoop Tips and Tricks

Key Hadoop Distributions

Vendor

Strength

Apache Hadoop

The open source distribution from Apache

Hortonworks

A leading vendor committed to a 100% open source package

Cloudera

Hadoop filesystem w/proprietary components for enterprise needs

MapR

Uses its own proprietary file system

IBM

Integration w/ IBM analytics products

Pivotal

Integration w/ Greenplum and Cloud Foundry (CF)

Hadoop Modules

Module

Description

Common

Common utilities. Supports other Hadoop modules

HDFS

Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware

YARN

Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling

MapReduce

Software framework for parallel processing of large data sets based on YARN

Hadoop Components

Component / Module

Description

NameNode / HDFS

The directory tree of the Hadoop HDFS file system (a.k.a Hadoop inode)

Secondary NameNode / HDFS

High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits file into the fsimage file

JournalNode / HDFS

Arbiter node that supports auto failover between NameNodes

DataNode / HDFS

Nodes (or servers) that store the actual data

NFS3 Gateway / HDFS

Daemons that enable NFS3 support

ResourceManager / YARN

Global daemon that arbitrates resources among all the applications in the Hadoop cluster

ApplicationMaster / YARN

Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager to consume them and monitor the tasks

NodeManager / YARN

Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage such as CPU and disk, and reporting back to the ResourceManager

Container / YARN

Running specific tasks on a specific machine for a specific application based on allocated resources

Hadoop Ecosystem – Related Products

Product

Description

Ambari

A completely open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters

Apex

Big data in motion platform based on YARN

Azbakan

Workflow job scheduling and management system for Hadoop

Flume

Reliable, distributed and available service that streams logs into HDFS

Knox

Authentication and Access gateway service for Hadoop

HBase

Distributed non-relational database that runs on top of HDFS

Hive

Data warehouse system based on Hadoop

Mahout

Machine learning algorithm (clustering, classification and batch-based collaborative filtering) implementation based on MapReduce

Impala

Enables low-latency SQL queries on HBase and HDFS

Oozie

Workflow job scheduling and management system for Hadoop

Ranger

Access policy manager for HDFS files, folders, databases, tables and columns

Spark

Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like interface and machine learning library.

Sqoop

Data migration application between RDBMS and Hadoop using CLI

Tez

Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN

Pig

High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark

ZooKeeper

Distributed name registry, synchronization service and configuration service that is used as a sub-system in Hadoop

Major Hadoop Cloud Providers

Cloud operator

Service name

Amazon Web Services

EMR (Elastic Map Reduce)

IBM Softlayer

IBM Brightsight

Microsoft Azure

HDInsight

Common Data Formats

Format

Description

Avro

JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.

Parquet

Columnar storage format

ORC

Fast Columnar storage format

RCFile

Data placement format for Rational tables

SequenceFile

Binary data format with a record of specific data types

Unstructured

Hadoop also supports various unstructured data formats

Single Node Installation

Requirement / Task

Command

Java Installation / Check version

>java -version

Java Installation / Install

>sudo apt-get -y update && sudo apt-get -y install default-jdk

Create User and Permissions / Create User

>useradd hadoop
>passwd hadoop
>mkdir /home/hadoop
>chown -R hadoop:hadoop /home/hadoop

Create User and Permissions / Create keys

>su - hadoop
>ssh-keygen -t rsa &&
>cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
>&& chmod 0600 ~/.ssh/authorized_keys

Install from source

>wget http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz &&
>tar xzf hadoop-2.7.2.tar.gz &&
>mv hadoop-2.7.2 hadoop

Environment / Env Vars

>source ~/.bashrc
>export HADOOP_HOME=/home/hadoop/hadoop

>export HADOOP_INSTALL=$HADOOP_HOME

>export HADOOP_MAPRED_HOME=$HADOOP_HOME

>export HADOOP_COMMON_HOME=$HADOOP_HOME

>export HADOOP_HDFS_HOME=$HADOOP_HOME

>export YARN_HOME=$HADOOP_HOME

>export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

>export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Environment / Set Java_Home

>vi $HADOOP_HOME/etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/opt/jdk1.8.0_05/

Configuration files / Edit if required

core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml

Format NameNode

>hdfs namenode -format

Start System

>cd $HADOOP_HOME/sbin/
>start-dfs.sh
>start-yarn.sh

Test System

>bin/hdfs dfs -mkdir /user
>bin/hdfs dfs -mkdir /user/hadoop
>bin/hdfs dfs -put /var/log/httpd logs

Multi-node Installation

Task

Command

Configure hosts on each node

>vi /etc/hosts
192.168.1.11 hadoop-master
192.168.1.12 hadoop-slave-1
192.168.1.13 hadoop-slave-2

Enable cross node authentication

>su – hadoop
>ssh-keygen -t rsa
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2
>chmod 0600 ~/.ssh/authorized_keys>exit

Copy system

>su - hadoop
>cd /opt/hadoop
>scp -r hadoop hadoop-slave-1:/opt/hadoop
>scp -r hadoop hadoop-slave-2:/opt/hadoop

Configure Master

>su - hadoop

>cd /opt/hadoop/hadoop

>vi conf/masters
//add your master node to the file:
hadoop-master

>vi conf/slaves
//add your slave nodes to the file, one hostname per line:
hadoop-slave-1
hadoop-slave-2

>su - hadoop

>cd /opt/hadoop/hadoop

>bin/hadoop namenode -format

Start system

>bin/start-all.sh

Backup HDFS Metadata

Task

Command

Stop the cluster

>stop-all.sh

Perform cold backup to metadata directories

>cd /data/dfs/nn
>tar -cvf /tmp/backup.tar.gz

Start the cluster

>start-all.sh

HDFS Basic Commands

Task

Command

List the content of the home directory

>hdfs dfs -ls /data/

Upload a file from the local file system to HDFS

>hdfs dfs -put logs.csv /data/

Read the content of the file from HDFS

>hdfs dfs -cat /data/logs.csv

Change the permission of a file

>hdfs dfs -chmod 744 /data/logs.csv

Set the replication factor of a file to 3

>hdfs dfs -setrep -w 3 /data/logs.csv

Check the size of the file

>hdfs dfs -du -h /data/logs.csv

Move the file to the newly-created subdirectory

>hdfs dfs -mv logs.csv logs/

Remove directory from HDFS

>hdfs dfs -rm -r logs

HDFS Administration

Task

Command

Balance the cluster storage

>hdfs balancer -threshold

Run the NameNode

>hdfs namenode

Run the secondary NameNode

>hdfs secondarynamenode

Run a datanode

>hdfs datanode

Run the NFS3 gateway

>hdfs nfs3

Run the RPC portmap for the NFS3 gateway

>hdfs portmap

YARN

Task

Command

Show yarn help

>yarn

Define configuration file

>yarn [--config confdir]

Define log level

>yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG or TRACE

User commands

Show Hadoop classpath

>yarn classpath

Show and kill application

>yarn application

Show application attempt

>yarn applicationattempt

Show container information

>yarn container

Show node information

>yarn node

Show queue information

>yarn queue

Administration commands

Start NodeManager

>yarn nodemanager

Start Proxy web server

>yarn proxyserver

Start ResourceManager

>yarn resourcemanager

Run ResourceManager admin client

>yarn rmadmin

Start Shared Cache Manager

>yarn sharedcachemanager

Start TimeLineServer

>yarn timelineserver

MapReduce

Submit the WordCount MapReduce job to the cluster

>hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input logs-output

Check the output of this job in HDFS

>hadoop fs -cat logs -output/*

Submit a scalding job

>hadoop jar scalding.jar com.twitter.scalding.Tool Scalding

Kill a MapReduce job

>yarn application -kill

Resource Manager UI

Resource

Default URI

NameNode

http://:50070/

DataNode

http://:50075/

Sec NameNode

http://:50090/

Resource Manager

http://:8088

HBase Master

http://:60010

Secure Hadoop

Aspect

Best Practice

Authentication

Define users
Enable Kerberos in Hadoop
Setup Knox gateway to control access and authentication to the HDFS cluster
Integrate with the organization’s SSO and LDAP

Authorization

Define groups
Define HDFS Permissions
Define HDFS ACL’s
Enable Ranger policies to control access to HDFS folders, directories, databases, tables and columns

Audit

Enable process execution audit trail

Data Protection

Wire encryption with Knox or Hadoop

Hadoop Tips and Tricks

Project Concept

Iterate cluster sizing to optimize performance and meet actual load patterns

Hardware

Clusters with more nodes recover faster

The higher the storage per node, the longer the recovery time

Use commodity hardware:

Use large slow disks (SATA) without RAID (3-6TB disks)
Use as much RAM as is cost-effective (96-192GB RAM)
Use mainstream CPU with as many cores as possible (8-12 cores)

Invest in reliable hardware for the NameNodes

NameNode RAM should be 2GB + 1GB for every 100TB raw disk space

Networking cost should be 20% of hardware budget

40 nodes is the critical mass to achieve best performance/cost ratio

Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows for 3 replicas

Operating System and JVM

Must be 64-bit

Set file descriptor limit to 64K (ulimit)

Enable time synchronization using NTP

Speed up reads by mounting disks with NOATIME

Disable hugepages

System

Enable monitoring using Ambari

Monitor the checkpoints of the NameModes to verify that they occur at the correct times. This will enable you to recover your cluster when needed

Avoid reaching 90% cluster disk utilization

Balance the cluster periodically using balancer

Edit metadata files using Hadoop utilities only, to avoid corruption

Keep replication >= 3

Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation

Clean /tmp regularly – it tends to fill up with junk files

Optimize the number of reducers to avoid system starvation

Verify that the file system you selected is supported by your Hadoop vendor

Data and System Recovery

Disk failure is not an issue

Data nodes failure is not a major issue

NameNodes failure is an issue even in a clustered environment

Make regular backups of namenode metadata

Enable NameNode clustering using ZooKeeper

Provide sufficient disk space for NameNode logging

Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml

Hadoop Deployment Cheat Sheet

Introduction

Contents

Key Hadoop Distributions

Hadoop Modules

Hadoop Components

Hadoop Ecosystem – Related Products

Major Hadoop Cloud Providers

Common Data Formats

Single Node Installation

Multi-node Installation

Backup HDFS Metadata

HDFS Basic Commands

HDFS Administration

YARN

MapReduce

Resource Manager UI

Secure Hadoop

Hadoop Tips and Tricks