首页 > 代码库 > 个人Hadoop集群部署

个人Hadoop集群部署

环境:centos 6.6 x64 (学习用3节点)

软件:jdk 1.7 + hadoop 2.7.3 + hive 2.1.1 

环境准备:

 1、安装必要工具

yum -y install openssh wget curl tree screen nano lftp htop mysql

2、使用163的yum源:

技术分享
cd /etc/yum.repo.d/wget http://mirrors.163.com/.help/CentOS7-Base-163.repo#备份mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backupmv CentOS7-Base-163.repo CentOS-Base.repo#生成缓存yum clean allyum makecache
View Code

3、 关闭图形界面init 3:

vim /etc/inittab  #将启动级别5更改为级别3字符界面启动

4、设置静态IP、修改主机名、hosts、

(1)规划

192.168.235.138 node1192.168.235.139 node2192.168.235.140 node3

下面在各个节点上,根据当前机器的规划设置IP、主机名、hosts

(2)静态IP(各个节点)

技术分享
#方式一:使用setup在图形界面下设置# setup #方式二:修改网络配置文件,一个完整的设置如下 # cat /etc/sysconfig/network-scripts/ifcfg-Auto_eth1 HWADDR=00:0C:29:2C:9F:4ATYPE=EthernetBOOTPROTO=noneIPADDR=192.168.235.139PREFIX=24GATEWAY=192.168.235.1DNS1=192.168.235.1DEFROUTE=yesIPV4_FAILURE_FATAL=yesIPV6INIT=noNAME="Auto eth1"UUID=2753c781-4222-47bd-85e7-44877cde27ddONBOOT=yesLAST_CONNECT=1491415778
View Code

(3)主机名(各个节点)

技术分享
# cat /etc/sysconfig/networkNETWORKING=yesHOSTNAME=node1      #修改hostname的值
View Code

(4)hosts(各个节点)

技术分享
# cat /etc/hosts# 在文件末尾添加如下内容192.168.235.138 node1192.168.235.139 node2192.168.235.140 node3
View Code

5、关闭防火墙

技术分享
# service iptables stop# service iptables status# chkconfig iptables off
View Code

6、建立普通用户

技术分享
# useradd hadoop# passwd hadoop# visudo 在root    ALL=(ALL)       ALL行下面增加:hadoop  ALL=(ALL)       ALL
View Code

7、设置ssh免密码登录

方式一:自动部署脚本

技术分享
# cat ssh.sh ERVERS="node1 node2 node3"PASSWORD=123456BASE_SERVER=192.168.235.138yum -y install expectauto_ssh_copy_id() {    expect -c "set timeout -1;        spawn ssh-copy-id $1;            expect {            *(yes/no)* {send -- yes\r;exp_continue;}            *assord:*  {send -- $2\r;exp_continue;}            eof        {exit 0;}        }"}ssh_copy_id_to_all() {    for SERVER in $SERVERS    do        auto_ssh_copy_id $SERVER $PASSWORD    done    }ssh_copy_id_to_all
View Code

方式二:手动设置

技术分享
ssh-keygen -t  rsa  #生成公钥scp ~/.ssh/id_rsa.pub hadoop@192.168.235.139:~/ #使用scp或scp-copy-id 分发公钥到其他节点上
View Code

集群规划与安装

1、节点规划

规划:node01:NameNode、DataNode、NodeManager、node02:ResourceManager、DataNode、NodeManager、JobHisotrynode03:SecondaryNameNode、DataNode、NodeManager、

说明:注意节点功能的划分,DataNode存储数据,NodeManager处理数据,需要放在同一节点上,避免占用大量的网络带宽。

此处仅用于个人机器,学习使用。事实上,一个典型的生产环境示例如下

技术分享
7台节点参考配置hadoop2.x (HA: 高可用)主机名    IP地址    进程cloud01    192.168.2.31    namenode    zkfc          cloud02    192.168.2.32    namenode    zkfc          cloud03    192.168.2.33    resourcemanager               cloud04    192.168.2.34    resourcemanager               cloud05    192.168.2.35    journalNode      datanode     nodemanager     QuorumaPeerMaincloud06    192.168.2.36    journalNode      datanode     nodemanager     QuorumaPeerMaincloud07    192.168.2.37    journalNode      datanode     nodemanager     QuorumaPeerMain 备注:   namenode: 管理元数据              resourcemanager: 用于资源控制              datanode :用于存储数据              nodemanager:用于数据计算              journalNode: 用于共享元数据存储              zkfc: ZooKeeper failOverSwitch ,namenode失败切换              QuorumaPeerMain : 是ZooKeeper启动进程
View Code

HA+zookeeper可以有效防止单点故障,实现自动故障转移。

摘自:http://blog.csdn.net/shenfuli/article/details/44889757

2、安装jdk、Hadoop

(1)安装JDK、Hadoop

上传软件包到服务器,并在软件包所在目录下编辑、运行如下脚本:

技术分享
#!/bin/bashtar -zxvf jdk-7u79-linux-x64.tar.gz -C /opt/tar -zxvf hadoop-2.7.3.tar.gz /usr/local/cat >> /etc/profile << EOFexport JAVA_HOME=/opt/jdk1.7.0_79/export HADOOP_HOME=/usr/local/hadoop-2.7.3export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOE/sbinEOFsource /etc/profile
View Code

(2)预设Hadoop工作目录

技术分享
  mkdir -p hadoop/tmp  mkdir -p hadoop/dfs/data  mkdir -p hadoop/dfs/name  mkdir -p hadoop/namesecondary
View Code

3、配置Hadoop

几个基本的配置文件如下:

# cd /usr/local/hadoop-2.7.3/etc/hadoop/# ls -l | awk {print $9}core-site.xmlhadoop-env.shhdfs-site.xmlmapred-site.xmlslavesyarn-site.xml

配置内容如下:

(1)core-site.xml

技术分享
    <property>          <name>fs.defaultFS</name>          <value>hdfs://node1:9000</value>      </property>      <property>          <name>io.file.buffer.size</name>          <value>131072</value>      </property>      <property>          <name>hadoop.tmp.dir</name>          <value>file:/usr/hadoop/tmp</value>          <description>Abase for other temporary   directories.</description>      </property>         <property>        <name>fs.trash.interval</name>        <value>1440</value>        </property>
View Code

(2)hadoop-env.sh

export JAVA_HOME=/opt/jdk1.7.0_79/

(3)hdfs-site.xml

技术分享
    <property>          <name>dfs.namenode.name.dir</name>          <value>file:///usr/hadoop/dfs/name</value>         <description></description>            </property>          <property>          <name>dfs.datanode.data.dir</name>          <value>file:///usr/hadoop/dfs/data</value>          <description></description>    </property>     <property>          <name>dfs.namenode.secondary.http-address</name>          <value>node3:9001</value>          <description></description>    </property>    <property>      <name>dfs.namenode.checkpoint.dir</name>      <value>file:///usr/hadoop/namesecondary</value>      <description></description>    </property>             <property>          <name>dfs.replication</name>          <value>2</value>         <description>replication</description>    </property>      <property>          <name>dfs.webhdfs.enabled</name>          <value>true</value>          <description></description>    </property>         <property>        <name>dfs.permissions</name>        <value>false</value>    </property>    <property>        <name>dfs.datanode.max.transfer.threads</name>        <value>4096</value>    </property>
View Code

(4)mapred-site.xml

技术分享
    <property>        <name>mapreduce.framework.name</name>        <value>yarn</value>    </property>     <property>          <name>mapreduce.jobhistory.address</name>          <value>node2:10020</value>         <description>MapReduce JobHistory Server host:port,Default port is 10020.</description>    </property>      <property>          <name>mapreduce.jobhistory.webapp.address</name>          <value>node2:19888</value>          <description>MapReduce JobHistory Server Web UI host:port    Default port is 19888.</description>    </property>     <property>        <name>yarn.app.mapreduce.am.staging-dir</name>        <value>/history</value>    </property>    <property>        <name>mapreduce.jobhistory.done-dir</name>        <value>${yarn.app.mapreduce.am.staging-dir}/history/done</value>    </property>    <property>        <name>mapreduce.jobhistory.intermediate-done-dir</name>        <value>${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate</value>    </property>    <property>        <name>mapreduce.map.log.level</name>        <value>DEBUG</value>    </property>    <property>        <name>mapreduce.reduce.log.level</name>        <value>DEBUG</value>    </property>
View Code

(5)slaves

技术分享
node1node2node3
View Code

(6)yarn-site.xml

技术分享
    <property>                                                                        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>          <value>org.apache.hadoop.mapred.ShuffleHandler</value>      </property>      <!--Configurations for ResourceManager and NodeManager:-->    <!--    <property>          <name>yarn.acl.enable</name>          <value>false</value>         <description>Enable ACLs? Defaults to false.</description>    </property>     <property>          <name>yarn.admin.acl</name>          <value>Admin ACL</value>         <description>ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.</description>    </property>     <property>          <name>yarn.log-aggregation-enable</name>          <value>false</value>         <description>Configuration to enable or disable log aggregation</description>    </property>     -->            <!--Configurations for ResourceManager:-->    <property>          <name>yarn.resourcemanager.address</name>          <value>node2:8032</value>          <description>ResourceManager host:port for clients to submit jobs.host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>    </property>     <property>          <name>yarn.resourcemanager.scheduler.address</name>          <value>node2:8030</value>          <description>ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>    </property>     <property>          <name>yarn.resourcemanager.resource-tracker.address</name>          <value>node2:8031</value>          <description>ResourceManager host:port for NodeManagers:host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>    </property>     <property>          <name>yarn.resourcemanager.admin.address</name>          <value>node2:8033</value>          <description>ResourceManager host:port for administrative commands.:host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>    </property>     <property>          <name>yarn.resourcemanager.webapp.address</name>          <value>node2:8088</value>          <description>ResourceManager web-ui host:port.host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>    </property>         <property>          <name>yarn.resourcemanager.hostname</name>          <value>node2</value>          <description>host Single hostname that can be set in place of setting all yarn.resourcemanager*address resources. Results in default ports for ResourceManager components.</description>    </property>     <!--    <property>          <name>yarn.resourcemanager.scheduler.class</name>          <value>ResourceManager Scheduler class.</value>          <description>CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler</description>    </property>         <property>          <name>yarn.scheduler.minimum-allocation-mb</name>          <value>Minimum limit of memory to allocate to each container request at the Resource Manager.</value>          <description>In MBs</description>    </property>         <property>          <name>yarn.scheduler.maximum-allocation-mb</name>          <value>Maximum limit of memory to allocate to each container request at the Resource Manager.</value>          <description>In MBs</description>    </property>         <property>          <name>yarn.resourcemanager.nodes.include-path/ yarn.resourcemanager.nodes.exclude-path</name>          <value>List of permitted/excluded NodeManagers.</value>          <description>If necessary, use these files to control the list of allowable NodeManagers.</description>    </property>     -->        <!--Configurations for NodeManager:-->    <!--    <property>          <name>yarn.nodemanager.resource.memory-mb</name>          <value>Resource i.e. available physical memory, in MB, for given NodeManager</value>          <description>Defines total available resources on the NodeManager to be made available to running containers</description>    </property>         <property>          <name>yarn.nodemanager.vmem-pmem-ratio</name>          <value>Maximum ratio by which virtual memory usage of tasks may exceed physical memory</value>          <description>The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.</description>    </property>     <property>          <name>yarn.nodemanager.local-dirs</name>          <value>Comma-separated list of paths on the local filesystem where intermediate data is written.</value>          <description>Multiple paths help spread disk i/o.</description>    </property>     <property>          <name>yarn.nodemanager.log-dirs</name>          <value>Comma-separated list of paths on the local filesystem where logs are written.</value>          <description>Multiple paths help spread disk i/o.</description>    </property>     <property>          <name>yarn.nodemanager.log.retain-seconds</name>          <value>10800</value>          <description>Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.</description>    </property>     <property>          <name>yarn.nodemanager.remote-app-log-dir</name>          <value>/logs</value>          <description>HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.</description>    </property>     <property>          <name>yarn.nodemanager.remote-app-log-dir-suffix</name>          <value>logs</value>          <description>Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.</description>    </property>         -->    <property>          <name>yarn.nodemanager.aux-services</name>          <value>mapreduce_shuffle</value>          <description>Shuffle service that needs to be set for Map Reduce applications.    </description>    </property>         <property>          <name>yarn.log-aggregation-enable</name>          <value>true</value>          <description></description>    </property>         <property>          <name>yarn.log.server.url</name>          <value>http://node2:19888/jobhistory/logs</value>          <description></description>    </property> 
View Code

注意:实际配置文件中尽量不要有中文

此配置仅供参考,使用时,去掉注释即可,也可根据自己情况增删配置

4、启动集群

(1)格式化NameNode

hadoop namenode -format  #{HADOOP_HOME}/bin

(2)启动、关闭集群

几个常用启动命令

技术分享
# ls -l | awk {print $9}start-all.sh/stop-all.sh #启动、关闭所有进程start-dfs.sh/stop-dfs.sh #启动、关闭hdfsstart-yarn.sh/stop-yarn.sh #启动、关闭yarnmr-jobhistory-daemon.sh #作业查看hadoop-daemon.sh / hadoop-daemons.sh yarn-daemon.sh / yarn-daemons.shstart-balancer.sh/stop-balancer.sh #更新datanode的文件块分布情况
View Code

三种启动方式:

技术分享
三种启动方式:方式一:逐一启动(实际生产环境中的启动方式)hadoop-daemon.sh start|stop namenode|datanode| journalnodeyarn-daemon.sh start |stop resourcemanager|nodemanager方式二:分开启动start-dfs.shstart-yarn.sh方式三:一起启动start-all.sh作业查看服务:    mr-jobhistory-daemon.sh start historyserver
View Code

 

个人Hadoop集群部署