首页 > 代码库 > Hadoop: Hadoop Cluster配置文件

Hadoop: Hadoop Cluster配置文件

一、环境准备

1、系统环境

2台CentOS7服务器

 NameNodeResourceManagerDataNodeNodeManager
server1 
server2 

 

 

 

2、软件环境

java-1.8.0-openjdk

java-1.8.0-openjdk-devel

Hadoop 2.7.3

二、Hadoop配置文件

Hadoop的配置文件:

  • 只读的默认配置文件:core-default.xml, hdfs-default.xml, yarn-default.xml 和 mapred-default.xml
  • 站点特定的配置文件:etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml 和 etc/hadoop/mapred-site.xm
  • Hadoop环境变量配置文件:etc/hadoop/hadoop-env.shetc/hadoop/mapred-env.sh 和 etc/hadoop/yarn-env.sh

管理员用户可以修改etc/hadoop/hadoop-env.shetc/hadoop/mapred-env.sh 和 etc/hadoop/yarn-env.sh 脚本来自定义站点特定的配置,修改这些脚本就是配置Hadoop后台进程用到的环境变量,比如,配置JAVA_HOME。

通过修改下面配置参数,管理员可以设置单独的Hadoop后台进程

 DaemonEnvironment Variable
NameNodeHADOOP_NAMENODE_OPTS
DataNodeHADOOP_DATANODE_OPTS
Secondary NameNodeHADOOP_SECONDARYNAMENODE_OPTS
ResourceManagerYARN_RESOURCEMANAGER_OPTS
NodeManagerYARN_NODEMANAGER_OPTS
WebAppProxyYARN_PROXYSERVER_OPTS
Map Reduce Job History ServerHADOOP_JOB_HISTORYSERVER_OPTS

 

 

 

 

 

 

其他有用的配置参数:

  • HADOOP_PID_DIR:进程ID文件存放的目录
  • HADOOP_LOG_DIR:日志文件存放的目录,默认会自动创建。
  • HADOOP_HEAPSIZE / YARN_HEAPSIZE:能够使用堆内存的最大值,以MB为单位。默认值是1000,既1000M。使用它可以单独指定某个节点Hadoop进程能够使用的内存大小。

大多数情况下,我们需要配置HADOOP_PID_DIRHADOOP_LOG_DIR,因为运行Hadoop进程的用户需要对这些目录有写权限。

三、Hadoop后台进程及配置

  • HDFS后台进程:NameNode、SecondaryNameNode、DataNode
  • YARN后天进程:ResourceManager、WebAppProxy
  • MapReduce后台进程:MapReduce Job History Server

下面介绍各个配置文件中的重要参数

1、etc/hadoop/core-site.xml

  ParameterValueNotes
fs.defaultFSNameNode URIhdfs://host:port/
io.file.buffer.size131072读写文件的buffer大小,单位byte

 

 

 

2、etc/hadoop/hdfs-site.xml

NameNode配置参数

 ParameterValueNotes
dfs.namenode.name.dirNameNo的在本地文件系统中存储namespace和事务日志的目录如果配置多个目录,用逗号分开,每个目录都会存放一份副本
dfs.hosts DataDode白名单不指定,默认所有DataNode都可以使用
dfs.hosts.excludeDataNode黑名单,不允许使用不指定,默认所有DataNode都可以使用
dfs.blocksize268435456HDFS数据块大小,单位byte,默认64M,对于超大文件可以配置为256M
dfs.namenode.handler.count100处理对DataNode的RPC调用的NameNode服务线程数量

 

 

 

 

 

 

 

DataNode配置参数

 ParameterValueNotes
dfs.datanode.data.dirDataNode在本地文件系统存储数据块的目录,多个目录按逗号分割如果是多个目录,会在每个目录存放一个副本

 

 

 

 

3、etc/hadoop/yarn-site.xml

针对ResourceManager 和 NodeManager共同的配置

 ParameterValueNotes
yarn.acl.enabletrue / false是否启用ACL权限控制,默认false
yarn.admin.aclAdmin ACL集群上管理员的ACL权限,具体参考Linux下ACL权限控制的详细内容。默认是*,表示任何人都可以访问,什么也不设置(空白)表示禁止任何人访问
yarn.log-aggregation-enablefalse是否启用日志聚合,默认false

 

ResourceManager配置参数

 ParameterValueNotes
yarn.resourcemanager.address客户端访问并提交作业的地址(host:port一旦设置了,这个地址会覆盖参数 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.scheduler.addressApplicationMasters 连接并调度、获取资源的地址host:port)一旦设置了,这个地址会覆盖参数 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.resource-tracker.addressNodeManagers连接ResourceManager的地址host:port)一旦设置了,这个地址会覆盖参数 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.admin.address管理相关的commands连接ResourceManager的地址host:port)一旦设置了,这个地址会覆盖参数 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.webapp.address浏览器访问ResourceManager的地址host:port)一旦设置了,这个地址会覆盖参数 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.hostnameResourceManager主机名称 
yarn.resourcemanager.scheduler.classResourceManager调度程序使用的java classCapacityScheduler (推荐), FairScheduler (推荐), or FifoScheduler
yarn.scheduler.minimum-allocation-mb为每一个资源请求分配的最小内存单位MB
yarn.scheduler.maximum-allocation-mb为每一个资源请求分配的最大内存单位MB
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-pathetc/hadoop/hdfs-site.xml 

 

NodeManager配置参数

 ParameterValueNotes
yarn.nodemanager.resource.memory-mbNodeManager进程可使用的物理内存大小关系到yarn.scheduler.minimum-allocation-mbyarn.scheduler.maximum-allocation-mb
yarn.nodemanager.vmem-pmem-ratioMaximum ratio by which virtual memory usage of tasks may exceed physical memoryThe virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
yarn.nodemanager.local-dirs存放中间数据的本地目录,多个目录逗号分隔多个目录可以提升磁盘IO速度
yarn.nodemanager.log-dirs存放日志的本地目录,多个目录逗号分隔多个目录可以提升磁盘IO速度
yarn.nodemanager.log.retain-seconds10800Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
yarn.nodemanager.remote-app-log-dir/logsHDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
yarn.nodemanager.remote-app-log-dir-suffixlogsSuffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.
yarn.nodemanager.aux-servicesmapreduce_shuffleShuffle service that needs to be set for Map Reduce applications.

 

History Server 参数配置:

 ParameterValueNotes
yarn.log-aggregation.retain-seconds-1How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.
yarn.log-aggregation.retain-check-interval-seconds-1Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.

 

4、etc/hadoop/mapred-site.xml

MapReduce 应用的配置:

ParameterValueNotes
mapreduce.framework.nameyarnExecution framework set to Hadoop YARN.
mapreduce.map.memory.mb1536Larger resource limit for maps.
mapreduce.map.java.opts-Xmx1024MLarger heap-size for child jvms of maps.
mapreduce.reduce.memory.mb3072Larger resource limit for reduces.
mapreduce.reduce.java.opts-Xmx2560MLarger heap-size for child jvms of reduces.
mapreduce.task.io.sort.mb512Higher memory-limit while sorting data for efficiency.
mapreduce.task.io.sort.factor100More streams merged at once while sorting files.
mapreduce.reduce.shuffle.parallelcopies50Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

 

 

 

 

 

 

 

 MapReduce JobHistory Server配置:

ParameterValueNotes
mapreduce.jobhistory.addressMapReduce JobHistory Server host:portDefault port is 10020.
mapreduce.jobhistory.webapp.addressMapReduce JobHistory Server Web UI host:portDefault port is 19888.
mapreduce.jobhistory.intermediate-done-dir/mr-history/tmpDirectory where history files are written by MapReduce jobs.
mapreduce.jobhistory.done-dir/mr-history/doneDirectory where history files are managed by the MR JobHistory Server.

 

NodeManagers的健康监控

Hadoop提供了一个监控机制,管理员可以配置NodeManager运行一个脚本,定期检测某个Node是否健康可用。如果某Node不可用,该节点会在standard output打印出一条ERROR开头的消息,NodeManager会定期检查所有Node的output,如果发现有ERROR信息,就会把这个Node标志为unhealthy,然后将其加入黑名单,不会有任务分陪给它了,直到该Node恢复正常,NodeManager检测到会将其移除黑名单,继续分配任务给它。

下面是health monitoring script的配置信息,位于etc/hadoop/yarn-site.xml

ParameterValueNotes
yarn.nodemanager.health-checker.script.pathNode health scriptScript to check for node’s health status.
yarn.nodemanager.health-checker.script.optsNode health script optionsOptions for script to check for node’s health status.
yarn.nodemanager.health-checker.script.interval-msNode health script intervalTime interval for running health script.
yarn.nodemanager.health-checker.script.timeout-msNode health script timeout intervalTimeout for health script execution.

 

 

 

 

NodeManager要能够定期检查本地磁盘,特别是nodemanager-local-dirs 和 nodemanager-log-dirs配置的目录,当发现bad directories的数量达到了yarn.nodemanager.disk-health-checker.min-healthy-disks指定的值,这个节点才被标志为unhealthy。

Slaves File

列出所有slave hostnames or IP 地址在etc/hadoop/slaves 文件, 一行一个。 Helper 脚本 (described below) 使用etc/hadoop/slaves 文件在许多客户端运行命令。 它不需要任何基于Java的hadoop配置,为了使用此功能,Hadoop节点之间应使用ssh建立互信连接。

 

 

Hadoop: Hadoop Cluster配置文件