首页 > 代码库 > 【Spark学习】Spark 1.1.0 with CDH5.2 安装部署

【Spark学习】Spark 1.1.0 with CDH5.2 安装部署

【时间】2014年11月18日

【平台】Centos 6.5

【工具】scp

【软件】jdk-7u67-linux-x64.rpm

    spark-worker-1.1.0+cdh5.2.0+56-1.cdh5.2.0.p0.35.el6.noarch.rpm
    spark-core-1.1.0+cdh5.2.0+56-1.cdh5.2.0.p0.35.el6.noarch.rpm
    spark-history-server-1.1.0+cdh5.2.0+56-1.cdh5.2.0.p0.35.el6.noarch.rpm
    spark-master-1.1.0+cdh5.2.0+56-1.cdh5.2.0.p0.35.el6.noarch.rpm
    spark-python-1.1.0+cdh5.2.0+56-1.cdh5.2.0.p0.35.el6.noarch.rpm

【步骤】

    1. 准备条件

      (1)集群规划

主机类型IP地址域名
master192.168.50.10master.hadoop.com
worker192.168.50.11slave1.hadoop.com
worker192.168.50.12slave2.hadoop.com
worker192.168.50.13slave3.hadoop.com

   

 

 

 

 

        (2)以root身份登录操作系统

      (3)在集群中的每台主机上执行如下命令,设置主机名。

          hostname *.hadoop.com 

          编辑文件/etc/sysconfig/network如下

          HOSTNAME=*.hadoop.com 

      (4)修改文件/etc/hosts如下

         192.168.86.111 master.hadoop.com
         192.168.86.120 slave1.hadoop.com
         192.168.86.121 slave2.hadoop.com
         192.168.86.122 slave3.hadoop.com

         执行如下命令,将hosts文件复制到集群中每台主机上

          scp /etc/hosts 192.168.50.*:/etc/hosts 

      (4)安装jdk

         rpm -ivh jdk-7u67-linux-x64.rpm 

        (5)安装hadoop-client

         yum install hadoop-client 

      (4)关闭iptables

         service iptables stop  

         chkconfig iptables off 

      (5)关闭selinux。修改文件/etc/selinux/config,然后重启操作系统

         SELINUX=disabled

    2. 安装

           yum install spark-core spark-master spark-worker spark-history-server spark-python 

    3. 配置。将以下文件修改完毕后,用scp命令复制到集群中的所有主机上

      (1)修改文件/etc/spark/conf/spark-env.sh

          export STANDALONE_SPARK_MASTER_HOST= master.hadoop.com 

      (2)修改文件/etc/spark/conf/spark-defaults.conf

          spark.master                     spark://master.hadoop.com:7077 

            spark.eventLog.enabled           true 

          spark.eventLog.dir               hdfs://master.hadoop.com:8020/user/spark/eventlog

          spark.yarn.historyServer.address http://master.hadoop.com:18081 

          spark.executor.memory            2g 

          spark.logConf                    true 

      (3)复制配置文件到集群所有主机

          scp /etc/spark/conf/*  192.168.50.10:/etc/spark/conf/* 

      (4)在HDFS上执行如下操作

          sudo -u hdfs hadoop fs -mkdir /user/spark 

          sudo -u hdfs hadoop fs -mkdir /user/spark/applicationHistory 

          sudo -u hdfs hadoop fs -chown -R spark:spark /user/spark 

          sudo -u hdfs hadoop fs -chmod 1777 /user/spark/applicationHistory 

     4. 启动spark

       (1)在集群中选择一台主机作为master,并执行如下命令

          service spark-master start 

          service spark-history-server start

          注意:history server服务可以单独部署在一台主机上

      (2)在集群中的其他所有主机上执行如下命令

          service spark-worker start 

     5. 测试。向Spark提交程序,有三种工具可用:spark-shell、pyspark、spark-submit

      (1)执行如下命令,进入交互式模式,运行scala代码测试

          spark-shell --driver-library-path /usr/lib/hadoop/lib/native/ --driver-class-path /usr/lib/hadoop/lib/ 

          输入以下代码

          val file = sc.textFile("hdfs://master.hadoop.com:8020/tmp/input.txt") 

          val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) 

          counts.saveAsTextFile("hdfs://master.hadoop.com:8020/tmp/output") 

          运行完毕,执行exit或者ctrl-d退出

      (2)执行如下命令,进入交互式模式,运行python代码测试

          pyspark --driver-library-path /usr/lib/hadoop/lib/native/ --driver-class-path /usr/lib/hadoop/lib/

          运行完毕,执行exit()、quit()或者ctrl-d退出

      (3)执行如下命令,使用非交互式模式执行测试代码

         1)local[N]执行模式: 使用N个worker线程在本地运行Spark应用程序(其中N代表线程数,默认为1。请根据你本地主机的CPU核数而定)

                             spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client  --master local[N] --driver-library-path /usr/lib/hadoop/lib/native/  \

           --driver-class-path /usr/lib/hadoop/lib/ /usr/lib/spark/examples/lib/spark-examples-1.1.0-cdh5.2.0-  hadoop2.5.0-cdh5.2.0.jar 10

         2)local[*]执行模式: 使用你本地主机上所有剩余的worker线程在本地运行Spark应用程序

                            spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client  --master local[*] --driver-library-path /usr/lib/hadoop/lib/native/  \

          --driver-class-path /usr/lib/hadoop/lib/ /usr/lib/spark/examples/lib/spark-examples-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar 10

         3)standalone client执行模式:  连接到Spark Standalone集群,driver在client运行,而executor在cluster中运行。

                            spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client  --master spark://master.hadoop.com:7077 --driver-library-path /usr/lib/hadoop/lib/native/  \

          --driver-class-path /usr/lib/hadoop/lib/ /usr/lib/spark/examples/lib/spark-examples-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar 10

         4)standalone cluster执行模式:  连接到Spark Standalone集群,driver和executor都在cluster中运行。

                            spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master spark://master.hadoop.com:7077 --driver-library-path /usr/lib/hadoop/lib/native/  \

          --driver-class-path /usr/lib/hadoop/lib/ /usr/lib/spark/examples/lib/spark-examples-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar 10

         5)yarn-client执行模式:  连接到YARN集群,driver在client运行,而executor在cluster中运行。(需要安装部署YARN集群)

                            spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client  --master yarn --driver-library-path /usr/lib/hadoop/lib/native/  \

          --driver-class-path /usr/lib/hadoop/lib/ /usr/lib/spark/examples/lib/spark-examples-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar 10

         6)yarn-cluster执行模式:  连接到YARN集群,driver和executor都在cluster中运行。(需要安装部署YARN集群)

                            spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn --driver-library-path /usr/lib/hadoop/lib/native/  \

          --driver-class-path /usr/lib/hadoop/lib/ /usr/lib/spark/examples/lib/spark-examples-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar 10

          注意:命令参数请依据需要而定;以上spark-submit的六种模式中,*.jar文件可以换成*.py以执行python代码;更多参数可以参考命令“spark-submit --help”
    6. 停止spark

        service spark-master stop

        service spark-worker stop 

        service spark-history-server stop

    7. 查看spark集群状态

      (1)Standalone模式,登录http://192.168.50.10:18080  

          

      (2)YARN模式,登录http://192.168.50.10:8088

           

【参考】

    1)http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_spark_installation.html

    2)http://blog.csdn.net/book_mmicky/article/details/25714287

【扩展】

    1)JavaChen‘s Blog  Spark安装和使用    http://blog.javachen.com/2014/07/01/spark-install-and-usage/

    2) China_OS‘s Blog Hadoop CDH5 学习 http://my.oschina.net/guol/blog?catalog=483307

    3)Spark on Yarn遇到的几个问题            http://www.cnblogs.com/Scott007/p/3889959.html

    4)How to Run Spark App on CDH5     http://muse.logdown.com/posts/2014/08/26/how-to-run-spark-app-on-cdh5

    5)Cloudera Spark on GitHub              https://github.com/cloudera/spark

    6)deploy Spark Server and compute Pi from your Web Browser           http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/

【Spark学习】Spark 1.1.0 with CDH5.2 安装部署