首页 > 代码库 > 在Spark上用Scala实验梯度下降算法

在Spark上用Scala实验梯度下降算法

首先参考的是这篇文章:http://blog.csdn.net/sadfasdgaaaasdfa/article/details/45970185

但是其中的函数太老了。所以要改。另外出发点是我自己的这篇文章 http://www.cnblogs.com/charlesblc/p/6206198.html 里面关于梯度下降的那幅图片。

 

改来改去,在随机化向量上耗费了很多时间,最后还是做好了。代码如下:

package com.spark.my

import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import breeze.linalg.DenseVector
import breeze.numerics.exp

/**
  * Created by baidu on 16/11/28.
  */


object  GradientDemo{
  case class DataPoint(x: DenseVector[Double], y: Double) // case class见下文
  def parsePoint(x: Array[Double]): DataPoint = {
    //DataPoint(Vectors.dense(x.slice(0, x.size-2)), x(x.size-1))
    DataPoint(DenseVector(x.slice(0, x.size-2)), x(x.size-1))
  }



  def main(args: Array[String]) {

    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    val conf = new SparkConf()
    val sc = new SparkContext(conf)

    println("Begin load gradient file")
    // 装载数据集
    val text = sc.textFile("hdfs://master.Hadoop:8390/gradient_data/spam.data.txt")
    val lines = text.map {
      line =>
        line.split(" ").map(_.toDouble)
    }

    val points = lines.map(parsePoint(_)) // (parsePoint(_))看起来是一样的
    var w = DenseVector.rand(lines.first().size - 2)

    val iterations = 100
    for (i <- 1 to iterations) {
      val gradient = points.map(p =>
        (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x)
        .reduce(_ + _)
      w -= gradient
    }

    println("Finish data loading, w num: " + w.length + "; w: " + w)

  }
}

 

然后在m42n05机器上,先用的是把 http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data 这个文件拷贝到Hadoop上:

$hadoop fs -mkdir /gradient_data

$ hadoop fs -put spam.data.txt /gradient_data/

$ hadoop fs -ls /gradient_data/
Found 1 items
-rw-r--r--   3 work supergroup     698341 2016-12-21 17:59 /gradient_data/spam.data.txt

 

然后把jar包也拷贝过来,运行命令:

$ ./bin/spark-submit --class com.spark.my.GradientDemo  --master spark://10.117.146.12:7077 myjars/scala-demo.jar

得到输出:
16/12/21 18:17:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/21 18:17:58 INFO util.log: Logging initialized @1689ms
16/12/21 18:17:58 INFO server.Server: jetty-9.2.z-SNAPSHOT
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@107ed6fc{/jobs,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1643d68f{/jobs/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@186978a6{/jobs/job,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2e029d61{/jobs/job/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@482d776b{/stages,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4052274f{/stages/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@132ddbab{/stages/stage,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@297ea53a{/stages/stage/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@acb0951{/stages/pool,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5bf22f18{/stages/pool/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@267f474e{/storage,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7a7471ce{/storage/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28276e50{/storage/rdd,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@62e70ea3{/storage/rdd/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3efe7086{/environment,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@675d8c96{/environment/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@741b3bc3{/executors,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2ed3b1f5{/executors/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@63648ee9{/executors/threadDump,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@68d6972f{/executors/threadDump/json,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@45be7cd5{/static,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7651218e{/,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3185fa6b{/api,null,AVAILABLE}
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6d366c9b{/stages/stage/kill,null,AVAILABLE}
16/12/21 18:17:58 INFO server.ServerConnector: Started ServerConnector@53e211ee{HTTP/1.1}{0.0.0.0:4040}
16/12/21 18:17:58 INFO server.Server: Started @1811ms
16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6e0d4a8{/metrics/json,null,AVAILABLE}
Begin load gradient file
16/12/21 18:18:00 INFO mapred.FileInputFormat: Total input paths to process : 1
16/12/21 18:18:02 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
16/12/21 18:18:02 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
Finish data loading, w num: 56; w: DenseVector(0.5742670447735152, 0.3793477463119241, 0.9681722093411653, 0.5967720119758925, 1.513648869152009, 0.8246263930800145, 0.8513296345703405, 0.5016541916805365, 0.10371045067354999, 1.0622529560536655, 0.7333760424194737, 2.1149483032187897, 0.9299367625800867, 0.7255747859512406, 0.13008556580706143, 1.4831202765138185, 0.7729907277492736, 0.9723309264036033, 13.394753146641808, 0.5531526429090097, 2.7444722115693665, 0.11325813324181622, 0.5096129116641023, 0.7201439311127137, 0.44719912156747926, 0.8273500952621051, 0.6736417633922696, 0.046531684571481415, 0.017895929000231802, 0.4726397794671698, 0.394438566392741, 0.8438784726078483, 0.4144073806784945, 0.18873920886297268, 0.4760240368798872, 0.31604719205329873, 0.694745503752298, 0.721380820951884, 0.988535475648986, 0.13515871744899247, 0.15694652560543523, 0.6939378895510522, 0.9279201378471407, 0.3336083293555714, 0.38938263676999685, 0.17159756568171308, 0.18897754115255144, 0.7281027812135723, 0.7233165381530381, 1.1093715737790655, 0.15675561193336351, 2.059622965151493, 0.6839713282339183, 0.11528695729374866, 7.413534050555067, 23.13404922028611)
16/12/21 18:18:07 INFO server.ServerConnector: Stopped ServerConnector@53e211ee{HTTP/1.1}{0.0.0.0:4040}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6d366c9b{/stages/stage/kill,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3185fa6b{/api,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7651218e{/,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@45be7cd5{/static,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@68d6972f{/executors/threadDump/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@63648ee9{/executors/threadDump,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2ed3b1f5{/executors/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@741b3bc3{/executors,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@675d8c96{/environment/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3efe7086{/environment,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@62e70ea3{/storage/rdd/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@28276e50{/storage/rdd,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a7471ce{/storage/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@267f474e{/storage,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5bf22f18{/stages/pool/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@acb0951{/stages/pool,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@297ea53a{/stages/stage/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@132ddbab{/stages/stage,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4052274f{/stages/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@482d776b{/stages,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2e029d61{/jobs/job/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@186978a6{/jobs/job,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1643d68f{/jobs/json,null,UNAVAILABLE}
16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@107ed6fc{/jobs,null,UNAVAILABLE}

可以看到数据正常进行了处理。

 

在代码的迭代循环里面再加上这么一句,看看过程:

println("In data loading, w num: " + w.length + "; w: " + w)

然后重新拷贝jar包,然后运行。发现增加了很多中间数据,但是每次改动不大,有的只是最后几个数字改动:

In data loading, w num: 56; w: DenseVector(0.8387794911469437, 0.041931950643148204, 0.610593576873822, 0.775693127624059, 0.9595814255406686, 0.8346753461732199, 1.3049939469403333, 0.7056665962054256, 0.4607139317388798, 0.7272237992038442, 0.658182563650663, 0.733627042229442, 0.49543528179048996, 0.43928474305383947, 0.7784540121519834, 3.3618947233533456, 0.8863247999385253, 0.4007587753541083, 2.0631977325748334, 0.8211289850510815, 1.2076387347473903, 0.43209585536401196, 0.8361371667999544, 0.3902040623717107, 0.9249800607229486, 0.9684655358995048, 0.7122113545634148, 0.7564214721597596, 0.9295754044438086, 0.0667831407627083, 0.8262226990678785, 0.9866253536733688, 0.7214690647928418, 0.5992067836236182, 0.801215365214358, 1.0206941788488395, 0.8887684894893382, 0.39696145592511084, 0.7994301499483707, 0.39766237687949973, 0.3213782652296576, 0.3959330364022269, 0.6573698429264838, 0.5725594506918451, 0.932872703406284, 0.4276515117478306, 0.8908902872993782, 0.6281143587881469, 0.5136752276267151, 1.0933173640821512, 0.10820509511118362, 1.9426418431339785, 0.2017114624971559, 0.9827542778431644, 5.224634203803431, 16.694903977208174)
In data loading, w num: 56; w: DenseVector(0.8387794911469437, 0.041931950643148204, 0.6105935768739001, 0.775693127624059, 0.9595814255414439, 0.8346753461732199, 1.3049939469403333, 0.7056665962054256, 0.4607139317388798, 0.7272237992038442, 0.658182563650663, 0.733627042229442, 0.49543528179048996, 0.43928474305383947, 0.7784540121519834, 3.3618947233534118, 0.8863247999385373, 0.4007587753541083, 2.0631977325749897, 0.8211289850510815, 1.2076387347474142, 0.43209585536401196, 0.8361371667999544, 0.3902040623717107, 0.9249800607229486, 0.9684655358995048, 0.7122113545634148, 0.7564214721597596, 0.9295754044438086, 0.0667831407627083, 0.8262226990678785, 0.9866253536733688, 0.7214690647928418, 0.5992067836236182, 0.801215365214358, 1.0206941788488395, 0.8887684894893382, 0.39696145592511084, 0.7994301499483707, 0.3976623768795117, 0.3213782652296576, 0.3959330364022269, 0.6573698429264838, 0.5725594506918451, 0.932872703406296, 0.4276515117478306, 0.8908902872993782, 0.6281143587881469, 0.5136752276267151, 1.093317364082217, 0.10820509511118362, 1.942641843152015, 0.2017114624971559, 0.982754277843168, 5.22463420411604, 16.694903977520784)

 

 

梯度下降原理

梯度下降原理讲的比较好的,可以看这里:

http://blog.csdn.net/woxincd/article/details/7040944

还有这篇:

http://www.cnblogs.com/maybe2030/p/5089753.html?utm_source=tuicool&utm_medium=referral

 

仔细看了一下,发现上面的公式,和代码里面的公式好像不太一样。应该是代码里面用到了Sigmoid函数。

技术分享

 

还需要好好领悟一下。

上面代码里面用到的公式主要是:

(1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x)
上面p.x是一个n维的vector,p.y是一个数值。

然后 reduce(_+_)是说把没一行的都加起来。也就是最后是一个n维的vector.

然后
w -= gradient

 

然后迭代N次,得到一个新的w.

 

case class

case class和class的区别可以看:http://www.tuicool.com/articles/yEZr6ve

在Scala中存在case class,它其实就是一个普通的class。但是它又和普通的class略有区别,如下:

1、初始化的时候可以不用new,当然你也可以加上,普通类一定需要加new;

2、toString的实现更漂亮;

3、默认实现了equals 和hashCode;

4、默认是可以序列化的,也就是实现了Serializable ;

5、自动从scala.Product中继承一些函数;

6、case class构造函数的参数是public级别的,我们可以直接访问;

7、支持模式匹配。

 

Breeze

另外,上面的DenseVector其实都是用的Breeze里面的类

 

 

LinearRegressionWithSGD

另外,这是Spark里面实现的线性回归,是基于随机梯度下降的。相似的函数还有:

MLlib中可用的线性回归算法有:LinearRegressionWithSGD,RidgeRegressionWithSGD,LassoWithSGD;MLlib回归分析中涉及到的主要类有,GeneralizedLinearAlgorithm,GradientDescent。

 

 

Scala用Java

上文最后用的是DenseVector,所以没有用下面这段。但是下面这段说明了Scala里面可以用Java的:

import java.util.Random
val rand = new Random(53)

 

在Spark上用Scala实验梯度下降算法