首页 > 代码库 > Spark 加强版WordCount ,统计日志中文件访问数量

Spark 加强版WordCount ,统计日志中文件访问数量

写在前面

学习Scala和Spark基本语法比较枯燥无味,搞搞简单的实际运用可以有效的加深你对基本知识点的记忆,前面我们完成了最基本的WordCount功能的http://blog.csdn.net/whzhaochao/article/details/72358215,这篇主要是结合实际生产情况编写一个简单的功能,功能就是通过分析CDN或者Nginx的日志文件,统计出访问的PV、UV、IP地址、访问来源等相关数据,这里只是提供一种练习思路,实际运用可能还需要复杂点

统计文件请求数

如下图所示为七牛CDN请求的日志

223.93.159.226 HIT 203 [15/Feb/2017:11:14:35 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 5444007 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 62 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4866645 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 15 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 4854183 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 91 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 4751957 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
61.164.41.226 HIT 2537 [15/Feb/2017:11:13:54 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 200 5173432 "http://www.abc.com.cn/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
115.215.115.229 HIT 1 [15/Feb/2017:11:17:53 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 5.1; M578CA Build/LMY47D; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
115.215.115.229 HIT 1 [15/Feb/2017:11:17:53 +0800] "GET http://v-cdn.abc.com.cn/videojs/video.js HTTP/1.1" 200 173397 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 5.1; M578CA Build/LMY47D; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
115.236.173.95 HIT 1 [15/Feb/2017:11:17:49 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "http://v.abc.com.cn/video/iframe/player.html?id=139067&auto=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 QQ/6.6.9.412 V1_IPH_SQ_6.6.9_1_APP_A Pixel/1080 Core/UIWebView NetType/WIFI"
183.129.251.218 HIT 486 [15/Feb/2017:11:18:40 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4845881 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 34 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4976817 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 27 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 3859028 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 37 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 3859028 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 43 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 19 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5304429 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.228.161.136 HIT 1 [15/Feb/2017:11:16:51 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140994&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like Mac OS X) AppleWebKit/602.4.6 (KHTML, like Gecko) Mobile/14D27 MicroMessenger/6.5.4 NetType/WIFI Language/zh_CN"
202.107.208.102 HIT 1226 [15/Feb/2017:11:19:10 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
115.231.248.162 HIT 34 [15/Feb/2017:11:17:56 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 1208743 "http://www.abc.com.cn/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; GWX:DOWNLOADED; GWX:RESERVED)"
221.234.216.142 HIT 744 [15/Feb/2017:11:17:09 +0800] "GET http://v-cdn.abc.com.cn/140995.mp4 HTTP/1.1" 206 4194896 "https://v.abc.com.cn/video/iframe/player.html?id=140995&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12B411 MicroMessenger/6.3.31 NetType/WIFI Language/zh_CN"
183.132.170.2 HIT 1 [15/Feb/2017:11:15:22 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 6.0; HUAWEI MT7-CL00 Build/HuaweiMT7-CL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
183.132.170.2 HIT 1 [15/Feb/2017:11:15:22 +0800] "GET http://v-cdn.abc.com.cn/videojs/video.js HTTP/1.1" 200 173397 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 6.0; HUAWEI MT7-CL00 Build/HuaweiMT7-CL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
112.17.240.97 HIT 1440 [15/Feb/2017:11:20:31 +0800] "GET http://v-cdn.abc.com.cn/140941.mp4 HTTP/1.1" 206 6284261 "https://v.abc.com.cn/video/iframe/player.html?id=140941&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12D508 (zjxw;3.5.1;iPhone6,2;8.2;zh;bianfeng;b541b2039c2c00c66c14c7fb7e26df19fccd9cf4)"
125.118.106.43 HIT 32 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 1637949 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 31 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5042489 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 32 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 40 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4911485 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 30 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4583601 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
60.190.59.200 HIT 1741 [15/Feb/2017:11:20:05 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5173425 "http://www.abc.com.cn/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

日志的格式为

IP 命中率 响应时间 请求时间 请求方法 请求URL    请求协议 状态吗 响应大小 referer 用户代理
ClientIP Hit/Miss ResponseTime [Time Zone] Method URL Protocol StatusCode TrafficSize Referer UserAgent

从这日志中我们首先想要得到视频文件的请求数(注意这里不是观看数),然后按请求数大小排序,视频文件的名称是[0-9]+.mp4这样的

计算代码

 def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("hello").setMaster("local")
    val sc = new SparkContext(conf)
    val input = sc.textFile("D:\\data\\cdn.txt")
    input.filter(x=>x.matches(".*([0-9]+)\\.mp4.*")).flatMap(x=>"([0-9]+).mp4".r findFirstIn(x)).map(x=>(x,1)).reduceByKey((x,y)=>x+y).sortBy(_._2,false).foreach(println)
  }

计算过程

  • filter(x=>x.matches(“.([0-9]+)\.mp4.“)) 通过正则过滤出视频文件
  • flatMap(x=>”([0-9]+).mp4”.r findFirstIn(x)) 通过正则取日志中的文件名
  • .map(x=>(x,1)) 将文件名变成元组,创建Pair RDD
  • reduceByKey((x,y)=>x+y) 按文件名统计
  • sortBy(_._2,false) 按请求数排序
  • foreach(println) 输出结果

输出结果

(141027.mp4,10534)
(141081.mp4,6823)
(140995.mp4,5076)
(141032.mp4,4988)
(141114.mp4,4244)
(141090.mp4,4198)
(141035.mp4,4123)
(141082.mp4,3916)
(89973.mp4,3477)
(140938.mp4,3227)
(138982.mp4,3048)
(139870.mp4,2580)
(141080.mp4,2484)
(140976.mp4,2476)
(140510.mp4,2167)
(132247.mp4,1785)
(141102.mp4,1726)
(141036.mp4,1703)
(140876.mp4,1584)
(140941.mp4,1479)
(140967.mp4,1414)
(140819.mp4,1287)
(140279.mp4,1276)
(140822.mp4,1241)
(140994.mp4,1174)
(141011.mp4,1148)
(141060.mp4,1033)
(140981.mp4,998)

学习数据及源代码

http://git.oschina.net/whzhaochao/spark-learning

<script type="text/javascript"> $(function () { $(‘pre.prettyprint code‘).each(function () { var lines = $(this).text().split(‘\n‘).length; var $numbering = $(‘
    ‘).addClass(‘pre-numbering‘).hide(); $(this).addClass(‘has-numbering‘).parent().append($numbering); for (i = 1; i <= lines; i++) { $numbering.append($(‘
  • ‘).text(i)); }; $numbering.fadeIn(1700); }); }); </script>

    Spark 加强版WordCount ,统计日志中文件访问数量