首页 > 代码库 > hadoop日志分析

hadoop日志分析

一、项目要求


  • 本文讨论的日志处理方法中的日志,仅指Web日志。事实上并没有精确的定义,可能包含但不限于各种前端Webserver——apache、lighttpd、nginx、tomcat等产生的用户訪问日志,以及各种Web应用程序自己输出的日志。  


二、需求分析: KPI指标设计

 PV(PageView): 页面訪问量统计
 IP: 页面独立IP的訪问量统计
 Time: 用户每小时PV的统计
 Source: 用户来源域名的统计
 Browser: 用户的訪问设备统计

以下我着重分析浏览器统计

三、分析过程

1、 日志的一条nginx记录内容

222.68.172.190  - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 
"http://www.angularjs.cn/A00n" 
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"

2、对上面的日志记录进行分析

remote_addr : 记录client的ip地址, 222.68.172.190
remote_user :  记录clientusername称, –
time_local:  记录訪问时间与时区, [18/Sep/2013:06:49:57 +0000]
request: 记录请求的url与http协议, “GET /images/my.jpg HTTP/1.1″
status:  记录请求状态,成功是200, 200
body_bytes_sent:  记录发送给client文件主体内容大小, 19939
http_referer:  用来记录从那个页面链接訪问过来的, “http://www.angularjs.cn/A00n”
http_user_agent:  记录客户浏览器的相关信息, “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36″  

3、java语言分析上面一条日志记录(使用空格切分)

view source
print?
1String line ="222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] \"GET /images/my.jpg HTTP/1.1\" 200 19939 \"http://www.angularjs.cn/A00n\" \"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36\"";
2        String[] elementList = line.split(" ");
3        for(inti=0;i<elementList.length;i++){
4            System.out.println(i+" : "+elementList[i]);
5        }

測试结果:

view source
print?
010: 222.68.172.190
021: -
032: -
043: [18/Sep/2013:06:49:57
054: +0000]
065: "GET
076: /images/my.jpg
087: HTTP/1.1"
098: 200
109: 19939
1110: "http://www.angularjs.cn/A00n"
1211: "Mozilla/5.0
1312: (Windows
1413: NT
1514: 6.1)
1615: AppleWebKit/537.36
1716: (KHTML,
1817: like
1918: Gecko)
2019: Chrome/29.0.1547.66
2120: Safari/537.36"
4、实体Kpi类的代码:
view source
print?
01publicclass Kpi {
02    privateString remote_addr;// 记录client的ip地址
03    privateString remote_user;// 记录clientusername称,忽略属性"-"
04    privateString time_local;// 记录訪问时间与时区
05    privateString request;// 记录请求的url与http协议
06    privateString status;// 记录请求状态;成功是200
07    privateString body_bytes_sent;// 记录发送给client文件主体内容大小
08    privateString http_referer;// 用来记录从那个页面链接訪问过来的
09    privateString http_user_agent;// 记录客户浏览器的相关信息
10    privateString method;//请求方法 get post
11    privateString http_version; //http版本号
12      
13    publicString getMethod() {
14        returnmethod;
15    }
16    publicvoid setMethod(String method) {
17        this.method = method;
18    }
19    publicString getHttp_version() {
20        returnhttp_version;
21    }
22    publicvoid setHttp_version(String http_version) {
23        this.http_version = http_version;
24    }
25    publicString getRemote_addr() {
26        returnremote_addr;
27    }
28    publicvoid setRemote_addr(String remote_addr) {
29        this.remote_addr = remote_addr;
30    }
31    publicString getRemote_user() {
32        returnremote_user;
33    }
34    publicvoid setRemote_user(String remote_user) {
35        this.remote_user = remote_user;
36    }
37    publicString getTime_local() {
38        returntime_local;
39    }
40    publicvoid setTime_local(String time_local) {
41        this.time_local = time_local;
42    }
43    publicString getRequest() {
44        returnrequest;
45    }
46    publicvoid setRequest(String request) {
47        this.request = request;
48    }
49    publicString getStatus() {
50        returnstatus;
51    }
52    publicvoid setStatus(String status) {
53        this.status = status;
54    }
55    publicString getBody_bytes_sent() {
56        returnbody_bytes_sent;
57    }
58    publicvoid setBody_bytes_sent(String body_bytes_sent) {
59        this.body_bytes_sent = body_bytes_sent;
60    }
61    publicString getHttp_referer() {
62        returnhttp_referer;
63    }
64    publicvoid setHttp_referer(String http_referer) {
65        this.http_referer = http_referer;
66    }
67    publicString getHttp_user_agent() {
68        returnhttp_user_agent;
69    }
70    publicvoid setHttp_user_agent(String http_user_agent) {
71        this.http_user_agent = http_user_agent;
72    }
73    @Override
74    publicString toString() {
75        return"Kpi [remote_addr="+ remote_addr + ", remote_user="
76                + remote_user +", time_local=" + time_local + ", request="
77                + request +", status=" + status + ", body_bytes_sent="
78                + body_bytes_sent +", http_referer=" + http_referer
79                +", http_user_agent=" + http_user_agent + ", method="+ method
80                +", http_version=" + http_version + "]";
81    }
82  
83      
84      
85}
5、kpi的工具类
 
view sourceprint?
01packageorg.aaa.kpi;
02  
03publicclass KpiUtil {
04    /***
05     * line记录转化成kpi对象
06     * @param line 日志的一条记录
07     * @author tianbx
08     * */
09    publicstatic Kpi transformLineKpi(String line){
10        String[] elementList = line.split(" ");
11        Kpi kpi =new Kpi();
12        kpi.setRemote_addr(elementList[0]);
13        kpi.setRemote_user(elementList[1]);
14        kpi.setTime_local(elementList[3].substring(1));
15        kpi.setMethod(elementList[5].substring(1));
16        kpi.setRequest(elementList[6]);
17        kpi.setHttp_version(elementList[7]);
18        kpi.setStatus(elementList[8]);
19        kpi.setBody_bytes_sent(elementList[9]);
20        kpi.setHttp_referer(elementList[10]);
21        kpi.setHttp_user_agent(elementList[11] +" " + elementList[12]);
22        returnkpi;
23    }
24}

6、算法模型: 并行算法 

Browser: 用户的訪问设备统计
– Map: {key:$http_user_agent,value:1}
– Reduce: {key:$http_user_agent,value:求和(sum)} 
7、map-reduce分析代码


view source
print?
01importjava.io.IOException;
02importjava.util.Iterator;
03  
04importorg.apache.hadoop.fs.Path;
05importorg.apache.hadoop.io.IntWritable;
06importorg.apache.hadoop.io.Text;
07importorg.apache.hadoop.mapred.FileInputFormat;
08importorg.apache.hadoop.mapred.FileOutputFormat;
09importorg.apache.hadoop.mapred.JobClient;
10importorg.apache.hadoop.mapred.JobConf;
11importorg.apache.hadoop.mapred.MapReduceBase;
12importorg.apache.hadoop.mapred.Mapper;
13importorg.apache.hadoop.mapred.OutputCollector;
14importorg.apache.hadoop.mapred.Reducer;
15importorg.apache.hadoop.mapred.Reporter;
16importorg.apache.hadoop.mapred.TextInputFormat;
17importorg.apache.hadoop.mapred.TextOutputFormat;
18importorg.hmahout.kpi.entity.Kpi;
19importorg.hmahout.kpi.util.KpiUtil;
20  
21importcz.mallat.uasparser.UASparser;
22importcz.mallat.uasparser.UserAgentInfo;
23  
24publicclass KpiBrowserSimpleV {
25  
26    publicstatic class KpiBrowserSimpleMapperextends MapReduceBase 
27        implementsMapper<Object, Text, Text, IntWritable> {
28        UASparser parser =null;
29        @Override
30        publicvoid map(Object key, Text value,
31                OutputCollector<Text, IntWritable> out, Reporter reporter)
32                throwsIOException {
33            Kpi kpi = KpiUtil.transformLineKpi(value.toString());
34  
35            if(kpi!=null&& kpi.getHttP_user_agent_info()!=null){
36                if(parser==null){
37                    parser =new UASparser();
38                }
39                UserAgentInfo info = 
40                parser.parseBrowserOnly(kpi.getHttP_user_agent_info());
41                if("unknown".equals(info.getUaName())){
42                    out.collect(newText(info.getUaName()), new IntWritable(1));
43                }else{
44                    out.collect(newText(info.getUaFamily()), new IntWritable(1));
45                }
46  
47            }
48        }
49    }
50  
51    publicstatic class KpiBrowserSimpleReducerextends MapReduceBase implements
52        Reducer<Text, IntWritable, Text, IntWritable>{
53  
54        @Override
55        publicvoid reduce(Text key, Iterator<IntWritable> value,
56                OutputCollector<Text, IntWritable> out, Reporter reporter)
57                throwsIOException {
58            IntWritable sum =new IntWritable(0);
59            while(value.hasNext()){
60                sum.set(sum.get()+value.next().get());
61            }
62            out.collect(key, sum);
63        }
64    }
65    publicstatic void main(String[] args)throws IOException {
66        String input ="hdfs://127.0.0.1:9000/user/tianbx/log_kpi/input";
67        String output ="hdfs://127.0.0.1:9000/user/tianbx/log_kpi/browerSimpleV";
68        JobConf conf =new JobConf(KpiBrowserSimpleV.class);
69        conf.setJobName("KpiBrowserSimpleV");
70        String url ="classpath:";
71        conf.addResource(url+"/hadoop/core-site.xml");
72        conf.addResource(url+"/hadoop/hdfs-site.xml");
73        conf.addResource(url+"/hadoop/mapred-site.xml");
74          
75        conf.setMapOutputKeyClass(Text.class);
76        conf.setMapOutputValueClass(IntWritable.class);
77          
78        conf.setOutputKeyClass(Text.class);
79        conf.setOutputValueClass(IntWritable.class);
80          
81        conf.setMapperClass(KpiBrowserSimpleMapper.class);
82        conf.setCombinerClass(KpiBrowserSimpleReducer.class);
83        conf.setReducerClass(KpiBrowserSimpleReducer.class);
84  
85        conf.setInputFormat(TextInputFormat.class);
86        conf.setOutputFormat(TextOutputFormat.class);
87  
88        FileInputFormat.setInputPaths(conf,new Path(input));
89        FileOutputFormat.setOutputPath(conf,new Path(output));
90  
91        JobClient.runJob(conf);
92        System.exit(0);
93    }
94  
95}


8、输出文件log_kpi/browerSimpleV内容

AOL Explorer 1
Android Webkit 123
Chrome 4867
CoolNovo 23
Firefox 1700
Google App Engine 5
IE 1521
Jakarta Commons-HttpClient 3
Maxthon 27
Mobile Safari 273
Mozilla 130
Openwave Mobile Browser 2
Opera 2
Pale Moon 1
Python-urllib 4
Safari 246
Sogou Explorer 157
unknown 4685

8 R制作图片


data<-read.table(file="borwer.txt",header=FALSE,sep=",") 

 names(data)<-c("borwer","num")

 qplot(borwer,num,data=http://www.mamicode.com/data,geom="bar")



解决这个问题

1、排除爬虫和程序点击,对抗作弊

解决的方法:页面做个检測鼠标是否动。

2、浏览量 怎么排除图片

3、浏览量排除假点击?

4、哪一个搜索引擎訪问的?

5、点击哪一个keyword訪问的?

6、从哪一个地方訪问的?

7、使用哪一个浏览器訪问的?

hadoop日志分析