ASA的美国总统竞选

在这个大选之年，美国统计协会(ASA)将学生竞赛和总统选举放在一起，将学生预测谁是2016年总统大选的赢家准确的百分比作为比赛点。详情见:

http://thisisstatistics.org/electionprediction2016/

获取数据

互联网上有很多公开的民调数据。可以下面的网站获取总统大选的相关数据：

http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/

其他较好的数据源是：

http://www.realclearpolitics.com/epolls/latest_polls/

http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton

http://www.gallup.com/products/170987/gallup-analytics.aspx)

值得注意的是：数据是每天更新的，所以你在看本文的时候很可能数据变化而得到不同的结果。

因为原始的数据是JSON文件，R拉取下来将其作为了lists中的一个list(列表)。

原文的Github地址：https://github.com/hardin47/prediction2016/blob/master/predblog.Rmd

##载入需要的包require(XML)require(dplyr)require(tidyr)require(readr)require(mosaic)require(RCurl)require(ggplot2)require(lubridate)require(RJSONIO)##数据拉取url = "http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/"doc <- htmlParse(url, useInternalNodes = TRUE) #爬取网页内容sc = xpathSApply(doc,                  "//script[contains(., ‘race.model‘)]",                  function(x) c(xmlValue(x), xmlAttrs(x)[["href"]]))jsobj = gsub(".*race.stateData = http://www.mamicode.com/(.*);race.pathPrefix.*", "\\1", sc)data = fromJSON(jsobj)allpolls <- data$polls#unlisting the whole thingindx <- sapply(allpolls, length)pollsdf <- as.data.frame(do.call(rbind, lapply(allpolls, ‘length<-‘, max(indx))))##数据清洗#unlisting the weightspollswt <- as.data.frame(t(as.data.frame(do.call(cbind,                                                  lapply(pollsdf$weight,                                                        data.frame,                                                        stringsAsFactors=FALSE)))))names(pollswt) <- c("wtpolls", "wtplus", "wtnow")row.names(pollswt) <- NULLpollsdf <- cbind(pollsdf, pollswt)#unlisting the votingindxv <- sapply(pollsdf$votingAnswers, length)pollsvot <- as.data.frame(do.call(rbind, lapply(pollsdf$votingAnswers,                                                ‘length<-‘, max(indxv))))pollsvot1 <- rbind(as.data.frame(do.call(rbind, lapply(pollsvot$V1, data.frame,                                                       stringsAsFactors=FALSE))))pollsvot2 <- rbind(as.data.frame(do.call(rbind, lapply(pollsvot$V2, data.frame,                                                       stringsAsFactors=FALSE))))pollsvot1 <- cbind(polltype = rownames(pollsvot1), pollsvot1,                    polltypeA = gsub(‘[0-9]+‘, ‘‘, rownames(pollsvot1)),                   polltype1 = extract_numeric(rownames(pollsvot1)))pollsvot1$polltype1 <- ifelse(is.na(pollsvot1$polltype1), 1, pollsvot1$polltype1 + 1)pollsvot2 <- cbind(polltype = rownames(pollsvot2), pollsvot2,                    polltypeA = gsub(‘[0-9]+‘, ‘‘, rownames(pollsvot2)),                   polltype1 = extract_numeric(rownames(pollsvot2)))pollsvot2$polltype1 <- ifelse(is.na(pollsvot2$polltype1), 1, pollsvot2$polltype1 + 1)pollsdf <- pollsdf %>%   mutate(population = unlist(population),          sampleSize = as.numeric(unlist(sampleSize)),          pollster = unlist(pollster),          startDate = ymd(unlist(startDate)),         endDate = ymd(unlist(endDate)),          pollsterRating = unlist(pollsterRating)) %>%  select(population, sampleSize, pollster, startDate, endDate, pollsterRating,         wtpolls, wtplus, wtnow)allpolldata <- cbind(rbind(pollsdf[rep(seq_len(nrow(pollsdf)), each=3),],                           pollsdf[rep(seq_len(nrow(pollsdf)), each=3),]),                      rbind(pollsvot1, pollsvot2))allpolldata <- allpolldata %>%  arrange(polltype1, choice)

查看所有的选择数据：allolldata

技术分享

快速可视化

在找出2016年美国总统竞选的预测选票比例之前，简单的查看数据是非常有必要的。数据集已经整理好了，使用ggplot2包对其进行可视化（选取2016年8月以后的数据，x轴为endDate，y轴为adj_pct，颜色根据choice也就是两种颜色克林顿和希拉里，并根据wtnow设置点的大小）：

##快速可视化ggplot(subset(allpolldata, ((polltypeA == "now") & (endDate > ymd("2016-08-01")))),        aes(y=adj_pct, x=endDate, color=choice)) +   geom_line() + geom_point(aes(size=wtnow)) +   labs(title = "Vote percentage by date and poll weight\n",        y = "Percent Vote if Election Today", x = "Poll Date",        color = "Candidate", size="538 Poll\nWeight")

技术分享

快速分析

考虑到每位候选人的选票比例会基于当前投票的票数百分比，所以，必须基于538人（样本容量samplesize）的想法（投票举动）和投票关闭天数（day sine poll）进行选票权重设置。权重的计算公式如下：

技术分享

使用计算出的权重，我将计算被预测选票百分比的加权平均和其标准偏差（SE）。标准偏差（SE）计算公式来自 Cochran (1977) 。

##快速分析# 参考文献# code found at http://stats.stackexchange.com/questions/25895/computing-standard-error-in-weighted-mean-estimation# cited from http://www.cs.tufts.edu/~nr/cs257/archive/donald-gatz/weighted-standard-error.pdf# Donald F. Gatz and Luther Smith, "THE STANDARD ERROR OF A WEIGHTED MEAN CONCENTRATION-I. BOOTSTRAPPING VS OTHER METHODS"weighted.var.se <- function(x, w, na.rm=FALSE)  #  Computes the variance of a weighted mean following Cochran 1977 definition{  if (na.rm) { w <- w[i <- !is.na(x)]; x <- x[i] }  n = length(w)  xWbar = weighted.mean(x,w,na.rm=na.rm)  wbar = mean(w)  out = n/((n-1)*sum(w)^2)*(sum((w*x-wbar*xWbar)^2)-2*xWbar*sum((w-wbar)*(w*x-wbar*xWbar))+xWbar^2*sum((w-wbar)^2))  return(out)}# 计算累计平均和加权平均值Cumulative Mean / Weighted Meanallpolldata2 <- allpolldata %>%  filter(wtnow > 0) %>%  filter(polltypeA == "now") %>%  mutate(dayssince = as.numeric(today() - endDate)) %>%  mutate(wt = wtnow * sqrt(sampleSize) / dayssince) %>%  mutate(votewt = wt*pct) %>%  group_by(choice) %>%  arrange(choice, -dayssince) %>%  mutate(cum.mean.wt = cumsum(votewt) / cumsum(wt)) %>%  mutate(cum.mean = cummean(pct))View(allpolldata2 )

技术分享

可视化累计平均和加权平均值

##绘制累计平均/加权平均Cumulative Mean / Weighted Mean# 累计平均ggplot(subset(allpolldata2, ( endDate > ymd("2016-01-01"))),        aes(y=cum.mean, x=endDate, color=choice)) +   geom_line() + geom_point(aes(size=wt)) +   labs(title = "Cumulative Mean Vote Percentage\n",        y = "Cumulative Percent Vote if Election Today", x = "Poll Date",        color = "Candidate", size="Calculated Weight")# 加权平均ggplot(subset(allpolldata2, (endDate > ymd("2016-01-01"))),        aes(y=cum.mean.wt, x=endDate, color=choice)) +   geom_line() + geom_point(aes(size=wt)) +   labs(title = "Cumulative Weighted Mean Vote Percentage\n",        y = "Cumulative Weighted Percent Vote if Election Today", x = "Poll Date",        color = "Candidate", size="Calculated Weight")

技术分享

选票百分比预测

此外，加权平均和平均的标准偏差(科克伦(1977))可以对每个候选人进行计算。使用这个公式，我们可以预测主要候选人的最后的百分比！

pollsummary <- allpolldata2 %>%   select(choice, pct, wt, votewt, sampleSize, dayssince) %>%  group_by(choice) %>%  summarise(mean.vote = weighted.mean(pct, wt, na.rm=TRUE),            std.vote = sqrt(weighted.var.se(pct, wt, na.rm=TRUE)))pollsummary## # A tibble: 2 x 3##     choice mean.vote  std.vote##      <chr>     <dbl>     <dbl>## 1 Clinton  43.48713 0.5073771## 2   Trump  38.95760 1.0717574

技术分享

显然，主要的候选人是克林顿和希拉里，克林顿的选票平均百分比高于希拉里，并且其标准偏差小于希拉里，也就是说其选票变化稳定，最后胜出的很可能就是克林顿，但是按照希拉里的变化波动大，也不排除希拉里获胜的可能。可以看到希拉里的选票比例最高曾达到51%。

技术分享

原文链接：https://www.r-statistics.com/2016/08/presidential-election-predictions-2016/

本文链接：http://www.cnblogs.com/homewch/p/5811945.html

2016年总统选举的预测

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 2016年总统选举的预测

2016年总统选举的预测

ASA的美国总统竞选

获取数据

快速可视化

快速分析

可视化累计平均和加权平均值

选票百分比预测

看完仍有疑问？有类似问题直接问程序猿