首页 > 代码库 > 基于R语言的用户分析
基于R语言的用户分析
1. 基本分析理论
C5.0是决策树模型中的算法,79年由J R Quinlan发展,并提出了ID3算法,主要针对离散型属性数据,其后又不断的改进,形成C4.5,它在ID3基础上增加了队连续属性的离散化。C5.0是C4.5应用于大数据集上的分类算法,主要在执行效率和内存使用方面进行了改进。
C4.5算法是ID3算法的修订版,采用GainRatio来加以改进方法,选取有最大GainRatio的分割变量作为准则,避免ID3算法过度配适的问题。
C5.0算法则是C4.5算法的修订版,适用于处理大数据集,采用Boosting方式提高模型准确率,又称为BoostingTrees,在软件上计算速度比较快,占用的内存资源较少。
决策树模型,也称规则推理模型。通过对训练样本的学习,建立分类规则;依据分类规则,实现对新样本的分类;属于有指导(监督)式的学习方法,有两类变量:目标变量(输出变量),属性变量(输入变量)。
决策树模型与一般统计分类模型的主要区别:决策树的分类是基于逻辑的,一般统计分类模型是基于非逻辑的。
常见的算法有CHAID、CART、Quest和C5.0。对于每一个决策要求分成的组之间的“差异”最大。各种决策树算法之间的主要区别就是对这个“差异”衡量方式的区别。
决策树很擅长处理非数值型数据,这与神经网络智能处理数值型数据比较而言,就免去了很多数据预处理工作。
C5.0是经典的决策树模型算法之一,可生成多分支的决策树,目标变量为分类变量,使用C5.0算法可以生成决策树或者规则集。C5.0模型根据能偶带来的最大信息增益的字段拆分样本。第一次拆分确定的样本子集随后再次拆分,通常是根据另一个字段进行拆分,这一过程重复进行指导样本子集不能在被拆分为止。最后,重新缉拿眼最低层次的拆分,哪些对模型值没有显著贡献的样本子集被提出或者修剪。
C5.0优点:
C5.0模型在面对数据遗漏和输入字段很多的问题时非常稳健;
C5.0模型比一些其他类型的模型易于理解,模型退出的规则有非常直观的解释;
C5.0也提供强大技术以提高分类的精度。
C5.0算法
C5.0算法选择分支变量的依据:以信息熵的下降速度作为确定最佳分支变量和分割阀值的依据。信息熵的下降意味着信息的不确定性下降。
2. 涉及先关包
library(C50)
library(dplyr)
3. 实例
library(C50) data(churn) churn_data <- churnTrain outcome_name <- ‘churn‘ # make the outcome variable easier to read churn_data[,outcome_name] <- as.factor(ifelse(churn_data[,outcome_name]==‘yes‘,‘Does_Churn‘, ‘Stays‘))
interesting_interactions <- function(the_data_frame, outcome_name) {
# install.packages(...) if missing
require(C50)
require(dplyr)
c5model <- C5.0(
x = the_data_frame[,setdiff(names(the_data_frame), outcome_name)],
y = the_data_frame[,outcome_name],
rules = TRUE
)
rule_munger <- capture.output(c5model$rules , split = TRUE)
rule_munger <- strsplit(rule_munger,‘\\\\n‘)
rule_munger <- gsub(x = rule_munger[[1]], pattern = ‘\\\\|\"‘, replacement = ‘‘)[-1]
# extract results into data frame format
rule_count <- 0
conds_last <- 0
cover_last <- 0
ok_last <- 0
lift_last <- 0
class_last <- 0
rules <- c()
for (entry in rule_munger) {
print(entry)
if (substr(entry,1,5) == ‘rules‘)
print(entry)
# track only lines starting with conds or type - ignore rest
if (substr(entry,1,5) == ‘conds‘ |
substr(entry,1,4) == ‘type‘) {
if (substr(entry,1,5) == ‘conds‘) {
rule_count <- rule_count + 1
conds_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][1], split = ‘=‘)[[1]][2]
# cover is the number of training cases covered by the rule
cover_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][2], split = ‘=‘)[[1]][2]
# ok is the number of positives covered by class,
ok_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][3], split = ‘=‘)[[1]][2]
# lift is the estimated accuracy of the rule
lift_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][4], split = ‘=‘)[[1]][2]
# class predicted by
class_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][5], split = ‘=‘)[[1]][2]
}
if (substr(entry,1,4) == ‘type‘) {
# variable type
type_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][1], split = ‘=‘)[[1]][2]
att_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][2], split = ‘=‘)[[1]][2]
# sniff out optional parameters
elts_last <- ‘‘
if (grepl(x = entry, pattern = ‘elts‘)) {
elts_last <- strsplit(x = entry, split = "elts=")[[1]][2]
}
cut_last <- ‘‘
if (grepl(x = entry, pattern = ‘cut‘)) {
cut_last <-
strsplit(
x = strsplit(
x = entry, split = "cut="
)[[1]][2], split = ‘ ‘
)[[1]][1]
}
val_last <- ‘‘
if (grepl(x = entry, pattern = ‘val‘)) {
val_last <- strsplit(x = entry, split = "val=")[[1]][2]
}
result_last <- ‘‘
if (grepl(x = entry, pattern = ‘result‘)) {
result_last <- strsplit(x = entry, split = "result=")[[1]][2]
}
rules <- rbind(
rules, c(
rule_count,
conds_last,
cover_last,
ok_last,
lift_last,
type_last,
att_last,
elts_last,
result_last,
cut_last,
val_last,
class_last
)
)
}
}
}
if (!is.null(rules)) {
rules <- data.frame(rules)
names(rules) <-
c(
‘rule_number‘, ‘conditions‘, ‘cover‘, ‘true_pos‘,
‘lift‘, ‘type‘, ‘attribute‘, ‘elts‘, ‘cut‘, ‘result‘,
‘value‘, ‘outcome‘
)
rules[, 1:6] <- sapply(rules[, 1:6], as.character)
rules[, 1:6] <- sapply(rules[, 1:6], as.numeric)
if (length(unique(rules$rule_number) > 0)) {
rules %>% dplyr::arrange(desc(lift)) -> rules
}
}
return (rules)
}
results <- interesting_interactions(the_data_frame = churn_data, outcome_name = outcome_name)
print_rules <- function(rules_found, rulenum) { print(‘‘) print(paste0(‘Rule #‘, rulenum)) dplyr::filter(rules_found, rule_number == rulenum) -> pulled_rule dplyr::select(pulled_rule, cover, true_pos, outcome) %>% head(1) -> rule_def dplyr::select(pulled_rule, attribute, elts, cut, result, value) -> conditions print(paste0(‘In ‘, rule_def$cover, ‘ cases, ‘, round(rule_def$true_pos/rule_def$cover,2)*100, ‘% customers ‘, as.character(rule_def$outcome),‘ when:‘)) for (cond_id in seq(nrow(conditions))) { cond <- conditions[cond_id,] #attribute elts cut result value if (nchar(as.character(cond$elts)) > 0) { print(paste0(cond$attribute, ‘: ‘, cond$elts)) } else if (nchar(as.character(cond$value)) > 0) { print(paste0(cond$attribute, ‘ == ‘, cond$value)) } else { print(paste0(cond$attribute, " ", cond$cut, " ", cond$result)) } } print(‘‘) }
for (rule_number in unique(results$rule_number)) print_rules(results, rule_number)
## [1] "" ## [1] "Rule #1" ## [1] "In 60 cases, 100% customers Does_Churn when:" ## [1] "international_plan == yes" ## [1] "total_intl_calls < 2" ## [1] "" ## [1] "" ## [1] "Rule #2" ## [1] "In 57 cases, 100% customers Does_Churn when:" ## [1] "international_plan == yes" ## [1] "total_intl_minutes > 13.1" ## [1] "" ## [1] "" ## [1] "Rule #3" ## [1] "In 32 cases, 100% customers Does_Churn when:" ## [1] "total_day_minutes < 120.5" ## [1] "number_customer_service_calls > 3" ## [1] "" ## [1] "" ## [1] "Rule #4" ## [1] "In 79 cases, 96% customers Does_Churn when:" ## [1] "total_day_minutes < 160.2" ## [1] "total_eve_charge < 19.83" ## [1] "number_customer_service_calls > 3" ## [1] "" ## [1] "" ## [1] "Rule #5" ## [1] "In 43 cases, 95% customers Does_Churn when:" ## [1] "international_plan == no" ## [1] "voice_mail_plan == no" ## [1] "total_day_minutes > 246.60001" ## [1] "total_eve_charge > 20.5" ## [1] "" ## [1] "" ## [1] "Rule #6" ## [1] "In 28 cases, 93% customers Does_Churn when:" ## [1] "total_day_minutes < 264.39999" ## [1] "total_eve_calls < 125" ## [1] "total_eve_charge < 12.05" ## [1] "number_customer_service_calls > 3" ## [1] "" ## [1] "" ## [1] "Rule #7" ## [1] "In 78 cases, 90% customers Does_Churn when:" ## [1] "voice_mail_plan == no" ## [1] "total_day_minutes > 223.2" ## [1] "total_eve_charge > 20.5" ## [1] "total_night_minutes > 174.2" ## [1] "" ## [1] "" ## [1] "Rule #8" ## [1] "In 114 cases, 79% customers Does_Churn when:" ## [1] "voice_mail_plan == no" ## [1] "total_day_minutes > 223.2" ## [1] "total_eve_charge > 20.5" ## [1] "" ## [1] "" ## [1] "Rule #9" ## [1] "In 152 cases, 62% customers Does_Churn when:" ## [1] "total_day_minutes > 223.2" ## [1] "total_eve_charge > 20.5" ## [1] "" ## [1] "" ## [1] "Rule #10" ## [1] "In 211 cases, 60% customers Does_Churn when:" ## [1] "total_day_minutes > 264.39999" ## [1] "" ## [1] "" ## [1] "Rule #12" ## [1] "In 768 cases, 97% customers Stays when:" ## [1] "international_plan == no" ## [1] "voice_mail_plan == yes" ## [1] "number_customer_service_calls < 3" ## [1] "" ## [1] "" ## [1] "Rule #11" ## [1] "In 2221 cases, 97% customers Stays when:" ## [1] "international_plan == no" ## [1] "total_day_minutes < 223.2" ## [1] "number_customer_service_calls < 3" ## [1] "" ## [1] "" ## [1] "Rule #13" ## [1] "In 140 cases, 96% customers Stays when:" ## [1] "account_length < 123" ## [1] "total_eve_minutes < 187.7" ## [1] "total_night_minutes < 151.89999" ## [1] "" ## [1] "" ## [1] "Rule #14" ## [1] "In 45 cases, 98% customers Stays when:" ## [1] "international_plan == no" ## [1] "voice_mail_plan == yes" ## [1] "total_day_minutes > 264.39999" ## [1] "" ## [1] "" ## [1] "Rule #15" ## [1] "In 1972 cases, 96% customers Stays when:" ## [1] "total_day_minutes < 264.39999" ## [1] "total_intl_minutes < 13.1" ## [1] "total_intl_calls > 2" ## [1] "number_customer_service_calls < 3" ## [1] "" ## [1] "" ## [1] "Rule #16" ## [1] "In 197 cases, 95% customers Stays when:" ## [1] "total_day_minutes > 120.5" ## [1] "total_day_minutes < 160.2" ## [1] "total_eve_charge > 19.83" ## [1] "" ## [1] "" ## [1] "Rule #17" ## [1] "In 155 cases, 94% customers Stays when:" ## [1] "voice_mail_plan == no" ## [1] "total_day_minutes < 277" ## [1] "total_night_minutes < 126.9" ## [1] "" ## [1] "" ## [1] "Rule #18" ## [1] "In 1675 cases, 89% customers Stays when:" ## [1] "total_day_minutes > 160.2" ## [1] "total_day_minutes < 264.39999" ## [1] "total_eve_charge > 12.05" ## [1] "" ## [1] "" ## [1] "Rule #19" ## [1] "In 434 cases, 89% customers Stays when:" ## [1] "total_eve_charge < 12.26" ## [1] ""
基于R语言的用户分析