首页 > 代码库 > 数据挖掘算法学习(五)C4.5

数据挖掘算法学习(五)C4.5

C4.5分类决策树算法,其核心算法是ID3算法。目前应用在临床决策、生产制造、文档分析、生物信息学、空间数据建模等领域。算法的输入是带类标的数据,输出是树形的决策规则。

C4.5ID3改进:

1)用信息增益率来选择属性,克服了用信息增益选择属性时偏向选择取值多的属性的不足;
2)在树构造过程中进行剪枝;
3)能够完成对连续属性的离散化处理;
4)能够对不完整数据进行处理。

C4.5算法优点:产生的分类规则易于理解,准确率较高

C4.5算法缺点:在构造树的过程中,需要对数据集进行多次的顺序扫描和排序,因而导致算法的低效。

算法过程:

?C4.5(DataSet,featureList)
创建根节点R
如果当前DataSet中的数据都属于同一类,则标R的类别为该
果当featureList合为空,则标记R的类别为当前DataSet中样本最多的类别
归情况:
?featureList中选择属性F(选择GainRatio(DataSet, F)最大的属性,连续属性参见上面的离散化过程)
?F的每一个值v,将DataSet划分为不同的子集DS,对于每一个DS
创建节点C
如果DS为空,节点C标记为DataSet中样本最多的类别
如果DS不为空,节点C=C4.5(DS,featureList - F)
将节点C添加为R的子节点

利用weka将weather数据用c4.5分类构建的决策树如下图:


用SQL算法实现c4.5算法的核心代码:

drop procedure if exists buildtree;
DELIMITER |
create procedure buildtree()
begin
declare le int default 1;
declare letemp int default 1;
declare current_num int default 1;
declare current_class varchar(20);
declare current_gain double;
declare current_table varchar(20) default 'weather';
update infoset set state=1,statetemp=1;
rr:while (1=1) do
	set @weather = (select play from weather where class is null limit 0,1);
	set @feature =(select x from infoset where statetemp=1 limit 0,1);
	if (@weather is null ) then
		leave rr;
	else if(@feature is null) then
		update infoset set statetemp = state; 
	end if;
	end if;
	if (@weather is not null) then
		b:begin
			set current_gain = (select max(info_Gain) from infoset where statetemp=1);
			set current_class = (select x from infoset where info_Gain = current_gain);
			drop table if exists aa;
			set @a=concat('create temporary table aa select distinct ',current_class,' as namee from weather where class is null');
			prepare stmt1 from @a;
			execute stmt1;
			tt:while (1=1) do
				set @x = (select namee from aa limit 0,1);
				if (@x is not null) then
					a0:begin
						drop table if exists bb;
						set @b=concat('create temporary table bb select * from ', current_table,' where ',current_class,' = ? and class is null');
						prepare stmt2 from @b;
						execute stmt2 using @x;
						set @count = (select count(distinct play) from bb);
						if (@count =1) then
							a1:begin
								update weather set class = current_num,levelnum = letemp where id in (select id from bb); 
								set current_num = current_num+1;
								if (current_table ='cc') then
									delete from cc where id in (select id from bb);
								end if;
								set @f=(select play from cc limit 0,1);
								if (@f is null) then
									set current_table='weather';
									update infoset set statetemp=state; 
									set letemp =le;
								end if;
							delete from aa where namee = @x;
							end a1;
							end if;
						if (@count>1) then
								drop table if exists cc;
								create temporary table cc select * from bb;
								set current_table = 'cc';
								set letemp = letemp+1;
								leave tt;
							end if;
						if(@count=0) then
								delete from aa where namee = @x;
								set le = le+1;
							end if;
				end a0;
				else 
					update infoset set state=0 where x=current_class;
					leave tt;
				end if;
			end while;	
			update infoset set statetemp=0 where x=current_class; 
	 end b;
end if;
end while;
end |
delimiter ;

运行后分类结果如下图所示:



 

其中,class属性记录分类结果,levelnum属性记录各个数据在决策树中的层数。


对程序中各个表的解释:

?1 weather 增加levelnum属性列,用于存放每条记录在生成的分类树中的层数。
?2 C45_temp 用于存放中间结果。其中属性列logvparent,分别用于存储各属性分量的中间计算结果和所属类别。
?3 infoset用于存放各个属性的信息增益率。属性列statestatetemp用于在计算过程中做标记。



数据挖掘算法学习(五)C4.5