首页 > 代码库 > 导入GoogleClusterData到MySQL
导入GoogleClusterData到MySQL
本篇随笔记录如何导入google-cluster-data-2011-1-2的
job_events和task_events到MySQL
1. 下载数据
download_job_events:
import urllib2url = ‘https://commondatastorage.googleapis.com/clusterdata-2011-2/‘f = open(‘C:\\SHA256SUM‘)l = f.readlines()f.close()for i in l: if i.count(‘job_events‘)>0: fileAddr = i.split()[1][1:] fileName = fileAddr.split(‘/‘)[1] print ‘downloading‘, fileName data = urllib2.urlopen(url+fileAddr).read() print ‘saving‘, fileName fileDown = open(‘C:\\job_events\\‘+fileName, ‘wb‘) fileDown.write(data) fileDown.close()
download_task_events:
import urllib2url = ‘https://commondatastorage.googleapis.com/clusterdata-2011-2/‘f = open(‘C:\\SHA256SUM‘)l = f.readlines()f.close()for i in l: if i.count(‘task_events‘)>0: fileAddr = i.split()[1][1:] fileName = fileAddr.split(‘/‘)[1] print ‘downloading‘, fileName data = urllib2.urlopen(url+fileAddr).read() print ‘saving‘, fileName fileDown = open(‘C:\\task_events\\‘+fileName, ‘wb‘) fileDown.write(data) fileDown.close()
注意:这次用的数据是
clusterdata-2011-2
不同于之前重画GoogleCLusterData中的
clusterdata-2011-1
2. 解压缩
由于不能直接导入压缩包里的数据到mysql,故先将它们解压缩
unzip_job_events:
import gzipimport osfileNames = os.listdir(‘C:\\task_events‘)for l in fileNames: print ‘now at: ‘+ l f = gzip.open(‘C:\\job_events\\‘+l) fOut = open(‘C:\\job_events_unzip\\‘+l[:-3], ‘w‘) content = f.read() fOut.write(content) f.close() fOut.close() #raw_input()
unzip_task_events:
import gzipimport osfileNames = os.listdir(‘C:\\task_events‘)for l in fileNames: print ‘now at: ‘+ l f = gzip.open(‘C:\\task_events\\‘+l) fOut = open(‘C:\\task_events_unzip\\‘+l[:-3], ‘w‘) content = f.read() fOut.write(content) f.close() fOut.close()
3. 建数据库
create_job_events:
create table job_events(time bigint,missing_info int,job_id bigint,event_type int,user text,scheduling_class int,job_name text,logical_job_name text)engine = myisam;
create_task_events:
create table task_events(time bigint,missing_info int,job_id bigint,task_index bigint,machine_id bigint,event_type int,user text,scheduling_class int,priority int,cpu_request float,memory_request float,disk_space_request float,difference_machine_restriction boolean)engine = myisam;
注意:由于数据量非常大,这里一定要选择myisam作为engine。
4. 导入数据
由于数据中有部分为空的值,需要先设定mysql使其能够导入空值。
具体方法为:
在mysql的控制台输入
SET @@GLOBAL.sql_mode="NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION";
之后就可以开始导入数据了。
loadJobEvents2MySQL.py
import osimport MySQLdbfileNames = os.listdir(‘C:\\task_events_unzip‘)conn=MySQLdb.connect(host="localhost",user="root",passwd="123456",db="googleclusterdata",charset="utf8")cursor = conn.cursor()cursor.execute(‘truncate job_events‘)for f in fileNames: print ‘now at: ‘+ f order = "load data infile ‘C:/job_events_unzip/%s‘ into table job_events fields terminated by ‘,‘ lines terminated by ‘\n‘" %f print order cursor.execute(order) conn.commit()
loadTaskEvents2MySQL.py
import osimport MySQLdbfileNames = os.listdir(‘C:\\task_events_unzip‘)conn=MySQLdb.connect(host="localhost",user="root",passwd="123456",db="googleclusterdata",charset="utf8")cursor = conn.cursor()cursor.execute(‘truncate task_events‘)for f in fileNames: print ‘now at: ‘+ f order = "load data infile ‘C:/task_events_unzip/%s‘ into table task_events fields terminated by ‘,‘ lines terminated by ‘\n‘" %f print order cursor.execute(order) conn.commit()
注意:这里需要相应的修改密码和使用的数据库名(db)
导入GoogleClusterData到MySQL
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。