首页 > 代码库 > pandas小记:pandas数据输入输出
pandas小记:pandas数据输入输出
http://blog.csdn.net/pipisorry/article/details/52208727
数据输入输出
数据pickling
pandas数据pickling比保存和读取csv文件要快2-3倍(lz测试不准,差不多这么多)。
ltu_df.to_pickle(os.path.join(CWD, ‘middlewares/ltu_df‘))ltu_df = pd.read_pickle(os.path.join(CWD, ‘middlewares/ltu_df‘))
[read_pickle]
不过lz测试了一下,还是直接pickle比较快,比pd.read_pickle快2倍左右。
pickle.dump(ltu_df, open(os.path.join(CWD, ‘middlewares/ltu_df.pkl‘), ‘wb‘))ltu_df = pickle.load(open(os.path.join(CWD, ‘middlewares/ltu_df.pkl‘), ‘rb‘))
CSV
如:# Reading data locally
Writing to a csv file
In [136]: df.to_csv(‘foo.csv‘)
read_csv
lines = pd.read_csv(checkin_filename, sep=‘\t‘, header=None,names=col_names, parse_dates=[1], skip_blank_lines=True, index_col=0).reset_index() dateparse = lambda dates: pd.datetime.strptime(dates, ‘%Y-%m‘) data = http://www.mamicode.com/pd.read_csv(‘AirPassengers.csv‘, parse_dates=‘Month‘, index_col=‘Month‘,date_parser=dateparse)>参数:skiprows=2,表示前面两行[0, 1]都不读入,等价于skiprows=[0, 1];
header=None第0行不作为列名;
names=[‘‘] 指定列名;
parse_dates=[] 解析指定行为date类型;
index_col=0 指定某列为行索引,否则自动索引0, 1, .....。reset_index()是其反操作。
parse_dates:这是指定含有时间数据信息的列。正如上面所说的,列的名称为“月份”。
index_col:使用pandas 的时间序列数据背后的关键思想是:目录成为描述时间数据信息的变量。所以该参数告诉pandas使用“月份”的列作为索引。
date_parser:指定将输入的字符串转换为可变的时间数据。Pandas默认的数据读取格式是‘YYYY-MM-DD HH:MM:SS’。如需要读取的数据没有默认的格式,就要人工定义。这和dataparse的功能部分相似,这里的定义可以为这一目的服务。[python模块 - 时间模块 ]converters : dict, default None: Dict of functions for converting values in certain columns. Keys can eitherbe integers or column labels.将数据某列按特定函数转化,必然可以取代自定义时date_parser和parse_dates两个参数呀。
如解析时间时想返回时间戳的浮点数表示时:
def dateParse(s): return float(__import__(‘datetime‘).datetime.timestamp(__import__(‘dateutil.parser‘).parser.parse(s)))df = pd.read_csv(os.path.join(CA_DATASET_DIR, checkin_ca), header=0, sep=‘\t‘, converters={‘Time(GMT)‘: dateParse})[Reading from a csv file]
In [137]: pd.read_csv(‘foo.csv‘) Out[137]: Unnamed: 0 A B C D 0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860 1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953 .. ... ... ... ... ... 998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560 999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368 [1000 rows x 5 columns]
HDF5
Reading and writing to HDFStores
Writing to a HDF5 Store
In [138]: df.to_hdf(‘foo.h5‘,‘df‘)
Reading from a HDF5 Store
In [139]: pd.read_hdf(‘foo.h5‘,‘df‘) Out[139]: A B C D 2000-01-01 0.266457 -0.399641 -0.219582 1.186860 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953 ... ... ... ... ... 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368 [1000 rows x 4 columns]
Excel
好像如果使用pd.read_excel要安装xlrd:pip install xlrd
- Reading and writing to MS Excel
Writing to an excel file
In [140]: df.to_excel(‘foo.xlsx‘, sheet_name=‘Sheet1‘)
Reading from an excel file
pandas.read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, engine=None, squeeze=False, **kwds)
参数:converters:读数据的时候使用converters指定列数据的数值类型 pd.read_excel(‘a.xlsx‘,converters={0: str})
In [141]: pd.read_excel(‘foo.xlsx‘, ‘Sheet1‘, index_col=None, na_values=[‘NA‘]) Out[141]: A B C D 2000-01-01 0.266457 -0.399641 -0.219582 1.186860 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953 ... ... ... ... ... 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368 [1000 rows x 4 columns]
Gotchas
If you are trying an operation and you see an exception like:
>>> if pd.Series([False, True, False]): print("I was true") Traceback ... ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
See Comparisons for an explanation and what to do.
See Gotchas as well.
[CSV & Text files]
from: http://blog.csdn.net/pipisorry/article/details/52208727
ref: [IO Tools (Text, CSV, HDF5, ...)?]
pandas小记:pandas数据输入输出