首页 > 代码库 > Python基本数据统计

Python基本数据统计

1. 便捷数据获取

  1.1 本地数据获取:文件的打开,读写和关闭(另外的单独章节)

  1.2 网络数据获取:

    1.2.1 urllib, urllib2, httplib, httplib2 (python3中为urllib.request, http.client)

      正则表达式(另外的单数章节)

    1.2.2 通过matplotlib.finace模块获取雅虎财经上的数据

In [7]: from matplotlib.finance import quotes_historical_yahoo_ochl

In [8]: from datetime import date

In [9]: from datetime import datetime

In [10]: import pandas as pd

In [11]: today = date.today()

In [12]: start = (today.year-1, today.month, today.day)

In [14]: quotes = quotes_historical_yahoo_ochl(AXP, start, today)  # 获取数据

In [15]: fields = [date, open, close, high, low, volume]

In [16]: list1 = []

In [18]: for i in range(0,len(quotes)):
    ...:     x = date.fromordinal(int(quotes[i][0]))  # 取每一行的第一列,通过date.fromordinal设置为日期数据类型
    ...:     y = datetime.strftime(x,%Y-%m-%d)  # 通过datetime.strftime把日期设置为指定格式
    ...:     list1.append(y)  # 将日期放入列表中
    ...:     

In [19]: quotesdf = pd.DataFrame(quotes,index=list1,columns=fields)  # index设置为日期,columns设置为字段

In [20]: quotesdf = quotesdf.drop([date],axis=1)  # 删除date列

In [21]: print quotesdf
                 open      close       high        low      volume
2016-01-20  60.374146  61.835916  62.336256  60.128882   9043800.0
2016-01-21  61.806486  61.453305  63.101479  61.325767   8992300.0
2016-01-22  57.283819  54.016907  57.774347  53.114334  43783400.0

    1.2.3 通过自然语言工具包NLTK获取语料库等数据

      1. 下载nltk:pip install nltk

      2. 下载语料库:

In [1]: import nltk

In [2]: nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> gutenberg
    Downloading package gutenberg to /root/nltk_data...
      Package gutenberg is already up-to-date!

      3. 获取数据:

In [3]: from nltk.corpus import gutenberg

In [4]: print gutenberg.fileids()
[uausten-emma.txt, uausten-persuasion.txt, uausten-sense.txt, ubible-kjv.txt, ublake-poems.txt, ubryant-stories.txt, uburgess-busterbrown.txt, ucarroll-alice.txt, uchesterton-ball.txt, uchesterton-brown.txt, uchesterton-thursday.txt, uedgeworth-parents.txt, umelville-moby_dick.txt, umilton-paradise.txt, ushakespeare-caesar.txt, ushakespeare-hamlet.txt, ushakespeare-macbeth.txt, uwhitman-leaves.txt]

In [5]: texts = gutenberg.words(shakespeare-hamlet.txt)

In [6]: texts
Out[6]: [u[, uThe, uTragedie, uof, uHamlet, uby, ...]

2. 数据准备和整理

  2.1 quotes数据加入[ 列 ]属性名

In [79]: quotesdf = pd.DataFrame(quotes)

In [80]: quotesdf
Out[80]: 
            0          1          2          3          4           5
0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0
3    735988.0  53.428272  53.977664  54.713455  53.114334  18498300.0

[253 rows x 6 columns]

In [81]: fields = [date,open,close,high,low,volume]

In [82]: quotesdf = pd.DataFrame(quotes,columns=fields)  # 设置列属性名称

In [83]: quotesdf
Out[83]: 
         date       open      close       high        low      volume
0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0
3    735988.0  53.428272  53.977664  54.713455  53.114334  18498300.0

  2.2 quotes数据加入[ index ]属性名

In [84]: quotesdf
Out[84]: 
         date       open      close       high        low      volume
0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0

[253 rows x 6 columns]

In [85]: quotesdf = pd.DataFrame(quotes, index=range(1,len(quotes)+1),columns=fields)  # 把index属性从0,1,2...改为1,2,3...

In [86]: quotesdf
Out[86]: 
         date       open      close       high        low      volume
1    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
2    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
3    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0

  2.3 日期转换:Gregorian日历表示法 => 普通表示方法

In [88]: from datetime import date

In [89]: firstday = date.fromordinal(735190)

In [93]: firstday
Out[93]: datetime.date(2013, 11, 18)

In [95]: firstday = datetime.strftime(firstday,%Y-%m-%d)

In [96]: firstday
Out[96]: 2013-11-18

  2.4 创建时间序列:

In [120]: import pandas as pd

In [121]: dates = pd.date_range(20170101, periods=7)  # 根据起始日期和长度生成日期序列

In [122]: dates
Out[122]: 
DatetimeIndex([2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04,2017-01-05, 2017-01-06, 2017-01-07],dtype=datetime64[ns], freq=D)

In [123]: import numpy as np

In [124]: dates = pd.DataFrame(np.random.randn(7,3), index=dates, columns=list(ABC))  # 时间序列当作index,ABC当作列的name属性,表内容为七行三列随机数

In [125]: dates
Out[125]: 
                   A         B         C
2017-01-01  0.705927  0.311453  1.455362
2017-01-02 -0.331531 -0.358449  0.175375
2017-01-03 -0.284583 -1.760700 -0.582880
2017-01-04 -0.759392 -2.080658 -2.015328
2017-01-05 -0.517370  0.906072 -0.106568
2017-01-06 -0.252802 -2.135604 -0.692153
2017-01-07 -0.275184  0.142973 -1.262126

  2.5 练习

In [101]: datetime.now()  # 显示当前日期和时间
Out[101]: datetime.datetime(2017, 1, 20, 16, 11, 50, 43258)
=========================================
In [108]: datetime.now().month  # 显示当前月份
Out[108]: 1

=========================================
In [126]: import pandas as pd

In [127]: dates = pd.date_range(2015-02-01,periods=10)

In [128]: dates
Out[128]: 
DatetimeIndex([2015-02-01, 2015-02-02, 2015-02-03, 2015-02-04,2015-02-05, 2015-02-06, 2015-02-07, 2015-02-08,2015-02-09, 2015-02-10],dtype=datetime64[ns], freq=D)

In [133]: res = pd.DataFrame(range(1,11),index=dates,columns=[value])

In [134]: res
Out[134]: 
            value
2015-02-01      1
2015-02-02      2
2015-02-03      3
2015-02-04      4
2015-02-05      5
2015-02-06      6
2015-02-07      7
2015-02-08      8
2015-02-09      9
2015-02-10     10

3. 数据显示

  3.1 显示方式:

In [180]: quotesdf2.index  # 显示索引
Out[180]: 
Index([u2016-01-20, u2016-01-21, u2016-01-22, u2016-01-25,
       ...
       u2017-01-11, u2017-01-12, u2017-01-13, u2017-01-17,
       u2017-01-18, u2017-01-19],
      dtype=object, length=253)

In [181]: quotesdf2.columns  # 显示列名
Out[181]: Index([uopen, uclose, uhigh, ulow, uvolume], dtype=object)

In [182]: quotesdf2.values  # 显示数据的值
Out[182]: 
array([[  6.03741455e+01,   6.18359160e+01,   6.23362562e+01,
          6.01288817e+01,   9.04380000e+06],
       ..., 
       [  7.76100010e+01,   7.66900020e+01,   7.77799990e+01,
          7.66100010e+01,   7.79110000e+06]])

In [183]: quotesdf2.describe  # 显示数据描述
Out[183]: 
<bound method DataFrame.describe of                  open      close       high        low      volume
2016-01-20  60.374146  61.835916  62.336256  60.128882   9043800.0
2016-01-21  61.806486  61.453305  63.101479  61.325767   8992300.0
2016-01-22  57.283819  54.016907  57.774347  53.114334  43783400.0

  3.2 索引的格式:u 表示unicode编码

  3.3 显示行:

In [193]: quotesdf.head(2)  # 专用方式显示头两行
Out[193]: 
       date       open      close       high        low     volume
1  735983.0  60.374146  61.835916  62.336256  60.128882  9043800.0
2  735984.0  61.806486  61.453305  63.101479  61.325767  8992300.0

In [194]: quotesdf.tail(2)  # 专用方式显示尾两行
Out[194]: 
         date       open      close       high        low     volume
252  736347.0  77.110001  77.489998  77.610001  76.510002  5988400.0
253  736348.0  77.610001  76.690002  77.779999  76.610001  7791100.0

In [195]: quotesdf[:2]  # 切片方式显示头两行
Out[195]: 
       date       open      close       high        low     volume
1  735983.0  60.374146  61.835916  62.336256  60.128882  9043800.0
2  735984.0  61.806486  61.453305  63.101479  61.325767  8992300.0

In [197]: quotesdf[251:]  # 切片方式显示尾两行
Out[197]: 
         date       open      close       high        low     volume
252  736347.0  77.110001  77.489998  77.610001  76.510002  5988400.0
253  736348.0  77.610001  76.690002  77.779999  76.610001  7791100.0

4. 数据选择

5. 简单统计与处理

6. Grouping

7. Merge

Python基本数据统计