首页 > 代码库 > 利用Python进行数据分析——数据规整化:清理、转换、合并、重塑(七)(2)

利用Python进行数据分析——数据规整化:清理、转换、合并、重塑(七)(2)

1、索引上的合并

有时候,DataFrame中的连接键位于其索引中。在这种情况下,你可以传入left_index = True或right_index = True(或两个都传)以说明索引应该被用作连接键:

In [8]: left1 = pd.DataFrame({‘key‘: [‘a‘, ‘b‘, ‘a‘, ‘a‘, ‘b‘, ‘c‘],
   ...:                       ‘value‘:range(6)})

In [9]: right1 = pd.DataFrame({‘group_val‘:[3.5, 7]}, index=[‘a‘, ‘b‘])

In [10]: left1
Out[10]: 
  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5

[6 rows x 2 columns]

In [11]: right1
Out[11]: 
   group_val
a        3.5
b        7.0

[2 rows x 1 columns]

In [12]: pd.merge(left1, right1, left_on=‘key‘, right_index=True)
Out[12]: 
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        7.0
4   b      4        7.0

[5 rows x 3 columns]
由于默认的merge方法是求取连接键的交集,因此你可以通过外连接的方式得到它们的并集:

In [13]: pd.merge(left1, right1, left_on=‘key‘, right_index=True, how=‘outer‘)
Out[13]: 
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        7.0
4   b      4        7.0
5   c      5        NaN

[6 rows x 3 columns]
对于层次化索引的数据,事情就有点复杂了:

In [14]: lefth = pd.DataFrame({‘key1‘: [‘Ohio‘, ‘Ohio‘, ‘Ohio‘, ‘Nevada‘, ‘Nevada‘],
   ....:                       ‘key2‘: [2000, 2001, 2002, 2001, 2002],
   ....:                       ‘data‘: np.arange(5.)})  

In [15]: righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
   ....:                       index=[[‘Nevada‘, ‘Nevada‘, ‘Ohio‘, ‘Ohio‘, ‘Ohio‘, ‘Ohio‘], [2001, 2000, 2000, 2000, 2001, 2002]],
   ....:                       columns=[‘event1‘, ‘event2‘])

In [16]: lefth
Out[16]: 
   data    key1  key2
0     0    Ohio  2000
1     1    Ohio  2001
2     2    Ohio  2002
3     3  Nevada  2001
4     4  Nevada  2002

[5 rows x 3 columns]

In [17]: righth
Out[17]: 
             event1  event2
Nevada 2001       0       1
       2000       2       3
Ohio   2000       4       5
       2000       6       7
       2001       8       9
       2002      10      11

[6 rows x 2 columns]
这种情况下,你必须以列表的形式指明用作合并键的多个列(注意对重复索引值的处理):

In [18]: pd.merge(lefth, righth, left_on=[‘key1‘, ‘key2‘], right_index=True)
Out[18]: 
   data    key1  key2  event1  event2
0     0    Ohio  2000       4       5
0     0    Ohio  2000       6       7
1     1    Ohio  2001       8       9
2     2    Ohio  2002      10      11
3     3  Nevada  2001       0       1

[5 rows x 5 columns]

In [19]: pd.merge(lefth, righth, left_on=[‘key1‘, ‘key2‘], right_index=True, how=‘outer‘)
Out[19]: 
   data    key1  key2  event1  event2
0     0    Ohio  2000       4       5
0     0    Ohio  2000       6       7
1     1    Ohio  2001       8       9
2     2    Ohio  2002      10      11
3     3  Nevada  2001       0       1
4     4  Nevada  2002     NaN     NaN
4   NaN  Nevada  2000       2       3

[7 rows x 5 columns]

同时使用合并双方的索引也没问题:

In [21]: left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=[‘a‘, ‘c‘, ‘e‘], columns=[‘Ohio‘, ‘Nevada‘])

In [22]: right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
   ....:                        index=[‘b‘, ‘c‘, ‘d‘, ‘e‘], columns=[‘Missouri‘, ‘Alabama‘])

In [23]: left2
Out[23]: 
   Ohio  Nevada
a     1       2
c     3       4
e     5       6

[3 rows x 2 columns]

In [24]: right2
Out[24]: 
   Missouri  Alabama
b         7        8
c         9       10
d        11       12
e        13       14

[4 rows x 2 columns]

In [25]: pd.merge(left2, right2, how=‘outer‘, left_index=True, right_index=True)Out[25]: 
   Ohio  Nevada  Missouri  Alabama
a     1       2       NaN      NaN
b   NaN     NaN         7        8
c     3       4         9       10
d   NaN     NaN        11       12
e     5       6        13       14

[5 rows x 4 columns]
DataFrame还有一个join实例方法,它能更为方便地实现按索引合并。它还可用于合并多个带有相同或相似索引的DataFrame对象,而不管它们之间有没有重叠的列。在上面那个例子中,我们可以编写:

In [26]: left2.join(right2, how=‘outer‘)
Out[26]: 
   Ohio  Nevada  Missouri  Alabama
a     1       2       NaN      NaN
b   NaN     NaN         7        8
c     3       4         9       10
d   NaN     NaN        11       12
e     5       6        13       14

[5 rows x 4 columns]
由于一些历史原因(早期版本的pandas),DataFrame的join方法是在连接键上做左连接。它还支持参数DataFrame的索引跟调用者DataFrame的某个列之间的连接:

In [28]: left1.join(right1, on=‘key‘)
Out[28]: 
  key  value  group_val
0   a      0        3.5
1   b      1        7.0
2   a      2        3.5
3   a      3        3.5
4   b      4        7.0
5   c      5        NaN

[6 rows x 3 columns]
最后,对于简单的索引合并,你还可以向join传入一组DataFrame(concat函数也能实现此功能):

In [29]: another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
   ....:                        index=[‘a‘, ‘c‘, ‘e‘, ‘f‘], columns=[‘New York‘, ‘Oregon‘])

In [30]: left2.join([right2, another])
Out[30]: 
   Ohio  Nevada  Missouri  Alabama  New York  Oregon
a     1       2       NaN      NaN         7       8
c     3       4         9       10         9      10
e     5       6        13       14        11      12

[3 rows x 6 columns]

In [31]: left2.join([right2, another], how=‘outer‘)
Out[31]: 
   Ohio  Nevada  Missouri  Alabama  New York  Oregon
a     1       2       NaN      NaN         7       8
b   NaN     NaN         7        8       NaN     NaN
c     3       4         9       10         9      10
d   NaN     NaN        11       12       NaN     NaN
e     5       6        13       14        11      12
f   NaN     NaN       NaN      NaN        16      17

[6 rows x 6 columns]

2、轴向连接

另一种数据合并运算也被称作连接(concatenation)、绑定(binding)或堆叠(stacking)。NumPy有一个用于合并原始NumPy数组的concatenation函数:

In [32]: arr = np.arange(12).reshape((3, 4))

In [33]: arr
Out[33]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [34]: np.concatenate([arr, arr], axis=1)
Out[34]: 
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])
对于pandas对象(如Series和DataFrame),带有标签的轴使你能够进一步推广数组的连接运算。具体点说,你还需要考虑以下这些东西:

  • 如果各对象其他轴上的索引不同,那些轴应该是做并集还是交集?
  • 结果对象中的分组需要各不相同吗?
  • 用于连接的轴重要吗?

pandas的concat函数提供了一种能够解决这些问题的可靠方式。我将给出一些例子来讲解其使用方式。假设有三个没有重叠索引的Series:

In [35]: s1 = pd.Series([0, 1], index=[‘a‘, ‘b‘])

In [36]: s2 = pd.Series([2, 3, 4], index=[‘c‘, ‘d‘, ‘e‘])

In [37]: s3 = pd.Series([5, 6], index=[‘f‘, ‘g‘])
对这些对象调用concat可以将值和索引粘合在一起:

In [38]: pd.concat([s1, s2, s3])
Out[38]: 
a    0
b    1
c    2
d    3
e    4
f    5
g    6a
dtype: int64
默认情况下,concat是在axis=0上工作的,最终产生一个新的Series。如果传入axis=1,则结果就会变成一个DataFrame(axis=1是列):

In [39]: pd.concat([s1, s2, s3], axis=1)
Out[39]: 
    0   1   2
a   0 NaN NaN
b   1 NaN NaN
c NaN   2 NaN
d NaN   3 NaN
e NaN   4 NaN
f NaN NaN   5
g NaN NaN   6

[7 rows x 3 columns]
这种情况下,另外一条轴上没有重叠,从索引的有序并集(外连接)上就可以看出来。传入join=‘inner‘即可得到它们的交集:

In [40]: s4 = pd.concat([s1 * 5, s3])

In [41]: pd.concat([s1, s4], axis=1)
Out[41]: 
    0  1
a   0  0
b   1  5
f NaN  5
g NaN  6

[4 rows x 2 columns]

In [42]: pd.concat([s1, s4], axis=1, join=‘inner‘)
Out[42]: 
   0  1
a  0  0
b  1  5

[2 rows x 2 columns]
你可以通过join_axes指定要在其他轴上使用的索引:

In [43]: pd.concat([s1, s4], axis=1, join_axes=[[‘a‘, ‘c‘, ‘b‘, ‘e‘]])
Out[43]: 
    0   1
a   0   0
c NaN NaN
b   1   5
e NaN NaN

[4 rows x 2 columns]
不过有个问题,参与连接的片段在结果中区分不开。假设你想要在连接轴上创建一个层次化索引。使用keys参数即可达到这个目的:

In [44]: result = pd.concat([s1, s1, s3], keys=[‘one‘, ‘two‘, ‘three‘])

In [45]: result
Out[45]: 
one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64
In [46]: # 稍后将详细讲解unstack函数

In [47]: result.unstack()
Out[47]: 
        a   b   f   g
one     0   1 NaN NaN
two     0   1 NaN NaN
three NaN NaN   5   6

[3 rows x 4 columns]
如果沿着axis=1对Series进行合并,则keys就会成为DataFrame的列头:

In [49]: pd.concat([s1, s2, s3], axis=1, keys=[‘one‘, ‘two‘, ‘three‘])
Out[49]: 
   one  two  three
a    0  NaN    NaN
b    1  NaN    NaN
c  NaN    2    NaN
d  NaN    3    NaN
e  NaN    4    NaN
f  NaN  NaN      5
g  NaN  NaN      6

[7 rows x 3 columns]
同样的逻辑对DataFrame对象也是一样:

In [50]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=[‘a‘, ‘b‘, ‘c‘],
   ....:                    columns=[‘one‘, ‘two‘])

In [51]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=[‘a‘, ‘c‘],
   ....:                    columns=[‘three‘, ‘four‘])

In [52]: pd.concat([df1, df2], axis=1, keys=[‘level1‘, ‘level2‘])
Out[52]: 
   level1       level2      
      one  two   three  four
a       0    1       5     6
b       2    3     NaN   NaN
c       4    5       7     8

[3 rows x 4 columns]
如果传入的不是列表而是一个字典,则字典的键就会被当做keys选项的值:

In [53]: pd.concat({‘level1‘: df1, ‘level2‘: df2}, axis=1)
Out[53]: 
   level1       level2      
      one  two   three  four
a       0    1       5     6
b       2    3     NaN   NaN
c       4    5       7     8

[3 rows x 4 columns]
此外还有两个用于管理层次化索引创建方式的参数,参见下表:

In [54]: pd.concat([df1, df2], axis=1, keys=[‘level1‘, ‘level2‘], names=[‘upper‘, ‘lower‘])
Out[54]: 
upper  level1       level2      
lower     one  two   three  four
a           0    1       5     6
b           2    3     NaN   NaN
c           4    5       7     8

[3 rows x 4 columns]

最后一个需要考虑的问题是,跟当前分析工作无关的DataFrame行索引:

In [59]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=[‘a‘, ‘b‘, ‘c‘, ‘d‘]) 
In [60]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=[‘b‘, ‘d‘, ‘a‘])

In [61]: df1
Out[61]: 
          a         b         c         d
0  0.189556 -1.076347  0.074607  0.920779
1 -1.109474 -0.643803 -0.109871 -0.441758
2 -1.385936 -1.468652 -0.979678  1.021477

[3 rows x 4 columns]

In [62]: df2
Out[62]: 
          b         d         a
0 -0.009724  0.514570  1.054630
1 -0.462397 -2.705826 -2.622081

[2 rows x 3 columns]
在这种情况下,传入ignore_index=True即可:

In [63]: pd.concat([df1, df2], ignore_index=True)Out[63]: 
          a         b         c         d
0  0.189556 -1.076347  0.074607  0.920779
1 -1.109474 -0.643803 -0.109871 -0.441758
2 -1.385936 -1.468652 -0.979678  1.021477
3  1.054630 -0.009724       NaN  0.514570
4 -2.622081 -0.462397       NaN -2.705826

[5 rows x 4 columns]


3、合并重叠数据

还有一种数据组合问题不能用简单的合并(merge)或连接(concatenation)运算来处理。比如说,你可能有索引全部或部分重叠的两个数据集。给这个例子增加一点启发性,我们使用NumPy的where函数,它用于表达一种矢量化的if-else:

In [64]: a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=[‘f‘, ‘e‘, ‘d‘, ‘c‘, ‘b‘, ‘a‘])

In [65]: b = pd.Series(np.arange(len(a), dtype=np.float64), index=[‘f‘, ‘e‘, ‘d‘, ‘c‘, ‘b‘, ‘a‘])

In [66]: b[-1] = np.nan

In [67]: a
Out[67]: 
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [68]: b
Out[68]: 
f     0
e     1
d     2
c     3
b     4
a   NaN
dtype: float64

In [69]: np.where(pd.isnull(a), b, a)
Out[69]: array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])
Series有一个combine_first方法,实现的也是一样的功能,而且会进行数据对齐:

In [70]: b[:-2].combine_first(a[2:])
Out[70]: 
a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64
对于DataFrame,combile_first自然也会在列上做同样的事情,因此你可以将其看做:用参数对象中的数据为调用者对象的缺失数据“打补丁”:

In [71]: df1 = pd.DataFrame({‘a‘: [1., np.nan, 5., np.nan],
   ....:                     ‘b‘: [np.nan, 2., np.nan, 6.],
   ....:                     ‘c‘: range(2, 18, 4)})

In [72]: df2 = pd.DataFrame({‘a‘: [5., 4., np.nan, 3., 7.],
   ....:                     ‘b‘: [np.nan, 3., 4., 6., 8.]})

In [73]: df1.combine_first(df2)
Out[73]: 
   a   b   c
0  1 NaN   2
1  4   2   6
2  5   4  10
3  3   6  14
4  7   8 NaN

[5 rows x 3 columns]