首页 > 代码库 > 利用Python进行数据分析——数据规整化:清理、转换、合并、重塑(七)(2)
利用Python进行数据分析——数据规整化:清理、转换、合并、重塑(七)(2)
1、索引上的合并
有时候,DataFrame中的连接键位于其索引中。在这种情况下,你可以传入left_index = True或right_index = True(或两个都传)以说明索引应该被用作连接键:
In [8]: left1 = pd.DataFrame({‘key‘: [‘a‘, ‘b‘, ‘a‘, ‘a‘, ‘b‘, ‘c‘], ...: ‘value‘:range(6)}) In [9]: right1 = pd.DataFrame({‘group_val‘:[3.5, 7]}, index=[‘a‘, ‘b‘]) In [10]: left1 Out[10]: key value 0 a 0 1 b 1 2 a 2 3 a 3 4 b 4 5 c 5 [6 rows x 2 columns] In [11]: right1 Out[11]: group_val a 3.5 b 7.0 [2 rows x 1 columns] In [12]: pd.merge(left1, right1, left_on=‘key‘, right_index=True) Out[12]: key value group_val 0 a 0 3.5 2 a 2 3.5 3 a 3 3.5 1 b 1 7.0 4 b 4 7.0 [5 rows x 3 columns]由于默认的merge方法是求取连接键的交集,因此你可以通过外连接的方式得到它们的并集:
In [13]: pd.merge(left1, right1, left_on=‘key‘, right_index=True, how=‘outer‘) Out[13]: key value group_val 0 a 0 3.5 2 a 2 3.5 3 a 3 3.5 1 b 1 7.0 4 b 4 7.0 5 c 5 NaN [6 rows x 3 columns]对于层次化索引的数据,事情就有点复杂了:
In [14]: lefth = pd.DataFrame({‘key1‘: [‘Ohio‘, ‘Ohio‘, ‘Ohio‘, ‘Nevada‘, ‘Nevada‘], ....: ‘key2‘: [2000, 2001, 2002, 2001, 2002], ....: ‘data‘: np.arange(5.)}) In [15]: righth = pd.DataFrame(np.arange(12).reshape((6, 2)), ....: index=[[‘Nevada‘, ‘Nevada‘, ‘Ohio‘, ‘Ohio‘, ‘Ohio‘, ‘Ohio‘], [2001, 2000, 2000, 2000, 2001, 2002]], ....: columns=[‘event1‘, ‘event2‘]) In [16]: lefth Out[16]: data key1 key2 0 0 Ohio 2000 1 1 Ohio 2001 2 2 Ohio 2002 3 3 Nevada 2001 4 4 Nevada 2002 [5 rows x 3 columns] In [17]: righth Out[17]: event1 event2 Nevada 2001 0 1 2000 2 3 Ohio 2000 4 5 2000 6 7 2001 8 9 2002 10 11 [6 rows x 2 columns]这种情况下,你必须以列表的形式指明用作合并键的多个列(注意对重复索引值的处理):
In [18]: pd.merge(lefth, righth, left_on=[‘key1‘, ‘key2‘], right_index=True) Out[18]: data key1 key2 event1 event2 0 0 Ohio 2000 4 5 0 0 Ohio 2000 6 7 1 1 Ohio 2001 8 9 2 2 Ohio 2002 10 11 3 3 Nevada 2001 0 1 [5 rows x 5 columns] In [19]: pd.merge(lefth, righth, left_on=[‘key1‘, ‘key2‘], right_index=True, how=‘outer‘) Out[19]: data key1 key2 event1 event2 0 0 Ohio 2000 4 5 0 0 Ohio 2000 6 7 1 1 Ohio 2001 8 9 2 2 Ohio 2002 10 11 3 3 Nevada 2001 0 1 4 4 Nevada 2002 NaN NaN 4 NaN Nevada 2000 2 3 [7 rows x 5 columns]
同时使用合并双方的索引也没问题:
In [21]: left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=[‘a‘, ‘c‘, ‘e‘], columns=[‘Ohio‘, ‘Nevada‘]) In [22]: right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], ....: index=[‘b‘, ‘c‘, ‘d‘, ‘e‘], columns=[‘Missouri‘, ‘Alabama‘]) In [23]: left2 Out[23]: Ohio Nevada a 1 2 c 3 4 e 5 6 [3 rows x 2 columns] In [24]: right2 Out[24]: Missouri Alabama b 7 8 c 9 10 d 11 12 e 13 14 [4 rows x 2 columns] In [25]: pd.merge(left2, right2, how=‘outer‘, left_index=True, right_index=True)Out[25]: Ohio Nevada Missouri Alabama a 1 2 NaN NaN b NaN NaN 7 8 c 3 4 9 10 d NaN NaN 11 12 e 5 6 13 14 [5 rows x 4 columns]DataFrame还有一个join实例方法,它能更为方便地实现按索引合并。它还可用于合并多个带有相同或相似索引的DataFrame对象,而不管它们之间有没有重叠的列。在上面那个例子中,我们可以编写:
In [26]: left2.join(right2, how=‘outer‘) Out[26]: Ohio Nevada Missouri Alabama a 1 2 NaN NaN b NaN NaN 7 8 c 3 4 9 10 d NaN NaN 11 12 e 5 6 13 14 [5 rows x 4 columns]由于一些历史原因(早期版本的pandas),DataFrame的join方法是在连接键上做左连接。它还支持参数DataFrame的索引跟调用者DataFrame的某个列之间的连接:
In [28]: left1.join(right1, on=‘key‘) Out[28]: key value group_val 0 a 0 3.5 1 b 1 7.0 2 a 2 3.5 3 a 3 3.5 4 b 4 7.0 5 c 5 NaN [6 rows x 3 columns]最后,对于简单的索引合并,你还可以向join传入一组DataFrame(concat函数也能实现此功能):
In [29]: another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]], ....: index=[‘a‘, ‘c‘, ‘e‘, ‘f‘], columns=[‘New York‘, ‘Oregon‘]) In [30]: left2.join([right2, another]) Out[30]: Ohio Nevada Missouri Alabama New York Oregon a 1 2 NaN NaN 7 8 c 3 4 9 10 9 10 e 5 6 13 14 11 12 [3 rows x 6 columns] In [31]: left2.join([right2, another], how=‘outer‘) Out[31]: Ohio Nevada Missouri Alabama New York Oregon a 1 2 NaN NaN 7 8 b NaN NaN 7 8 NaN NaN c 3 4 9 10 9 10 d NaN NaN 11 12 NaN NaN e 5 6 13 14 11 12 f NaN NaN NaN NaN 16 17 [6 rows x 6 columns]
2、轴向连接
另一种数据合并运算也被称作连接(concatenation)、绑定(binding)或堆叠(stacking)。NumPy有一个用于合并原始NumPy数组的concatenation函数:
In [32]: arr = np.arange(12).reshape((3, 4)) In [33]: arr Out[33]: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) In [34]: np.concatenate([arr, arr], axis=1) Out[34]: array([[ 0, 1, 2, 3, 0, 1, 2, 3], [ 4, 5, 6, 7, 4, 5, 6, 7], [ 8, 9, 10, 11, 8, 9, 10, 11]])对于pandas对象(如Series和DataFrame),带有标签的轴使你能够进一步推广数组的连接运算。具体点说,你还需要考虑以下这些东西:
- 如果各对象其他轴上的索引不同,那些轴应该是做并集还是交集?
- 结果对象中的分组需要各不相同吗?
- 用于连接的轴重要吗?
pandas的concat函数提供了一种能够解决这些问题的可靠方式。我将给出一些例子来讲解其使用方式。假设有三个没有重叠索引的Series:
In [35]: s1 = pd.Series([0, 1], index=[‘a‘, ‘b‘]) In [36]: s2 = pd.Series([2, 3, 4], index=[‘c‘, ‘d‘, ‘e‘]) In [37]: s3 = pd.Series([5, 6], index=[‘f‘, ‘g‘])对这些对象调用concat可以将值和索引粘合在一起:
In [38]: pd.concat([s1, s2, s3]) Out[38]: a 0 b 1 c 2 d 3 e 4 f 5 g 6a dtype: int64默认情况下,concat是在axis=0上工作的,最终产生一个新的Series。如果传入axis=1,则结果就会变成一个DataFrame(axis=1是列):
In [39]: pd.concat([s1, s2, s3], axis=1) Out[39]: 0 1 2 a 0 NaN NaN b 1 NaN NaN c NaN 2 NaN d NaN 3 NaN e NaN 4 NaN f NaN NaN 5 g NaN NaN 6 [7 rows x 3 columns]这种情况下,另外一条轴上没有重叠,从索引的有序并集(外连接)上就可以看出来。传入join=‘inner‘即可得到它们的交集:
In [40]: s4 = pd.concat([s1 * 5, s3]) In [41]: pd.concat([s1, s4], axis=1) Out[41]: 0 1 a 0 0 b 1 5 f NaN 5 g NaN 6 [4 rows x 2 columns] In [42]: pd.concat([s1, s4], axis=1, join=‘inner‘) Out[42]: 0 1 a 0 0 b 1 5 [2 rows x 2 columns]你可以通过join_axes指定要在其他轴上使用的索引:
In [43]: pd.concat([s1, s4], axis=1, join_axes=[[‘a‘, ‘c‘, ‘b‘, ‘e‘]]) Out[43]: 0 1 a 0 0 c NaN NaN b 1 5 e NaN NaN [4 rows x 2 columns]不过有个问题,参与连接的片段在结果中区分不开。假设你想要在连接轴上创建一个层次化索引。使用keys参数即可达到这个目的:
In [44]: result = pd.concat([s1, s1, s3], keys=[‘one‘, ‘two‘, ‘three‘]) In [45]: result Out[45]: one a 0 b 1 two a 0 b 1 three f 5 g 6 dtype: int64
In [46]: # 稍后将详细讲解unstack函数 In [47]: result.unstack() Out[47]: a b f g one 0 1 NaN NaN two 0 1 NaN NaN three NaN NaN 5 6 [3 rows x 4 columns]如果沿着axis=1对Series进行合并,则keys就会成为DataFrame的列头:
In [49]: pd.concat([s1, s2, s3], axis=1, keys=[‘one‘, ‘two‘, ‘three‘]) Out[49]: one two three a 0 NaN NaN b 1 NaN NaN c NaN 2 NaN d NaN 3 NaN e NaN 4 NaN f NaN NaN 5 g NaN NaN 6 [7 rows x 3 columns]同样的逻辑对DataFrame对象也是一样:
In [50]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=[‘a‘, ‘b‘, ‘c‘], ....: columns=[‘one‘, ‘two‘]) In [51]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=[‘a‘, ‘c‘], ....: columns=[‘three‘, ‘four‘]) In [52]: pd.concat([df1, df2], axis=1, keys=[‘level1‘, ‘level2‘]) Out[52]: level1 level2 one two three four a 0 1 5 6 b 2 3 NaN NaN c 4 5 7 8 [3 rows x 4 columns]如果传入的不是列表而是一个字典,则字典的键就会被当做keys选项的值:
In [53]: pd.concat({‘level1‘: df1, ‘level2‘: df2}, axis=1) Out[53]: level1 level2 one two three four a 0 1 5 6 b 2 3 NaN NaN c 4 5 7 8 [3 rows x 4 columns]此外还有两个用于管理层次化索引创建方式的参数,参见下表:
In [54]: pd.concat([df1, df2], axis=1, keys=[‘level1‘, ‘level2‘], names=[‘upper‘, ‘lower‘]) Out[54]: upper level1 level2 lower one two three four a 0 1 5 6 b 2 3 NaN NaN c 4 5 7 8 [3 rows x 4 columns]
最后一个需要考虑的问题是,跟当前分析工作无关的DataFrame行索引:
In [59]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=[‘a‘, ‘b‘, ‘c‘, ‘d‘]) In [60]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=[‘b‘, ‘d‘, ‘a‘]) In [61]: df1 Out[61]: a b c d 0 0.189556 -1.076347 0.074607 0.920779 1 -1.109474 -0.643803 -0.109871 -0.441758 2 -1.385936 -1.468652 -0.979678 1.021477 [3 rows x 4 columns] In [62]: df2 Out[62]: b d a 0 -0.009724 0.514570 1.054630 1 -0.462397 -2.705826 -2.622081 [2 rows x 3 columns]在这种情况下,传入ignore_index=True即可:
In [63]: pd.concat([df1, df2], ignore_index=True)Out[63]: a b c d 0 0.189556 -1.076347 0.074607 0.920779 1 -1.109474 -0.643803 -0.109871 -0.441758 2 -1.385936 -1.468652 -0.979678 1.021477 3 1.054630 -0.009724 NaN 0.514570 4 -2.622081 -0.462397 NaN -2.705826 [5 rows x 4 columns]
3、合并重叠数据
还有一种数据组合问题不能用简单的合并(merge)或连接(concatenation)运算来处理。比如说,你可能有索引全部或部分重叠的两个数据集。给这个例子增加一点启发性,我们使用NumPy的where函数,它用于表达一种矢量化的if-else:
In [64]: a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=[‘f‘, ‘e‘, ‘d‘, ‘c‘, ‘b‘, ‘a‘]) In [65]: b = pd.Series(np.arange(len(a), dtype=np.float64), index=[‘f‘, ‘e‘, ‘d‘, ‘c‘, ‘b‘, ‘a‘]) In [66]: b[-1] = np.nan In [67]: a Out[67]: f NaN e 2.5 d NaN c 3.5 b 4.5 a NaN dtype: float64 In [68]: b Out[68]: f 0 e 1 d 2 c 3 b 4 a NaN dtype: float64 In [69]: np.where(pd.isnull(a), b, a) Out[69]: array([ 0. , 2.5, 2. , 3.5, 4.5, nan])Series有一个combine_first方法,实现的也是一样的功能,而且会进行数据对齐:
In [70]: b[:-2].combine_first(a[2:]) Out[70]: a NaN b 4.5 c 3.0 d 2.0 e 1.0 f 0.0 dtype: float64对于DataFrame,combile_first自然也会在列上做同样的事情,因此你可以将其看做:用参数对象中的数据为调用者对象的缺失数据“打补丁”:
In [71]: df1 = pd.DataFrame({‘a‘: [1., np.nan, 5., np.nan], ....: ‘b‘: [np.nan, 2., np.nan, 6.], ....: ‘c‘: range(2, 18, 4)}) In [72]: df2 = pd.DataFrame({‘a‘: [5., 4., np.nan, 3., 7.], ....: ‘b‘: [np.nan, 3., 4., 6., 8.]}) In [73]: df1.combine_first(df2) Out[73]: a b c 0 1 NaN 2 1 4 2 6 2 5 4 10 3 3 6 14 4 7 8 NaN [5 rows x 3 columns]
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。