In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
In [9]: df Out[9]: A B C D 2013-01-010.469112-0.282863-1.509059-1.135632 2013-01-021.212112-0.1732150.119209-1.044236 2013-01-03-0.861849-2.104569-0.4949291.071804 2013-01-040.721555-0.706771-1.0395750.271860 2013-01-05-0.4249720.5670200.276232-1.087401 2013-01-06-0.6736900.113648-1.4784270.524988
In [14]: df.head() Out[14]: A B C D 2013-01-010.469112-0.282863-1.509059-1.135632 2013-01-021.212112-0.1732150.119209-1.044236 2013-01-03-0.861849-2.104569-0.4949291.071804 2013-01-040.721555-0.706771-1.0395750.271860 2013-01-05-0.4249720.5670200.276232-1.087401
In [15]: df.tail(3) Out[15]: A B C D 2013-01-040.721555-0.706771-1.0395750.271860 2013-01-05-0.4249720.5670200.276232-1.087401 2013-01-06-0.6736900.113648-1.4784270.524988
In [19]: df.describe() Out[19]: A B C D count 6.0000006.0000006.0000006.000000 mean 0.073711-0.431125-0.687758-0.233103 std 0.8431570.9228180.7798870.973118 min-0.861849-2.104569-1.509059-1.135632 25% -0.611510-0.600794-1.368714-1.076610 50% 0.022070-0.228039-0.767252-0.386188 75% 0.6584440.041933-0.0343260.461706 max1.2121120.5670200.2762321.071804
4、 对数据的转置:
1 2 3 4 5 6 7
In [20]: df.T Out[20]: 2013-01-012013-01-022013-01-032013-01-042013-01-052013-01-06 A 0.4691121.212112-0.8618490.721555-0.424972-0.673690 B -0.282863-0.173215-2.104569-0.7067710.5670200.113648 C -1.5090590.119209-0.494929-1.0395750.276232-1.478427 D -1.135632-1.0442361.0718040.271860-1.0874010.524988
5、 按轴进行排序
1 2 3 4 5 6 7 8 9
In [21]: df.sort_index(axis=1, ascending=False) Out[21]: D C B A 2013-01-01-1.135632-1.509059-0.2828630.469112 2013-01-02-1.0442360.119209-0.1732151.212112 2013-01-031.071804-0.494929-2.104569-0.861849 2013-01-040.271860-1.039575-0.7067710.721555 2013-01-05-1.0874010.2762320.567020-0.424972 2013-01-060.524988-1.4784270.113648-0.673690
6、 按值进行排序
1 2 3 4 5 6 7 8 9
In [22]: df.sort_values(by='B') Out[22]: A B C D 2013-01-03-0.861849-2.104569-0.4949291.071804 2013-01-040.721555-0.706771-1.0395750.271860 2013-01-010.469112-0.282863-1.509059-1.135632 2013-01-021.212112-0.1732150.119209-1.044236 2013-01-06-0.6736900.113648-1.4784270.524988 2013-01-05-0.4249720.5670200.276232-1.087401
In [23]: df['A'] Out[23]: 2013-01-010.469112 2013-01-021.212112 2013-01-03-0.861849 2013-01-040.721555 2013-01-05-0.424972 2013-01-06-0.673690 Freq: D, Name: A, dtype: float64
2、 通过[]进行选择,这将会对行进行切片
1 2 3 4 5 6 7 8 9 10 11 12 13
In [24]: df[0:3] Out[24]: A B C D 2013-01-010.469112-0.282863-1.509059-1.135632 2013-01-021.212112-0.1732150.119209-1.044236 2013-01-03-0.861849-2.104569-0.4949291.071804
In [25]: df['20130102':'20130104'] Out[25]: A B C D 2013-01-021.212112-0.1732150.119209-1.044236 2013-01-03-0.861849-2.104569-0.4949291.071804 2013-01-040.721555-0.706771-1.0395750.271860
通过标签选择
1、 使用标签来获取一个交叉的区域
1 2 3 4 5 6 7
In [26]: df.loc[dates[0]] Out[26]: A 0.469112 B -0.282863 C -1.509059 D -1.135632 Name: 2013-01-0100:00:00, dtype: float64
2、 通过标签来在多个轴上进行选择
1 2 3 4 5 6 7 8 9
In [27]: df.loc[:,['A','B']] Out[27]: A B 2013-01-010.469112-0.282863 2013-01-021.212112-0.173215 2013-01-03-0.861849-2.104569 2013-01-040.721555-0.706771 2013-01-05-0.4249720.567020 2013-01-06-0.6736900.113648
3、 标签切片
1 2 3 4 5 6
In [28]: df.loc['20130102':'20130104',['A','B']] Out[28]: A B 2013-01-021.212112-0.173215 2013-01-03-0.861849-2.104569 2013-01-040.721555-0.706771
4、 对于返回的对象进行维度缩减
1 2 3 4 5
In [29]: df.loc['20130102',['A','B']] Out[29]: A 1.212112 B -0.173215 Name: 2013-01-0200:00:00, dtype: float64
5、 获取一个标量
1 2
In [30]: df.loc[dates[0],'A'] Out[30]: 0.46911229990718628
6、 快速访问一个标量(与上一个方法等价)
1 2
In [31]: df.at[dates[0],'A'] Out[31]: 0.46911229990718628
通过位置选择
1、 通过传递数值进行位置选择(选择的是行)
1 2 3 4 5 6 7
In [32]: df.iloc[3] Out[32]: A 0.721555 B -0.706771 C -1.039575 D 0.271860 Name: 2013-01-0400:00:00, dtype: float64
2、 通过数值进行切片,与 numpy/python 中的情况类似
1 2 3 4 5
In [33]: df.iloc[3:5,0:2] Out[33]: A B 2013-01-040.721555-0.706771 2013-01-05-0.4249720.567020
3、 通过指定一个位置的列表,与 numpy/python 中的情况类似
1 2 3 4 5 6
In [34]: df.iloc[[1,2,4],[0,2]] Out[34]: A C 2013-01-021.2121120.119209 2013-01-03-0.861849-0.494929 2013-01-05-0.4249720.276232
4、 对行进行切片
1 2 3 4 5
In [35]: df.iloc[1:3,:] Out[35]: A B C D 2013-01-021.212112-0.1732150.119209-1.044236 2013-01-03-0.861849-2.104569-0.4949291.071804
5、 对列进行切片
1 2 3 4 5 6 7 8 9
In [36]: df.iloc[:,1:3] Out[36]: B C 2013-01-01-0.282863-1.509059 2013-01-02-0.1732150.119209 2013-01-03-2.104569-0.494929 2013-01-04-0.706771-1.039575 2013-01-050.5670200.276232 2013-01-060.113648-1.478427
6、 获取特定的值
1 2
In [37]: df.iloc[1,1] Out[37]: -0.17321464905330858
快速访问标量(等同于前一个方法):
1 2
In [38]: df.iat[1,1] Out[38]: -0.17321464905330858
布尔索引
1、 使用一个单独列的值来选择数据:
1 2 3 4 5 6
In [39]: df[df.A > 0] Out[39]: A B C D 2013-01-010.469112-0.282863-1.509059-1.135632 2013-01-021.212112-0.1732150.119209-1.044236 2013-01-040.721555-0.706771-1.0395750.271860
2、 使用where操作来选择数据:
1 2 3 4 5 6 7 8 9
In [40]: df[df > 0] Out[40]: A B C D 2013-01-010.469112 NaN NaN NaN 2013-01-021.212112 NaN 0.119209 NaN 2013-01-03 NaN NaN NaN 1.071804 2013-01-040.721555 NaN NaN 0.271860 2013-01-05 NaN 0.5670200.276232 NaN 2013-01-06 NaN 0.113648 NaN 0.524988
3、 使用isin()方法来过滤:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In [41]: df2 = df.copy()
In [42]: df2['E'] = ['one', 'one','two','three','four','three']
In [43]: df2 Out[43]: A B C D E 2013-01-010.469112-0.282863-1.509059-1.135632 one 2013-01-021.212112-0.1732150.119209-1.044236 one 2013-01-03-0.861849-2.104569-0.4949291.071804 two 2013-01-040.721555-0.706771-1.0395750.271860 three 2013-01-05-0.4249720.5670200.276232-1.087401 four 2013-01-06-0.6736900.113648-1.4784270.524988 three
In [44]: df2[df2['E'].isin(['two','four'])] Out[44]: A B C D E 2013-01-03-0.861849-2.104569-0.4949291.071804 two 2013-01-05-0.4249720.5670200.276232-1.087401 four
设置
1、 设置一个新的列:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [45]: s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
In [51]: df Out[51]: A B C D F 2013-01-010.0000000.000000-1.5090595 NaN 2013-01-021.212112-0.1732150.11920951.0 2013-01-03-0.861849-2.104569-0.49492952.0 2013-01-040.721555-0.706771-1.03957553.0 2013-01-05-0.4249720.5670200.27623254.0 2013-01-06-0.6736900.113648-1.47842755.0
5、 通过where操作来设置新的值:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [52]: df2 = df.copy()
In [53]: df2[df2 > 0] = -df2
In [54]: df2 Out[54]: A B C D F 2013-01-010.0000000.000000-1.509059-5 NaN 2013-01-02-1.212112-0.173215-0.119209-5-1.0 2013-01-03-0.861849-2.104569-0.494929-5-2.0 2013-01-04-0.721555-0.706771-1.039575-5-3.0 2013-01-05-0.424972-0.567020-0.276232-5-4.0 2013-01-06-0.673690-0.113648-1.478427-5-5.0
In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
In [56]: df1.loc[dates[0]:dates[1],'E'] = 1
In [57]: df1 Out[57]: A B C D F E 2013-01-010.0000000.000000-1.5090595 NaN 1.0 2013-01-021.212112-0.1732150.11920951.01.0 2013-01-03-0.861849-2.104569-0.49492952.0 NaN 2013-01-040.721555-0.706771-1.03957553.0 NaN
2、 去掉包含缺失值的行:
1 2 3 4
In [58]: df1.dropna(how='any') Out[58]: A B C D F E 2013-01-021.212112-0.1732150.11920951.01.0
3、 对缺失值进行填充:
1 2 3 4 5 6 7
In [59]: df1.fillna(value=5) Out[59]: A B C D F E 2013-01-010.0000000.000000-1.50905955.01.0 2013-01-021.212112-0.1732150.11920951.01.0 2013-01-03-0.861849-2.104569-0.49492952.05.0 2013-01-040.721555-0.706771-1.03957553.05.0
4、 对数据进行布尔填充:
1 2 3 4 5 6 7
n [60]: pd.isnull(df1) Out[60]: A B C D F E 2013-01-01FalseFalseFalseFalseTrueFalse 2013-01-02FalseFalseFalseFalseFalseFalse 2013-01-03FalseFalseFalseFalseFalseTrue 2013-01-04FalseFalseFalseFalseFalseTrue
In [63]: s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
In [64]: s Out[64]: 2013-01-01 NaN 2013-01-02 NaN 2013-01-031.0 2013-01-043.0 2013-01-055.0 2013-01-06 NaN Freq: D, dtype: float64
In [65]: df.sub(s, axis='index') Out[65]: A B C D F 2013-01-01 NaN NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN NaN 2013-01-03-1.861849-3.104569-1.4949294.01.0 2013-01-04-2.278445-3.706771-4.0395752.00.0 2013-01-05-5.424972-4.432980-4.7237680.0-1.0 2013-01-06 NaN NaN NaN NaN NaN
Apply
1、 对数据应用函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
In [66]: df.apply(np.cumsum) Out[66]: A B C D F 2013-01-010.0000000.000000-1.5090595 NaN 2013-01-021.212112-0.173215-1.389850101.0 2013-01-030.350263-2.277784-1.884779153.0 2013-01-041.071818-2.984555-2.924354206.0 2013-01-050.646846-2.417535-2.6481222510.0 2013-01-06-0.026844-2.303886-4.1265493015.0
In [67]: df.apply(lambda x: x.max() - x.min()) Out[67]: A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000 dtype: float64
In [87]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [88]: df Out[88]: A B C D 01.3460611.5117631.627081-0.990582 1-0.4416521.2115260.2685200.024580 2-1.5775850.396823-0.105381-0.532532 31.4537491.208843-0.080952-0.264610 4-0.727965-0.5893460.339969-0.693205 5-0.3393550.5936160.8843451.591431 60.1418090.2203900.4355890.192451 7-0.0967010.8033511.715071-0.708758
In [89]: s = df.iloc[3]
In [90]: df.append(s, ignore_index=True) Out[90]: A B C D 01.3460611.5117631.627081-0.990582 1-0.4416521.2115260.2685200.024580 2-1.5775850.396823-0.105381-0.532532 31.4537491.208843-0.080952-0.264610 4-0.727965-0.5893460.339969-0.693205 5-0.3393550.5936160.8843451.591431 60.1418090.2203900.4355890.192451 7-0.0967010.8033511.715071-0.708758 81.4537491.208843-0.080952-0.264610
In [92]: df Out[92]: A B C D 0 foo one -1.202872-0.055224 1 bar one -1.8144702.395985 2 foo two 1.0186011.552825 3 bar three -0.5954470.166599 4 foo two 1.3954330.047609 5 bar two -0.392670-0.136473 6 foo one 0.007207-0.561757 7 foo three 1.928123-1.623033
1、 分组并对每个分组执行sum函数:
1 2 3 4 5 6
In [93]: df.groupby('A').sum() Out[93]: C D A bar -2.8025882.42611 foo 3.146492-0.63958
2、 通过多个列进行分组形成一个层次索引,然后执行函数:
1 2 3 4 5 6 7 8 9 10
In [94]: df.groupby(['A','B']).sum() Out[94]: C D A B bar one -1.8144702.395985 three -0.5954470.166599 two -0.392670-0.136473 foo one -1.195665-0.616981 three 1.928123-1.623033 two 2.4140341.600434
In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [98]: df2 = df[:4]
In [99]: df2 Out[99]: A B first second bar one 0.029399-0.542108 two 0.282696-0.087302 baz one -1.5751701.771208 two 0.8164821.100230
1 2 3 4 5 6 7 8 9 10 11 12 13 14
In [100]: stacked = df2.stack()
In [101]: stacked Out[101]: first second bar one A 0.029399 B -0.542108 two A 0.282696 B -0.087302 baz one A -1.575170 B 1.771208 two A 0.816482 B 1.100230 dtype: float64
In [106]: df Out[106]: A B C D E 0 one A foo 1.418757-0.179666 1 one B foo -1.8790241.291836 2 two C foo 0.536826-0.009614 3 three A bar 1.0061600.392149 4 one B bar -0.0297160.264599 5 one C bar -1.146178-0.057409 6 two A foo 0.100900-1.425638 7 three B foo -1.0350181.024098 8 one C foo 0.314665-0.106062 9 one A bar -0.7737231.824375 10 two B bar -1.1706530.595974 11 three C bar 0.6487401.167115
可以从这个数据中轻松的生成数据透视表:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[107]: C bar foo A B one A -0.7737231.418757 B -0.029716-1.879024 C -1.1461780.314665 three A 1.006160 NaN B NaN -1.035018 C 0.648740 NaN two A NaN 0.100900 B -1.170653 NaN C NaN 0.536826
In [132]: df["grade"] Out[132]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]
4、 排序是按照 Categorical 的顺序进行的而不是按照字典顺序进行:
1 2 3 4 5 6 7 8 9
In [133]: df.sort_values(by="grade") Out[133]: id raw_grade grade 56 e very bad 12 b good 23 b good 01 a very good 34 a very good 45 a very good
5、 对 Categorical 列进行排序时存在空的类别:
1 2 3 4 5 6 7 8 9
In [134]: df.groupby("grade").size() Out[134]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
In [146]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']) Out[146]: A B C D 2000-01-010.266457-0.399641-0.2195821.186860 2000-01-02-1.170732-0.3458731.653061-0.282953 2000-01-03-1.7349330.5304682.060811-0.515536 2000-01-04-1.5551211.4526200.239859-1.156896 2000-01-050.5781170.5113710.103552-2.428202 2000-01-060.4783440.449933-0.741620-1.962409 2000-01-071.235339-0.091757-1.543861-1.084753 ... ... ... ... ... 2002-09-20-10.628548-9.153563-7.88314628.313940 2002-09-21-10.390377-8.727491-6.39964530.914107 2002-09-22-8.985362-8.485624-4.66946231.367740 2002-09-23-9.558560-8.781216-4.49981530.518439 2002-09-24-9.902058-9.340490-4.38663930.105593 2002-09-25-10.216020-9.480682-3.93380229.758560 2002-09-26-11.856774-10.671012-3.21602529.369368
[1000 rows x 4 columns]
十三、陷阱
如果你尝试某个操作并且看到如下异常:
1 2 3 4 5
>>> if pd.Series([False, True, False]): print("I was true") Traceback ... ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().