pandas 数据处理

pandas 数据格式
基本展示
创建 Series, DataFrame
修改索引
类型转换
切片
布尔索引
多重索引
查看缺失值
缺失值填充
逻辑与或
删除、去重
排序
按指定索引重排
频率统计
返回唯一值
查找、匹配
onehot 编码
拼接
读写文件
Map
apply
groupby
agg
时间操作
读取大文件的技巧
- 只读取前 n 行
- 分批次读取
- 跳过指定行
常见问题
参考

pandas 数据格式

pandas 两种常用数据格式：

Series
DataFrame，它的每一列是 Series

Series 常用的属性：

name –> str
index
shape –> tuple
dtype
values –> ndarray

DataFrame 常用属性：

index
columns
values –> ndarray
dtypes –> Series
shape –> tuple

基本展示

x.head(), x.head(20)
x.tail(), x.tail(20)
x.describe()
sr.value_counts() # 计数统计
df.info()

创建 Series, DataFrame

sr = pd.Series([1,2,3])
sr = pd.Series([1,2,3], index=[0,1,2])
>>> sr
0    1
1    2
2    3
dtype: int64

sr = pd.Series({'a':1, 'b':2, 'c':3})
>>> sr
a    1
b    2
c    3
dtype: int64

df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]}, index=range(3))
>>> df
   a  b
0  1  4
1  2  5
2  3  6
df = pd.DataFrame([[1,2],[3,4]], columns=['a','b'])
>>> df
   a  b
0  1  2
1  3  4

修改索引

# 修改索引
df = pd.DataFrame([[1,2],[3,4]], columns=['a','b'])
df.index = [1,2]
df.columns = ['c','d']
>>> df
   c  d
1  1  2
2  3  4

df2 = df.rename({'a': 'X', 'b': 'Y'}, axis=1)  # 修改部分索引

类型转换

df = pd.DataFrame([[1,2],[3,4]], columns=['a','b'])
>>> df.dtypes
a    int64
b    int64
df2 = df.astype('str')
>>> df2.dtypes
a    object
b    object
dtype: object

切片

df[column_key] 选择某一列（其中 column_key 是字符串）<==> df.column_key
df[[key1, …]] 选择多列
df.loc[idx1] 选择 index == idx1 的行
df.loc[[idx1, idx2], [key1, key2]] 选择 index 属于 {idx1, idx2}, columns 属于 {key1, key2} 的表（DataFrame）
df.iloc[pos1, pos2]

注意：df[‘a’] 和 df[[‘a’]] 不同，前者返回 Series, 后者返回 DataFrame

a=np.random.rand(10,4)
df=pd.DataFrame(a)
df.columns = list('abcd')
>>> df
          a         b         c         d
0  0.476674  0.491428  0.465477  0.976000
1  0.397214  0.483136  0.759356  0.119943
2  0.592388  0.867402  0.268019  0.357892
3  0.268457  0.316199  0.239093  0.908658
4  0.119298  0.894530  0.825473  0.799663
5  0.906994  0.540645  0.044490  0.896530
6  0.130975  0.148297  0.994240  0.169837
7  0.943039  0.543730  0.047538  0.327427
8  0.113551  0.683084  0.649643  0.810463
9  0.117071  0.101195  0.028766  0.957437

>>> df['a']
0    0.476674
1    0.397214
2    0.592388
3    0.268457
4    0.119298
5    0.906994
6    0.130975
7    0.943039
8    0.113551
9    0.117071
Name: a, dtype: float64

sr = df['a']
>>> sr[0]
0.4766742823955723

>>> df[['a','b']]
          a         b
0  0.476674  0.491428
1  0.397214  0.483136
2  0.592388  0.867402
3  0.268457  0.316199
4  0.119298  0.894530
5  0.906994  0.540645
6  0.130975  0.148297
7  0.943039  0.543730
8  0.113551  0.683084
9  0.117071  0.101195

>>> df.loc[1]
a    0.397214
b    0.483136
c    0.759356
d    0.119943
Name: 1, dtype: float64

>>> df.loc[[1,2]]
          a         b         c         d
1  0.397214  0.483136  0.759356  0.119943
2  0.592388  0.867402  0.268019  0.357892

>>> df.loc[[1,2],['a','c']]
          a         c
1  0.397214  0.759356
2  0.592388  0.268019

布尔索引

df[bool_series] 选择所有为 True 的行
df.loc[[bool_series1], [bool_series2]]
df.loc[:, [bool_series]]
df.loc[[bool_series], [key1, key2]]

a=np.random.rand(10,4)
df=pd.DataFrame(a)
df.columns = list('abcd')

>>> df['a']>.5
  False
  False
   True
  False
  False
   True
  False
   True
  False
  False
Name: a, dtype: bool

>>> df[df['a']>.5] <==> df.loc[df['a']>.5]
          a         b         c         d
0.592388  0.867402  0.268019  0.357892
0.906994  0.540645  0.044490  0.896530
0.943039  0.543730  0.047538  0.327427

>>> df[df>.5]
          a         b         c         d
     NaN       NaN       NaN  0.976000
     NaN       NaN  0.759356       NaN
0.592388  0.867402       NaN       NaN
     NaN       NaN       NaN  0.908658
     NaN  0.894530  0.825473  0.799663
0.906994  0.540645       NaN  0.896530
     NaN       NaN  0.994240       NaN
0.943039  0.543730       NaN       NaN
     NaN  0.683084  0.649643  0.810463
     NaN       NaN       NaN  0.957437

>>> df[(df['a']>.5) & (df['d']>.5)] <==> df.loc[(df['a']>.5) & (df['d']>.5)]
          a         b        c        d
0.906994  0.540645  0.04449  0.89653

>>> df.loc[:,df.loc[0]>0.5]
          d
0.976000
0.119943
0.357892
0.908658
0.799663
0.896530
0.169837
0.327427
0.810463
0.957437

多重索引

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

查看缺失值

>>> df2
          a         b         c
0  0.476674  0.491428  0.465477
1  0.397214       NaN  0.759356
2       NaN  0.867402  0.268019

>>> df2.isnull()
       a      b      c
0  False  False  False
1  False   True  False
2   True  False  False

>>> df2.isnull().any()  # 沿 axis=0 轴合并
a     True
b     True
c    False
dtype: bool

>>> df2.isnull().any(axis=1)  # 沿 axis=1 轴合并
0    False
1     True
2     True
dtype: bool

df2.index[df2.isnull().any(axis=1)]  # 找出缺失值所在的行标
Int64Index([1, 2], dtype='int64')

缺失值填充

sr.fillna(value, inplace=True)

逻辑与或

(sr1>sr2).all()
(sr1>sr2).any()

删除、去重

df.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=’raise’)
drop_duplicates(subset=None, keep=’first’, inplace=False)

排序

按值排序：df.sort_values(by=[…], axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)
按索引排序：df.sort_index(axis=0, level=None, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’, sort_remaining=True, by=None)

按指定索引重排

df.reindex(new_index)

频率统计

ser.value_counts()

返回唯一值

ser.unique()

查找、匹配

df.isin([..]) 返回 bool series
ser.str.contains(pattern) 返回 bool series，字符串查找，支持正则表达式

onehot 编码

sr.get_dummies() # 每一行单个取值
sr.str.get_dummies(‘,’) # 每一行多个取值，逗号分隔

拼接

pd.concat([df1, df2], axis=0, ignore_index=True)
- 可以将多个 df 或 sr, 行拼接或列拼接
- axis=1, 多列拼接，连接键默认为 index
- 指定 ignore_index=True, 合并后，索引按 0,1,2，… 重排
- keys = [‘xx’, …] 用于在拼后建立多重索引
pd.merge(left, right, on=.., how=’inner’, …)
- 用于列拼接
- on 指定拼接键
- how 取 ‘inner’, ‘outer’, ‘left’, ‘right’, inner 取连接键的交集，outer 取并集

df1 = pd.DataFrame({'a': [1,1,2,3,3,3,2], 'b':[1,2,3,4,5,6,7]})
df2 = pd.DataFrame({'a': [1,2,3,4], 'c':[5,6,7,8]})
print(df1)
print('-------')
print(df2)
print('-------------')
pd.merge(df1, df2, on='a')

  a  b
0  1  1
1  1  2
2  2  3
3  3  4
4  3  5
5  3  6
6  2  7
-------
   a  c
0  1  5
1  2  6
2  3  7
3  4  8
-------------
    a	b	c
0	1	1	5
1	1	2	5
2	2	3	6
3	2	7	6
4	3	4	7
5	3	5	7
6	3	6	7


df1 = pd.DataFrame({'a':np.random.rand(3),'c':np.random.rand(3),'b': np.random.rand(3)})
df2 = pd.DataFrame({'d':np.random.rand(3),'f':np.random.rand(3),'e': np.random.rand(3)})
df3 = pd.DataFrame({'i':np.random.rand(3),'g':np.random.rand(3),'h': np.random.rand(3)})
df = pd.concat([df1, df2, df3], axis=1, keys=['1', '2', '3'])
	1	                                2	                                3
    a	        c	        b	        d	        f       	e	        i	        g       	h
0	0.262559	0.448603	0.961162	0.760485	0.304973	0.544774	0.082724	0.080505	0.132155
1	0.721134	0.561027	0.937514	0.577973	0.042380	0.168546	0.645962	0.769562	0.640384
2	0.533538	0.097343	0.459582	0.276983	0.270667	0.818550	0.434499	0.697353	0.079110

df.loc[:, ('1', slice(None))]  # 多重索引的切片
    1
    a	        c       	b
0	0.044250	0.016172	0.636867
1	0.914084	0.610851	0.638163
2	0.196434	0.779077	0.529745

读写文件

csv

读取 csv

  df = pd.read_csv('xx.csv')  # 读取带有 header, 没有 index，以逗号间隔的表，（header 默认是第一行，自动添加 index）
  df = pd.read_csv('xx.csv', index_col=0)  # 读取带有 index 的表，并指定 index 的列
  df = pd.read_csv('xx.csv', sep='\t', header=None， names=[..])  # 没有 header 的话，必须设置 header=None, 并使用 names 指定 columns

保存 csv

   csv：df.to_csv('xx.csv', index=False, header=True, mode='w', sep=',')

hdf

df = pd.read_hdf(‘xx.h5’, key=’xxx’)
df.to_hdf(‘xx.h5’, key=’xxx’, mode=’w’)

对于超大文件，hdf 读写可能失败，可以用 pickle

pickle

df.read_pickle(..)
df.to_pickle(..)

Map

进行单列的函数映射或字典映射，输入输出形状不变

sr.map(xx), xx 可以是：

函数
字典
Series

a=np.random.rand(4,3)
df = pd.DataFrame(a, columns=list('abc'))
>>> df
          a         b         c
0  0.698255  0.037483  0.099545
1  0.531019  0.110648  0.601072
2  0.883828  0.433867  0.083165
3  0.421908  0.397592  0.148393

>>> df['a'].map(str)
0     0.6982551727271022
1     0.5310185100676619
2     0.8838283555917328
3    0.42190809006346663
Name: a, dtype: object

apply

对表的各行或各列应用一个函数映射，有压缩维度的效果

a=np.random.rand(4,3)
df = pd.DataFrame(a, columns=list('abc'))
>>> df
          a         b         c
0  0.698255  0.037483  0.099545
1  0.531019  0.110648  0.601072
2  0.883828  0.433867  0.083165
3  0.421908  0.397592  0.148393

>>> df.apply('mean')
a    0.372162
b    0.411302
c    0.303223
dtype: float64

>>> df.apply(lambda t: t.a+t.b-t.c, axis=1)
0    0.382569
1    0.734681
2    0.261291
3    0.542423
dtype: float64

groupby

https://blog.csdn.net/zhangxiaojiakele/article/details/78198568
https://blog.csdn.net/youngbit007/article/details/54288603

df=pd.DataFrame({'a': [1,2,2,2,3,1,3], 'b': [1,2,3,4,5,6,7]})
groupby_a = df.groupby(['a'])
groupby_a
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f50cc0889b0>

groupby_a.size()  # 分组统计数目
a
1    2
2    3
3    2
dtype: int64

groupby_a['b'].mean()  # 组内求均值
a
1    3.5
2    3.0
3    6.0
Name: b, dtype: float64

# 以上操作等价于
df.b.groupby(df.a).mean()

# 遍历分组
for g in groupby_a:
    print('a = ', g[0])
    print(g[1])
    print('-------')

a =  1
   a  b
0  1  1
5  1  6
-------
a =  2
   a  b
1  2  2
2  2  3
3  2  4
-------
a =  3
   a  b
4  3  5
6  3  7
-------

agg

df.agg([fun1, fun2]) 将表的每列分别按照 fun1, fun2 进行聚合
df.agg({‘a’: ‘sum’, ‘b’: ‘mean’}) 每列按照不同函数聚合
df.groupby([‘a’]).agg(..) 先分组，组内在聚合
df.groupby(‘a’)[‘b’].agg({‘b_sum’: ‘sum’, ‘b_mean’: ‘mean’}) 分组，对其中一列实施多个聚合操作，并通过字典的 key 指定新列名

时间操作

字符串转时间

  # 单个字符串转换时间类型
  t1 = pd.to_datetime('2019-06-01')  # Timestamp('2019-06-01 00:00:00')
  # 多个字符串进行转换
  t2 = pd.to_datetime(['2019-11-01', '2019-12-03'])  # DatetimeIndex(['2019-11-01', '2019-12-03'], dtype='datetime64[ns]', freq=None)

求星期几

  t1.dayofweek  # Int64Index([4, 1], dtype='int64')
  t2.dayofweek  # 5

时间戳转换转时间
- sr.map(lambda t: time.strftime(‘%Y%m%d %H:%M:%S’, time.localtime(t))) # 时间戳 -> 时间字符串
- pd.to_datetime(time.time(), unit=’s’) # Timestamp(‘2019-05-29 07:35:24.789425373’) # 时间戳 -> pandas 时间类型

https://www.jianshu.com/p/96ea42c58abe

读取大文件的技巧

只读取前 n 行

df = pd.read_csv(‘xxx.csv’, nrows=10000)

分批次读取

data = pd.read_csv('xxx.csv', chunksize=10000)
for d in data:
    pass

nrows 参数也可以和 chunksize 参数结合使用

跳过指定行

skiprows

常见问题

SettingWithCopyWarning

  SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

solution

参考

[1]https://blog.csdn.net/zhili8866/article/details/68134481
[2]https://blog.csdn.net/qq_42156420/article/details/82813482
[3]https://www.pypandas.cn/

pandas 数据处理

pandas 数据格式

基本展示

创建 Series, DataFrame

修改索引

类型转换

切片

布尔索引

多重索引

查看缺失值

缺失值填充

逻辑与或

删除、去重

排序

按指定索引重排

频率统计

返回唯一值

查找、匹配

onehot 编码

拼接

读写文件

Map

apply

groupby

agg

时间操作

读取大文件的技巧

只读取前 n 行

分批次读取

跳过指定行

常见问题

参考

Similar Posts

Comments