创建方法

分为直接定义和导入定义
1、直接定义:pd.DataFrame(),pd.DataFrame()的参数可以是ndarray,列表,字典,元组、Series等。

1
2
3
4
5
6
7
8
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.arange(10).reshape(2,5))
print(df1)
# Out:
# 0 1 2 3 4
# 0 0 1 2 3 4
# 1 5 6 7 8 9

2、导入定义:当用Pandas包导入一个外部文件时,将自动转换为DataFrame对象

1
2
3
4
df2 = pd.read_csv('bc_data.csv')
df2.shape
# Out
# (569,32)

DataFrame中的行/列

查看行——用index属性

1
2
3
df2.index
# Out
# RangeIndex(start=0, stop=569, step=1)

计算行数——用.index.size属性

1
2
3
df2.index.size
# Out
# 569

查看列——用columns属性

1
2
3
df2.columns
# Out
# Index(['id', 'diagnosis', 'area_mean'], dtype='object')

计算列数——用.columns.size属性

1
2
3
df2.columns.size
# out
# 3

同时显示行数和列数,即查看DataFrame的形状——用.shape属性

1
2
3
df2.shape
# out
# (569,3)

计算行数和列数的另一个方法——.shape[0]和.shape[1]

1
2
3
4
5
print("行数为:", df2.shape[0])
print("列数为:", df2.shape[1])
# out
# 行数为: 569
# 列数为: 3

访问元素的方法

1.按列名读取

  • 列名出现在下标中
1
2
3
4
5
6
7
8
df2["id"].head()
# out
# 0 842302
# 1 842517
# 2 84300903
# 3 84348301
# 4 84358402
# Name: id, dtype: int64
  • 可将【列名】当作数据框的一个【属性】来用
1
2
3
4
5
6
7
df2.id.head()
# 0 842302
# 1 842517
# 2 84300903
# 3 84348301
# 4 84358402
# Name: id, dtype: int64
  • 列名和行号一起用。数据框的第0轴为列,所以不能写成df2[2][“id”]
1
2
df2["id"][2] 
# 84300903
  • 属性名和行号一起用
1
2
df2.id[2]
# 84300903
  • 用Fancy Indexing
1
2
3
4
df2["id"][[2,4]]
# 2 84300903
# 4 84358402
# Name: id, dtype: int64

2.按照index读取
在python中,每个数据有两种index,一种是默认的,从0开始(隐式index);另一种是通过index属性定义的(显式index)。为了区分。隐式用iloc,显式用loc。特别地,与C和Java不同的是,Python中的的Data Frame计算的依据并非为【隐式index】,而是【显式index】。

  • 按隐式index
1
2
df2.iloc[1,2]
# 1326.0
1
2
3
4
df2.loc[[1,5],["id"]]
# id
# 1 842517
# 5 843786
1
2
3
4
5
6
7
df2.loc[1:5]
# id diagnosis area_mean
# 1 842517 M 1326.0
# 2 84300903 M 1203.0
# 3 84348301 M 386.1
# 4 84358402 M 1297.0
# 5 843786 M 477.1

index操作

  • 更改显式index的方法:用reindex()方法,更改显式index的含义:并不是指重新给出新index,而是调整索引的index。
1
2
3
4
5
6
7
8
df2.reindex(index=["1","2","3"],columns=["1","2","3"])
df2.head()
# id diagnosis area_mean
# 0 842302 M 1001.0
# 1 842517 M 1326.0
# 2 84300903 M 1203.0
# 3 84348301 M 386.1
# 4 84358402 M 1297.0
1
2
3
4
5
df2.reindex(index=[2,3,1], columns=["diagnosis","id","area_mean"])
# diagnosis id area_mean
# 2 M 84300903 1203.0
# 3 M 84348301 386.1
# 1 M 842517 1326.0
  • 在重新索引时可以新增一个【显式index】
1
2
3
4
5
6
df3=df2.reindex(index=[2,3,1], columns=["diagnosis","id","area_mean","MyNewColumn"],fill_value=100)
df3
# diagnosis id area_mean MyNewColumn
# 2 M 84300903 1203.0 100
# 3 M 84348301 386.1 100
# 1 M 842517 1326.0 100

删除或过滤行列

  • drop()函数删除行
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
df2 = pd.read_csv('bc_data.csv')
df2=df2[["id","diagnosis","area_mean"]]
df3 =df2.drop([2]).head() # drop()函数不改变df2本身
df3.head()
# id diagnosis area_mean
# 0 842302 M 1001.0
# 1 842517 M 1326.0
# 3 84348301 M 386.1
# 4 84358402 M 1297.0
# 5 843786 M 477.1
  • 下列代码如果前两行没有写,那么反复运行此行代码会报错,原因:df2的当前值一直在发生变化.inplace = True 表示就地修改,数据框本身会改变。axis=0的含义:
    (1)计算前后的列数不变(2)以列为单位计算(3)逐列计算
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
df2 = pd.read_csv('bc_data.csv')
df2=df2[["id","diagnosis","area_mean"]]
df2.drop([3,4], axis=0, inplace=True)
df2.head()
# id diagnosis area_mean
# 0 842302 M 1001.0
# 1 842517 M 1326.0
# 2 84300903 M 1203.0
# 5 843786 M 477.1
# 6 844359 M 1040.0
  • del()删除列
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
df2 = pd.read_csv('bc_data.csv')
df2=df2[["id","diagnosis","area_mean"]]
del df2["area_mean"]
df2.head()
# id diagnosis
# 0 842302 M
# 1 842517 M
# 2 84300903 M
# 3 84348301 M
# 4 84358402 M
  • drop()删除列
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
df2 = pd.read_csv('bc_data.csv')
df2=df2[["id","diagnosis","area_mean"]]
df2.drop(["id","diagnosis"], axis=1, inplace=True)
df2.head()
# area_mean
# 0 1001.0
# 1 1326.0
# 2 1203.0
# 3 386.1
# 4 1297.0
  • 过滤
1
2
3
4
5
6
7
8
9
10
import pandas as pd
df2 =pd.read_csv('bc_data.csv')
df2=df2[["id","diagnosis","area_mean"]]
df2[df2.area_mean> 1000].head()
# id diagnosis area_mean
# 0 842302 M 1001.0
# 1 842517 M 1326.0
# 2 84300903 M 1203.0
# 4 84358402 M 1297.0
# 6 844359 M 1040.0
1
2
3
4
5
6
7
df2[df2.area_mean> 1000][["id","diagnosis"]].head()
# id diagnosis
# 0 842302 M
# 1 842517 M
# 2 84300903 M
# 4 84358402 M
# 6 844359 M

算术运算

1、规则之一:先补齐行列索引(新增索引对应值为NaN),得到相同结构后,进行计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
df4=pd.DataFrame(np.arange(6).reshape(2,3))
print(df4)
# 0 1 2
# 0 0 1 2
# 1 3 4 5
df5=pd.DataFrame(np.arange(10).reshape(2,5))
print(df5)
# 0 1 2 3 4
# 0 0 1 2 3 4
# 1 5 6 7 8 9
print(df4+df5)
# 0 1 2 3 4
#0 0 2 4 NaN NaN
#1 8 10 12 NaN NaN

2、规则之二:用算法运算符+/-/*等,会产生NaN值,如果想修改默认填充的NaN改为指定值,建议不要用运算符,而改用函数,如add,sub,mul,div

1
2
3
4
5
df6=df4.add(df5,fill_value=10)
df6
# 0 1 2 3 4
#0 0 2 4 13.0 14.0
#1 8 10 12 18.0 19.0

3、规则之三、数据框与Series的计算规则:按行(第1轴)广播,先把行改为等长,行内不做循环补齐。只是一行一行计算,不会跨行广播