深入理解Series和DataFrame | 田超杰的个人网站-一个传播计算机知识和人生哲理的博客

本文最后更新于：1 年前

Series和DataFrame这两种数据结构都是第三方库（pandas）所特有的。理解他们的最好的方法就是深入理解他们的底层源码。

一、Series

Series类的构造方法是：pandas.Series(data,index,dtype,name,copy)

参数	说明
data	可以是类似数组的数据、迭代对象（例：range()）、字典和标准值(例：1)。如果是字典，那么数据的顺序与其在字典中的顺序一致。
index	可以是类似数组的数据和迭代对象。其值必须是可哈希（经过哈希运算后能产生唯一的value与key对应）的且长度与数据长度一致。默认是RangeIndex (0, 1, 2, …, n)。如果是字典，那么字典的key值会作为index值。
dtype	可以是str, numpy.dtype, or ExtensionDtype, optional。默认自动判断。
name	可以是str, optional。Series的名字。
copy	布尔值。默认是False。

①对属性name的具体解释

在Series和DataFrame中，name可看作列标题。

②对属性values的具体解释

返回存储值的数组（数组内元素间用一个空格隔开）。

③len()函数的应用

len(Series对象)返回值个数。

④取值的几种方法

#底层源码中关于“取值”的部分代码
if is_integer(key) and self.index._should_fallback_to_positional:
    return self._values[key]

elif key_is_scalar:
    return self._get_value(key)

由上可得，我们可以通过数字索引和自定义索引来访问Series对象中的值。

⑤改值的几种方法

直接改值：

1 2	`#底层源码中关于“改值”的部分代码 self._mgr.setitem_inplace(key, value)#key是数字索引，value是要修改的值`

当我们用自定义索引改值时，python需要先取得相应的数字索引，再执行上述代码。

使用update()方法：

示例如下：

>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, 5, 6]))
>>> s
0    4
1    5
2    6
dtype: int64
>>> s = pd.Series(['a', 'b', 'c'])
>>> s.update(pd.Series(['d', 'e'], index=[0, 2]))
>>> s
0    d
1    b
2    e
dtype: object
>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, 5, 6, 7, 8]))
>>> s
0    4
1    5
2    6
dtype: int64
If other contains NaNs the corresponding values are not updated in the original Series.
>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, np.nan, 6]))
>>> s
0    4
1    2
2    6
dtype: int64
other can also be a non-Series object type that is coercible into a Series
>>> s = pd.Series([1, 2, 3])
>>> s.update([4, np.nan, 6])
>>> s
0    4
1    2
2    6
dtype: int64
>>> s = pd.Series([1, 2, 3])
>>> s.update({1: 9})
>>> s
0    1
1    9
2    3
dtype: int64

⑥加值的方法

s[新索引]=新值

⑦删值的方法

del s[索引]：直接从源数据中删除 s[索引]。

s.drop(labels=,inplace=)：labels可以是单个索引也可以是多个索引组成的数组。inplace为False时，不改变源数据；inplace为True时，直接改源数据；inplace默认为False。

⑦转为字典

使用 to_dict() 方法可将Series对象转为字典。

⑧groupby()方法

示例如下：

>> ser = pd.Series([390., 350., 30., 20.], ...                 index=['Falcon', 'Falcon', 'Parrot', 'Parrot'], name="Max Speed") 
>>> ser
Falcon    390.0
Falcon    350.0
Parrot     30.0
Parrot     20.0
Name: Max Speed, dtype: float64 
>>> ser.groupby(["a", "b", "a", "b"]).mean()
a    210.0
b    185.0
Name: Max Speed, dtype: float64 
>>> ser.groupby(level=0).mean()
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(ser > 100).mean()
Max Speed
False     25.0
True     370.0
Name: Max Speed, dtype: float64

⑨count()方法

返回值的个数。

⑩sort_values()方法

示例如下：

>>> s = pd.Series([np.nan, 1, 3, 10, 5])
>>> s
0     NaN
1     1.0
2     3.0
3     10.0
4     5.0
dtype: float64
Sort values ascending order (default behaviour)
>>> s.sort_values(ascending=True)
1     1.0
2     3.0
4     5.0
3    10.0
0     NaN
dtype: float64
Sort values descending order
>>> s.sort_values(ascending=False)
3    10.0
4     5.0
2     3.0
1     1.0
0     NaN
dtype: float64
Sort values inplace
>>> s.sort_values(ascending=False, inplace=True)
>>> s
3    10.0
4     5.0
2     3.0
1     1.0
0     NaN
dtype: float64
Sort values putting NAs first
>>> s.sort_values(na_position='first')
0     NaN
1     1.0
2     3.0
4     5.0
3    10.0
dtype: float64
Sort a series of strings
>>> s = pd.Series(['z', 'b', 'd', 'a', 'c'])
>>> s
0    z
1    b
2    d
3    a
4    c
dtype: object
>>> s.sort_values()
3    a
1    b
4    c
2    d
0    z
dtype: object
Sort using a key function. Your key function will be given the Series of values and should return an array-like.
>>> s = pd.Series(['a', 'B', 'c', 'D', 'e'])
>>> s.sort_values()
1    B
3    D
0    a
2    c
4    e
dtype: object

⑪sort_index()方法

示例如下：

>>> s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4])
>>> s.sort_index()
1    c
2    b
3    a
4    d
dtype: object
Sort Descending
>>> s.sort_index(ascending=False)
4    d
3    a
2    b
1    c
dtype: object
Sort Inplace
>>> s.sort_index(inplace=True)
>>> s
1    c
2    b
3    a
4    d
dtype: object

⑫drop()方法

示例如下：

>>> s = pd.Series(data=np.arange(3), index=['A', 'B', 'C'])
>>> s
A  0
B  1
C  2
dtype: int64
Drop labels B en C
>>> s.drop(labels=['B', 'C'])
A  0
dtype: int64

注意：在使用了自定义索引后，drop()不再支持数字索引；反之也成立。

二、Dataframe

Dataframe类的构造方法是：pandas.Dataframe(data,index,columns,dtype,copy)

参数	说明
data	可以是n维矩阵、迭代对象（例：range()）和字典。如果是字典，那么key将作为列标题。
index	可以是类似数组的数据和迭代对象。其值必须是可哈希（经过哈希运算后能产生唯一的value与key对应）的且长度与数据长度一致。默认是RangeIndex (0, 1, 2, …, n)。
columns	可以是类似数组的数据和迭代对象。默认是RangeIndex (0, 1, 2, …, n)。
dtype	默认自动判断。
copy	布尔值。默认是None。

①转为字典

使用 to_dict() 方法可将Dataframe对象转为字典。

②行列转置

df.T

③取值的几种方法

df.at[]：

示例如下：

>>> df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
...                   index=[4, 5, 6], columns=['A', 'B', 'C'])
>>> df
    A   B   C
4   0   2   3
5   0   4   1
6  10  20  30

Get value at specified row/column pair

>>> df.at[4, 'B']
2

df[ start : end : step ]，此方法只能取行。使用布尔索引也可以检索符合条件（多条件时，分条件需用圆括号括起）的行。

df.XX 和 df[]，这两种方法只能取列。注意：属性检索法不支持数字列标题。

④改值的方法

直接改值。

使用update()方法：

>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, 5, 6],
...                        'C': [7, 8, 9]})
>>> df.update(new_df)
>>> df
   A  B
0  1  4
1  2  5
2  3  6
The DataFrame's length does not increase as a result of the update, only values at matching index/column labels are updated.
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_df = pd.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']})
>>> df.update(new_df)
>>> df
   A  B
0  a  d
1  b  e
2  c  f
For Series, its name attribute must be set.
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_column = pd.Series(['d', 'e'], name='B', index=[0, 2])
>>> df.update(new_column)
>>> df
   A  B
0  a  d
1  b  y
2  c  e
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_df = pd.DataFrame({'B': ['d', 'e']}, index=[1, 2])
>>> df.update(new_df)
>>> df
   A  B
0  a  x
1  b  d
2  c  e
If other contains NaNs the corresponding values are not updated in the original dataframe.
>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, np.nan, 6]})
>>> df.update(new_df)
>>> df
   A      B
0  1    4.0
1  2  500.0
2  3    6.0

⑤加值的方法

直接加值：df[新索引]=新值。注意：不支持属性检索法。

使用insert()方法。

⑥删值的方法

df.pop(列标题)：直接从源数据中删除并返回删去的列。

使用drop()方法。

⑦sort_index()方法

示例如下：

>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150],
...                   columns=['A'])
>>> df.sort_index()
     A
1    4
29   2
100  1
150  5
234  3
By default, it sorts in ascending order, to sort in descending order, use ascending=False
>>> df.sort_index(ascending=False)
     A
234  3
150  5
100  1
29   2
1    4

⑧append()方法

示例如下：

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=['x', 'y'])
>>> df
   A  B
x  1  2
y  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'), index=['x', 'y'])
>>> df.append(df2)
   A  B
x  1  2
y  3  4
x  5  6
y  7  8
With ignore_index set to True:
>>> df.append(df2, ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

⑨values属性

返回存储值的二维数据（类二维列表，行与行间用空格隔开，输出时呈现二维形式）。

信息复习笔记

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

蒲公英飞走了上一篇

【扩展】用Python处理数据下一篇