Python中pandas模块解析--688IT编程网

Python中pandas模块解析

⼀、定义

pandas 是⼀个Python包，提供快速、灵活和富有表现⼒的数据结构，旨在使处理“关系”或“标记”数据既简单⼜直观。它旨在成为在 Python 中进⾏实⽤、真实世界数据分析的基本⾼级构建块。此外，它还有⼀个更⼴泛的⽬标，即成为任何语⾔中可⽤的最强⼤、最灵活的开源数据分析/操作⼯具。它已经在朝着这个⽬标前进。pandas ⾮常适合许多不同类型的数据：

具有异构类型列的表格数据，如 SQL 表或 Excel 电⼦表格

有序和⽆序（不⼀定是固定频率）时间序列数据。

具有⾏和列标签的任意矩阵数据（同种类型或异类）

任何其他形式的观察/统计数据集。数据根本不需要标记即可放⼊pandas数据结构中

⼆、功能

pandas 的两个主要数据结构Series（⼀维）和DataFrame（⼆维）处理⾦融、统计、社会科学和许多⼯程

领域的绝⼤多数典型⽤例。对于 R ⽤户，DataFrame提供R提供的⼀切 data.frame以及更多。pandas 建⽴在NumPy 之上，旨在与许多其他 3rd ⽅库在科学计算环境中很好地集成。

以下是 Pandas 擅长的⼀些事情：

轻松处理浮点和⾮浮点数据中的缺失数据（表⽰为 NaN）

⼤⼩可变性：可以从 DataFrame 和更⾼维度的对象中插⼊和删除列

⾃动和显式数据对齐：对象可以明确地对齐⼀组标签，或者⽤户可以简单地忽略标签和让Series，DataFrame等⾃动对齐数据你计算

强⼤、灵活的分组功能，可对数据集执⾏拆分-应⽤-组合操作，⽤于聚合和转换数据

使它易于转换⾐衫褴褛，在其他Python和NumPy的数据结构不同索引的数据转换成数据帧对象

基于标签的智能切⽚、花式索引和⼤型数据集的⼦集

直观的合并和连接数据集

灵活地重塑和旋转数据集

轴的分层标记（每个刻度可能有多个标签）

强⼤的 IO ⼯具，⽤于从平⾯⽂件（CSV 和带分隔符）、Excel ⽂件、数据库加载数据，以及从超快HDF5 格式保存/加载数据

时间序列特定功能：⽇期范围⽣成和频率转换、移动窗⼝统计、⽇期偏移和滞后。

这些原则中有许多是为了解决使⽤其他语⾔/科学研究环境时经常遇到的缺点。对于数据科学家来说，处理数据通常分为多个阶段：处理和清理数据、分析/建模，然后将分析结果组织成适合绘图或表格显⽰的形式。pandas 是所有这些任务的理想⼯具。

三、类型

Pandas基于两种数据类型： series 与 dataframe 。

1、Series

⼀个series是⼀个⼀维的数据类型，其中每⼀个元素都有⼀个标签。类似于Numpy中元素带标签的数组。其中，标签可以是数字或者字符串。

import numpy as np

import pandas as pd

s = pd.Series([1, 2, 5, np.nan, 6, 8])

print(s)

'''

输出：

0 1.0

1 2.0

2 5.0

3 NaN

4 6.0

5 8.0

dtype: float64

'''

2、DataFrame

⼀个dataframe是⼀个⼆维的表结构。Pandas的dataframe可以存储许多种不同的数据类型，并且每⼀个坐标轴都有⾃⼰的标签。你可以把它想象成⼀个series的字典项。

#创建⼀个 DateFrame：

#创建⽇期索引序列

dates =pd.date_range('20130101', periods=6)

print(type(dates))

#创建Dataframe，其中 index 决定索引序列，columns 决定列名

df =pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

print(df)

'''

输出：

A B C D

2013-01-01 0.406575 -1.356139 0.188997 -1.308049

2013-01-02 -0.412154 0.123879 0.907458 0.201024

2013-01-03 0.576566 -1.875753 1.967512 -1.044405

2013-01-04 1.116106 -0.796381 0.432589 0.764339

2013-01-05 -1.851676 0.378964 -0.282481 0.296629

2013-01-06 -1.051984 0.960433 -1.313190 -0.093666

'''

#字典创建 DataFrame

df2 =pd.DataFrame({'A' : 1.,

'B': pd.Timestamp('20130102'),

'C': pd.Series(1,index=list(range(4)),dtype='float32'),

'D': np.array([3]*4,dtype='int32'),

'E': pd.Categorical(["test","train","test","train"]),

'F':'foo' })

print(df2)

'''

输出：

A B C D E F

0 1.0 2013-01-02 1.0 3 test foo

1 1.0 2013-01-0

2 1.0

3 train foo

2 1.0 2013-01-02 1.0

3 test foo

3 1.0 2013-01-02 1.0 3 train foo

'''

⼆、运⽤

1. 导⼊模块

import pandas as pd

import numpy as np

2. 读取excel⽂件

df = pd.read_csv(path='file.csv')

'''

参数：header=None ⽤默认⾏名，0，1，2，3...

names=['A', 'B', 'C'...] ⾃定义列名

index_col='A'|['A', 'B'...] 给索引列指定名称，如果是多重索引，可以传list

skiprows=[0,1,2] 需要跳过的⾏号，从⽂件头0开始，skip_footer从⽂件尾开始

nrows=N 需要读取的⾏数，前N⾏

chunksize=M 返回迭代类型TextFileReader，每M条迭代⼀次，数据占⽤较⼤内存时使⽤

sep=':'数据分隔默认是','，根据⽂件选择合适的分隔符，如果不指定参数，会⾃动解析

skip_blank_lines=False 默认为True，跳过空⾏，如果选择不跳过，会填充NaN

converters={'col1', func} 对选定列使⽤函数func转换，通常表⽰编号的列会使⽤（避免转换成int）

dfjs = pd.read_json('file.json') 可以传⼊json格式字符串

dfex = pd.read_excel('file.xls', sheetname=[0,1..]) 读取多个sheet页，返回多个df的字典

'''

#df.to_csv()

3. 查询数据

df.shape #显⽰数据的多少⾏和多少列

df.dtypes #显⽰数据的格式

df.head(n) #显⽰数据的前n=5⾏

df.tail(n) #显⽰数据的后n=5⾏

df.head(1)[‘date’] #获取第⼀⾏的date列

df.head(1)[‘date’][0] #获取第⼀⾏的date列的元素值

df.describe(include='all') # all代表需要将所有列都列出

df.T #对数据的转置：

df.isnull() #isnull是Python中检验空值的函数，返回的结果是逻辑值，包含空值返回True，不包含则返回False。可以对整个数据表进⾏检查，也可以单独对某⼀列进⾏空值检查。df[“列名”] #返回这⼀列(“列名”)的数据

df[[“name”,”age”]] #返回列名为name和 age的两列数据

df[‘列字段名’].unique() #显⽰数据某列的所有唯⼀值, 有0值是因为对数据缺失值进⾏了填充

df = pd.read_excel(file,skiprows=[2] ) #不读取哪⾥数据，可⽤skiprows=[i]，跳过⽂件的第i⾏不读取

df.loc[0] #使⽤loc[]⽅法来选择第⼀⾏的数据

df.loc[0][“name”] #使⽤loc[]⽅法来选择第⼀⾏且列名为name的数据

df.loc[2:4] #返回第3⾏到第4⾏的数据

df.loc[[2,5,10]] #返回⾏标号为2，5，10三⾏数据，注意必须是由列表包含起来的数据。

df.loc[:,’test1’] #获取test1的那⼀列，这个冒号的意思是所有⾏，逗号表⽰⾏与列的区分

df.loc[:,[‘test1’,’test2’]] #获取test1列和test2列的数据

df.loc[1,[‘test1’,’test2’]] #获取第⼆⾏的test1和test2列的数据

df.at[1,’test1’] #表⽰取第⼆⾏，test1列的数据，和上⾯的⽅法类似

df.iloc[0] #获取第⼀⾏

df.iloc[0:2,0:2] #获取前两⾏前两列的数据

df.iloc[[1,2,4],[0,2]] #获取第1，2，4⾏中的0，2列的数据

4. 数据处理

（1）数据获取（excel⽂件数据基本信息）

#coding=utf-8

import pandas as pd

import numpy as np

excel_data = pd.read_excel("test.xlsx")

print excel_data.shape #显⽰数据多少⾏多少列

print excel_data.index #显⽰数据所有⾏的索引数

print lumns #显⽰数据所有列的列名

print excel_data.info #显⽰所有列的列名

print excel_data.dtypes #显⽰数据的类型

#Help on function read_excel in module l:

read_excel(*args, **kwargs)

Read an Excel table into a pandas DataFrame

Parameters

----------

io : string, path object (pathlib.Path or py._path.local.LocalPath),

file-like object, pandas ExcelFile, or xlrd workbook.

The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local

file could be file://localhost/path/to/workbook.xlsx

sheet_name : string, int, mixed list of strings/ints, or None, default 0

Strings are used for sheet names, Integers are used in zero-indexed

sheet positions.

Lists of strings/integers are used to request multiple sheets.

Specify None to get all sheets.

str|int -> DataFrame is returned.

list|None -> Dict of DataFrames is returned, with keys representing

sheets.

Available Cases

* Defaults to 0 -> 1st sheet as a DataFrame

* 1 -> 2nd sheet as a DataFrame

* "Sheet1" -> 1st sheet as a DataFrame

* [0,1,"Sheet5"] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames

* None -> All sheets as a dictionary of DataFrames

sheetname : string, int, mixed list of strings/ints, or None, default 0

.. deprecated:: 0.21.0

Use `sheet_name` instead

header : int, list of ints, default 0

Row (0-indexed) to use for the column labels of the parsed

DataFrame. If a list of integers is passed those row positions will

be combined into a ``MultiIndex``. Use None if there is no header.

names : array-like, default None

List of column names to use. If file contains no header row,

then you should explicitly pass header=None

index_col : int, list of ints, default None

Column (0-indexed) to use as the row labels of the DataFrame.

Pass None if there is no such column. If a list is passed,

those columns will be combined into a ``MultiIndex``. If a

subset of data is selected with ``usecols``, index_col

is based on the subset.

parse_cols : int or list, default None

.. deprecated:: 0.21.0

Pass in `usecols` instead.

usecols : int or list, default None

* If None then parse all columns,

* If int then indicates last column to be parsed

* If list of ints then indicates list of column numbers to be parsed

* If string then indicates comma separated list of Excel column letters and column ranges (e.g. "A:E"or"A,C,E:F"). Ranges are inclusive of

both sides.

squeeze : boolean, default False

If the parsed data only contains one column then return a Series

dtype : Type name or dict of column -> type, default None

Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}

Use `object` to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD

of dtype conversion.

.. versionadded:: 0.20.0

engine: string, default None

If io is not a buffer or path, this must be set to identify io.

Acceptable values are None or xlrd

converters : dict, default None

Dict of functions for converting values in certain columns. Keys can

either be integers or column labels, values are functions that take one

input argument, the Excel cell content, and return the transformed

content.

true_values : list, default None

Values to consider as True

.. versionadded:: 0.19.0

false_values : list, default None

Values to consider as False

.. versionadded:: 0.19.0

skiprows : list-like

Rows to skip at the beginning (0-indexed)

nrows : int, default None

Number of rows to parse

.. versionadded:: 0.23.0

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific

per-column NA values. By default the following values are interpreted

as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',

'1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan',

'null'.

keep_default_na : bool, default True

If na_values are specified and keep_default_na is False the default NaN

values are overridden, otherwise they're appended to.

verbose : boolean, default False

Indicate number of NA values placed in non-numeric columns

thousands : str, default None

Thousands separator for parsing string columns to numeric. Note that

this parameter is only necessary for columns stored as TEXT in Excel,

any numeric columns will automatically be parsed, regardless of display

format.

comment : str, default None

Comments out remainder of line. Pass a character or characters to this

argument to indicate comments in the input file. Any data between the

comment string and the end of the current line is ignored.

skip_footer : int, default 0

.. deprecated:: 0.23.0

Pass in `skipfooter` instead.

skipfooter : int, default 0

Rows at the end to skip (0-indexed)

convert_float : boolean, default True

convert integral floats to int (i.e., 1.0 --> 1). If False, all numeric

data will be read in as floats: Excel stores all numbers as floats

internally

Returns

-------

parsed : DataFrame or Dict of DataFrames

DataFrame from the passed in Excel file. See notes in sheet_name

argument for more information on when a Dict of Dataframes is returned.

read_excel参数详解

获取⾏

excel_data.head(5) #显⽰数据的前5⾏

excel_data.tail(5) #显⽰数据的后5⾏

excel_data.loc[0] #获取第⼀⾏的数据

excel_data.loc[2:4] #返回第3⾏到第4⾏的数据

excel_data.loc[[2,5,10]] #返回⾏标号为2，5，10三⾏数据，注意必须是由列表包含起来的数据。

excel_data.iloc[0] #获取第⼀⾏

获取列

excel_data["name"] #返回这⼀列("name")的数据

excel_data[["name","age"]] #返回列名为name和 age的两列数据

excel_data["name"].unique() #显⽰数据name列的所有唯⼀值, 有0值是因为对数据缺失值进⾏了填充

获取某⾏某列

excel_data.head(5)["name"] #获取前5⾏的name列

excel_data.head(5)["name"][0] #获取前5⾏的name列的元素值

excel_data.at[1,"age"] #表⽰取第⼆⾏"age"列的数据

excel_data.loc[0]["name"] #获取第⼀⾏且列名为name的数据

excel_data.loc[:,"age"] #获取age的那⼀列，这个冒号的意思是所有⾏，逗号表⽰⾏与列的区分

excel_data.loc[:,["age","time"]] #获取所有⾏的age列和time列的数据

excel_data.loc[1,["age","time"]] #获取第⼆⾏的age和time列的数据

excel_data.iloc[0:2,0:2] #获取前两⾏前两列的数据

excel_data.iloc[[1,2,4],[0,2]] #获取第1，2，4⾏中的0，2列的数据

获取空值

ull() #excel_data的⾮空值为True

excel_data.isnull() #isnull是Python中检验空值的函数，返回的结果是逻辑值，包含空值返回True，不包含则返回False。可以对整个数据表进⾏检查，也可以单独对某⼀列进⾏空值检查。（2）数据清洗转换

1）增

2）删

a、删除⽆效⾏、列（整⾏、列都是空⽩，且说明⽆效的⾏、列）

b、删除指定⾏、列

Help on method drop in frame:

drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') method frame.DataFrame instance Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding

axis, or by specifying directly index or column names. When using a

multi-index, labels on different levels can be removed by specifying

the level.

Parameters

----------

labels : single label or list-like

Index or column labels to drop.

axis : {0 or'index', 1 or'columns'}, default 0

Whether to drop labels from the index (0 or'index') or

columns (1 or'columns').

index, columns : single label or list-like

Alternative to specifying axis (``labels, axis=1``

is equivalent to ``columns=labels``).

.. versionadded:: 0.21.0

level : int or level name, optional

For MultiIndex, level from which the labels will be removed.

inplace : bool, default False

If True, do operation inplace and return None.

errors : {'ignore', 'raise'}, default 'raise'

If 'ignore', suppress error and only existing labels are

dropped.

excel_data.drop

#Help on method dropna in frame:

dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False) method frame.DataFrame instance

Remove missing values.

See the :ref:`User Guide <missing_data>` for more on which values are

considered missing, and how to work with missing data.

Parameters

----------

axis : {0 or'index', 1 or'columns'}, default 0

Determine if rows or columns which contain missing values are

removed.

* 0, or'index' : Drop rows which contain missing values.

* 1, or'columns' : Drop columns which contain missing value.

.. deprecated:: 0.23.0: Pass tuple or list to drop on multiple

axes.

how : {'any', 'all'}, default 'any'

Determine if row or column is removed from DataFrame, when we have

at least one NA or all NA.

* 'any' : If any NA values are present, drop that row or column.

* 'all' : If all values are NA, drop that row or column.

thresh : int, optional

Require that many non-NA values.

subset : array-like, optional

Labels along other axis to consider, e.g. if you are dropping rows

these would be a list of columns to include.

inplace : bool, default False

If True, do operation inplace and return None.

excel_data.dropna

3）改

#Help on method fillna in frame:

fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) method frame.DataFrame instance Fill NA/NaN values using the specified method

Parameters

----------

value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a

dict/Series/DataFrame of values specifying which value to use for

each index (for a Series) or column (for a DataFrame). (values not

in the dict/Series/DataFrame will not be filled). This value cannot

be a list.

method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None

Method to use for filling holes in reindexed Series

pad / ffill: propagate last valid observation forward to next valid

backfill / bfill: use NEXT valid observation to fill gap

axis : {0 or'index', 1 or'columns'}

inplace : boolean, default Falsepython json字符串转数组

If True, fill in place. Note: this will modify any

other views on this object, (e.g. a no-copy slice for a column in a

DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive

NaN values to forward/backward fill. In other words, if there is

a gap with more than this number of consecutive NaNs, it will only

be partially filled. If method is not specified, this is the

maximum number of entries along the entire axis where NaNs will be

688IT编程网

Python中pandas模块解析

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林特征选择原理

自动驾驶系统中的随机森林算法解析

随机森林算法及其在生物信息学中的应用

监督学习中的随机森林算法解析(六)

随机森林算法在数据分析中的应用

机器学习——随机森林,RandomForestClassifier参数含义详解

随机森林的算法

随机森林算法作用

监督学习中的随机森林算法解析(十)

随机森林算法案例

随机森林案例

二分类问题常用的模型

绘制ssd框架训练流程

一种基于信息熵和DTW的多维时间序列相似性度量算法

SVM训练过程范文

如何使用支持向量机进行股票预测与交易分析

二分类交叉熵损失函数binary

tinybert_训练中文文本分类模型_概述说明

基于门控可形变卷积和分层Transformer的图像修复模型及其应用

人工智能开发技术的测试和评估方法

最新文章

基于随机森林的数据分类算法改进

人工智能中的智能识别与分类技术

基于人工智能技术的随机森林算法在医疗数据挖掘中的应用

随机森林回归模型的建模步骤

r语言随机森林预测模型校准曲线

《2024年随机森林算法优化研究》范文

标签列表

688IT编程网

Python中pandas模块解析

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林特征选择原理

自动驾驶系统中的随机森林算法解析

随机森林算法及其在生物信息学中的应用

监督学习中的随机森林算法解析(六)

随机森林算法在数据分析中的应用

机器学习——随机森林,RandomForestClassifier参数含义详解

随机森林 的算法

随机森林算法作用

监督学习中的随机森林算法解析(十)

随机森林算法案例

随机森林案例

二分类问题常用的模型

绘制ssd框架训练流程

一种基于信息熵和DTW的多维时间序列相似性度量算法

SVM训练过程范文

如何使用支持向量机进行股票预测与交易分析

二分类交叉熵损失函数binary

tinybert_训练中文文本分类模型_概述说明

基于门控可形变卷积和分层Transformer的图像修复模型及其应用

人工智能开发技术的测试和评估方法

最新文章

基于随机森林的数据分类算法改进

人工智能中的智能识别与分类技术

基于人工智能技术的随机森林算法在医疗数据挖掘中的应用

随机森林回归模型的建模步骤

r语言随机森林预测模型校准曲线

《2024年随机森林算法优化研究》范文

标签列表

随机森林的算法