用Python导入数据(一)--688IT编程网

⽤Python导⼊数据（⼀）

在本部分中，学习将数据导⼊Python的多种⽅法：

（i）来⾃平⾯⽂件，如.txts和.csvs;

（ii）从原⽣到其他软件的⽂件，如Excel电⼦表格，Stata，SAS和MATLAB⽂件;

（iii）来⾃关系数据库，例如SQLite和PostgreSQL。

平⾯⽂件(flat file)是去除了所有特定应⽤(程序)格式的电⼦记录，从⽽使可以迁移到其他的应⽤上进⾏处理。这种去除电⼦数据格式的模式可以避免因为硬件和专有软件的过时⽽导致数据丢失。平⾯⽂件是⼀种，所有信息都在⼀个信号字符串中。

1、导⼊整个⽂本⽂件

# Open a file: file

file = open('', 'r')

# Print it

ad())

# Check whether file is closed

print(file.closed)

# Close file

file.close()

# Check whether file is closed

print(file.closed)

2、逐⾏导⼊⽂本⽂件

# Read & print the first 3 lines

with open('') as file:

#ad())

adline())

CHAPTER 1. Loomings.

Call me Ishmael. Some years ago--never mind how long precisely--having

3、使⽤NumPy导⼊平⾯⽂件

np.loadtxt(file, delimiter=',')

# Import package

import numpy as np

# Assign filename to variable: file

file = 'digits.csv'

# Load file as array: digits

digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits

print(type(digits)) # <class 'numpy.ndarray'>

# Select and reshape a row

im = digits[21, 1:]

python 定义数组im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)

plt.imshow(im_sq, cmap='Greys', interpolation='nearest')

plt.show()

⾃定义NumPy导⼊

np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])

delimiter：changes the delimiter that loadtxt() is expecting, for example, you can use ',' and '\t' for comma-delimited and tab-delimited respectively。

skiprows：需要忽略的⾏数（从⽂件开始处算起），或需要跳过的⾏号列表（从0开始）

usecols：获取希望保留的列的索引列表

我们要导⼊的.txt⽂件它有标题，且由字符串组成，是制表符分隔的。

由于标题，如果尝试使⽤np.loadtxt（）按原样导⼊它，Python会抛出⼀个ValueError并告诉它⽆法将字符串转换为float。

有两种⽅法可以解决这个问题：⾸先，将数据类型参数dtype设置为str（对于字符串）。

或者，使⽤skiprows参数跳过第⼀⾏。使⽤skiprows=1，去掉标题⾏

# Import numpy

import numpy as np

# Assign the filename: file

file = ''

# Load the data: data

data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])

# Print data

print(data)

[[ 1. 0.]

[ 0. 0.]

[ 1. 0.]

[ 2. 1.]]

np.loadtxt()输出为多维数组，可以按照数组切⽚等⽅式取数。

# Assign filename: file

file = ''

# Import file: data

data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the data

print(data)

# Import data as floats and skip the first row: data_float

data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the data_float

print(data_float)

[["b'Time'" "b'Percent'"]

["b'99'" "b'0.067'"]

["b'99'" "b'0.133'"]

......

["b'5'" "b'0.214'"]

["b'5'" "b'0.4'"]]

[[ 9.90000000e+01 6.70000000e-02]

[ 9.90000000e+01 1.33000000e-01]

......

[ 5.00000000e+00 2.14000000e-01]

[ 5.00000000e+00 4.00000000e-01]]

使⽤混合数据类型

⼤多数情况下，需要导⼊在不同列中具有不同数据类型的数据集; 例如，⼀列可能包含字符串和另⼀列

包含浮点数。函数np.loadtxt（）没法解决。还有另⼀个函数，np.genfromtxt（），它可以处理这样的结构。如果我们将dtype = None传递给它，它将确定每列应该是什么类型。np.recfromcsv（）类似。

data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

这⾥，第⼀个参数是⽂件名，第⼆个参数指定分隔符，第三个参数names告诉我们有⼀个标题。由于数据类型不同，因此数据是⼀个称为结构化数组的对象。因为numpy数组必须包含所有相同类型的元素，所以结构化数组通过作为⼀维数组来解决这个问题，其中数组的每个元素都是导⼊的平⾯⽂件的⼀⾏。

In [2]: np.shape(data)

Out[2]: (10,)

In [5]: print(data)

[(1, 0, 3, b'male', 22.0, 1, 0, b'A/5 21171', 7.25, b'', b'S')

(2, 1, 1, b'female', 38.0, 1, 0, b'PC 17599', 71.2833, b'C85', b'C')

(3, 1, 3, b'female', 26.0, 0, 0, b'STON/O2. 3101282', 7.925, b'', b'S')

(4, 1, 1, b'female', 35.0, 1, 0, b'113803', 53.1, b'C123', b'S')

(5, 0, 3, b'male', 35.0, 0, 0, b'373450', 8.05, b'', b'S')

(6, 0, 3, b'male', nan, 0, 0, b'330877', 8.4583, b'', b'Q')

(7, 0, 1, b'male', 54.0, 0, 0, b'17463', 51.8625, b'E46', b'S')

(8, 0, 3, b'male', 2.0, 3, 1, b'349909', 21.075, b'', b'S')

(9, 1, 3, b'female', 27.0, 0, 2, b'347742', 11.1333, b'', b'S')

(10, 1, 2, b'female', 14.0, 1, 0, b'237736', 30.0708, b'', b'C')]

# Assign the filename: file

file = 'titanic.csv'

# Import file fromcsv: d

d = np.recfromcsv(file,delimiter=',',names=True,dtype=None)

# Print out first three entries of d

print(d[:3])

[(1, 0, 3, b'male', 22.0, 1, 0, b'A/5 21171', 7.25, b'', b'S')

(2, 1, 1, b'female', 38.0, 1, 0, b'PC 17599', 71.2833, b'C85', b'C')

(3, 1, 3, b'female', 26.0, 0, 0, b'STON/O2. 3101282', 7.925, b'', b'S')]

4、使⽤pandas将平⾯⽂件导⼊为DataFrames

可以将包含具有不同数据类型的列的平⾯⽂件导⼊为numpy数组。但是，pandas中的DataFrame对象

是⼀个更适合存储此类数据的结构，幸运的是，我们可以使⽤pandas函数read_csv（）和read_table（）轻松地将混合数据类型的⽂件导⼊为DataFrame。

# Import pandas as pd

import pandas as pd

# Assign the filename: file

file = 'titanic.csv'

# Read the file into a DataFrame: df

df = pd.read_csv(file)

# View the head of the DataFrame

print(df.head())

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S

1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C

2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S

4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S ⾃定义pandas导⼊

data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

comment：表⽰在⽂件中出现注释的字符，在本例中为“＃”。

na_values：将字符串列表识别为NA / NaN，在本例中为字符串'Nothing'

688IT编程网

用Python导入数据(一)

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林特征选择原理

自动驾驶系统中的随机森林算法解析

随机森林算法及其在生物信息学中的应用

监督学习中的随机森林算法解析(六)

随机森林算法在数据分析中的应用

机器学习——随机森林,RandomForestClassifier参数含义详解

随机森林的算法

随机森林算法作用

监督学习中的随机森林算法解析(十)

随机森林算法案例

随机森林案例

二分类问题常用的模型

绘制ssd框架训练流程

一种基于信息熵和DTW的多维时间序列相似性度量算法

SVM训练过程范文

如何使用支持向量机进行股票预测与交易分析

二分类交叉熵损失函数binary

tinybert_训练中文文本分类模型_概述说明

基于门控可形变卷积和分层Transformer的图像修复模型及其应用

人工智能开发技术的测试和评估方法

最新文章

基于随机森林的数据分类算法改进

人工智能中的智能识别与分类技术

基于人工智能技术的随机森林算法在医疗数据挖掘中的应用

随机森林回归模型的建模步骤

r语言随机森林预测模型校准曲线

《2024年随机森林算法优化研究》范文

标签列表

688IT编程网

用Python导入数据(一)

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林特征选择原理

自动驾驶系统中的随机森林算法解析

随机森林算法及其在生物信息学中的应用

监督学习中的随机森林算法解析(六)

随机森林算法在数据分析中的应用

机器学习——随机森林,RandomForestClassifier参数含义详解

随机森林 的算法

随机森林算法作用

监督学习中的随机森林算法解析(十)

随机森林算法案例

随机森林案例

二分类问题常用的模型

绘制ssd框架训练流程

一种基于信息熵和DTW的多维时间序列相似性度量算法

SVM训练过程范文

如何使用支持向量机进行股票预测与交易分析

二分类交叉熵损失函数binary

tinybert_训练中文文本分类模型_概述说明

基于门控可形变卷积和分层Transformer的图像修复模型及其应用

人工智能开发技术的测试和评估方法

最新文章

基于随机森林的数据分类算法改进

人工智能中的智能识别与分类技术

基于人工智能技术的随机森林算法在医疗数据挖掘中的应用

随机森林回归模型的建模步骤

r语言随机森林预测模型校准曲线

《2024年随机森林算法优化研究》范文

标签列表

随机森林的算法