⽤Python导⼊数据(⼀)
在本部分中,学习将数据导⼊Python的多种⽅法:
(i)来⾃平⾯⽂件,如.txts和.csvs;
(ii)从原⽣到其他软件的⽂件,如Excel电⼦表格,Stata,SAS和MATLAB⽂件;
(iii)来⾃关系数据库,例如SQLite和PostgreSQL。
平⾯⽂件(flat file)是去除了所有特定应⽤(程序)格式的电⼦记录,从⽽使可以迁移到其他的应⽤上进⾏处理。这种去除电⼦数据格式的模式可以避免因为硬件和专有软件的过时⽽导致数据丢失。 平⾯⽂件是⼀种,所有信息都在⼀个信号字符串中。
1、导⼊整个⽂本⽂件
# Open a file: file
file = open('', 'r')
# Print it
ad())
# Check whether file is closed
print(file.closed)
# Close file
file.close()
# Check whether file is closed
print(file.closed)
2、逐⾏导⼊⽂本⽂件
# Read & print the first 3 lines
with open('') as file:
#ad())
adline())
adline())
adline())
CHAPTER 1. Loomings.
Call me Ishmael. Some years ago--never mind how long precisely--having
3、使⽤NumPy导⼊平⾯⽂件
np.loadtxt(file, delimiter=',')
# Import package
import numpy as np
# Assign filename to variable: file
file = 'digits.csv'
# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')
# Print datatype of digits
print(type(digits)) # <class 'numpy.ndarray'>
# Select and reshape a row
im = digits[21, 1:]
python 定义数组im_sq = np.reshape(im, (28, 28))
# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()
⾃定义NumPy导⼊
np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])
delimiter:changes the delimiter that loadtxt() is expecting, for example, you can use ',' and '\t' for comma-delimited and tab-delimited respectively。
skiprows:需要忽略的⾏数(从⽂件开始处算起),或需要跳过的⾏号列表(从0开始)
usecols:获取希望保留的列的索引列表
我们要导⼊的.txt⽂件它有标题,且由字符串组成,是制表符分隔的。
由于标题,如果尝试使⽤np.loadtxt()按原样导⼊它,Python会抛出⼀个ValueError并告诉它⽆法将字符串转换为float。
有两种⽅法可以解决这个问题:⾸先,将数据类型参数dtype设置为str(对于字符串)。
或者,使⽤skiprows参数跳过第⼀⾏。使⽤skiprows=1,去掉标题⾏
# Import numpy
import numpy as np
# Assign the filename: file
file = ''
# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])
# Print data
print(data)
[[ 1. 0.]
[ 0. 0.]
[ 1. 0.]
[ 2. 1.]]
np.loadtxt()输出为多维数组,可以按照数组切⽚等⽅式取数。
# Assign filename: file
file = ''
# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)
# Print the data
print(data)
# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
# Print the data_float
print(data_float)
[["b'Time'" "b'Percent'"]
["b'99'" "b'0.067'"]
["b'99'" "b'0.133'"]
......
["b'5'" "b'0.214'"]
["b'5'" "b'0.4'"]]
[[ 9.90000000e+01 6.70000000e-02]
[ 9.90000000e+01 1.33000000e-01]
......
[ 5.00000000e+00 2.14000000e-01]
[ 5.00000000e+00 4.00000000e-01]]
使⽤混合数据类型
⼤多数情况下,需要导⼊在不同列中具有不同数据类型的数据集; 例如,⼀列可能包含字符串和另⼀列
包含浮点数。 函数np.loadtxt()没法解决。 还有另⼀个函数,np.genfromtxt(),它可以处理这样的结构。 如果我们将dtype = None传递给它,它将确定每列应该是什么类型。np.recfromcsv()类似。
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
这⾥,第⼀个参数是⽂件名,第⼆个参数指定分隔符,第三个参数names告诉我们有⼀个标题。 由于数据类型不同,因此数据是⼀个称为结构化数组的对象。 因为numpy数组必须包含所有相同类型的元素,所以结构化数组通过作为⼀维数组来解决这个问题,其中数组的每个元素都是导⼊的平⾯⽂件的⼀⾏。
In [2]: np.shape(data)
Out[2]: (10,)
In [5]: print(data)
[(1, 0, 3, b'male', 22.0, 1, 0, b'A/5 21171', 7.25, b'', b'S')
(2, 1, 1, b'female', 38.0, 1, 0, b'PC 17599', 71.2833, b'C85', b'C')
(3, 1, 3, b'female', 26.0, 0, 0, b'STON/O2. 3101282', 7.925, b'', b'S')
(4, 1, 1, b'female', 35.0, 1, 0, b'113803', 53.1, b'C123', b'S')
(5, 0, 3, b'male', 35.0, 0, 0, b'373450', 8.05, b'', b'S')
(6, 0, 3, b'male', nan, 0, 0, b'330877', 8.4583, b'', b'Q')
(7, 0, 1, b'male', 54.0, 0, 0, b'17463', 51.8625, b'E46', b'S')
(8, 0, 3, b'male', 2.0, 3, 1, b'349909', 21.075, b'', b'S')
(9, 1, 3, b'female', 27.0, 0, 2, b'347742', 11.1333, b'', b'S')
(10, 1, 2, b'female', 14.0, 1, 0, b'237736', 30.0708, b'', b'C')]
# Assign the filename: file
file = 'titanic.csv'
# Import file fromcsv: d
d = np.recfromcsv(file,delimiter=',',names=True,dtype=None)
# Print out first three entries of d
print(d[:3])
[(1, 0, 3, b'male', 22.0, 1, 0, b'A/5 21171', 7.25, b'', b'S')
(2, 1, 1, b'female', 38.0, 1, 0, b'PC 17599', 71.2833, b'C85', b'C')
(3, 1, 3, b'female', 26.0, 0, 0, b'STON/O2. 3101282', 7.925, b'', b'S')]
4、使⽤pandas将平⾯⽂件导⼊为DataFrames
可以将包含具有不同数据类型的列的平⾯⽂件导⼊为numpy数组。 但是,pandas中的DataFrame对象
是⼀个更适合存储此类数据的结构,幸运的是,我们可以使⽤pandas函数read_csv()和read_table()轻松地将混合数据类型的⽂件导⼊为DataFrame。
# Import pandas as pd
import pandas as pd
# Assign the filename: file
file = 'titanic.csv'
# Read the file into a DataFrame: df
df = pd.read_csv(file)
# View the head of the DataFrame
print(df.head())
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S ⾃定义pandas导⼊
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
comment:表⽰在⽂件中出现注释的字符,在本例中为“#”。
na_values:将字符串列表识别为NA / NaN,在本例中为字符串'Nothing'
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论