python输出⽔果的个数_⽤Python解决⼀个简单的⽔果分类问
题
在本⽂中,我们将使⽤Python中最流⾏的机器学习⼯具Scikit-learn在Python中实现⼏种机器学习算法。使⽤简单的数据集来训练分类器以区分不同类型的⽔果。
本⽂的⽬的是确定最适合⼿头问题的机器学习算法; 因此,我们想要⽐较不同的算法,选择效果最好的算法。
数据
⽔果数据集由爱丁堡⼤学的Iain Murray博⼠创建。他买了⼏⼗个不同品种的桔⼦、柠檬和苹果,并记录了他们的尺⼨。
让我们看⼀下数据的前⼏⾏。
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_')
fruits.head()
数据集的每⼀⾏表⽰⽔果的⼀个部分,由列表中的⼏个特征表⽰。
我们的数据集中有59个⽔果和7个特征:
print(fruits.shape)
(59,7)
我们的数据集中有四种类型的⽔果:
print(fruits['fruit_name'].unique())
['苹果''橘⼦(mandarin)''橙⼦''柠檬']
除橘⼦外,数据⾮常平衡。我们必须坚持下去。
upby('fruit_name').size())
import seaborn as sns
plt.show()
可视化
每个数字变量的⽅形图将使我们更清楚地了解输⼊变量的分布:
fruits.drop('fruit_label', axis=1).plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(9,9),
title='Box Plot for each input variable')
plt.savefig('fruits_box')
plt.show()
颜⾊分数近似于⾼斯分布。
import pylab as pl
fruits.drop('fruit_label' ,axis=1).hist(bins=30, figsize=(9,9))
pl.suptitle("Histogram for each numeric input variable")
plt.savefig('fruits_hist')
plt.show()
⼀些属性对是相关的(质量和宽度)。这表明⾼度相关性和可预测的关系。
ls.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']
cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(X, c = y, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap = cmap) plt.suptitle('Scatter-matrix for each input variable')
plt.savefig('fruits_scatter_matrix')
统计摘要
我们可以看到数值没有相同的⽐例。我们需要对我们为训练集计算的测试集扩展应⽤。
创建训练和测试集扩展到应⽤。
del_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = ansform(X_test)
构建模型
Logistic回归
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(logreg.score(X_test, y_test)))
Logistic回归分类器在训练集上的准确率:0.70
Logistic回归分类器在测试集上的准确率:0.40
决策树
import DecisionTreeClassifier
clf = DecisionTreeClassifier().fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
决策树分类器在训练集上的准确率:1.00
决策树分类器在测试集上的准确率:0.73
K-Nearest Neighbors
ighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'
.format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
.format(knn.score(X_test, y_test)))
K-NN分类器在训练集上的准确率:0.95
K-NN分类器在测试集上的准确率:1.00
线性判别分析
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
print('Accuracy of LDA classifier on training set: {:.2f}'
.format(lda.score(X_train, y_train)))
print('Accuracy of LDA classifier on test set: {:.2f}'
.format(lda.score(X_test, y_test)))
LDA分类器在训练集上的准确率:0.86
LDA分类器在测试集上的准确率:0.67
⾼斯朴素贝叶斯
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
random在python中的意思gnb.fit(X_train, y_train)
print('Accuracy of GNB classifier on training set: {:.2f}'
.format(gnb.score(X_train, y_train)))
print('Accuracy of GNB classifier on test set: {:.2f}'
.format(gnb.score(X_test, y_test)))
GNB分类器在训练集上的准确率:0.86
GNB分类器在测试集上的准确率:0.67
⽀持向量机
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)
print('Accuracy of SVM classifier on training set: {:.2f}'
.
format(svm.score(X_train, y_train)))
print('Accuracy of SVM classifier on test set: {:.2f}'
.format(svm.score(X_test, y_test)))
SVM分类器在训练集上的准确率:0.61
SVM分类器在测试集上的准确率:0.33
KNN算法是我们尝试过的最准确的模型
。混淆矩阵表⽰测试集没有发⽣错误。但是,测试集⾮常⼩。
ics import classification_report
ics import confusion_matrix
pred = knn.predict(X_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
绘制k-NN分类器的决策边界
as cm
lors import ListedColormap, BoundaryNorm import matplotlib.patches as mpatches
import matplotlib.patches as mpatches
X = fruits[['mass', 'width', 'height', 'color_score']]
y = fruits['fruit_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def plot_fruit_knn(X, y, n_neighbors, weights):
X_mat = X[['height', 'width']].as_matrix()
y_mat = y.as_matrix()
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X_mat, y_mat)
# Plot the decision boundary by assigning a color in the color map
# to each mesh point.
mesh_step_size = .01 # step size in the mesh
plot_symbol_size = 50
x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size),
np.arange(y_min, y_max, mesh_step_size))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot training points
plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black') plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
patch0 = mpatches.Patch(color='#FF0000', label='apple')
patch1 = mpatches.Patch(color='#00FF00', label='mandarin')
patch2 = mpatches.Patch(color='#0000FF', label='orange')
patch3 = mpatches.Patch(color='#AFAFAF', label='lemon')
plt.legend(handles=[patch0, patch1, patch2, patch3])
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论