Ex6_机器学习_吴恩达课程作业(Python ):SVM ⽀持向量机(SupportVect 。。。
Ex6_机器学习_吴恩达课程作业(Python ):SVM ⽀持向量机(Support Vector Machines )
⽂章⽬录使⽤说明:
本⽂章为关于吴恩达⽼师在Coursera上的机器学习课程的学习笔记。
本⽂第⼀部分⾸先介绍课程对应周次的知识回顾以及重点笔记,以及代码实现的库引⼊。本⽂第⼆部分包括代码实现部分中的⾃定义函数实现细节。
本⽂第三部分即为与课程练习题⽬相对应的具体代码实现。0. Pre-condition
This section includes some introductions of libraries.
00. Self-created Functions
This section includes self-created functions.
loadData(path):读取数据
plotData(X, y):可视化数据# This file includes self-created functions used in exercise 3import  numpy as  np import  pandas as  pd import  matplotlib .pyplot as  plt import  re  # regular expression for e-mail processing import  nltk .stem .porter  # 英⽂分词算法from  scipy .io import  loadmat from  sklearn import  svm
1
2
3
4
5
6
7
8# Load data from the given file  读取数据# ARGS: { path: 数据路径 }def  loadData (path ):    data = loadmat (path )    return  data ['X'], data ['y']
1
2
3
4
5# Visualize data  可视化数据# ARGS: { X: 训练集; y: 标签集 }def  plotData (X , y ):    plt .figure (figsize =[8, 6])    plt .scatter (X [:, 0], X [:, 1], c =y .flatten ())1
2
3
4
5
plotBoundary(classifier, X):绘制类别间的决策边界
displayBoundaries(X, y):绘制不同SVM 参数C 下的的决策边界(线性核)
gaussianKernel(x1, x2, sigma)
:实现⾼斯核函数
displayGaussKernelBoundary(X, y, C, sigma):绘制⾼斯核SVM 对某数据集的决策边界    plt .ylabel ('X2')    plt .title ('Data Visualization')    # plt.show()
7
8
9
10# Plot the boundary between two classes  绘制类别间的决策边界# ARGS: { classifier: 分类器; X: 训练集 }def  plotBoundary (classifier , X ):    x_min , x_max = X [:, 0].min () * 1.2, X [:, 0].max () * 1.1    y_min , y_max = X [:, 1].min () * 1.2, X [:, 1].max () * 1.1    xx , yy = np .meshgrid (np .linspace (x_min , x_max , 500),                        np .linspace (y_min , y_max , 500))    # 利⽤传⼊的分类器,对预测样本做出类别预测    Z = classifier .predict (np .c_[xx .flatten (), yy .flatten ()])    Z = Z .reshape (xx .shape )    plt .contour (xx , yy , Z )
1
2
3
4
5
6
7
8
9
10
11# Display boundaries for different situations with different C (1 and 100)# 改变SVM 参数C ,绘制
各情况下的的决策边界# ARGS: { X: 训练集 ; y: 标签集 }def  displayBoundaries (X , y ):    # 此处使⽤skilearn 的包,采⽤线性核函数,获取多个SVM 模型    models = [svm .SVC (C =C , kernel ='linear') for  C in  [1, 100]]    # 给定训练集X 和标签集y ,训练得到的多个SVM 模型,获得多个分类器    classifiers = [model .fit (X , y .flatten ()) for  model in  models ]    # 输出信息    titles = ['SVM Decision Boundary with C = {}'.format (C ) for  C in  [1, 100]]    # 对于每个分类器,绘制其得出的决定边界    for  classifier , title in  zip (classifiers , titles ):        plotData (X , y )        plotBoundary (classifier , X )        plt .title (title )    # 展⽰数据    plt .show ()
1
2
3
4
5
6
7
8
9
10
linspace函数python11
12
13
14
15
16
17# Implement a Gaussian kernel function (Could be considered as a similarity function)# 实现⾼斯核函数(可以看作相似度函数,测量⼀对样本的距离)# ARGS: { x1: 样本1; x2: 样本2; sigma: ⾼斯核函数参数 }def  gaussianKernel (x1, x2, sigma ):    return  np .exp (-(np .power (x1 - x2, 2).sum () / (2 *
np .power (sigma , 2))))
1
2
3
4
5# Display the decision boundary using SVM with a Gaussian kernel # 绘制出基于⾼斯核的SVM 对某数据集的决策边界# ARGS: { X: 训练集; y: 标签集; C: SVM 参数; sigma: ⾼斯核函数参数 }def  displayGaussKernelBoundary (X , y , C , sigma ):    gamma = np .power (sigma , -2.) / 2    # 'rbf'指径向基函数/⾼斯核函数    model = svm .SVC (C =1, kernel ='rbf', gamma =gamma )    classifier = model .fit (X , y .flatten ())    plotData (X , y )    plotBoundary (classifier , X )    plt .title ('Decision boundary using SVM with a Gaussian Kernel')1
2
3
4
5
6
7
8
9
10
11
trainGaussParams(X, y, Xval, yval):⽐较交叉验证集误差,训练最优参数C 和sigma
preprocessEmail(email):预处理邮件
email2TokenList(email):词⼲提取及去除⾮字符内容,返回单词列表# Train out the best parameters '
C' and 'sigma" with the least cost on the validation set # 通过⽐较在交叉验证集上的误差,训练出最优的参数C 和sigma # ARGS: { X: 训练集; y: 标签集; Xval: 训练交叉验证集; yval: 标签交叉验证集 }def  trainGaussParams (X , y , Xval , yval ):    C_values = (0.01, 0.03, 0.1, 0.3, 1., 3., 10., 30.)    sigma_values = C_values    best_pair , best_score = (0, 0), 0    for  C in  C_values :        for  sigma in  sigma_values :            gamma = np .power (sigma , -2.) / 2            model = svm .SVC (C =C , kernel ='rbf', gamma =gamma )            classifier = model .fit (X , y .flatten ())            this_score = model .score (Xval , yval )            if  this_score > best_score :                best_score = this_score                best_pair = (C , sigma )    print ('Best pair(C, sigma): {}, best score: {}'.format (best_pair , best_score ))    return  best_pair [0], best_pair [1]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18# Preprocess an email  预处理邮件# 执⾏除了Word Stemming 和Removal of non-words 的所有处理def  preprocessEmail (email ):    # 全⽂⼩写    email = email .lower ()    # 统⼀化HTML 格式。匹配<;开头,以及所有不是< ,> 的内容,直到>结尾,相当于匹配<...>    email = re .sub ('<[^<>]>', ' ', email )    # 统⼀化URL 。将所有URL 地址转化成"httpadddr"。    email = re .sub ('(http|https)://[^\s]*', 'httpaddr', email )    # 统⼀化邮件地址。将所有邮件地址转化成"emailaddr"。    email = re .sub ('[^\s]+@[^\s]+', 'emailaddr', email )    # 统⼀化美元符号。    email = re .sub ('[\$]+', 'dollar', email )    # 统⼀化数字。    email = re .sub ('[\d]+', 'number', email )    return  email
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16# Conduct Word Stemming and Removal of non-words.# Besides, here we use "NLTK" lib's stemmer, since it's more accurate and efficient.# 执⾏词⼲提取以及去除⾮字符内容的处理,返回的是⼀个个的处理后的单词# 此处⽤NLTK 包的提取器,效率更⾼且更准确def  email2TokenList (email ):   
# Preprocess the email 预处理邮件    email = preprocessEmail (email )    # Instantiate the stemmer 实例化提取器    stemmer = nltk .stem .porter .PorterStemmer ()    # Split the whole email into separated words 将邮件分割为⼀个个单词    tokens = re .split ('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email )    # Traverse all the split contents 遍历逐个分割出来的内容    token_list = []    for  token in  tokens :        # Remove non-word contents 删除任何⾮字母数字的字符        token = re .sub ('[^a-zA-Z0-9]', '', token )        # Stem the root of the word 提取词根        stemmed_word = stemmer .stem (token )        # Remove empty string 去除空字符串‘’,⾥⾯不含任何字符,不添加它1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
email2VocabularyList(email, vocab_list):获取在邮件和词汇表中同时出现的单词的索引
email2FeatureVector(email):提取邮件的特征
1. Support Vector Machines In the fifirst half of this exercise, you will be using support vector machines (SVMs) with various example 2D datasets.Experimenting with these datasets will help you gain an intuition of how SVMs work and how to use a Gaussian kernel with SVMs.
In the next half of the exercise, you will be using support vector machines to build a spam classififier.
调⽤的相关函数在⽂章头部"Self-created functions"中详细描述。
1.1 Example dataset 1        if  not  len (token ): continue        # Append the word into the list 添加到list 中        token_list .append (stemmed_word )    return  token_list
20
21
22
23# Get the indices of words that exist both in the email and the vocabulary list # 获取在邮件和词汇表中同时出现的单词的索引# ARGS: { email: 邮件; vocab_list: 单词表 }def  email2VocabularyList (email , vocab_list ):    token = email2TokenList (email )    index = [i for  i in  range (len (vocab_list )) if  vocab_list [i ] in  token ]    return  index
1
2
3
4
5
6
7# Extract features from email, turn the email into a feature vector # 提取邮件的特征,获取⼀个表⽰邮件的特征向量(长度为单词表长度,存在该单词则对应下标位置值为1,反之为0)# ARGS: { email: 邮件 }def  email2FeatureVector (email ):    # 提供的单词表    df = pd .read_table ('../', n
ames =['words'])    vocab_list = np .asmatrix (df )    # 长度与单词表长度相同    feature_vector = np .zeros (len (vocab_list ))    # 邮件中存在该单词则对应下标位置值为1,反之为0    vocab_indices = email2VocabularyList (email , vocab_list )    for  i in  vocab_indices :        feature_vector [i ] = 1    return  feature_vector
1
2
3
4
5
6
7
8
9
10
11
12
13
14# 1. Support Vector Machines  ⽀持向量机path = '../data/ex6data1.mat'X , y = func .loadData (path )
1
2
3# 1.1 Example dataset 1  样例数据集1# 可视化数据func .plotData (X , y )# 尝试不同的参数C ,并且绘制各种情况下的决定边界func .displayBoundaries (X , y )
1
2
3
4
5
6
数据可视化:
决策边界(线性核,C = 1):
决策边界(线性核,C = 100):可以从上图看到:
当  较⼤(即  较⼤, 较⼩)时,模型对误分类的惩罚增⼤,较严格,误分类少,间隔较⼩。
当  较⼩(即  较⼩, 较⼤)时,模型对误分类的惩罚减⼩ ,较宽松,允许⼀定误分类存在,间隔较⼤。
1.2 SVM with Gaussian Kernels 为了⽤SVM 出⾮线性的决策边界,我们⾸先要实现⾼斯核函数。我可以把⾼斯核函数想象成⼀个相似度函数,⽤来测量⼀对样本的距离 (x ( i ) , y ( j ) ) (x^{(i)}, y^{(j)}) (x(i),y(j))。注意,⼤多数SVM 库会⾃动帮你添加额外的特征  以及 ,所以⽆需⼿动添加。
1.2.1 Gaussian Kernel 1.2.2 Example dataset 2
数据可视化:
决策边界(⾼斯核):
1.2.3 Example dataset 3C 1/λλC 1/λλx 0θ0# 1.2 SVM with Gaussian Kernels  基于⾼斯核函数的SVM
path2 = '../data/ex6data2.mat'X2, y2 = func .loadData (path2)path3 = '../data/ex6data3.mat'df3 = loadmat (path3)X3, y3, Xval , yval = df3['X'], df3['y'], df3['Xval'], df3['yval']
1
2
3
4
5
6
7# 1.2.1 Gaussian Kernel  ⾼斯核函数res_gaussianKernel = func .gaussianKernel (np .array ([1, 2, 1]), np .array ([0, 4, -1]), 2.)print (res_gaussianKernel )  # 0.32465246735834974
1
2
3# 1.2.2 Example dataset 2  样例数据集2# 可视化数据func .plotData (X2, y2)# 绘制基于⾼斯核函数的SVM 对于数据集的决策边界func .displayGaussKernelBoundary (X2, y2, C =1, sigma =0.1)
1
2
3
4
5
6

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。