用 Scikit-learn 处理Iris数据集

安装Scikit-learn

1
2
sudo pip install scipy
sudo pip install sklearn

图像可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# -*- coding:utf-8 -*-
from sklearn import datasets
from sklearn import svm
import matplotlib.pyplot as plt
import numpy as np
# 读取内置的iris数据集
iris = datasets.load_iris()
irisFeatures = iris["data"]
irisFeaturesName = iris["feature_names"]
irisLabels = iris["target"]
# 画出任意两维的数据散点图
def scatter_plot(dim1, dim2):
for t,marker,color in zip(xrange(3),">ox","rgb"):
# zip()接受任意多个序列参数,返回一个元组tuple列表
# 用不同的标记和颜色画出每种品种iris花朵的前两维数据
# We plot each class on its own to get different colored markers
plt.scatter(irisFeatures[irisLabels == t,dim1],
irisFeatures[irisLabels == t,dim2],marker=marker,c=color)
dim_meaning = {0:'setal length',1:'setal width',2:'petal length',3:'petal width'}
plt.xlabel(dim_meaning.get(dim1))
plt.ylabel(dim_meaning.get(dim2))
plt.subplot(231)
scatter_plot(0,1)
plt.subplot(232)
scatter_plot(0,2)
plt.subplot(233)
scatter_plot(0,3)
plt.subplot(234)
scatter_plot(1,2)
plt.subplot(235)
scatter_plot(1,3)
plt.subplot(236)
scatter_plot(2,3)
plt.show()

SVM 简单实例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# -*- coding:utf-8 -*-
from sklearn import datasets
from sklearn import svm
import matplotlib.pyplot as plt
import numpy as np
# 读取内置的iris数据集
iris = datasets.load_iris()
irisFeatures = iris["data"]
irisFeaturesName = iris["feature_names"]
irisLabels = iris["target"]
# 数据预处理
# 我先取了50种Setosa的鸢尾花,50种Virginica的鸢尾花
# 取40种Setosa与40种Virginica组成训练集
# 取10种Setosa与10种Virginica组成测试集
irisFeatures = irisFeatures[0:100,:]
irisLabels = irisLabels[0:100]
iris_Setosa = irisFeatures[0:50,:]
iris_Setosa_labels = irisLabels[0:50]
iris_Virginica = irisFeatures[50:100,:]
iris_Virginica_labels = irisLabels[50:100]
iris_Setosa_train = iris_Setosa[0:40,:]
iris_Setosa_labels_train = iris_Setosa_labels[0:40]
iris_Virginica_train = iris_Virginica[0:40,:]
iris_Virginica_labels_train = iris_Virginica_labels[0:40]
iris_Setosa_test = iris_Setosa[40:50,:]
iris_Setosa_labels_test = iris_Setosa_labels[40:50]
iris_Virginica_test = iris_Virginica[40:50,:]
iris_Virginica_labels_test = iris_Virginica_labels[40:50]
iris_Train = np.vstack([iris_Setosa_train,iris_Virginica_train])
irisLabels_Train = np.hstack([iris_Setosa_labels_train,iris_Virginica_labels_train])
iris_Test = np.vstack([iris_Setosa_test,iris_Virginica_test])
irisLabels_Test = np.hstack([iris_Setosa_labels_test,iris_Virginica_labels_test])
# 支持向量分类器
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(iris_Train, irisLabels_Train)
print clf.predict(iris_Test)

模型持久化

内存持久化

持久化到内存<type 'str'>对象

1
2
3
4
5
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(iris_Test[0,:])

文件持久化

1
2
3
4
5
from sklearn.externals import joblib
joblib.dump(clf, 'model/svm.pkl')
clf2 = joblib.load('model/svm.pkl')
print clf2.predict(irisFeatures)

参考资料

[1] 【Scikit-Learn】学习Python来分类现实世界的数据
[2] PyCon 2014:机器学习应用占据Python的半壁江山
[3] scikit-learn 官方主页
[4] 使用scikit-learn进行机器学习的简介(教程1)
[5] Scikit Learn: 在python中机器学习
[6] 各种机器学习与spark教程的博客

其他参考资料

google关键词检索
[1] pycon 2015 Presentation: Machine Learning with Scikit-Learn (I)
[2] Presentation: Machine Learning with Scikit-Learn (II)
[3] Presentation: Statistical Machine Translation with NLTK