完整演示代码请见本书GitHub上的6-2.py。
1.数据搜集和数据清洗
使用KDD 99数据集中POP3的相关数据,KDD 99数据集详细介绍请阅读第3章相关内容:
4,tcp,pop_3,SF,30,93,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,guess_passwd. 4,tcp,pop_3,SF,28,93,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,guess_passwd. 4,tcp,pop_3,SF,30,93,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,guess_passwd.
加载KDD 99数据集中的数据:
def load_kdd99(filename): x=[] with open(filename) as f: for line in f: line=line.strip('\n') line=line.split(',') x.append(line) return x
筛选标记为guess-passwd和normal且是POP3协议的数据:
for x1 in x: if ( x1[41] in ['guess_passwd.','normal.'] ) and ( x1[2] == 'pop_3' ): if x1[41] == 'guess_passwd.': y.append(1) else: y.append(0)
2.特征化
挑选与POP3密码破解相关的网络特征以及TCP协议内容的特征作为样本特征:
x1 = [x1[0]] + x1[4:8]+x1[22:30] v.append(x1) for x1 in v : v1=[] for x2 in x1: v1.append(float(x2)) w.append(v1)
3.训练样本
实例化决策树算法:
clf = tree.DecisionTreeClassifier()
4.效果验证
我们使用十折交叉验证:
print cross_validation.cross_val_score(clf, x, y, n_jobs=-1,cv=10)
测试结果如下,准确率约为99%:
[0.98637602 1. 1. 1. 1. 1. 1. 1. 1. 1. ]
可视化训练得到的决策树为:
dot_data = tree.export_graphviz(clf, out_file=None) graph = pydotplus.graph_from_dot_data(dot_data) graph.write_pdf("../photo/6/iris-dt.pdf")
可视化决策树如图6-3所示。
图6-3 使用决策树检测暴力破解训练得到的决策树