6.3 示例:使用决策树算法检测POP3暴力破解

完整演示代码请见本书GitHub上的6-2.py。

1.数据搜集和数据清洗

使用KDD 99数据集中POP3的相关数据,KDD 99数据集详细介绍请阅读第3章相关内容:


4,tcp,pop_3,SF,30,93,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,guess_passwd.
4,tcp,pop_3,SF,28,93,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,guess_passwd.
4,tcp,pop_3,SF,30,93,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,guess_passwd.

加载KDD 99数据集中的数据:


def load_kdd99(filename):
    x=[]
    with open(filename) as f:
        for line in f:
            line=line.strip('\n')
            line=line.split(',')
            x.append(line)
    return x

筛选标记为guess-passwd和normal且是POP3协议的数据:


for x1 in x:
    if ( x1[41] in ['guess_passwd.','normal.'] ) and ( x1[2] == 'pop_3' ):
        if x1[41] == 'guess_passwd.':
            y.append(1)
        else:
            y.append(0)

2.特征化

挑选与POP3密码破解相关的网络特征以及TCP协议内容的特征作为样本特征:


x1 = [x1[0]] + x1[4:8]+x1[22:30]
v.append(x1)
for x1 in v :
    v1=[]
    for x2 in x1:
        v1.append(float(x2))
    w.append(v1)

3.训练样本

实例化决策树算法:


clf = tree.DecisionTreeClassifier()

4.效果验证

我们使用十折交叉验证:


print  cross_validation.cross_val_score(clf, x, y, n_jobs=-1,cv=10)

测试结果如下,准确率约为99%:


[0.98637602  1.          1.          1.          1.          1.          1.
 1.          1.          1.        ]

可视化训练得到的决策树为:


dot_data = tree.export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("../photo/6/iris-dt.pdf")

可视化决策树如图6-3所示。

图6-3 使用决策树检测暴力破解训练得到的决策树