DDoS攻击通常会使企业的互联网业务造成巨大损失——业务中断几个小时甚至几天。这次我们基于KDD 99的样本数据,尝试使用NB算法识别针对Apache的DDoS攻击(见图7-5)。对KDD 99数据的详细介绍请参考第3章的相关内容。完整演示代码请见本书GitHub上的7-5.py。
图7-5 针对Apache的DDoS攻击的数据处理流程
1.数据搜集和数据清洗
KDD 99数据已经完成了大部分的数据清洗工作,KDD99数据集中每个连接用41个特征来描述:
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
其中和DDoS相关的特征主要为:
·网络连接基本特征,见表7-1。
·基于时间的网络流量统计特征,见表7-2。
表7-1 KDD 99与DDoS相关的网络连接基本特征
表7-2 KDD 99与DDoS相关的基于时间的网络流量统计特征
·基于主机的网络流量统计特征,见表7-3。
表7-3 KDD 99与DDoS相关的基于主机的网络流量统计特征
加载KDD 99数据集中的数据:
def load_kdd99(filename): x=[] with open(filename) as f: for line in f: line=line.strip('\n') line=line.split(',') x.append(line) return x
筛选标记为apache2和normal且是http协议的数据:
if ( x1[41] in ['apache2.','normal.'] ) and ( x1[2] == 'http' ): if x1[41] == 'apache2.': y.append(1) else: y.append(0)
2.特征化
挑选与DDoS相关的特征作为样本特征:
x1 = [x1[0]] + x1[4:8]+x1[22:30]+x1[31:40] v.append(x1) for x1 in v : v1=[] for x2 in x1: v1.append(float(x2)) w.append(v1)
3.训练样本
实例化NB算法:
clf = GaussianNB()
4.效果验证
我们使用十折交叉验证:
print cross_validation.cross_val_score(clf, x, y, n_jobs=-1,cv=10)
测试结果如下,准确率99%左右,相当不错。
[ 0.99925094 0.99875156 0.99950062 0.99950062 0.996004 0.9995005 0.997003 0.98975768 0.99975019 0.99925056]