KDD是知识发现与数据挖掘(Knowledge Discovery and Data Mining)的简称,KDD CUP是由ACM组织的年度竞赛如图3-1所示。KDD 99数据集就是KDD竞赛在1999年举行时采用的数据集。
图3-1 KDD大赛
1998年美国国防部高级研究计划局(DARPA)在MIT林肯实验室进行了一个入侵检测评估项目。林肯实验室建立了模拟美国空军局域网的一个网络环境,收集了9周的网络连接和系统审计数据,仿真各种用户类型、各种不同的网络流量和攻击手段,使它就像一个真实的网络环境。一个网络连接定义为:在某个时间内从开始到结束的TCP数据包序列,并且在这段时间内,数据在预定义的协议下从源IP地址到目的IP地址的传递。每个网络连接被标记为正常(normal)或异常(attack),异常类型被细分为4大类共39种攻击类型,其中22种攻击类型出现在训练集中,另有17种未知攻击类型出现在测试集中,见表3-2。
表3-2 KDD 99攻击类型详情
随后来自哥伦比亚大学的Sal Stolfo教授和来自北卡罗莱纳州立大学的Wenke Lee教授采用数据挖掘等技术对以上数据集进行特征分析和数据预处理,形成了一个新的数据集。该数据集用于1999年举行的KDD竞赛中,成为著名的KDD 99数据集。虽然年代有些久远,但KDD99数据集仍然是网络入侵检测领域的权威测试集,为基于计算智能的网络入侵检测研究奠定基础。
KDD99数据集中每个连接用41个特征来描述:
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
上面是数据集中的3条记录,以CSV格式写成,加上最后的标记(label),一共有42项,其中前41项特征分为4大类。
·TCP连接基本特征(见表3-3),基本连接特征包含了一些连接的基本属性,如连续时间、协议类型、传送的字节数等。
·TCP连接的内容特征,见表3-4。
·基于时间的网络流量统计特征,见表3-5。
·基于主机的网络流量统计特征,见表3-6。
表3-3 KDD 99 TCP连接基本特征
表3-4 KDD 99 TCP连接的内容的特征
表3-5 KDD 99基于时间的网络流量统计特征
表3-6 KDD 99基于主机的网络流量统计特征