5.5 示例:使用K近邻算法检测Rootkit

Rootkit是一种特殊的恶意软件,它的功能是在安装目标上隐藏自身及指定的文件、进程和网络链接等信息,比较常见的是Rootkit,一般都和木马、后门等其他恶意程序结合使用。这次我们基于KDD 99的样本数据,尝试使用KNN算法识别基于telnet连接的Rootkit行为,检测流程如图5-6所示。针对KDD 99数据的详细介绍请参考第3章的相关内容。完整演示代码请见本书GitHub上的5-4.py。

1.数据搜集和数据清洗

KDD 99数据已经完成了大部分的数据清洗工作,KDD 99数据集中每个连接用41个特征来描述:


35,tcp,ftp,SF,96,533,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,221,3,0.01,0.03,0.00,0.00,0.00,0.00,0.00,0.00,rootkit.
0,tcp,ftp_data,SF,116,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,52,0.20,0.03,0.20,0.00,0.20,0.00,0.02,0.00,rootkit.
15,tcp,ftp,SF,45,214,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,226,4,0.02,0.03,0.00,0.00,0.00,0.00,0.00,0.00,rootkit.

图5-6 基于telnet连接的Rootkit检测流程

其中和Rootkit相关的特征主要为TCP连接的内容特征,详见表5-1。

表5-1 KDD 99 TCP连接内容特征

加载KDD 99数据集中的数据:


def load_kdd99(filename):
x=[]
with open(filename) as f:
    for line in f:
        line=line.strip('\n')
        line=line.split(',')
        x.append(line)
return x

筛选标记为Rootkit和normal且是telnet协议的数据:


if ( x1[41] in ['rootkit.','normal.'] ) and ( x1[2] == 'telnet' ):
if x1[41] == 'rootkit.':
    y.append(1)
else:
    y.append(0)

2.特征化

挑选与Rootkit相关的特征作为样本特征:


    x1 = x1[9:21]
v.append(x1)
for x1 in v :
    v1=[]
    for x2 in x1:
        v1.append(float(x2))
    w.append(v1)

3.训练样本

实例化KNN算法,邻居数设置为3:


clf = KNeighborsClassifier(n_neighbors=3)

4.效果验证

我们使用十折交叉验证。


print  cross_validation.cross_val_score(clf, x, y, n_jobs=-1,cv=10)

测试结果如下,准确率约为90%。


[ 0.9         0.9         1.          1.          1.          0.77777778
  1.          1.          1.          1.        ]