9.4.1 数据搜集和数据清洗

实验阶段,我们搜集了如下数据:

·1000个cryptolocker域名;

·1000个post-tovar-goz域名;

·alexa前1000域名。

DGA文件格式如下:


xsxqeadsbgvpdke.co.uk,Domain used by Cryptolocker - Flashback DGA for 13 Apr 2017,2017-04-13,http://osint.bambenekconsulting.com/manual/cl.txt

从DGA文件中提取域名数据:


def load_dga(filename):
    domain_list=[]
    #xsxqeadsbgvpdke.co.uk,Domain used by Cryptolocker - Flashback DGA for 13 Apr 2017,2017-04-13,
    # http://osint.bambenekconsulting.com/manual/cl.txt
    with open(filename) as f:
        for line in f:
            domain=line.split(",")[0]
            if domain >= MIN_LEN:
                domain_list.append(domain)
    return  domain_list

alexa文件使用CSV格式保存域名的排名和域名,提取数据方式如下:


def load_alexa(filename):
    domain_list=[]
    csv_reader = csv.reader(open(filename))
    for row in csv_reader:
        domain=row[1]
        if domain >= MIN_LEN:
        domain_list.append(domain)
    return domain_list