I want to load the NSL_KDD dataset contained in this link with using the Python programming.
https://github.com/smellslikeml/deepIDS/blob/master/deep_IDS.ipynb
In this database, 22 features for training and testing data are classified into 5 separate classes(Normal, DOS, U2R, R2L, Probe)
But when I run this line of code y_test = pd.get_dummies(y_test), instead of being categorized into 5 classes, it shows me the same 22 features, while I did the same thing train data (target = pd.get_dummies(target) and crrocet result), it using for the test data.
with open(‘G:/RUN_PYTHON/kddcup.names.txt’, ‘r’) as infile:
kdd_names = infile.readlines()
kdd_cols = [x.split(’:’)[0] for x in kdd_names[1:]]
The Train+/Test+ datasets include sample difficulty rating and the attack class
kdd_cols += [‘class’, ‘difficulty’]
kdd = pd.read_csv(‘G:/RUN_PYTHON/KDDTrain+.txt’, names=kdd_cols)
kdd_t = pd.read_csv(‘G:/RUN_PYTHON/KDDTest+.txt’, names=kdd_cols)
#kdd = pd.read_csv(‘G:/RUN_PYTHON/kddcup.txt.data_10_percent_corrected’, names=kdd_cols)
#kdd_t = pd.read_csv(‘G:/RUN_PYTHON/kddcup.testdata.unlabeled_10_percent’, names=kdd_cols)
Consult the linked references for attack categories:
https://www.researchgate.net/post/What_are_the_attack_types_in_the_NSL-KDD_TEST_set_For_example_processtable_is_a_attack_type_in_test_set_Im_wondering_is_it_prob_DoS_R2L_U2R
The traffic can be grouped into 5 categories: Normal, DOS, U2R, R2L, Probe
or more coarsely into Normal vs Anomalous for the binary classification task
kdd_cols = [kdd.columns[0]] + sorted(list(set(kdd.protocol_type.values))) + sorted(list(set(kdd.service.values))) + sorted(list(set(kdd.flag.values))) + kdd.columns[4:].tolist()
attack_map = [x.strip().split() for x in open(‘G:/RUN_PYTHON/training_attack_types.txt’, ‘r’)]
attack_map = {x[0]: x[1] for x in attack_map if x}
Here we opt for the 5-class problem
kdd[‘class’] = kdd[‘class’].replace(attack_map)
kdd_t[‘class’] = kdd_t[‘class’].replace(attack_map)
def cat_encode(df, col):
return pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col].values)], axis=1)
def log_trns(df, col):
return df[col].apply(np.log1p)
cat_lst = [‘protocol_type’, ‘service’, ‘flag’]
for col in cat_lst:
kdd = cat_encode(kdd, col)
kdd_t = cat_encode(kdd_t, col)
log_lst = [‘duration’, ‘src_bytes’, ‘dst_bytes’]
for col in log_lst:
kdd[col] = log_trns(kdd, col)
kdd_t[col] = log_trns(kdd_t, col)
kdd = kdd[kdd_cols]
for col in kdd_cols:
if col not in kdd_t.columns:
kdd_t[col] = 0
kdd_t = kdd_t[kdd_cols]
Now we have used one-hot encoding and log scaling
difficulty = kdd.pop(‘difficulty’)
target = kdd.pop(‘class’)
y_diff = kdd_t.pop(‘difficulty’)
y_test = kdd_t.pop(‘class’)
target = pd.get_dummies(target)
print(target)
y_test = pd.get_dummies(y_test)
print(y_test)