pre-processing on NSL KDD data set

Question

I want to load the NSL_KDD dataset contained in this link with using the Python programming.

https://github.com/smellslikeml/deepIDS/blob/master/deep_IDS.ipynb

In this database, 22 features for training and testing data are classified into 5 separate classes(Normal, DOS, U2R, R2L, Probe)
But when I run this line of code y_test = pd.get_dummies(y_test), instead of being categorized into 5 classes, it shows me the same 22 features, while I did the same thing train data (target = pd.get_dummies(target) and crrocet result), it using for the test data.
with open(‘G:/RUN_PYTHON/kddcup.names.txt’, ‘r’) as infile:
kdd_names = infile.readlines()
kdd_cols = [x.split(’:’)[0] for x in kdd_names[1:]]

The Train+/Test+ datasets include sample difficulty rating and the attack class

kdd_cols += [‘class’, ‘difficulty’]

kdd = pd.read_csv(‘G:/RUN_PYTHON/KDDTrain+.txt’, names=kdd_cols)
kdd_t = pd.read_csv(‘G:/RUN_PYTHON/KDDTest+.txt’, names=kdd_cols)
#kdd = pd.read_csv(‘G:/RUN_PYTHON/kddcup.txt.data_10_percent_corrected’, names=kdd_cols)
#kdd_t = pd.read_csv(‘G:/RUN_PYTHON/kddcup.testdata.unlabeled_10_percent’, names=kdd_cols)

Consult the linked references for attack categories:

https://www.researchgate.net/post/What_are_the_attack_types_in_the_NSL-KDD_TEST_set_For_example_processtable_is_a_attack_type_in_test_set_Im_wondering_is_it_prob_DoS_R2L_U2R

The traffic can be grouped into 5 categories: Normal, DOS, U2R, R2L, Probe

or more coarsely into Normal vs Anomalous for the binary classification task

kdd_cols = [kdd.columns[0]] + sorted(list(set(kdd.protocol_type.values))) + sorted(list(set(kdd.service.values))) + sorted(list(set(kdd.flag.values))) + kdd.columns[4:].tolist()
attack_map = [x.strip().split() for x in open(‘G:/RUN_PYTHON/training_attack_types.txt’, ‘r’)]
attack_map = {x[0]: x[1] for x in attack_map if x}

Here we opt for the 5-class problem

kdd[‘class’] = kdd[‘class’].replace(attack_map)
kdd_t[‘class’] = kdd_t[‘class’].replace(attack_map)

def cat_encode(df, col):
return pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col].values)], axis=1)

def log_trns(df, col):
return df[col].apply(np.log1p)

cat_lst = [‘protocol_type’, ‘service’, ‘flag’]
for col in cat_lst:
kdd = cat_encode(kdd, col)
kdd_t = cat_encode(kdd_t, col)

log_lst = [‘duration’, ‘src_bytes’, ‘dst_bytes’]
for col in log_lst:
kdd[col] = log_trns(kdd, col)
kdd_t[col] = log_trns(kdd_t, col)

kdd = kdd[kdd_cols]
for col in kdd_cols:
if col not in kdd_t.columns:
kdd_t[col] = 0
kdd_t = kdd_t[kdd_cols]

Now we have used one-hot encoding and log scaling

difficulty = kdd.pop(‘difficulty’)
target = kdd.pop(‘class’)
y_diff = kdd_t.pop(‘difficulty’)
y_test = kdd_t.pop(‘class’)

target = pd.get_dummies(target)
print(target)
y_test = pd.get_dummies(y_test)
print(y_test)

MD · Answer 1 · May 13, 2020

Hi@arezoo,

I don't know why y_test = pd.get_dummies(y_test) is not giving you proper output. But you can this task in another way like this.

y_test = kdd['class']
Y_test = pd.get_dummies(y_test)

It will give you the categorical output.

answered May 13, 2020 by MD
• 95,460 points

Hi@MD ,

when I replace the command line y_test = pd.get_dummies(y_test) with

y_test = kdd['class']
Y_test = pd.get_dummies(y_test)

I received this error:

y_test = kdd['class']
Traceback (most recent call last):

File "<ipython-input-19-b667b206b69b>", line 1, in <module>
y_test = kdd['class']

File "C:\Python\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1964, in __getitem__
return self._getitem_column(key)

File "C:\Python\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1971, in _getitem_column
return self._get_item_cache(key)

File "C:\Python\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1645, in _get_item_cache
values = self._data.get(item)

File "C:\Python\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3590, in get
loc = self.items.get_loc(item)

File "C:\Python\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2444, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))

File "pandas\_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc

File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc

File "pandas\_libs\hashtable_class_helper.pxi", line 1210, in pandas._libs.hashtable.PyObjectHashTable.get_item

File "pandas\_libs\hashtable_class_helper.pxi", line 1218, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 'class'

could you please help me?

commented May 14, 2020 by arezoo
• 220 points

I like to suggest you that create your model in Jupyter notebook and run your code step by step. It will be more helpful to troubleshoot the issue. If you find your code is running well then save that code as .py extension.

Regarding the error, the above code should work. Kdd is a data frame and we can slice one column.