pre-processing on NSL KDD data set

0 votes

I want to load the NSL_KDD dataset contained in this link with using the Python programming.

https://github.com/smellslikeml/deepIDS/blob/master/deep_IDS.ipynb

In this database, 22 features for training and testing data are classified into 5 separate classes(Normal, DOS, U2R, R2L, Probe)
But when I run this line of code y_test = pd.get_dummies(y_test), instead of being categorized into 5 classes, it shows me the same 22 features, while I did the same thing train data (target = pd.get_dummies(target) and crrocet result), it using for the test data.
with open(‘G:/RUN_PYTHON/kddcup.names.txt’, ‘r’) as infile:
kdd_names = infile.readlines()
kdd_cols = [x.split(’:’)[0] for x in kdd_names[1:]]

The Train+/Test+ datasets include sample difficulty rating and the attack class

kdd_cols += [‘class’, ‘difficulty’]

kdd = pd.read_csv(‘G:/RUN_PYTHON/KDDTrain+.txt’, names=kdd_cols)
kdd_t = pd.read_csv(‘G:/RUN_PYTHON/KDDTest+.txt’, names=kdd_cols)
#kdd = pd.read_csv(‘G:/RUN_PYTHON/kddcup.txt.data_10_percent_corrected’, names=kdd_cols)
#kdd_t = pd.read_csv(‘G:/RUN_PYTHON/kddcup.testdata.unlabeled_10_percent’, names=kdd_cols)

Consult the linked references for attack categories:

https://www.researchgate.net/post/What_are_the_attack_types_in_the_NSL-KDD_TEST_set_For_example_processtable_is_a_attack_type_in_test_set_Im_wondering_is_it_prob_DoS_R2L_U2R

The traffic can be grouped into 5 categories: Normal, DOS, U2R, R2L, Probe

or more coarsely into Normal vs Anomalous for the binary classification task

kdd_cols = [kdd.columns[0]] + sorted(list(set(kdd.protocol_type.values))) + sorted(list(set(kdd.service.values))) + sorted(list(set(kdd.flag.values))) + kdd.columns[4:].tolist()
attack_map = [x.strip().split() for x in open(‘G:/RUN_PYTHON/training_attack_types.txt’, ‘r’)]
attack_map = {x[0]: x[1] for x in attack_map if x}

Here we opt for the 5-class problem

kdd[‘class’] = kdd[‘class’].replace(attack_map)
kdd_t[‘class’] = kdd_t[‘class’].replace(attack_map)

def cat_encode(df, col):
return pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col].values)], axis=1)

def log_trns(df, col):
return df[col].apply(np.log1p)

cat_lst = [‘protocol_type’, ‘service’, ‘flag’]
for col in cat_lst:
kdd = cat_encode(kdd, col)
kdd_t = cat_encode(kdd_t, col)

log_lst = [‘duration’, ‘src_bytes’, ‘dst_bytes’]
for col in log_lst:
kdd[col] = log_trns(kdd, col)
kdd_t[col] = log_trns(kdd_t, col)

kdd = kdd[kdd_cols]
for col in kdd_cols:
if col not in kdd_t.columns:
kdd_t[col] = 0
kdd_t = kdd_t[kdd_cols]

Now we have used one-hot encoding and log scaling

difficulty = kdd.pop(‘difficulty’)
target = kdd.pop(‘class’)
y_diff = kdd_t.pop(‘difficulty’)
y_test = kdd_t.pop(‘class’)

target = pd.get_dummies(target)
print(target)
y_test = pd.get_dummies(y_test)
print(y_test)

May 13, 2020 by arezoo
• 220 points

edited Jun 25, 2020 by MD 1,075 views

1 answer to this question.

0 votes

Hi@arezoo,

I don't know why  y_test = pd.get_dummies(y_test) is not giving you proper output. But you can this task in another way like this.

y_test = kdd['class']
Y_test = pd.get_dummies(y_test)

It will give you the categorical output.

answered May 13, 2020 by MD
• 95,460 points

Hi@MD ,

when I replace the command line y_test = pd.get_dummies(y_test) with 

y_test = kdd['class']
Y_test = pd.get_dummies(y_test)

I received this error:

 y_test = kdd['class']
Traceback (most recent call last):

  File "<ipython-input-19-b667b206b69b>", line 1, in <module>
    y_test = kdd['class']

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1964, in __getitem__
    return self._getitem_column(key)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1971, in _getitem_column
    return self._get_item_cache(key)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1645, in _get_item_cache
    values = self._data.get(item)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3590, in get
    loc = self.items.get_loc(item)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2444, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))

  File "pandas\_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 1210, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 1218, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 'class'

could you please help me?

I like to suggest you that create your model in Jupyter notebook and run your code step by step. It will be more helpful to troubleshoot the issue. If you find your code is running well then save that code as .py extension.

Regarding the error, the above code should work. Kdd is a data frame and we can slice one column.

Thanks for your response.

Related Questions

0 votes
2 answers
+1 vote
2 answers

how can i count the items in a list?

Syntax :            list. count(value) Code: colors = ['red', 'green', ...READ MORE

answered Jul 7, 2019 in Python by Neha
• 330 points

edited Jul 8, 2019 by Kalgi 4,604 views
0 votes
1 answer
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,217 views
0 votes
0 answers

Load and pre-process NSL_KDD data set

since I am a newbie in python ...READ MORE

May 27, 2020 in Python by arezoo
• 220 points
1,453 views
0 votes
1 answer

Building Random Forest on a data-set comprising of missing(NA) values

You have two options, either impute the ...READ MORE

answered Apr 3, 2018 in Data Analytics by Bharani
• 4,660 points
1,588 views
0 votes
1 answer
0 votes
2 answers

How to arrange a data set in ascending order based on a variable?

In your case it'll be, orderedviews = arrange(movie_views, ...READ MORE

answered Nov 27, 2018 in Data Analytics by Kalgi
• 52,350 points
1,194 views
0 votes
1 answer
+1 vote
0 answers

ValueError help with Simple Exponential Smoothing analysis on my data set.

I'm very new, and attempting to teach ...READ MORE

Jul 31, 2019 in Python by Declan

edited Jul 31, 2019 4,729 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP