discretization in weka

Question

I need to know when is the right time to do discretization in weka.I have data set,i need to create training and testing data samples from that data. Should i do the discretization for the numerical attributes before the sampling or after the sampling?

Nandini · Answer 1 · Apr 7, 2022

This should be self-evident.

You can do it later as long as you obtain the same outcome regardless of the split you used. But what is the advantage of doing so? Then just start with the preprocessing.

You should be alright if you discretize by rounding - for example, float to integer (which is unaffected by the split). However, if you discretize using quantiles, it should be evident that you can make a mistake because the different portions will be discretized differently!

Let's imagine you want to divide data into two categories:

Input data    Type     Output value
0.9           good     1.05
1.0           good     1.05
1.1           good     1.05
1.2           good     1.05
---
2.1           good     2.20
2.3           good     2.20
2.2           good     2.20
---  SPLIT HERE ---
1.1           bad      1.20
1.2           bad      1.20
1.3           bad      1.20
---
1.9           bad      2.00
2.0           bad      2.00
2.1           bad      2.00

Because the average of each cluster of values was used, both "good" and "bad" were discretized into two discrete values. The resulting property, however, plainly reveals the genuine membership because the averages for "excellent" and "bad" differ. The task of detecting "bad" has gotten a lot simpler.
Separate preprocessing is not required and you don't need to perform it also.

Elevate your skills with our comprehensive Machine Learning Course.