I’m trying to figure out Fayyad and Irani (1993) MDLP discretization of continuous variables (here is link to the original paper). I understand how algorithm works, but I have some doubts about first step, sorting of variable and finding potential cut-points.
My doubt is what happens if we have multiple occurrence of same value of variable in the dateset and those occurrences have different target variable associated with them? This can happen quite often in practice, e.g. we have table of clients with some properties which is 0 for some large number of them and there are multiple different events associated with those clients, some of them belonging to one class and others belonging to different class.
Here is a dummy table with sorted Variable X and 2-class classification problem:
If we apply heuristic like it is suggested in the paper, value 0.6 will end up in multiple different potential bins since it contains both class 0 and 1. What’s even worst, it seems that potential cutoffs and bins will be different if we sort target variable (ascending, descending or unsorted)…
Any help with this, someone maybe already tried to figure out this?