Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning

I’m trying to figure out Fayyad and Irani (1993) MDLP discretization of continuous variables (here is link to the original paper). I understand how algorithm works, but I have some doubts about first step, sorting of variable and finding potential cut-points.

My doubt is what happens if we have multiple occurrence of same value of variable in the dateset and those occurrences have different target variable associated with them? This can happen quite often in practice, e.g. we have table of clients with some properties which is 0 for some large number of them and there are multiple different events associated with those clients, some of them belonging to one class and others belonging to different class.

Here is a dummy table with sorted Variable X and 2-class classification problem:

enter image description here

If we apply heuristic like it is suggested in the paper, value 0.6 will end up in multiple different potential bins since it contains both class 0 and 1. What’s even worst, it seems that potential cutoffs and bins will be different if we sort target variable (ascending, descending or unsorted)…

Any help with this, someone maybe already tried to figure out this?

IFrame Vulnerability Classification

I was participating in a bug bounty on a website we will call example.com, when I ran into a very strange edge case which I am not sure I should report. The website uses ads and tracking similar to google analytics from a website we can call tracking.com. When visiting the example website there is an iframe to the tracking website. The source of the iframe can be seen below.

<body> <script type="text/javascript">      ((function (e, t)      {            var n = function () {                var e = t.createElement("iframe");                e.src = "https://tracking.com/container/?utm_source=[INJECT];                e.style.cssText = "position: absolute";                t.body.appendChild(e)           }            if (t.readyState === "complete")           {                n()           }           else           {                if (typeof e.addEventListener !== "undefined")                {                     t.addEventListener("DOMContentLoaded", n, false)                }                else                {                     e.attachEvent("onload", n, false)                }           }      })(window, document)); </script> </body> 

The example website also has a parameter called utm_source, into which javascript can be injected into the iframe (where I placed [INJECT] in the code above). For example, visiting https://example.com/?utm_source=";</script><script>alert(document.domain)</script> yields the alert embedded page at tracking.com says tracking.com. The issue is that the tracking website is not in scope of the bug bounty and I am not even sure that the issue is caused by the tracking website. It seems like the example website allows the user to inject arbitrary JS into the iframe of the tracking website. Is this a bug worth reporting or am I missing some easy way of escaping the iframe?

So far I have tried injecting </iframe> and things like e.onload=alert(1)to escape the iframe but have not been successful. Since the example and tracking websites are on different domains I cannot access things in the parent website (example) from the tracking website due to the “X-Frame-Options” header set to “SAMEORIGIN”.

As a beginner this bug has me very confused as to how it should be classified and if it is exploitable in any way. Any tips would be greatly appreciated!

How Joint Probability Distributions are used to solve the problem of missing inputs in Classification

With n input variables, we can now obtain all 2^n different classification functions needed for each possible set of missing inputs, but the computer program needs to learn only a single function describing the joint probability distribution.

This is page 98 of Ian Goodfellow’s Deep Learning Book. My confusion comes from how joint probability distributions are used to solve the problem of missing inputs. What are the random variables in this scenario? I don’t really understand the connection here so if someone could please elaborate that would be great.

Table-Driven Lexer and the Classification Table

I’m trying to implement a compiler for a custom language as part of an assignment.

I am still trying to figure out how to build the lexer. From what I understand, for a table-driven lexer, we have 3 tables:

  1. Classification Table
  2. Transition Table
  3. Token Type Table

My problem is mainly coming from the fact that the only example I’ve seen of the concept of a table-driven lexer is the “famous” (because I see it in every University’s online notes) Cooper & Torczon DFA for reading digits. Page 25

From what I gather, the purpose of each of these is as follows:

1: To classify the atomic parts of the language, such as digits (0,1,2,3….) and letters (a,b,c,…)

2: To define what should happen next according to what’s just been classified (If digit, go to state X, if letter, go to state Y)

3: Apparently this is used to check whether or not the string is accepted. Honestly I don’t even know what the point of this is.

The grammar I’m trying to build a compiler for is much more complicated than the examples I’ve seen online. It contains more “atomic” symbols, such as operators (*,+,-,/,>, etc..) and reserved keywords (if, for, while, etc…)

By atomic, I mean symbols that stand on their own. (I.e. if is a symbol in its own right, not i and f) This poses a problem for me, since I won’t be able to know if I’m reading if or a string of the form aifb

Here’s what I’m currently trying to do:

  1. First, I’m building a CAT (classifier table) for all the atomic symbols of the language. I don’t know if this is the right thing to do, especially when I have 52 letters (English alphabet), 10 digits and reserved words.
  2. I will then merge all the CATs together. So I will have one big CAT that covers letters, digits, and reserved words.
  3. Then, I will build a (big) transition table, so that when I read a character and determine its classification (problem: What about reserved words that take more than 1 character?) I will know where to transition to next.
  4. These tables are used by a simple DFA class which, once the lexeme is read, will spit out a token.

The assignment specifies that I have to use a table-driven lexer.

land-cover classification (matlab) (maximum likelihood) Gaussian models [closed]

Remotely sensed data are provided as 6 images showing an urban area, with the ground-truth information. These images have already been registered. You are required to implement the Maximum Likelihood (ML) algorithm to classify the given data into four classes, 1 – building; 2 – vegetation; 3 – car; 4 – ground. By doing so, each pixel in the images will be assigned a class. There are four objectives to achieve the final classification goal.

To select training samples from given source data based on information in the given ground truth (at least 20 training samples for each class)

To establish Gaussian models for each class with the training samples;

To apply maximum likelihood to the testing data (measured data) and classify each pixel into a class;

To evaluate the classification accuracy by using a confusion matrix and visual aid (colour coded figures).

Are there any creatures that have more than one classification?

Are there any creatures that have more than one “monster type”?

I was wondering whether in the official published materials there are any creatures that have a dual classification? – for instance, that in their description it says they are both Fey and Beast.

I appreciate any examples or, if none exist, confirmation that creatures only have the one classification of monster type and that’s it.

Thank you.

Classification accuracy based on top 3 most likely classifications

My goal is to recommend jobs to job seekers based on their skill set.

Currently I’m using an SVM for this, which is outputting one prediction, e.g. “software engineer at Microsoft”. However, consider this: how significantly different are the skill sets of a software engineer at Microsoft and a software engineer at IBM? Probably not significantly different. Indeed, by inspection of my data set I can confirm this. Hence, the SVM struggles to discriminate in situations like this, of which there are many in my data set, and my classification accuracy is about 50%.

So I had an idea.

In SK Learn, once you’ve trained some model, you can compute the probability a particular input X belongs to each class.

So for each input X in my test set, I took the the top 3 most likely classifications. Then I tested whether or not the correct label was in the top 3 predictions. If it was, then I considered the prediction to be correct. In doing so, the classification accuracy increased to over 80%.

So my question is: is this a valid approach to measuring classification accuracy? If it is, then does it have a name?

In my mind, it is valid given my intended application, which is to recommend a selection of jobs to a job seeker, which are relevant to their skill set.

is the complexity of an algorithm connected to the “classification” of a optimization problem?

I wonder if the complexity is connected to an optimization problem. In general but also specifically e.g. when having a look at $ O(n^2) $ . Does this just describe the complexity in general or does it also mean does the underlying algorithm has to be a nonlinear optimization problem? Or is it also valid for linear tasks?

In simple words: Can a nonlinear task be solved in $ O(n) $ ?

Active Directory GDPR classification

GDPR classifies data into Non-personal, Personal and Sensitive personal. Sensitive personal is further broken down to Genetic and biometric, Racial and ethnical, Religion and Philosophical etc.

Coming to implement a new Active Directory I want to present my stakeholders with the different options that they have, with GDPR in mind.

Reading the Microsoft white paper I cannot see mapping of attributes to the GDPR classifications, or a bottom line of how to implement any of the 3 approaches.

I would ideally like to see something along these lines:

  • to class as non-personal: 1. use employee ID in userPrincipalName and sAMAccountName; 2. only use corporate mobile phone in homePhone/otherPhone etc.; 3. use office address in streetAddress, targetAddress etc.
  • avoid these to class as sensitive: 1. biometrics (unless stored on the device such as Windows Hello); 2. distribution lists for religion communities (e.g. prayer room, even if multi-faith)

There are of course extension attributes which will have to be considered on case by case basis.

Has anyone been through this process and can share from their work?

Classification of programming exercises

I wondered what attempts were made to classify programming exercises for students ( just like “taxonomy” in biology ) in their learning path. For example :

  • A “covered concept(s)” criteria ( from generic to specific ) :

    • At level 1, we have “general” concepts : Iteration, Recursion, ADT, Conditionnals, etc.
    • At level 2, we have “less-general” concepts : For Loop, tail recursion , List, Ternary operator
  • A “difficulty” criteria ( easy, intermediary, hard)

According to my research ( on many search engines like https://dblp.uni-trier.de/ ), there is only one very mentioned one: Bloom’s Taxonomy . What other classifications exist, except this one?

Thank you for your help in advance