Model suggestion for detection of malware based on multiple api call sequences

I’m trying to build a RNN (LSTM) model for classification of binary as benign/malware. The data structure I’ve presently looks as follows

{     "binary1": {         "label": 1,         "sequences": [             ["api1","api2","api3", ...],             ["api1","api2","api3", ...],             ["api1","api2","api3", ...],             ["api1","api2","api3", ...],             ...         ]     },     "binary2": {         "label": 0,         "sequences": [             ["api1","api2","api3", ...],             ["api1","api2","api3", ...],             ["api1","api2","api3", ...],             ["api1","api2","api3", ...],             ...         ]     },     ... } 

Here each binary have variable number of sequences, and each sequence have variable number of API calls. I can pad the data so that all binaries will have equal number of sequences and each sequence also have equal number of API calls. But my question is how can I use this data for training?

The problem is that, all the sequences of the malicious binary may not be malicious sequences. So, if I use the label and indicate the model that all those sequences are malicious and if some of the sequences are similar in benign files also, the benign binary may be treated as malware.

To better understand the problem, treat each binary as a person on twitter, and each API call sequences as a words in a tweet. A user may tweet so many tweets, but a few of them may be about sports (for eg). And in my training data I know which persons tweets about sports, but I don’t know which tweets are about sports. So, what I’m trying to do is classifying those persons whether they like sports or not based on all the tweets of the person.

In the same way, I know whether the binary is malicious or not, but I don’t know which API call sequences are responsible for maliciousness. And I want the model to identify those sequences from the training data. Is it possible? And what architecture should I use?

Hope I conveyed my question, thanks for reading and waiting for a suggestion.

Interlacing sequences by polynomials?

Given set of integers $ 0<a_1,\dots,a_t$ and $ 0<b_1,\dots,b_t$ where $ a_i\leq a_{i+1}$ and $ b_i\leq b_{i+1}$ at every $ i\in\{1,\dots,t-1\}$ and $ a_t< b_t$ we can find polynomials $ f,g\in\mathbb Z[x]$ such that

$ $ f(a_i)<g(b_{\sigma(i)})<f(a_{i+1})$ $ holds at every $ i\in\{1,\dots,i-1\}$ at any given permutation $ \sigma$ .

  1. How small (among all permutations $ \sigma$ ) will $ \max(f_\infty,g_\infty)$ when $ d_{\max{}}=\max(\mathsf{deg}(f),\mathsf{deg}(g))$ is fixed where $ f_\infty$ and $ g_\infty$ refers to largest coefficient by magnitude?

  2. Same as 1. however we have $ f,g\in\mathbb Z_{\geq0}[x]$ but with permutation $ \sigma=\mathsf{id}$ .

Python how to shuffle an ordered list to make sequences of elements?

For instance we have an ordered list:

a = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4] 

I want to reshuffle this array to form:

a = [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4] 

Currently I’m doing:

a = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]) n_unique_elements = 4 arrays_with_same_elements = np.array_split(a, 5)  for idx in range(n_unique_elements):     final_list.append(list_similar_a[0][idx])     final_list.append(list_similar_a[1][idx])     final_list.append(list_similar_a[2][idx])     final_list.append(list_similar_a[3][idx])     final_list.append(list_similar_a[4][idx]) 

So the variable
final_list = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

There must a pythonic way of doing this. Perhaps a built-in function in numpy? What other different techniques come to your mind to solve this problem?

Why not program our video text terminals/terminal emulators to use something JSON or XML on the backend instead of ANSI escape sequences? [on hold]


Backstory (You can skip)

Awhile back I was developing a console toolkit for displaying debug messages and the like: enter image description here It gives me colour coding, blinking, underlines, bold, italic, etc.

While developing this library, I quickly learned that nesting with ANSI escape sequences was impossible, but assumed that good reasons existed why this was the case.

Of course working with other document types, nesting is more or less trivial:

<foo> the <bar> quick </bar> brown </foo> 

or with JSON

{         type=foo,         text=[                 "the",                 {type=bar, text="quick",},                 "brown",         ] } 

But with Ansi, its something like this:

\e[1m the \e[2m quick \e[1m brown 

giving an output like this:

enter image description here

Basically meaning that you would need to manually track the formatting, and explicitly construct escape sequences to represent all output moving forward. You can obviously make due, but it complicates things. Before complaining about this, I’d like to clarify what reasons exist that would necessitate an escape sequence model over say, a structured document style.

Questions:

Is it purely due to legacy reasons why displaying text on our video terminals is done with ANSI escape sequences and not another framework such as JSON, Yaml, XML, or something else?

Is ANSI escape sequences in video terminal, simply an old technology similar to say, X11 that sticks around solely due to how embedded it is within the computing paradigm?

If not, why don’t developers switch from an escape sequence style to something that would support nesting?

Are there any proposals to do away with ANSI escapes in terminals and replace it with something else?

Why not program our video text terminals/terminal emulators to use something JSON or XML on the backend instead of ANSI escape sequences? [on hold]


Backstory (You can skip)

Awhile back I was developing a console toolkit for displaying debug messages and the like: enter image description here It gives me colour coding, blinking, underlines, bold, italic, etc.

While developing this library, I quickly learned that nesting with ANSI escape sequences was impossible, but assumed that good reasons existed why this was the case.

Of course working with other document types, nesting is more or less trivial:

<foo> the <bar> quick </bar> brown </foo> 

or with JSON

{         type=foo,         text=[                 "the",                 {type=bar, text="quick",},                 "brown",         ] } 

But with Ansi, its something like this:

\e[1m the \e[2m quick \e[1m brown 

giving an output like this:

enter image description here

Basically meaning that you would need to manually track the formatting, and explicitly construct escape sequences to represent all output moving forward. You can obviously make due, but it complicates things. Before complaining about this, I’d like to clarify what reasons exist that would necessitate an escape sequence model over say, a structured document style.

Questions:

Is it purely due to legacy reasons why displaying text on our video terminals is done with ANSI escape sequences and not another framework such as JSON, Yaml, XML, or something else?

Is ANSI escape sequences in video terminal, simply an old technology similar to say, X11 that sticks around solely due to how embedded it is within the computing paradigm?

If not, why don’t developers switch from an escape sequence style to something that would support nesting?

Are there any proposals to do away with ANSI escapes in terminals and replace it with something else?

What is the purpose of using hex escape sequences when writing buffer overflow exploits?

I was trying to overwrite fp function pointer to 0x8048424(win() location) so that function win() will be called to solve this problem(machine is little endian)

#include <stdlib.h> #include <unistd.h> #include <stdio.h> #include <string.h>  void win() {   printf("code flow successfully changed\n"); }  int main(int argc, char **argv) {   volatile int (*fp)();   char buffer[64];    fp = 0;    gets(buffer);    if(fp) {       printf("calling function pointer, jumping to 0x%08x\n", fp);       fp();   } } 

I am able to do this by overflowing buffervariable and then overwriting fp by doing python -c "print'A'*64 + '\x24\x84\x04\x08'" | ./stack3.

But my question is why we need hex escape sequences?I have saw this notation in many tutorials but none of them explained it purpose

I read about this and found that they can be used as escape sequences. for e.g \n(which is a newline character) can be written as printf("\x0A") and it will do the same thing. So it makes sense.

But when overwriting memory why we need this? I didn’t understand it’s purpose here in buffer overflow Why we cannot simply use python -c "print'A'*64 + '0x24840408'" | ./stack3. I mean we are just writing a memory address to a pointer variable.

PS My question is related to this question but unfortunately doesn’t answer my question that Why do we need \x notation in first place

Statistical tests for modular roots of high complexity integer sequences?

Take two integers $ n$ and $ m$ with $ 0<\log_2m<n<m$ .

Let $ r=(2(n!)+1)\bmod m$ .

Denote the two roots of $ ((2(n!)+1)\bmod m)^2\bmod m$ by $ r_1$ and $ r_2$ .

One of $ r_1$ and $ r_2$ equals $ r$ .

  1. Is there a statistical test that performs better than a random coin toss at finding correct one of $ r_1$ or $ r_2$ ?

  2. Can we replace $ a(n!)+b$ by any other sequence whose complexity is purportedly high while a statistical test can separate them with significance?

On norming weakly$^*$ sequences in the dual of the Banach space $c_0$

A bounded subset $ B$ of the dual $ X^*$ of a Banach space $ X$ is called norming if the formula $ \|x\|:=\sup\{|x^*(x)|:x^*\in B\}$ determines an equivalent norm on $ X$ .

Observe that the sequence $ (e_n^*)_{n\in\omega}\subset c_0^*=\ell_1$ of coordinate functionals in the dual of $ c_0$ is norming and weakly$ ^*$ null.

Question 1. Is it true that every absolutely convex bounded norming set $ B\subset c_0^*$ in the dual of the Banach space $ c_0$ contains a norming weakly$ ^*$ null sequence?

The same question can be asked for subspaces of $ c_0$ .

Question 2. Let $ X$ be a closed subspace of the Banach space $ c_0$ and $ B\subset X^*$ is a norming bounded absolutely convex set. Is it true that $ B$ contains a norming weakly$ ^*$ null sequence?