Encoding a arbitrary stack trace into a fixed length value

Background

I would like to store the nodes of a Calling Context Tree using in a key value store. I need to be able to directly access a node by it’s method name and complete stack trace. In addition in need to access all nodes of a method by only it’s name (the key value store supports a loading based on prefix).

Problem

The first idea is to use the method name + an encoded stack trace as key, e.g. the concatenated string representations. Unfortunately this can get quite large and I cannot use keys of arbitrary length. So the second idea was to encode the stack trace in a deterministic and reversible way. So my next idea was to encode the stack trace in a 64 bit integer, by adding the 32 bit hash representations of the methods in the stack. Unfortunately this is not collision free as the traces A -> B -> C and B -> A -> C compute to the same values even though the traces are different. So my current idea is to encode the traces by:

encodeStacktrace(stack_trace) 1. 64bit current = 0 2. For every method m in stack_trace 3.   current = rotateLeft(current) + hash(m) 4. return current 

They key is then method name concatenated with the encoded stack trace value.

Question

Is this implementation collision safe? I think not 100% however I don’t know how to compute the probability under the assumption that the method hash computation is a perfect hashing algorithm.

If it is not safe, are there other implementations/directions I can look into?

Url encoding messed up

ASP.NET

Something is messing up / getting double encoded?

Breakdown of the URL, for a clearer view:

https://www.someSite.com/en/Download/DownloadLog?productId=6380 &downloadLink= http%3A%2F%2FsomeSite.com %2FDownload.ashx %3Frequest %3DIL7zxW6ETqiYU6cThSNKL8MpY %252bCRIVFZAVhd8DYPG85C1Uhdd %252f2hqqmoObeNmuS3dg4bDgGBb0kUUxGZhej89kTaLBHBXS %252bq3tlaEk2uMEcbWlUZzZQs00sirwZ2IvAvoSpU7HC3N1FaYSNciQ4iHNNmTU %252f6uMypNlPOJ6enlbZ1OrrYODkaMRdRfGKEba %252brusdryM4gp %252bopi1a0gNuMQVCtj %252bAvDcgXGOcZPNhPAnE %253d&version=Ma88r6Z6t2JQcnVhVXgp0A%3D%3D 

Replaced so far:

  • %3A%2F%2F would be ://
  • %2F would be /
  • %3F would be ?
  • %3D would be =
  • %252b would be <what?>
  • %252f would be <what?>
  • %253d would be <what?>

Altered version, as per progress so far:

https://www.someSite.com/en/Download/DownloadLog?productId=6380 &downloadLink= http://someSite.com /Download.ashx ?request =IL7zxW6ETqiYU6cThSNKL8MpY %252bCRIVFZAVhd8DYPG85C1Uhdd %252f2hqqmoObeNmuS3dg4bDgGBb0kUUxGZhej89kTaLBHBXS %252bq3tlaEk2uMEcbWlUZzZQs00sirwZ2IvAvoSpU7HC3N1FaYSNciQ4iHNNmTU %252f6uMypNlPOJ6enlbZ1OrrYODkaMRdRfGKEba %252brusdryM4gp %252bopi1a0gNuMQVCtj %252bAvDcgXGOcZPNhPAnE %253d&version=Ma88r6Z6t2JQcnVhVXgp0A== 

Can anyone help me figure out the remaining?

Does anyone know what this encoding format for passwords is? I think it is a decimal array but I can’t seem to convert it

During a penetration test, I ran across a server that was storing passwords in its database in what seems to be a binary array of sorts:

password_table  1,10,11,21,21,11,21,13,00,00,00,000 11,61,19,11,46,108,09,100 110,118,100,107,108,117,123,62,108,108,62,62 

(slightly edited for confidentiality)

The server in question is a Tomcat server and the application is running a Java program. I considered that this might be a array of sorts but I can’t seem to convert these arrays into anything readable or usable. Does anyone have any ideas?

Is unicode character encoding a safe alternative for html encoding when rendering unsafe user input to html?

I am building a web application in which a third party library is used, which transforms the user input into JSON and sends it to an controller action. In this action, we serialize the input using the standard Microsoft serialize from the System.Text.Json namespace.

public async Task<IActionResult> Put([FromBody]JsonElement json) {     string result = JsonSerializer.Serialize(json); } 

However currently, the json is rendered back to the page, within a script block and using @Html.Raw(), which raised an alarm with me, when I reviewed the code.

While testing if this creates an opening for script injection, I added

<script>alert("HACKED");</script> 

to the input. This input is transformed into

\u003Cscript\u003Ealert(\u0027HACKED\u0027);\u003C/script\u003E 

when serialized.

This look fine. Rendering this to the page does not result code execution, when I tested that.

So, is unicode character encoding really a good protection against script injection, or should I not rely on it?

Is it conceivable that unicode encoding is lost somewhere during processing? Like (de)serializing once more, etc?

This seems like a question that has been asked and answered before, but I couldn’t find it.

Is UTF-8 the final character encoding for all future time?

It seems to me that Unicode is the “final” character encoding. I cannot imagine anything else replacing it at this point. I’m frankly confused about why UTF-16 and UTF-32 etc. exist at all, not to mention all the non-Unicode character encodings (unless for legacy purposes).

In my system, I’ve hardcoded UTF-8 as the one and only supported character encoding for my database, my source code files, and any data I create or import to my system. My system internally works solely in UTF-8. I cannot imagine ever needing to change this, for any reason.

Is there a reason I should expect this to change at some point? Will UTF-8 ever become “obsolete” and replaced by “UniversalCode-128” or something, which also includes the alphabets of later discovered nearby galaxies’ civilizations?

CNF encoding of additions

I have $ m$ equations of the following form: $ $ x_1+x_2+\cdots+x_n=s,$ $ where each variable is either 1 or 0, and the total number of variables is $ m\approx3{,}000$ . So I’m thinking of modeling each variable as a binary variable and each equation as a CNF formula so that, once I combine all formulas into one CNF, I can solve it using a SAT solver.

I’ve tried to solve the system of equations using Gaussian elimination, but it was too slow since the time complexity is $ m^3\approx27{,}000{,}000{,}000$ .

My problem is how to encode addition efficiently and simply. My only known approach is to model $ a+b$ as a circuit and then convert the big totality of $ n$ circuits to a CNF. Is there a better way?

Optimal encoding scheme for semi-rewritable memory?

Let’s define a “semi-rewritable” memory device as having the following properties:

  • The initial blank media is initialised with all zeroes.
  • When writing to the media, individual zeroes can be turned into ones.
  • Ones can not be turned back into zeroes.

Making a physical interpretation of this is easy. Consider for instance a punch card where new holes can easily be made, but old holes can not be filled.

What makes this different from a “write once, read many” device is that a used device can be rewritten (multiple times), at the cost of reduced capacity for each rewrite.

Implicit assumptions I would like to make explicit:

  1. The memory reader has no information about what was previously written on the device. It can therefore not be relied upon to use a mechanism such as “which symbols have been changed?” to encode data on a device rewrite. That is, the reader is stateless.
  2. On the other hand, different “generations” of the device may use different encoding schemes as the available capacity shrinks.
  3. The data stored can assumed to be random bits.

Sample storage scheme, to demonstrate rewrite capability:

Information in this scheme is stored on the device as pairs of binary symbols, each pair encoding one of the three states of a ternary symbol, or [DISCARDED] in the case where both symbols have been written.

The first generation thus stores data at a density of $ \frac{log_2(3)}{2} \approx 0.79$ times that of simple binary encoding.

When the device is rewritten, the encoder considers each pair of binary symbols in sequence. If the existing state matches the one it desires to write, the encoder considers the data written. If on the other hand the pair doesn’t match, it writes the necessary modification to that pair, or in the case where that isn’t possible, writes the symbol [DISCARDED] and considers the next pair instead until it has successfully written the ternary symbol.

As such, every rewrite would discard $ \frac{4}{9}$ of existing capacity.

For a large number of cycles, the device would in sum have stored $ \frac{9log_2(3)}{8} \approx 1.78$ times the data of a simple one-time binary encoding.

(For a variation of the above, one could also encode the first generation in binary and then apply this scheme on every subsequent generation. The loss from the first generation to the second would be larger, and the total life time capacity reduced, but the initial capacity would be larger).

Question:

  1. Is it possible to have a better life-time capacity than $ \frac{9log_2(3)}{8}$ ? I suspect the the real asymptotic capacity is 2.

  2. Can a scheme do better than having $ \frac{4}{9}$ capacity loss between rewrites?

Encoding huge number of tape-symbols of a turing machine in the simulation of the turing-machine using real computer

I was going through the classic text “Introduction to Automata Theory, Languages and Computation” by Hopcroft,Ullman,Motwani where I came across the simulation of a turing machine using a real computer. There the author argued that it shall be almost impossible to carryout the above simulation (not considering universal turing machine) if the number of tape symbols are quite huge. In such a situation it might happen that the code of a tape symbol might not fit in the single hard-disk of a computer.

Then the author makes a claim as:

There would have to be very many tape symbols indeed, since a 30 gigabyte disk, for instance, can represent any of $ 2^{240000000000}$ symbols.

Now I can’t figure out the specific mathematics that the author does…

I assume the usual encoding as we do in digital logic say to encode $ 8$ symbols we need atleast $ 3$ bits of code. Then to represent $ n$ symbols if $ k$ bits are required then we should have the following relation,

$ 2^{k} = n$

$ \implies$ $ k=log_2(n)$

Now,

$ 30$ $ gigabytes$ $ = 30 \times 2^{33}$ $ bits$ $ = \beta (say)$

Now if our disk can hold all the $ n$ symbols then we shall have the following relation true:

$ Disk$ $ size$ $ = (Bits$ $ required$ $ for$ $ each$ $ symbol$ ) $ \times$ ($ number$ $ of$ $ symbols)$

$ \implies$ $ \beta = k \times n = (log_2(n)\times n)$

Solving graphically I have:

SOLUTION

$ n= 7.8 \times 10^{9}$ , which is no where close to the number $ 2^{240000000000}$

Where am I making the mistake?

Help in understanding ‘reasonable’ encoding of inputs

I read that a reasonable encoding of inputs is one where the length of the encoding is no more than a polynomial of the ‘natural representation’ of the input. For instance, binary encodings are reasonable, but unary encodings are not.

But say that the input is a graph, and its natural representation is a vertex and edge list. Suppose that the graph has $ k$ vertices. If I use unary to encode, the overall length of the input referring to the vertex list would be $ O(k^2)$ , i.e. $ =|1^1|+|1^2|+|1^3|+…+|1^k|$ . Isn’t this unary encoding still a polynomial with respect to the number of vertices of the graph (which is $ k$ )?

What am I missing here?

Name of binary encoding scheme for integer numbers

I once found on Wikipedia a nice technique for encoding $ k \in (2^{n-1}, 2^n)$ uniformly distributed integer numbers with less then $ \log_2n$ average bits/symbol, thanks to a simple to compute variable length code. Basically it used $ \log_2n$ for some symbols and $ \log_2n – 1$ for some others.

Unfortunately all my Googling has failed me. I recall something similar to “variable length binary”, but I keep ending on VLQ which are a different beast. Since I know your memory better than mine, can you help me?