Representation of the smallest grammar problem (SGP) using free groups on alphabets


Short Version:

  • Let all lower latin variables mean strings, except $ n, m, l, r$ , which are indices and index bounds.

  • Let $ ab$ mean usual string concatenation of $ a$ to $ b$ .

  • Let $ t \leqslant s \iff \exists$ strings $ u,v$ such that $ utv = s$ .

  • Let $ \Sigma(s) = \{ a \leqslant s : |a| = 1\}$ .

  • Let $ V(n) = \{X_1, \dots, X_n\}$ be disjoint from $ \Sigma(s)$ always.

  • Let $ G(s, n) = \text{free-group}(V(n)\cup \Sigma(s))$

  • Let $ \text{pruned}(s, n) = \{$ grammars $ g$ for $ s$ on $ \leq n$ variables such that $ \forall X \in V(g)$ we have $ X \leqslant g^k(S)$ for some $ k \geq 0$ .

  • Define a grammar $ g$ on $ n$ variables for $ s$ to be a homomorphism of free groups: $ \alpha : G(s, n) \to G(s, n – 1)$ , such that $ \alpha$ is some mapping $ V(n) \to G(s, n-1)$ extended homomorphically, and identity on $ \Sigma(s)$ , and such that $ \alpha^k(S) = s$ for all $ k \geq n_g$ .

  • Let $ \text{SGP}(s, n) = \{\overline{x} \in G(s, n)^n :$ if $ \overline{x}\cdot \overline{g} = \overline{g’}$ then $ g’$ can be pruned such that an equivalent grammar resides in $ \text{pruned}(s, n)$ whenever $ g$ does $ \}$ .

  • From now on, identify $ G(s, n-1)$ with the first $ n-1$ components of $ G(s, n)$ .

  • Question is at the bottom of post.

Written with StackEdit.

Long Version:

A grammar $ g$ of a string $ s$ over $ \Sigma$ is a string homomorphism from $ (\Sigma \cup V)^* \to (\Sigma \cup V\setminus \{S\})^*$ such that $ g^n(S) = s$ for all sufficiently large $ n$ , where $ V$ is a finite set of variable symbols, and $ S \in V$ is the designated start symbol which is defined to be present for any string $ s$ . A pruned grammar is one such that for all $ X \in V(g), $ the letter $ X$ appears as a substring to some iterate of $ g$ evaluated at $ S$ : $ X \leqslant g^k(S)$ for some $ k \geq 0$ . A pruned grammar just gets rid of unused rules.

Define the size of a grammar $ g$ of $ s$ to be the sum of the lengths of the RHS of rules defininig $ g$ . Recall from context-free languages that a grammar can also be defined as a list of rules of the form $ X \to x$ . Formally we can express this as $ |g| := \sum_{X \in V(g)} |g(X)|$ .

In order to algebratize the problem, I didn’t find it beneficial to study string homomorphisms that are pruned grammars (the semigroup formed is $ (g \circ g’ = g’, \forall g,g’ \in \text{ the semigroup})$ . This is because composing two grammars like functions causes the first in the composition to hide the variables of the second, such that when the result is pruned you get the first operand as the result.

Instead, we’ll study transformations of the rules of grammars with $ n$ rules. Consider the direct product $ G^n$ , where $ G$ is the free group on $ \Sigma(g) = V(g) \cup \Sigma(s)$ , where the union is made disjoint by an unlimited supply of new variable symbols. Clearly if we assign a variable index to factor of the product, then we have an encoding of all posible grammars on $ n$ variables, a lot them being redundant and “very similar” to other grammars.

None the less, the smallest grammar problem should be statable as:

Find a grammar $ g$ of $ s$ such that $ |g|$ is minimized, or equivalently,

Find a vector $ \overline{g} \in G^n, n\geq 1$ such that $ |g|$ is minimized and $ g$ is a grammar for $ s$ .

Note the $ \geq n$ part. It implies that any naive algorithm would have to check all $ n = 1…|C(s)|$ where $ |C(s)| = O(|s|^2)$ and $ C(s)$ is the set of substrings of $ s$ . I.e. in the worst case assume you have to variablize approximately all substrings. This is not the case in practice or in theory, but it shouldn’t matter for this analysis.


Now we have to be careful about how we define $ |\cdot| : G \to \Bbb{Z}$ on a group, because it can literally be always equal to $ 1$ if not done right. Since $ G$ is generated by $ \Sigma(g)$ , we say that $ |x| = $ the length of a minimial run of letters in $ x_i \in \Sigma(g) \cup \Sigma(g)^{-1}$ . I.e. if $ x_1 \cdots, x_r = x$ . And we cannot do the same for $ r’ \lt r$ , then $ |x| := r$ .

Since the alphabet $ \Sigma(g)$ depends on the grammar, we have to introduce the alphabet that doesn’t change over $ \overline{g} \in G^n$ . That, we’ll just call $ \Sigma_n = \{X_1, \dots, X_n\}$ .

Now to define $ |\overline{x}|$ for $ \overline{x} \in G$ we need to take out rules that are easily seen to be not required or not helpful in making a grammar the smallest it can be. For instance if $ X_1 \to a \in \Sigma(s)$ , then $ |a| = 1$ and so any grammar would actually be larger than the equivalent grammar without it. Thus whenever $ t \in \Sigma_n \cup \Sigma$ we define $ |t|_n = 0$ . To get to the equivalent non-redundantly styled grammar, you just substitute in $ a$ for every occurence of $ X_1$ found on the RHS of other rules.


Define a grammar symmetry from $ \overline{g}$ to be any element $ \sigma \in G^n$ such that when $ \sigma$ is applied componentwise on the left (or right) of $ \overline{g}$ , written $ \overline{h} = \sigma \overline{g}$ we have that $ h^k(S) = s$ for sufficiently large $ k$ is a preserved property.


We have $ G^n$ acting on $ G^n$ itself and a subset $ Y := \text{pruned-grammars}(n, s) = \{ \overline{g} : g$ is a pruned grammar for $ s\}$ . The subset $ H = \text{Stab}_{G^n}(Y)$ i.e. all group elements that stabilize pruned grammars for $ s$ together with their componentwise inverses, is a subgroup of $ G^n$ . Proof. $ iY =Y, hY = Y \implies Y = h^{-1} Y \implies ih^{-1}Y = Y$ .

Let $ X := \text{smallest-grammars}(n, s)$ . Clearly $ X \subset Y$ since you can’t have extra junk in a smallest grammar of course. Note: elements of $ X$ are precisely the smallest grammars one can achieve in $ n$ or less variables. The reason you can represent the lower variable counts in there as well, is that a component of $ \overline{g}$ can simply map to $ 1 \in \Sigma^*$ and it has length $ 0$ .


Lemma 0. $ H \times Y \to Y$ acts transitively.

Let $ g, h$ be two pruned grammars of $ s$ , each with $ n$ rules, or $ g,h \in Y$ . Consider the component of lowest index $ i$ such that $ g_i \neq h_i$ . Since we’re in a group, $ x g_i = h_i$ has a solution for all $ g_i, h_i \in \text{Stab}_{G^n}(Y)$ . Similarly we have that $ g_i = x^{-1} h_i$ . Do the same for each component, and call $ x_j$ the left multiplier to get from $ g_j$ to $ h_j$ . Then $ \overline{x} = (x_1, \dots, x_n)$ is such that $ \overline{x}\overline{g} = \overline{h}$ .

This is flawed, since $ \text{Stab}_{G^n}(Y)$ may not be writeable as a direct product.

Lemma -1. $ \text{Stab}_{G^n}(Y)$ is equal to a direct product of subgroups of $ G^n$ . Let $ K_i = \{ (1,1, \dots, x_i, \dots, 1, 1) : x_i$ , is in the $ i$ th position and $ (c_1, \dots, c_{i-1}, x_i, \dots, c_n) \in \text{Stab}_{G^n}(Y)$ for some $ c_j \in G\}$ . Then $ K_i$ forms a group under componentwise multiplicaiton.

Proof. We only need to show this for the $ i$ th component since the rest are all $ 1$ . Without loss of generality we only need to show that $ H_i = \{ x \in G : (c_1, \dots, c_{i-1}, x, \dots, c_n) \in \text{Stab}_{G^n}(Y)$ for some $ c_j \in G \}$ is a subgroup of $ G$ . Since $ \text{Stab}_{G^n}(Y)$ is in particular a subgroup of $ G^n$ any multiplication of its elements must be carried out componentwise, by definitino of direct product. Let $ x, y\in H_i$ . Then clearly $ (c_1 d_1, \dots, c_{i-1}d_{i-1}, xy^{-1}, \dots, c_n d_n) \in \text{Stab}_{G^n}(Y)$ , where similarly we can prove, using component-wiseness that $ y^{-1} \in H_i$ . $ \blacksquare$

That proves that $ \text{Stab}_{G^n}(Y)$ transitively acts on $ Y$ . Define $ H’ := \text{Stab}_{H}(X) = \{ \overline{x} \in H : \overline{x} X = X \}$ similarly, and we have a proof that $ H’ \times X \to X$ is a transitive group action.

Thus if we are given a grammar $ \overline{y} \in Y$ for $ s$ , say it might be approximately small, but not the smallest. Then there is a group element in $ H = \text{Stab}_{G^n}(Y)$ that will take $ \overline{y}$ to $ \overline{g}$ , a smallest grammar.

Thus the question can be phrased, compute $ \text{SGP}(s,n) = \text{argmin}_{\overline{x} \in H}\{|\overline{x} \cdot \overline{y}|\} \cdot \overline{y} =$ the set of smallest grammars of $ s$ on up to $ n$ variables, when given a pruned grammar $ y$ for $ s$ . You can go ahead and take $ y = \{ S \to s \}$ , with $ 1$ ‘s in the other components of $ \overline{y}$ .


Questions.

  1. For any smallest grammar $ g$ does there always exist two other smallest grammars $ h_1, h_2$ such that $ \overline{h_1}\overline{h_2} = \overline{g}$ in the direct product $ G^n$ ? In other words is there a recursive decomposition of the problem for each problem instance.

Written with StackEdit.

How to transform an arbitrary graph into a fixed vector representation?

Actuality I work in computer vision, specifically on a problem known as “scene graph modeling.” This problem aims to convert an image $ I$ in a graph $ G=(V,E)$ where the nodes $ V$ represent the objects (and the features) in scene and the edges $ E$ the relationships between objects. An interesting paper on this topic is Graph R-CNN for Scene Graph Generation (Note that unlike of only to detect the objects in an image, the scene graph aims to capture the contextual information of image). A graph is a mathematics structure rich in information, and it would be very interesting to integrate graphs in a machine learn approach. In order to achieve this task is necessary to transform a graph in a vector representation. Some works that intend solve this problem are the following:

  • SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS: The problem with this algorithm is that assumes a fix number of nodes. After training, this algorithm take a graph $ G=(V,E)$ as input  (whit $ N$ nodes, that is, $ |V|=N$ ) and outputs a fixed vector representation.
  • graph2vec: Learning Distributed Representations of Graphs: This algorithm is flexible due to permit build a vector representation from a graph $ G$ without restrict the number of nodes. However, it needs to know the whole space graph. That is, given a set $ G=\{g_{1},g_{2},\dots,g_{i},\dots,g_{m}\}$ , where $ g_{i}$ is the i-th graph, this algorithm builds a vectorial representation $ V=\{v_{1},v_{2},\dots,v_{i},\dots,v_{m}\}$ , where $ v_{i}$ is the i-th vector associated with the graph $ g_{i}$ . This algorithm is originally proposed to text analysis, where the features in nodes are of low dimension, I do not know if it can work using high dimension features in nodes. 

I would like to know if there is another simple algorithm that allows me to convert any graph into a fixed vector representation.

Representation of the concatenation at the type level

I would like to learn more about concatenative programming through the creation of a small simple language, based on the stack and following the concatenative paradigm.

Unfortunately, I haven’t found many resources concerning concatenative languages and their implementation, so excuse me in advance for my possible naivety.

I therefore defined my language as a simple sequence of concatenation of functions, represented in the AST as a list:

data Operation     = Concat [Operation]     | Quotation Operation     | Var String     | Lit Literal     | LitOp LiteralOperation  data Literal     = Int Int     | Float Float  data LiteralOperation     = Add | Sub | Mul | Div 

The following program, 4 2 swap dup * + (corresponding to 2 * 2 + 4) once parsed, will give the following AST:

Concat [Lit (Int 4), Lit (Int 2), Var "swap", Var "dup", LitOp Mul, LitOp Add] 

Now I have to infer and check the types.

I wrote this type system:

data Type     = TBasic BasicType   -- 'Int' or 'Float'     | TVar String        -- Variable type     | TQuoteE String     -- Empty stack, noted 'A'     | TQuote String Type -- Non empty stack, noted 'A t'     | TConc Type Type    -- A type for the concatenation     | TFun Type Type     -- The type of functions 

That’s where my question comes in, because I don’t know what type to infer from that expression. The resulting type is obvious, it is Int, but I don’t know how to actually entirely check this program at the type level.

At the beginning, as you can see above, I had thought of a TConc type that represents concatenation in the same way as the TFun type represents a function, because in the end the concatenation sequence forms an unique function.

Another option, which I have not yet explored, would be to apply the function composition inference rule to each element of this expression sequence. I don’t know how it would work with the stack-based.

The question is so: how do we do it? Which algorithm to use, and which approach at the type level should be preferred?

Binary representation of 129 when using 8bits for two’s complement?

I have this confusion regarding binary representation of decimal value 129 (or even 128). If 8 bits are used to represent numbers when doing the two’s complement, then we know that ‘00000000’ to ‘01111111’ are used for 0 to 127 and leading 1 is used for negative numbers i.e. 1xxxxxxx where x is any combination of 0’s and 1’s to represent the value of that negative number. But then how is 128 represented in 8 bits because the first bit is reserved for telling the number is negative (basically a signed number)?

In a DDD CQRS API, is it preferable to have a separate DTO per query or per representation of a resource?

I am in the middle of starting up a new project and just wanted some reassurance as to which approach to DTO’s returned by the read-side was easier to maintain in a real world application with reasonable complexity.

I realize with a DTO per query, it would allow for a specific query’s response to diverge from the others more easily in the future. However, common properties between that query and other queries would require changes in multiple places (e.g. the same resource could have multiple queries in the form of GetById, GetAll, GetByStatus, GetBySomethingElse, etc.) and there is more repetition.

If a DTO is created per representation of a resource, a GetById might return a DetailedDto with many properties, while a GetAll might return a SummarizedDto with fewer properties. This would require less changes for common properties and diverging queries would just require a new version of the DTO to be created. The biggest disadvantage to this approach is that I’m terrible at naming classes and coming up with words, so “Detailed” + “Summarized” is the extent of my imagination.

I’m leaning towards refactoring my first attempt at writing this application from using the “DTO per query” to the “DTO per representation” approach. Are there any benefits to sticking with the “DTO per query” or is “DTO per representation” a good way to go?

Is there a connection between representation theory and PDEs?

As a PhD student, if I want to do something algebraic / linear-algebraic such as representation theory as well as do PDEs, in both the theoretical and numerical aspects of PDEs, would this combination be compatible and / or useful? Is it feasible?

I’d be grateful for an online resource to look into.

Thanks,

Representation of the space of lattices in $\Bbb R^n$

The space of 2D lattices in $ \Bbb R^2$ can be represented with the two Eisenstein series $ G_4$ and $ G_6$ . Each lattice uniquely maps to a point in $ \Bbb C^2$ using these two invariants, and the points $ (1,0)$ and $ (0,1)$ map to the unit square and triangular lattices.

This representation requires us to treat $ \Bbb R^2$ as being isomorphic to $ \Bbb C$ , so that we can evaluate, for some lattice $ \Lambda(\omega_1, \omega_2)$ generated by $ \omega_1$ and $ \omega_2$

$ $ G_k(\Lambda) = \Bbb \sum_{0 \neq \omega \in \Lambda(\omega_1, \omega_2)} \frac{1}{\omega^k} $ $

where each element $ \omega \in \Lambda$ is then treated as having the algebra structure from $ \Bbb C$ .

The above is often treated in the setting of modular forms, where it is assumed that everything is in $ \Bbb C$ from the start, and that one of the lattice vectors is $ 1$ . However, it is also given fairly often as an equivalent treatment in terms of real lattices.

So the basic question is, how do you develop a similar theory for $ \Bbb R^n$ ? Is there a similar algebra structure one can place on $ \Bbb R^n$ taking the place of $ \Bbb C$ ?

My big-picture questions:

  1. What is a direct formula to represent the space of lattices in $ \Bbb R^n$ , so that every lattice corresponds to a unique point?
  2. If the above is too complicated, is there an easier representation if we allow for things like “signed lattices,” or allow different automorphisms of the same lattice to be represented with different elements of the same space, etc? (As an example, in $ \Bbb R^2$ , suppose we differentiated between two 90 degree rotations of the square lattice, or three 60 degree rotations of the triangular lattice, while equivocating between any 180 degree rotation of the same lattice.)

Assuming some generalization of the Eisenstein series is relevant, then the follow-up questions would be

  1. How do you generalize the Eisenstein series to lattices in $ \Bbb R^3$ , or $ \Bbb R^n$ in general?
  2. Assuming you manage to do that, the space of lattices for $ \Bbb R^n$ should have dimension $ n^2$ . Which Eisenstein invariants are needed to uniquely represent this space of lattices?
  3. The Eisenstein/Weierstrass invariants can be viewed as coefficients from the Laurent expansion of the Weierstrass elliptic function. Is the above simplified from looking at the Laurent expansions of higher-dimensional elliptic functions?

I have been pointed in a few interesting directions – Eisenstein Series on Real, Complex, and Quaternionic Half-Spaces, Eisenstein Series in Complexified Clifford Analysis, Generalization of Weierstrass Elliptic functions to $ \Bbb R^n$ , On a New Type of Eisenstein Series in Clifford Analysis, Clifford analysis with generalized elliptic and quasi elliptic functions. People have suggested the theory of automorphic forms and the Langlands’ Eisenstein series.

However, I have somehow been unable to unpack all this to get a simple definition of the space of n-dimensional real lattices. I would assume that some generalized Eisenstein series would be necessary, but I don’t know how to build them for arbitrary $ \Bbb R^n$ . Some of the suggestions for replacing $ \Bbb C$ with an arbitrary Clifford algebra can sometimes equivocate between different lattices. For instance, even in the simplest case of $ \Bbb R^4$ , where we can treat the entire thing as the quaternion algebra, we tend to have that the lattices $ \Lambda(1,2\mathbf e_1,\mathbf e_2,\mathbf e_3)$ and $ \Lambda(1,\mathbf e_1,2\mathbf e_2,\mathbf e_3)$ produce the same Eisenstein series, if we naively use the formula $ \sum_{0 \neq \omega \in \Lambda} \frac{1}{\omega^k}$ .

This was originally posted and bountied at MSE: https://math.stackexchange.com/questions/3260577/using-eisenstein-series-to-represent-the-space-of-lattices-generalization-to

I am hoping for some basic explication of how to do this that doesn’t require extremely deep knowledge of elliptic curve geometry and etc – I just want a basic representation for the space of n-d lattices.

portable hashable string representation of data object

I have a class of (flat) objects that are going to be passed around between three different parties, each possibly running a different software stack.

my_object:   item: "string_a"   money:     amount: 99.9     currency: "EUR"   metadata: 5   origin: "https://www.example.com/path" 

These objects will need to be signed and the signatures will need to be verifiable.
In order for the signed hash to be stable from one party to another, there has to be an exact canonical string representation.

EDIT:
For example, both the following are valid string representations of the above object, but they’ll have different hashes.

<my_object xmlns="https://www.TBD.com">   <item>string_a</item>   <money>     <amount>99.9</amount>     <currency>EUR</currency>   </money>   <metadata>5</metadata>   <origin>https://www.example.com/path</origin> </my_object> 
<my_object xmlns="https://www.TBD.com">   <item>string_a</item>   <money><amount>99.9</amount><currency>EUR</currency></money>   <metadata>5.0</metadata>   <origin>https://www.example.com/path</origin> </my_object> 

All parties will be parsing the objects, so it’s the values that need to be verifiable in the signature process. In the above example, the data that needs to be signed is
{“string_a”, 99.9, “EUR”, 5, “https://www.example.com/path”}.
Of course there must be no ambiguity about which value goes with which key; for example it must be clear that 5 is the metadata and not the amount.
That said, it seems reasonable to have the original serialization of the data treated as a data-value in-itself, and the contained data parsed out of it as needed. This would solve the above problem of a byte-perfect canonical version.
END EDIT

I think it’s a good idea to define the “outer” signed object in XML.

<signed_object xmlns="https://www.TBD.com">   <data>string representation of the inner data object</data>   <signature>RSA signature of the SHA512 hash of the value in `data`</signature> </signed_object> 

Clearly I have a variety of options for what should go in that inner <data> tag.

  • An escaped XML string would work.
    • We know all the parties have xml parsing set up.
    • We could help with verification by providing an XSD.
    • The relatively large size of an xml string isn’t a big deal because this will all get compressed.
    • On the other hand, it would be a pain for a human to look at an xml-escaped xml string and understand what was going on.
    • Also, a naive implementer might try to rebuild the XML from the contained data, get a different hash, and decline a valid signature.
  • JSON and Yaml might be slightly better than XML for human readability, but would have the same problem that “equivalent” objects could have different hashes.
  • A delimited string (with commas, pipes, etc) would more be human-readable, and would clearly be string data in itself.
  • Taking that idea even farther, we could provide a canonical regex with capture groups for unambiguously validating and parsing these strings.
  • And finally, we could decide not to worry about having a canonical string version of each such object, and instead have a really well defined process for serializing the data for hashing.
    • This would look the best.
    • This would be the easiest to screw up.

EDIT:
Expanding on that last option (because it’s the one I’ve actually seen done in the wild):

  • Simply concatenating the values in an explicit order probably won’t work: what if item ends with numeric characters?
  • Concatenating the keys and values in an explicit order would probably work, the only problems I can think of are ambiguities in how to represent numbers (5 vs 5.0), or possibly conversions between UTF-8/UTF-32/ASCII.

I like the idea of defining the string format using a regex. Is that a bad idea for any reason?

EDIT: I’ll be asking other people to implement parts of this system, on a variety of platforms. What system will be the easiest and/or most reliable to implement, in general?

Intuitive representation of “increase priority”


IGNORE THIS BLOCK
We have a grid, each row represent a pending download. By default, downloads will proceed top-down, first row first, second row after that, etc.

We want to put a column in the grid with a symbol/icon/button suggestions inside
so the user can use that control to say: i want this row to be transferred after first row, and then this… in other words, priorize transfer order overriding the default 1 to N order.

NOTE: sorting rows is not an option. Order has other implications and has to be kept as is.

Here´s a mock of the existing gui i am trying to improve.

The user adds items (hard covers, chapters, index, maps, etc) to build a book. The table can be sorted, and this will change the items order in the book.
As the user adds one item the app begins to download additional info (this process is sometimes a bit long). Let our book be: front cover, index, chapter 1, chapter 2, chapter 3, back cover. We add the proper items…

enter image description here

And the transfers begin automatically.

After a while, item 1 has succesfully downloaded its info, so item 2 begins to download info too.

enter image description here

Now guess chapter 1 and 2 are approved and has no recent changes, so no need for further inspection. but Chapter 3 and Back cover are the ones we want to check before approve the hole thing.

So we want to instruct the download queue to first download Hard Cover, and after that proceed with Chapter 3. Otherways we have to wait for previous transfers to complete while doing nothing because we dont need the info.

We want this to happen:

enter image description here

My question is:

How can i redesign that “Waiting” column so the user can intuitivelly know that download queue can be altered? Which controls/icons (instead of the “Waiting” label) are best suited for that?
I´d like that someone that is facing the waiting problem could say: Cool, if i click/use this i can alter the download queue and prioritize the items i want most !

Feel free to add a column, replace Waiting label with buttons, icons…
Any suggestions welcome.