How to convert recursive language grammar tree into automaton for optimal parsing?

So I have a definition of a sort of grammar for a programming language. Some aspects are recursive (like nesting function definitions), other parts of it are just simple non-recursive trees.

Originally I just treat this sort of like a Parsing Expression Grammar (PEG), and parse it using recursive descent. But this is kind of inefficient, as it sometimes has to build up objects as it’s parsing, only to find 10 calls down the road that it isn’t a good match. So then it backs up and has to try the next pattern.

I’m wondering what the general solution to this is. What can be done to create the most optimal algorithm starting from nothing but a tree-like data structure encoding the grammar?

Can you process the tree-like, recursive-ish BNF PEG data structure in a while loop, rather than using recursive descent? Better still, can you convert the data structure into a sort of automaton which you can (in some simple way) transition through, to parse the string without having to go down a bunch of wrong paths and generate stuff you only have to soon throw away if you find it’s a mismatch? Or if not, is there anything that can be done here with automata?

Sorry for my terminology (BNF vs. PEG), I am not using it exactly. I just mean I have a grammar, which is context-sensitive which falls outside the scope of either of these. So I am picking one or the other to simplify this question for the purpose of this site, even though I’m not 100% sure what the difference between BNF and PEG are at this point.

Oriented spanning tree of a directed multi-graph with minimum weight

I have problem of finding the minimum spanning tree of a simple graph, but the result is restricted by additional two types of condition:

  • There is a known root, which is s in the following example.
  • We know directions of some edges if they are chosen. These edges are chosen yet, or the problem becomes a Steiner tree problem.

Note that numbers on edges are their weights. So we will get s -> b -> c -> a if a normal min spanning tree is applied, but the direction of edge ac is wrong. On the other hand, we cannot use Chu–Liu/Edmonds’ algorithm for spanning arborescence of directed graphs, because we don’t know and cannot infer the direction of edge bc.

We can infer some edges’ directions according to the position of the root. For example, in the example, we know s -> b and s -> a.


Oriented Spanning Tree

In the final section of spanning tree, Wikipedia, oriented spanning tree is mentioned and a paper [levine2011sandpile] is referred. The problem fits the setting. It says:

Given a vertex v on a directed multigraph G, an oriented spanning tree T rooted at v is an acyclic subgraph of G in which every vertex other than v has outdegree 1.

Note that the term "outdegree" is a bit confusing, which I think should be "indegree". But it doesn’t matter, because it just restricts the simple subgraph to be a directed tree with root being source or sink.

For edges (in the original simple graph) whose directions are unknown, we add two directed edges between two vertices with inverse directions. Then we find the oriented spanning tree of this multi-graph. But it is not clear to me how an algorithm can be implemented according to that paper.


  • Levine, L. (2011). Sandpile groups and spanning trees of directed line graphs. Journal of Combinatorial Theory, Series A, 118(2), 350-364.
  • https://en.wikipedia.org/wiki/Spanning_tree

Recording a histogram in a tree exhibits strange best case

The task is to record a histogram from a streaming data source.

One data point is, say, a 16 bit integer. The maximum multiple of one data point before the stream ends, is < 2^32. The main restriction is that there isn’t enough memory to have an array of counters which completely covers the range of the data points (65536 counters of 0-2^32), much less record all single data points and bulk-process them afterwards. In the case where the dispersion of the data source covers the complete possible 16-bit range, the result is not of interest, i.e. this marks a histogram type which is not processed further. Valid histograms consist of only a few clusters where the majority of points are found. Data is roughly continuous which has a very valuable consequence: new outliers can, without loss of detection capability) supersede old outliers (i.e. those with the lowest count) so that new clusters can enter the histogram even when the available memory is full. This is very rare however, the data source is in a way nice to track; just the number and location of the clusters is unpredictable. The following is a try of mine with a binary tree recording the multitude of each data point (I didn’t implement the substitution, just the insert). The special property of the tree is that it tries to move points with a high occurrence number close to the root, in hope for a short search path in the majority cases. The boolean predicate is the function exchange_p which compares the weight of a root to its left or right child. In case the predicate is true, an exchange is executed, possibly invoking further exchanges lateron.

The idea is that it is favorable to not exchange tree nodes at every insert but to let the node count run up to a certain multiple of its parent before an exchange is initiated, thereby avoiding unnecessary oscillations of nodes. Things worked out very differently, tho.

The problem exhibits a strange behaviour, with QUOTA around 0.8 (i.e. exchange if root < 0.8*child) for the overall minimum number of operations. Is there some theory which amalgates O-notation with stochastics for special input distributions?

You can run the below example (it has real world data in it), the file is completely self-contained and should run out of the box.

#include <cstdio> #include <cstdint> #include <cstdlib>   int exchg; int ins; int tree_ins;  float QUOTA = 1.0f;  void error_func (const char *s) {   printf ("%s", s);   exit (1); }  typedef enum { LEFT_CHILD, RIGHT_CHILD, NO_CHILD } child_side_t;    typedef int16_t element_t; typedef uint32_t counting_t;   struct bin_tree_counting_node_t {   bin_tree_counting_node_t *left, *right;   counting_t cnt;   element_t element; };   void print_node (bin_tree_counting_node_t* root) {   printf ("%i\t%u\n", root->element, root->cnt); }  void print_tree (bin_tree_counting_node_t* root, int lvl) {   if (root == NULL)     return;   print_tree (root->left, lvl+1);   // printf ("Level %d: ", lvl);   print_node (root);   print_tree (root->right, lvl+1); }     bool exchange_p (bin_tree_counting_node_t* root,         bin_tree_counting_node_t* child,         child_side_t side) {   return (root->cnt < QUOTA * child->cnt); }  bin_tree_counting_node_t* tree_insert_tree (bin_tree_counting_node_t* tree_a,          bin_tree_counting_node_t* tree_b) {   tree_ins++;      if (tree_a == NULL)     return tree_b;   else if (tree_b == NULL)     return tree_a;   else if (exchange_p (tree_a, tree_b, NO_CHILD)) {     if (tree_a->element < tree_b->element)        tree_b->left = tree_insert_tree (tree_a, tree_b->left);     else       tree_b->right = tree_insert_tree (tree_a, tree_b->right);     return tree_b;   }   // Case for a > b coincides with a ~ b   // else if (exchange_p (tree_b, tree_a, NO_CHILD)) {   //   if (tree_a->element < tree_b->element)    //     tree_a->right = tree_insert_tree (tree_a->right, tree_b);   //   else   //     tree_a->left = tree_insert_tree (tree_a->left, tree_b);   //   return tree_a;   // }   else {     if (tree_a->element < tree_b->element)        tree_a->right = tree_insert_tree (tree_a->right, tree_b);     else       tree_a->left = tree_insert_tree (tree_a->left, tree_b);     return tree_a;   }       }  bin_tree_counting_node_t* exchange (bin_tree_counting_node_t* root,      child_side_t side) {   exchg++;      // Exchange root with its child   bin_tree_counting_node_t* child;   if (side == LEFT_CHILD) {     child = root->left;     root->left = NULL;     child->right = tree_insert_tree (root, child->right);   }   else {     child = root->right;     root->right = NULL;     child->left = tree_insert_tree (root, child->left);   }   return child; }  bin_tree_counting_node_t* tree_insert (bin_tree_counting_node_t* root,         element_t elem) {   ins++;      if (root == NULL) {     bin_tree_counting_node_t* new_node = (bin_tree_counting_node_t*)malloc (sizeof (bin_tree_counting_node_t));     if (new_node == NULL) { error_func ("Memory exhausted!"); }     new_node->element = elem;     new_node->cnt = 1;     new_node->right = new_node->left = NULL;     return new_node;   }      if (elem == root->element) {     root->cnt++;     if (root->cnt == 0) { error_func ("Counting overflow! Very large amount of inserts >2^32!"); }   }   else if (elem < root->element) {     root->left = tree_insert (root->left, elem);     if (exchange_p (root, root->left, LEFT_CHILD))       return exchange (root, LEFT_CHILD);   }   else {     root->right = tree_insert (root->right, elem);     if (exchange_p (root, root->right, RIGHT_CHILD))       return exchange (root, RIGHT_CHILD);   }   return root; }  void free_tree(bin_tree_counting_node_t* root) {   if (root == NULL)     return;   free_tree (root->left);   free_tree (root->right);   free (root); }  int main (void) {   element_t sample[] =     {       17060,17076,17076,17060,17092,17028,17076,17060,17060,17076,17060,17076,17060,17060,17140,17140,17124,17124,       17124,17140,17108,17140,17124,17124,17108,17108,17124,17124,17140,17092,17124,17124,17140,17140,17124,17156,       17140,17108,17156,17140,17124,17124,17124,17108,17124,17124,17124,17092,17140,17092,       17124,17108,17156,17124,17156,17140,17124,17172,17124,17108,17124,17108,17108,17108,17108,17124,17108,17124,       17108,17108,17124,17124,17140,17124,17108,17092,17108,17108,17092,17124,17108,17108,17124,17108,17124,17108,       17124,17140,17156,17124,17108,17108,17124,17124,17124,17140,17092,17140,17124,       17108,17124,17124,17124,17124,17108,17124,17108,17124,17108,17108,17108,17124,17108,17124,17124,17108,17108,       17124,17108,17124,17140,17124,17124,17108,17108,17140,17140,17124,17108,17140,17124,16291,16339,16339,16307,       16323,16259,16275,16275,16259,16388,16388,16355,15795,15731,16195,16179,16179,       15715,14467,14643,14851,17284,17012,16147,15155,15203,16131,15331,14691,14739,14739,14755,14723,14739,14707,       14771,14739,14707,14691,14531,14787,14563,15587,15907,15907,15923,15907,13123,13107,13091,13267,13091,13187,       13091,14643,15875,15907,16964,16404,16227,15219,14771,14771,14803,14787,14803,       14787,14803,14803,14755,14755,14771,14755,14723,14723,14739,14739,14963,16884,16868,15827,13075,13123,13091,       12979,13043,13219,13075,13059,13075,13123,15779,15875,16916,17028,15155,14675,14707,14691,14707,14691,14691,       14691,14707,14691,14691,14691,14675,14707,14691,14675,14595,14595,14595,14563,       14547,14579,14611,14547,14515,14611,14595,14611,14595,14659,14659,14627,14643,14643,14659,14643,14643,14643,       14659,14659,14643,14643,14611,14659,14611,14659,14659,14659,14643,14643,14643,14643,14611,14643,14627,14675,       14627,14643,14531,14531,14499,15187,16884,16932,15891,15923,13107,13091,13043,       13059,13075,13027,13027,13059,13059,13587,13059,15763,16900,16932,14899,14595,14627,14659,14691,14691,14691,       14691,14659,14675,14675,14675,14675,14691,14675,14691,14675,14611,14643,15107,16788,16964,14835,13363,13059,       13075,13075,13043,13043,13043,13075,13059,13059,15475,15715,16884,16932,16964,       14707,14643,14675,14643,14611,14611,14611,14611,14627,14611,14611,14627,14627,14611,14595,14595,14595,14563,       15027,16756,16932,14595,13059,13075,13059,13059,13011,13027,13059,13059,13043,13059,15683,16852,16948,16067,       15267,14771,15875,16964,16996,16980,17012,17012,17028,17028,17076,17060,17044,       17060,17076,17076,17092,17076,17092,17092,17108,17124,17092,17108,17124,17108,17124,17124,17108,17140,17124,       17108,17124,17108,17028,17124,17124,17092,17124,17108,17124,17108,17124,17092,17108,17108,17108,17124,17124,       17140,17108,17124,17108,17124,17108,17108,17156,17124,17108,17092,17108,17092,       17108,17108,17124,17124,17108,17108,17108,17124,17108,17124,17108,17124,17124,17108,17124,17124,17124,17124,       17124,17092,17140,17108,17124,17124,17140,17028,17124,17028,17108,17108,17124,17124,17124,17124,17108,17108,       17124,17140,17044,17108,17028,17124,17124,17124,17108,17124,17124,17108,17124,       17092,17124,17124,17124,17108,17124,17124,17140,17124,17140,17140,17140,17124,17044,17124,17092,17140,17124,       17140,17124,17124,17124,17108,17124,17124,17124,17124,17124,17140,17076,17140,17140,17108,17140,17124,17140,       17140,17108,17108,17124,17108,17124,17124,17124,17108,17140,17124,17140,17060,       17124,17124,17188,17108,17124,17124,17140,     };    for (QUOTA = 0.1; QUOTA < 3; QUOTA += 0.1) {     ins = tree_ins = exchg = 0;          bin_tree_counting_node_t *tree = NULL;      for (int i=0; i<sizeof sample/sizeof sample[0]; i++) {       tree = tree_insert (tree, sample[i]);     }     printf ("Q:\t%f\tins:\t%d\t tree_ins:\t%d\t exchg:\t%d\t sum:\t%d\n", QUOTA, ins, tree_ins, exchg, ins + tree_ins + exchg);     //print_tree (tree, 0);     //printf ("\n");     free_tree (tree);   }   return 0; } 

What is the most efficient way to turn a list of directory path strings into a tree?

I’m trying to find out the most efficient way of turning a list of path strings into a hierarchical list of hash maps tree using these rules:

  • Node labels are delimited/split by ‘/’
  • Hash maps have the structure:
{     label: "Node 0",     children: [] } 
  • Node labels are also keys, so for example all nodes with the same label at the root level will be merged

So the following code:

[     "Node 0/Node 0-0",     "Node 0/Node 0-1",     "Node 1/Node 1-0/Node 1-0-0" ] 

Would turn into:

[     {         label: "Node 0",         children: [             {                 label: "Node 0-0",                 children: []             },             {                 label: "Node 0-1",                 children: []             },         ]     },     {         label: "Node 1",         children: [             {                 label: "Node 1-0",                 children: [                     {                         label: "Node 1-0-0",                         children: []                     },                 ]             },         ]     }, ] 

Why decision tree method for lower bound on finding a minimum doesn’t work

(Motivated by this question. Also I suspect that my question is a bit too broad)

We know $ \Omega(n \log n)$ lower bound for sorting: we can build a decision tree where each inner node is a comparison and each leaf is a permutation. Since there are $ n!$ leaves, the minimum tree height is $ \Omega(\log (n!)) = \Omega (n \log n)$ .

However, it doesn’t work for the following problem: find a minimum in the array. For this problem, the results (the leaves) are just indices of the minimum element. There are $ n$ of them, and therefore the reasoning above gives $ \Omega(\log n)$ lower bound, which is obviously an understatement.

My question: why does this method works for sorting and doesn’t work for minimum? Is there some greater intuition or simply "it just happens" and we were "lucky" that sorting has so many possible answers?

I guess the lower bound from decision tree makes perfect sense: we do can ask yes/no questions so that we need $ O(\log n)$ answers: namely, we can use binary search for the desired index. My question still remains.

Understanding the proof of “DFS of undirected graph $G$, yields either tree edge or back edge” better with graph for each statement in proof

I was going through the edge classification section by $ \text{DFS}$ algorithm on an undirected graph from the text Introduction to Algorithms by Cormen et. al. where I came across the following proof. I was having a little difficulty in understanding the steps of the proof and hence I made an attempt to understand it fully by accompanying each statement in the proof with a possible graph of the situation.

Theorem 22.10 : In a depth-first search of an un-directed graph $ G$ , every edge of $ G$ is either a tree edge or a back edge.

Proof:

  1. Let $ (u , v)$ be an arbitrary edge of $ G$ , and suppose without loss of generality that $ d[u] < d[v]$ . Then, $ v$ must be discovered and finished before we finish $ u$ (while $ u$ is gray), since $ v$ is on $ u$ ‘s adjacency list.

  2. If the edge $ (u, v)$ is explored first in the direction from $ u$ to $ v$ , then $ v$ is undiscovered (white) until that time.

Figure 1

Figure 1 : Situation in point 2. DFS starts from ‘u’, ‘u’ is grayed and DFS then looks along the red arrow to ‘v’

  1. Otherwise we would have explored this edge already in the direction from $ v$ to $ u$ . Thus, $ (u, v)$ becomes a tree edge.

Figure 2

Figure 2 : Situation in point 3. DFS starts from ‘u’, ‘u’ is grayed, then discoveres ‘w’ and ‘w’ is grayed and then again discovers ‘v’ DFS then looks along the red arrow to ‘u’ , the green pointer explains the rest

  1. If $ (u, v)$ is explored first in the direction from $ v$ to $ u$ , then $ (u, v)$ is a back edge, since $ u$ is still gray at the time the edge is first explored.

Figure 3

Figure 3 : Situation in point 4. DFS starts from ‘u’, ‘u’ is grayed, then discoveres ‘w’ and ‘w’ is grayed and then again discovers ‘v’ DFS then looks along the red arrow to ‘u’ , ‘u’ is already grayed so the edge becomes a back edge, indicated by the green pointer


Could anyone confirm if I am on the right track or if not please rectify me.

[My question might seem similar to this or this but neither of them seemed to help me]

Visualizing a directory structure as a tree map of rectangles

There’s this nice tool called WinDirStat which lets you scan a directory and view files in a rectangular tree map. It looks like this:

windirstat

The size of each block is related to the file size, and blocks are grouped by directory and coloured distinctly according to the top level directory. I’d like to create a map like this in Mathematica. First I get some file names in the tree of my Mathematica installation and calculate their sizes:

fassoc = FileSystemMap[FileSize, File[$  InstallationDirectory], Infinity, IncludeDirectories -> False]; 

Level Infinity ensures it traverses the whole tree. I could also add 1 to ensure the association is flattened, but I want the nesting so I can assign total sizes per directory.

I can find the total size which I’ll need to use to scale the rectangles:

QuantityMagnitude[Total[Cases[fassoc, Quantity[_, _], Infinity]], "Bytes"] 

My idea is to recursively apply this total. In theory I could use this to do something like this with a tree graph and weight the vertices by size, but I want to convert this into a map of rectangles like in WinDirStat. While the colour is obvious – each level 1 association and all its descendants gets a RandomColor[] – I’m not sure how I should go about positioning the rectangles in a Graphics. Any ideas?

Merkle tree sorting leaves and pairs

I am implementing a Merkle tree and am considering using either of the two options.

The first one is sorting only by leaves. This one makes sense to me since you would like to have the same input every time you are constructing a tree from the data, that might not arrive sorted by default.

       CAB       /    \      CA     \    /    \    \   C     A     B /  \  /  \  /  \ 1  2  3  4  5  6 

The second one is sorting by leaves and pairs, which means that after sorting the leaves, you also sort all the pairs after hashing them, however I’m not entirely sure about the benefits of this implementation (if any).

       ACB       /    \      AC     \    /    \    \   C     A     B /  \  /  \  /  \ 1  2  3  4  5  6 

I have seen these implementations of Merkle trees in the past but am not sure about their benefits. So why choose one over the other?

How to answer the following queries on a tree?

Given a tree of "N" nodes(each node has been assigned a value A[i],node-"1" is the root of the tree), and a constant "K" , we have Q queries of the following type : [w]

(which means find the lowest valued node in the sub-tree of [w] , only considering those nodes in the sub-tree of [w] which have a depth less than equal to K) .

Example :

Value of nodes of tree :

A[1] = 10

A[2] = 20

A[3] = 30

A[4] = 40

A[5] = 50

A[6] = 60

Edges of tree : [1-2],

[2-3],

[3-4],

[4-5],

[4-6].

K=2.

Query-1 : [w]=1 . All nodes in subtree of [w] : (1,2,3,4,5,6) , now, all nodes in sub-tree of [w] having depth less than equal to K : (1,2) . Hence , minimum(A[1],A[2])=min(10,20)=10 is the answer .

Query-2 : [w]=4 . All nodes in subtree of [w] : (4,5,6) , now, all nodes in sub-tree of [w] having depth less than equal to K : (4,5,6). Hence , minimum(A[4],A[5],A[6]) = min(40,50,60)=40 is the answer .

What is the maximal difference between the depths of 2 leaves in AVL tree?

I’m wondering what’s the answer of the following question:

What is the maximal difference between the depths of 2 leaves in an AVL tree?

Intuitively I think that it shouldn’t exceed $ log n$ but have no idea how to prove that. On the other hand, I saw multiple articles that claim the depth can become larger and larger without any formal proof (I always get a contradiction when trying to draw such trees)

I would really like if someone can explain me why it’s correct\incorrect.