# Generating random words by grammar

## A bit of context

I was writing a parser for a grammar, and for testing purposes I come up with idea to generate some random inputs. The grammar I was dealing with was much more complicated, in this question I presented “minimal working example” for simplicity. And of course, I am able to avoid the issue I faced using a bunch of trivial heuristics, but the question really makes me wonder.

## The problem

Suppose we have a typical context-free grammar for arithmetic expressions of $$+,*$$, brackets, and integer literals:

$$E \longrightarrow F(“+”F)^*$$ $$F \longrightarrow T(“*”T)^*$$ $$T \longrightarrow int|“(”E“)”$$

It is easy to implement a staighforward algorithm for generating random words by this grammar: we implement a separate procedure for each nonterminal. If a nonterminal has multiple production rules (as $$T$$ does), we choose a production rule by tossing a coin. If rule contains kleene star (e.g. $$(“+”F)^*$$), we also toss a coin and generate zero or one repetition (sure we could pick any random integer $$k\geq0$$ and generate $$k$$ repetitions, but for simplicity we will focus on the simplest version of this procedure). Here is what we get:

generate_E():     if coin_toss():         return generate_F() + "+" + generate_F()     else:         return generate_F()  generate_F():     if coin_toss():         return generate_T() + "*" + generate_T()     else:         return generate_F()  def generate_T():     if coin_toss():         return "(" + generate_E() + ")"     else:         return random_integer() 

An invocation of generate_E() yields a random expression.

What could go wrong? It turns out that execution of this procedure on the real machine terminates with stack overflow quite often. Of course, technically here we have a possibility of endless recursion, but my intuition was telling me that probability of reaching recursion depth $$k$$ decays exponentially with increasing $$k$$, therefore getting deep levels (let’s say, 1000) is almost impossible. Apparently, a few consecutive runs reveals that procedure can easily reach depth of many thousands (by depth I mean maximum number of procedure calls the stack contains simultaneously).

I am curios how to formalize this empirical fact. I want either a formula for $$P(depth = k)$$, or an asymptotic approximation of it, or an inequality bounding right tail of CDF below (something like $$P(depth > 1000) > 0.05$$)

## My attempt

I tried to come up with a formula for $$P(depth = k)$$:

Let’s denote $$P(depth = k)$$ as $$P_E(k)$$. Also, we define similar values for generate_F() and generate_T()$$P_F(k)$$ and $$P_T(k)$$ respectivetely.

Clearly (else branch of generate_T), $$P_T(1) = \frac{1}{2}$$ and for $$k > 1$$ (then branch)$$P_T(k) = \frac{1}{2}P_E(k – 1)$$

Regarding $$P_F(k)$$, we can either execute else branch, and it gives term $$\frac{1}{2}P_T(k – 1)$$, or then branch, what gives a bit more complicated one $$\frac{1}{2}\sum_{(x, y) | max(x, y) = k – 1}{P_T(x)P_T(y)}$$ i.e. $$P_F(k)= \frac{1}{2}(P_F(k – 1) + \sum_{(x, y) | max(x, y) = k – 1}{P_T(x)P_T(y)})$$

Finally, the formula for $$P_E(k)$$ is almost the same as for $$P_F(f)$$, we only have to replace $$P_T(x)$$ with $$P_F(x)$$.

Now, we can calculate some values of $$P_e(k)$$

$$\begin{array} {|r|r|}\hline k & P_E(k) & P_E(k)\text{ in decimal}& P_E(k)\text{ by Monte-Carlo} \ \hline 3 & \frac{33}{128} & \approx0.257812 & \approx0.2527 \ \hline 6 & \frac{4448787585}{34359738368} & \approx0.129477 & \approx0.1282 \ \hline 9 & \frac{14080391757747154038821618051813151497122305}{178405961588244985132285746181186892047843328} & \approx0.078923 & \approx0.0761 \ \hline 12 & \text{ the fraction is too long} & \approx0.053213 & \approx0.0530 \ \hline \end{array}$$

As we can see, the recurrent formulas seem to be true, but neither they give me an insight on asymptotical behaviour of $$P_E(k)$$, nor the first values give a clue on formula in closed form.

Any help would be appreciated.