Generating random words by grammar


A bit of context

I was writing a parser for a grammar, and for testing purposes I come up with idea to generate some random inputs. The grammar I was dealing with was much more complicated, in this question I presented “minimal working example” for simplicity. And of course, I am able to avoid the issue I faced using a bunch of trivial heuristics, but the question really makes me wonder.

The problem

Suppose we have a typical context-free grammar for arithmetic expressions of $ +,*$ , brackets, and integer literals:

$ $ E \longrightarrow F(“+”F)^*$ $ $ $ F \longrightarrow T(“*”T)^*$ $ $ $ T \longrightarrow int|“(”E“)”$ $

It is easy to implement a staighforward algorithm for generating random words by this grammar: we implement a separate procedure for each nonterminal. If a nonterminal has multiple production rules (as $ T$ does), we choose a production rule by tossing a coin. If rule contains kleene star (e.g. $ (“+”F)^*$ ), we also toss a coin and generate zero or one repetition (sure we could pick any random integer $ k\geq0$ and generate $ k$ repetitions, but for simplicity we will focus on the simplest version of this procedure). Here is what we get:

generate_E():     if coin_toss():         return generate_F() + "+" + generate_F()     else:         return generate_F()  generate_F():     if coin_toss():         return generate_T() + "*" + generate_T()     else:         return generate_F()  def generate_T():     if coin_toss():         return "(" + generate_E() + ")"     else:         return random_integer() 

An invocation of generate_E() yields a random expression.

What could go wrong? It turns out that execution of this procedure on the real machine terminates with stack overflow quite often. Of course, technically here we have a possibility of endless recursion, but my intuition was telling me that probability of reaching recursion depth $ k$ decays exponentially with increasing $ k$ , therefore getting deep levels (let’s say, 1000) is almost impossible. Apparently, a few consecutive runs reveals that procedure can easily reach depth of many thousands (by depth I mean maximum number of procedure calls the stack contains simultaneously).

I am curios how to formalize this empirical fact. I want either a formula for $ P(depth = k)$ , or an asymptotic approximation of it, or an inequality bounding right tail of CDF below (something like $ P(depth > 1000) > 0.05$ )

My attempt

I tried to come up with a formula for $ P(depth = k)$ :

Let’s denote $ P(depth = k)$ as $ P_E(k)$ . Also, we define similar values for generate_F() and generate_T()$ P_F(k)$ and $ P_T(k)$ respectivetely.

Clearly (else branch of generate_T), $ $ P_T(1) = \frac{1}{2}$ $ and for $ k > 1$ (then branch)$ $ P_T(k) = \frac{1}{2}P_E(k – 1)$ $

Regarding $ P_F(k)$ , we can either execute else branch, and it gives term $ $ \frac{1}{2}P_T(k – 1)$ $ , or then branch, what gives a bit more complicated one $ $ \frac{1}{2}\sum_{(x, y) | max(x, y) = k – 1}{P_T(x)P_T(y)}$ $ i.e. $ $ P_F(k)= \frac{1}{2}(P_F(k – 1) + \sum_{(x, y) | max(x, y) = k – 1}{P_T(x)P_T(y)})$ $

Finally, the formula for $ P_E(k)$ is almost the same as for $ P_F(f)$ , we only have to replace $ P_T(x)$ with $ P_F(x)$ .

Now, we can calculate some values of $ P_e(k)$

\begin{array} {|r|r|}\hline k & P_E(k) & P_E(k)\text{ in decimal}& P_E(k)\text{ by Monte-Carlo} \ \hline 3 & \frac{33}{128} & \approx0.257812 & \approx0.2527 \ \hline 6 & \frac{4448787585}{34359738368} & \approx0.129477 & \approx0.1282 \ \hline 9 & \frac{14080391757747154038821618051813151497122305}{178405961588244985132285746181186892047843328} & \approx0.078923 & \approx0.0761 \ \hline 12 & \text{ the fraction is too long} & \approx0.053213 & \approx0.0530 \ \hline \end{array}

As we can see, the recurrent formulas seem to be true, but neither they give me an insight on asymptotical behaviour of $ P_E(k)$ , nor the first values give a clue on formula in closed form.

Any help would be appreciated.