Pervasive are smallest grammar *approximation* algorithms. But rarely if ever do they talk about what would be an exact algorithm for computing smallest grammars, regardless of its efficiency.

Here I attemp to exhibit an algorithm but of course I want to make it as efficient as possible. The idea is to construct a grammar starting at the leaf production rules and constructing from them the next stage, until there’s nothing left to construct.

Let $ s$ be a string over the singleton alphabet $ \Sigma = \{a\}$ .

For instance $ s = a^6$ .

A substring $ t \leqslant s$ is **repeating** in $ s$ if $ t \gamma t \leqslant s$ .

A substring $ t \leqslant s$ is **irreducible** if it contains no repeating substrings itself.

The repeating irreducibles of length $ \geq 2$ in $ s$ are $ I = \{ a^2, a^3 \}$ . These represent all possible **terminal** rules that could occur in a smallest grammar of $ s$ . A terminal rule is simply a rule with no variables occuring on the right of its arrow.

Now form and keep track of the rules $ R = \{A \to a^2, B \to a^3\}$ and put $ X := \{ A, B, a\}$ . Now compute all $ xy$ such that $ x, y \in X$ and the grammar $ G = \{ S \to xy\} \cup \text{Rules}(x) \cup \text{Rules}(y)$ is irreducible and $ \overline{G} \leqslant s$ , where $ \overline{G}$ is full grammar expansion to string $ \text{Rules}(\gamma)$ is defined recursively as follows:

$ $ \text{Rules}(a) = \{\} \ \text{Rules}(uv) = \text{Rules}(u) \cup \text{Rules}(v) \ \text{Rules}(U) = \{U \to \gamma_U\} \cup \text{Rules}(\gamma_U), \ $ $ where $ u,v$ are any nonempty strings, and $ U \to \gamma_U \in R$ .

Add each such $ xy$ to $ X$ . Now keep repeating the last step (augmenting $ X$ ) until it no longer changes size. So we have for $ s = a^6$ :

$ $ X_0 = \{ A, B, a\} \ X_1 = \{A^2, Aa, aA, aB, Ba, B^2 \} \ \vdots \ X_h = \dots $ $ Notice that we didn’t add in stuff such as $ AB$ since the resulting test grammar $ G$ would be reducible since $ a^2 \leqslant a^3$ . In other words we’re constructing only irreducible grammars!

What are methods to estimate the upper bound on $ |X_h|$ ? That would give an idea of the running time.

Forgot to mention, $ t$ is a substring of the grammar $ G$ , written $ t \leqslant G$ if for some rule $ M \to \gamma_M$ in $ G$ , $ t$ is a substring of $ \gamma_M$ .

A substring $ t \leqslant G$ is **repeating** if it is repeating in some $ \gamma_M$ for some $ M \to \gamma_M \in G$ , or if it is a substring of at least two rules in $ G$ .

Note that *smallest* grammar implies *irreducible grammar*, but not vise versa. So we’re constructing a larger class of grammars than “just the smallest” and then checking each.