The class of grammars recognizable by backtracking recursive-descent parsers

It is easy to show that there exists a grammar that can be parsed by a recursive-descent parser with backtracking but is not an $ \text{LL}(k)$ grammar (consider any grammar with a production featuring two alternatives starting with $ k$ occurrences of the same terminal).

My question is the following: Is there an identifiable strict superset of $ \bigcup_{k \in \mathbb{N}} \text{LL}(k)$ grammars that can be parsed by a backtracking recursive-descent parser, regardless of complexity?

If yes, is the maximal strict superset also identifiable?

Are there context free grammars for all restricted Dyck paths?

A Dyck path is a finite list of $ 1$ ‘s and $ -1$ ‘s whose partial sums are nonnegative and whose total sum is $ 0$ . For example, [1, 1, -1, -1] is a Dyck path. Rather than use $ 1$ and $ -1$ , it is common to use "U" and "D" for $ 1$ and $ -1$ , respectively, so we might write UUDD instead. (These might be more familiar as Dyck words.)

It is well-known that Dyck paths have a standard "grammar." The "Dyck grammar" for a path $ P$ is

$ $ P = \epsilon \quad | \quad U P D P.$ $

This grammar is very useful because it lets us quickly compute the generating function which enumerates the number of Dyck paths of given lengths.

I am interested not in all Dyck paths, but restricted sets of Dyck paths. For example, consider the Dyck paths which avoid "peaks" (the bigram UD) at positive even heights; UUUDDD is such a path, while UUDD is not. If we could devise an unambiguous grammar for such paths, then we could write down an equation for the generating function which counts them. (There is such a grammar. It requires two production rules – one for the paths which avoid peaks at even heights, and one for paths which avoid peaks at odd heights.)

This idea leads to a natural question:

Given a (possibly infinite) set of positive integers $ A$ , does there exist a finite, unambiguous context-free grammar which generates precisely the Dyck paths whose peaks avoid $ A$ ?

The answer (I think) is "yes" when $ A$ is finite or an arithmetic progression. I suspect that the answer is "no" in every other case, but I have no idea how to begin proving this.

For example, here is a special case that I cannot decide:

Is the set of Dyck paths which avoid peaks at positive squares a context-free language?

Are grammars consisting only of rules with one symbol on each side NL-complete?


The unrestricted grammars characterize the recursively enumerable languages. This is the same as saying that for every unrestricted grammar G there exists some Turing machine capable of recognizing L(G) and vice versa.

Context: Grammars are Turing-complete. Therefore complexity classes like NL have equivalences in grammars.

One important NL-complete problem is ST-connectivity (or “Reachability”) (Papadimitriou 1994 Thrm. 16.2), the problem of determining whether, given a directed graph G and two nodes s and t on that graph, there is a path from s to t. ST-connectivity can be seen to be in NL, because we start at the node s and nondeterministically walk to every other reachable node. ST-connectivity can be seen to be NL-hard by considering the computation state graph of any other NL algorithm, and considering that the other algorithm will accept if and only if there is a (nondetermistic) path from the starting state to an accepting state.

Given a directed graph, deciding if a->b is a directed path is NL-complete.

We will reduce the directed graph to a grammar rules with one symbol on each side:

For each directed edge in the graph, add a grammar rule. The directed edge a->b becomes the grammar rule a|b.

The NL-complete query becomes, “If I set a to the start symbol, can I derive symbol b using the grammar rules?”

Each grammar rule has one symbol on each side (i.e. a|b).

Therefore grammar rules with one symbol on each side is NL-complete.

Are grammars consisting only of rules with one symbol on each side NL-complete?

Grammars and Parsing: Ambiguity in Sequences

BACKGROUND

I am writing a grammar and parser for a domain-specific language. There is a specific form of expression that, while simple, is giving me headaches.

Given terminals:

  • “a”: KEYWORD
  • “b”: INTVAL
  • “c”: DECVAL
  • “d”: STRVAL

Given rules:

  • “E”: VALUE <- INTVAL
  • “F”: VALUE <- DECVAL
  • “G”: VALUE <- STRVAL
  • “H”: VALUES <- VALUE
  • “I”: VALUES <- VALUES VALUE
  • “J”: LINE <- KEYWORD VALUES

Not complicated, right? This specific subset of terminals and rules should be able to parse the following line:

  • KEYWORD DECVAL DECVAL DECVAL DECVAL

In other words, a keyword followed by one or more values should be a grammatically-correct line. DECVAL tokens should reduce to VALUE tokens, which are in turn aggregated into a series of VALUES, at which point the line reduces under rule “J”.

PROBLEM

However, consider the following parse state, after the first two tokens have been shifted and the lookahead is the next DECVAL:

  • KEYWORD DECVAL | DECVAL

This is a shift-reduce conflict, because it could shift the lookahead DECVAL under rule “F” or reduce the DECVAL on top of the stack under the same rule (“F”). By default, most parsers will perform the shift–in which case the series will never reduce because there is a DECVAL too “deep” on the stack.

QUESTION

But how would you (for example) define a precedence in such a way as to indicate a specific rule should reduce when in conflict with itself? Is that even a good idea? Based on my own limited understanding, there should be a way to restructure the grammar rules, right? But it’s not obvious to me what that is.

Why is FOLLOW not necessary for LL(1) grammars with no $\epsilon$ transitions?

I’m aware of how FIRST and FOLLOW sets are used to construct a parsing table for LL(1) grammars.

However, I’ve encountered this statement from my notes:

With $ \epsilon$ productions in the grammar, we may have to look beyond the current non-terminal to what can come after it

In my opinion, this suggests that FOLLOW is not necessary for LL(1) grammars that have no $ \epsilon$ transition. Am I wrong? And if I’m not, why is this the case?

Thanks

reference on relating Post systems to string rewriting systems and formal grammars?

wikipedia states:

Every Post canonical system can be reduced to a string rewriting system (semi-Thue system) […] It has been proved that any Post canonical system is reducible to such a substitution system, which, as a formal grammar, is also called a phrase-structure grammar, or a type-0 grammar in the Chomsky hierarchy

and on a different page:

A semi-Thue system is also a special type of Post canonical system, but every Post canonical system can also be reduced to an SRS.

I have not been able to find much explanation of these statements.

Is there a good reference that shows how to do conversions between these notions?

Proving sets of regular expressions and context free grammars are decidable [duplicate]

This question already has an answer here:

  • What is the complexity of deciding if the intersection of a regular language and a context free language is empty? 1 answer
  • Intersection between regular language and context-free language [closed] 1 answer
  • Proof Idea: How to prove the intersection of regular language and CFL is a CFL? 2 answers
  • Is there a *simple* proof that the intersection of a CFL and a regular language is a CFL? 4 answers
  • Intersection of context free with regular languages 1 answer

Consider below languages:

  1. $ L_1=\{<M>|M$ is a regular expression which generates at least one string containing an odd number of 1’s$ \}$
  2. $ L_2=\{<G>|G$ is context free grammar which generates at least one string of all 1’s$ \}$

Its given that both above languages are decidable, but no proof is given. I tried guessing. $ L_1$ is decidable, its a set of regular expressions containing

  • odd number of $ 1$ ‘s, or
  • even number of $ 1$ ‘s and $ 1^+$ or
  • $ 1^*$

So we just have to parse regular expression for these characteristics. Is this right way to prove $ L_1$ is decidable?

However, can we have some algorithm to check whether given input CFG accepts at least one string of all 1’s? I am not able to come up with and hence not able prove how $ L_2$ is decidable.

Proper algorithm for resolving ambiguity in grammars via enforcing associativity and precedence rules

I was told there is a algorithm that always resolves ambiguity for grammars that have issues with precedence and associativity. I know ambiguity in general is undecidable, so I only want to resolve those two sources of ambiguity (precedence and associativity).

As far as I know from reading this course, the heuristic I have developed are the following:

  • higher precedence = appear lower in the tree always (so they are never produced by any rule bellow a symbol that is higher up). A different way to understand this is; higher precedence operators are evaluated first, so they appear lower in the parse/computation tree.

  • right-associativity = when two operators of the same type fight for an argument if its right associative then the op on the right wins. Thus, this means that immediate recursion of the same production rule generating that op must always appear on the right (and that symbol/op can never be produced on the other side/left).

however, these rules must be incomplete or slightly wrong because when I was trying to resolve a “tricky” grammar I got wrong answers or couldn’t make sense of the right solution.

For example, imagine we have this:

<exp> ::= 0 | 1 | b<exp> | <exp>a | <exp>m<exp> 

and we want to enforce:

Right associativity for m and the following precedence inequality is the one we want: $ b >_p m >_p a$ which means b has higher precedence than m has higher precedence than a.

the solution given was:

<exp> ::= <no a m> | <not m>m<no a> | <exp>a <no a> ::= <no a m> | <no a m>m<no a> <not m> ::= <no a m> | <exp>a <no a m> ::= b<no a m> | 0 | 1 

However, according to the “rules” I came up this must be wrong. But since its the instructor solutions to its safer to assume I am wrong and need to refine my understanding.

The reason I think its wrong is that the first rule there is a <not m> to the left of m. That means we can generate an a. But a has lower precedence than m, so according to my rule it must never appear bellow the parse tree of m, but it does. So it seems I do not understand precedence properly or the algorithm to enforce it.

My understanding is that precedence means that if you have higher precedence then you bind tighter than other operators, so you steal the arguments from others. I was also told that means that you appear lower in the parse tree, which intuitively makes sense. But that’s too vague for me to really unambiguously and formally define a way to resolve precedence in any grammar (assuming I only want to resolve precedence and associativity only and solving that “solves the grammar”).

So is my understanding correct? Can someone help me make my algorithms for resolving these issue more rigorous (and ideally bullet proof)?


Note, I am aware that resolving ambiguity is undecidable in general, but I was told precedence and associativity is not undecidable assuming thats all we want. So thats what Im looking for.


Cross-posted:

  • https://www.quora.com/unanswered/What-is-the-proper-algorithm-for-resolving-ambiguity-in-grammars-via-enforcing-associativity-and-precedence-rules
  • https://www.reddit.com/r/ProgrammingLanguages/comments/e0lz9o/proper_algorithm_for_resolving_ambiguity_in/

Are these special (one production) Context-Free Grammars always unambiguous?

Consider the following (Context-Free) Grammars with only one production rule (not including the epsilon production):

  • $ S \rightarrow aSb\;|\;\epsilon$
  • $ S \rightarrow aSbS\;|\;\epsilon$
  • $ S \rightarrow aSaSb\;|\;\epsilon$
  • $ S \rightarrow aaSaaSbb\;|\;\epsilon$
  • $ S \rightarrow aSbScSdSeSf\;|\;\epsilon$
  • etc…

Grammars like these uphold 4 basic rules:

  1. The non-terminal symbol $ S$ can never appear next to itself.
    • e.g. $ [\;S \rightarrow aSSb\;|\;\epsilon\;]$ would not be allowed.
  2. The non-terminal symbol $ S$ can appear at the beginning or end of a production but not on both sides at the same time.
    • e.g. $ [\;S \rightarrow SabaS\;|\;\epsilon\;]$ would not be allowed.
    • e.g. $ [\;S \rightarrow Saba\;|\;\epsilon\;]$ or $ [\;S \rightarrow abaS\;|\;\epsilon\;]$ would be allowed.
  3. The sequence of terminal symbols that exist between each non-terminal $ S$ cannot all match.
    • e.g. $ [\;S \rightarrow aSaSa\;|\;\epsilon\;]$ , $ [\;S \rightarrow abSabSab\;|\;\epsilon\;]$ , etc. would not be allowed.
    • e.g. $ [\;S \rightarrow aSaSb\;|\;\epsilon\;]$ , $ [\;S \rightarrow abSabSaf\;|\;\epsilon\;]$ , etc. would be allowed.
  4. The sequence of terminal symbols at the beginning and end cannot match.
    (i.e. $ [\;S \rightarrow ☐_1SaS☐_2\;|\;\epsilon\;]$ s.t. $ ☐_1 \neq ☐_2$ where $ ☐_1$ and $ ☐_2$ are a sequence of terminal symbols)
    • e.g. $ [\;S \rightarrow aSbSa\;|\;\epsilon\;]$ , $ [\;S \rightarrow aaSbSaaS\;|\;\epsilon\;]$ , etc. would not be allowed.
    • e.g. $ [\;S \rightarrow aSbSb\;|\;\epsilon\;]$ , $ [\;S \rightarrow aaSbSaxS\;|\;\epsilon\;]$ , etc. would be allowed.

Are (Context-Free) Grammars, that follow these 4 rules, always unambiguous? It would seem so. I really don’t see any conceivable way that such Grammars could be ambiguous.

(Note: To show why Rule 4 is necessary consider the grammar $ S \rightarrow aSbSa\;|\;\epsilon$ with the string $ aababaababaa$ . A visual derivation can be found here.)

I’ve spent a lot of time thinking about this question and have had a hard time finding any way of either proving or disproving it. I’ve tried showing that Grammars like these are $ LL(1)$ . However, it seems only Grammars of the form $ S \rightarrow aSb\;|\;\epsilon$ are $ LL(1)$ . Grammars like $ S \rightarrow aSaSb\;|\;\epsilon$ are not $ LL(1)$ . Maybe I need to use a higher $ k$ for $ LL(k)$ ?

(This question is a follow-up/reformulation of a previous question.)

I would really appreciate any help I could get here.