How to convert recursive language grammar tree into automaton for optimal parsing?

So I have a definition of a sort of grammar for a programming language. Some aspects are recursive (like nesting function definitions), other parts of it are just simple non-recursive trees.

Originally I just treat this sort of like a Parsing Expression Grammar (PEG), and parse it using recursive descent. But this is kind of inefficient, as it sometimes has to build up objects as it’s parsing, only to find 10 calls down the road that it isn’t a good match. So then it backs up and has to try the next pattern.

I’m wondering what the general solution to this is. What can be done to create the most optimal algorithm starting from nothing but a tree-like data structure encoding the grammar?

Can you process the tree-like, recursive-ish BNF PEG data structure in a while loop, rather than using recursive descent? Better still, can you convert the data structure into a sort of automaton which you can (in some simple way) transition through, to parse the string without having to go down a bunch of wrong paths and generate stuff you only have to soon throw away if you find it’s a mismatch? Or if not, is there anything that can be done here with automata?

Sorry for my terminology (BNF vs. PEG), I am not using it exactly. I just mean I have a grammar, which is context-sensitive which falls outside the scope of either of these. So I am picking one or the other to simplify this question for the purpose of this site, even though I’m not 100% sure what the difference between BNF and PEG are at this point.

Is every language described by a grammar?

I read the following argument showing that not every language is described by a grammar:

For a fixed alphabet $ \Sigma$ and variables $ V$ there are uncountable many languages over $ \Sigma$ since the power set of $ \Sigma^\ast$ is uncountable. But grammars are finite objects by construction and thus there are only countably many grammars. In total, there can only be countably many languages described by grammars. Hence, there are this uncountably many languages that cannot be described by grammars.

I understand the idea of the argument, however i am not convinced by how they show that there are only countably many grammars. What do they mean by "finite objects"? Couldn’t one just take the set $ \{ (V,\Sigma, P, S) | \; P\subseteq (V\cup \Sigma)^+ \times (V\cup \Sigma)^\ast,\; S\in V\}$ , which is clearly uncountable, to get uncountably many grammars? Or do the languages that they generate fall together so often that we only get countably many languages generated by grammars in the end?

Is there any official terminology about something like double quotes “” grammar?

In many programming language string is a token.

For example:

 token               ::= '"' string                        | nat   string              ::= string                        | '"'   nat                 ::= digit nat                        | ϵ 

This is a LL(1) grammar for some programming language’s toke grammar.

When parsing a string, there is no need to check follow set, because there is a " at the end of each string.

Comparing with nat, string is more easy to parse.

My question is

Is there any official terminology about this kind of grammar?


Removing left factoring from Context-Free Grammar

I know that, removing left factoring is a simple task.
And i understand following procedure:

$ S→aA | aB$
$ S→aS’$
$ S’→A|B$

Yet I’m running into problems with this particular grammar:

$ S→AD|bbS|bScS|BS $
$ A→aAbb | abb$
$ B→aB|ba|b$
$ D→cDd|cccd$

How to remove left factoring from it, I’m trying to convert it into LL(1) grammar

Grammar and Enumerator for Decision and Halting Problem

in theoretical computer science I learned for every recursive enumerable language there would be an enumerator and a grammar. So since decision problem and halting problem are recursively enumerable, I was wondering what kind of grammar and enumerator could this be.

Ok since there exists a sequence of M_i I would start with M_1 and find all words for this TM and give them out. So if I have any TM is there a possibility to give all words out which are accepted by this TM? I probably would have to give all w_i to it and compute the first i words for i steps, then i+1 words for i+1 steps and so on. Or maybe something like DFS on all configurations. This really sounds like that only for one TM this could go on forever. So I would need to start the second TM for the same period of time after a while… Seems as if something similiar could work for Halting Problem. Do you have any more refined thoughts on this one?

But the Problem with the Grammar seems to be more challenging. How could I possibly come up with a Grammar for these problems? You would have to generate code(M_i) in such a way that its always in the language. So you would have to simulate the TM through grammmar on different words. It somewhat comes down to the question whether a TM accepts anything or holds on any entry. Or is it more "okay we proofed, so there must be some kind of grammar even though I can´t comeup with an idea for it".



Is EBNF a formal grammar? If yes, how can we generate production rules from EBNF expression?

According to Wikipedia definition EBNF, EBNF is a formal grammar.

My question is that how could I generate production rules base on EBNF expression:

For example:


letter = "A" | "B" | "C" ;

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

identifier = letter , { letter | digit | "_" } ;

Generates production rules:

letter ⟶ "A"

letter ⟶ "B"

letter ⟶ "C"

digit ⟶ "0"

digit ⟶ "1"

digit ⟶ "2"

digit ⟶ "3"

digit ⟶ "4"

digit ⟶ "5"

digit ⟶ "6"

digit ⟶ "7"

digit ⟶ "8"

digit ⟶ "9"

identifier ⟶ letter

identifier ⟶ letter noname_nonterminal

noname_nonterminal ⟶ letter

noname_nonterminal ⟶ digit

noname_nonterminal ⟶ "_"

Thank you for your reading,

What is a make sense (meaningful) example of language that an unrestricted grammar could generate?

I have learned that:

  1. Unrestricted grammar is used to define (or describe) a formal language.
  2. Unrestricted grammar is used to define recursively enumerable set [][1].

I’d like to find a meaningful example for #1 case which is similar to below context sensitives grammar example to learn the purpose of unrestricted grammar. I could find a meaningful example for context sensitives grammar but I could not find a one for the unrestricted grammar yet. Could you help me?

Language for “Network racing game record” with below record instances:

Mr. Bean Male Player 1

Ms. Emma Female Player 2

Mr. Hải n/a Computer 3

Ms. Tú n/a Computer 4

Production rule:

S ⟶ Title Name TAB Sex UserType TAB Rank

Title WomanName ⟶ "Ms. " WomanName

Title ManName ⟶ "Mr. " ManName

WomanName TAB Sex "Player" ⟶ WomanName TAB "Female" "Player"

ManName TAB Sex "Player" ⟶ ManName TAB "Male" "Player"

Name TAB Sex "Computer" ⟶ Name TAB "n/a" "Computer"

Name ⟶ WomanName

Name ⟶ ManName

Sex ⟶ "Male"

Sex ⟶ "Female"

UserType ⟶ "Player"

UserType ⟶ "Computer"

Rank ⟶ "1"

Rank ⟶ "2"

Rank ⟶ "3"

Rank ⟶ "4"

WomanName ⟶ "Emma"

WomanName ⟶ "Tú"

ManName ⟶ "Bean"

ManName ⟶ "Hải"

TAB ⟶ "\t"