Regular expressions in the sense as equivalent to regular (Chomsky type 3) languages know concatenation `xy`

, alternation `(x|y)`

, and the Kleenee star `x*`

.

"Real-life" regular expressions as used in programming usually have a lot more operations available; amongst others, quantification `x{n}`

, negation `[^x]`

, positive and negative lookahead `x(?=y)`

, or back-reference `\n`

.

There is a famous post on SO stating that regular expressions can not be used to parse HTML for the reason that HTML is not a regular language.

My question is: Is this accurate? **Do "real-life" regular expressions, say the selection defined in the Java docs, really have the same expressive power as regular expressions as understood in formal language theory; or do the additional constructs, although possibly not strong enough to capture HTML and the like, put common regular expressions further up on the Chomsky scale than just Type 3 languages?**

I would imagine the proof of the computational equality of the two would amount to showing that each operation available for the common regexp is just syntactic sugar and can be expressed by means of the 3 basic operations (concatenation, alternation, Kleene start) alone; but I am finding it hard to see how one would e.g. simulate back-reference with classic regexes alone.