I was working on a bioinformatics practice problem named Finding a Protein Motif on rosalind.info. In essence, I was given a particular regular expression `N[^P](S|T)[^P]`

and is asked to find all matches.

Solving that problem is not the goal here, I have a ‘working’ solution here. In essence, I manually designed a state machine that can find all matches for that regular expression.

And here is a ‘visualization’ of the state machine, to make it clear how it is manually designed.

`digraph G { 0 [label="0,''"] 1 [label="1,'N'"] 2 [label="2,'NN'"] 3 [label="3,'NS|NX'"] 4 [label="4,'NNS'"] 5 [label="5,'NSS|NXS'"] 0 -> 1 [label="N"] 0 -> 0 [label="P"] 0 -> 0 [label="S"] 0 -> 0 [label="X"] 1 -> 2 [label="N"] 1 -> 0 [label="P"] 1 -> 3 [label="S"] 1 -> 3 [label="X"] 2 -> 2 [label="N"] 2 -> 0 [label="P"] 2 -> 4 [label="S"] 2 -> 3 [label="X"] 3 -> 1 [label="N"] 3 -> 0 [label="P"] 3 -> 5 [label="S"] 3 -> 0 [label="X"] 4 -> 1 [label="N(accept)"] 4 -> 0 [label="P"] 4 -> 5 [label="S(accept)"] 4 -> 0 [label="X(accept)"] 5 -> 1 [label="N(accept)"] 5 -> 0 [label="P"] 5 -> 0 [label="S(accept)"] 5 -> 0 [label="X(accept)"] } `

The classical theory allows us to convert a regular expression to a non-deterministic finite-state automaton and then convert it to a deterministic finite-state automaton through subset construction. In particular, subset construction guarantees that if there exists an accepting computation, then the deterministic finite-state automaton would also accept.

Let say I have a regular expression that matched twice, the corresponding deterministic finite-state automaton would accept after the first match, but then it doesn’t know what to do in order to set it to the right state for detecting overlapping matches. I guess I could start one character after the beginning of the first match, which in the worst case would probably lead to quadratic time, as we could imagine with {.*} on {{{{{}

In the worst case, I expect quadratic time (e.g. {{{{{}}}}}}), but it would be great if the timing is output-sensitive, for I believe a good deal of cases aren’t quadratic in output size.

It would be great if my state machine used to find all matches can be generalized (apparently sometimes it need linear space, not just a single state) or automatically designed. Do we know if there are existing theories for that?