For optimization purposes I’m trying to analyze a large list of executed program commands to find chunks of commands that are executed over and over again. This problem is similar to searching repeated substrings in a string. However, in my case I’m not looking for the longest substrings but rather smaller substrings that occur very often.
For example, say each command is represented by a letter, then a program might look like
xabca yabca xabca yabca. If we are looking for the longest repeated substrings, the best result is
xabca yabca. A “better” result would be
abca, though. While being shorter, it occurs more often in the string.
a occurs even more often on its own, but it would be considered a too short match. So an algorithm should be parameterizable by a minimum and maximum chunk length.
Things I have tried so far:
- I played with suffix trees to find the longest repeated substrings that occur at least k times. While that is simple to implement, it doesn’t work well in my use case, because also overlapping substrings are found. Trying to remove those wasn’t very successful either. The approach mentioned in this post either gave wrong or incomplete results (or I misunderstood the approach), and it also doesn’t seem to be customizable. Suffix trees still seem the most promissing approach to me. Perhaps someone has an idea how to include the minumim/maximum chunk lengths into the search here?
- Another attempt was using the substring table that is created for the LZW compression algorithm. The problem with this approach is that is doesn’t find repeated chunks that occur early and it also creates longer and longer table entries the farer it processes the input (which makes sense for compression but not in my use case).
- My best solution so far is the brute-force approach, i.e. building a dictionary of every possible substring and counting how often it occurs in the input. However, this is slow for large inputs and has a huge memory consumption.
- Another idea was searching for single commands that occur most frequently, and then somehow inspecting the local environments of those commands for repeated patterns. I didn’t come up with a good algorithm here, though.
What else algorithms are there that could be useful in this scenario? What I’m looking for is not necessarily the best match but a good heuristics. My input data is pretty big, strings up to a length of about 100MB; the chunk sizes will usually be in the range from 10 to 50.