Algorithm to find repeated patterns in a large string

For optimization purposes I’m trying to analyze a large list of executed program commands to find chunks of commands that are executed over and over again. This problem is similar to searching repeated substrings in a string. However, in my case I’m not looking for the longest substrings but rather smaller substrings that occur very often.

For example, say each command is represented by a letter, then a program might look like xabca yabca xabca yabca. If we are looking for the longest repeated substrings, the best result is xabca yabca. A “better” result would be abca, though. While being shorter, it occurs more often in the string. a occurs even more often on its own, but it would be considered a too short match. So an algorithm should be parameterizable by a minimum and maximum chunk length.

Things I have tried so far:

  • I played with suffix trees to find the longest repeated substrings that occur at least k times. While that is simple to implement, it doesn’t work well in my use case, because also overlapping substrings are found. Trying to remove those wasn’t very successful either. The approach mentioned in this post either gave wrong or incomplete results (or I misunderstood the approach), and it also doesn’t seem to be customizable. Suffix trees still seem the most promissing approach to me. Perhaps someone has an idea how to include the minumim/maximum chunk lengths into the search here?
  • Another attempt was using the substring table that is created for the LZW compression algorithm. The problem with this approach is that is doesn’t find repeated chunks that occur early and it also creates longer and longer table entries the farer it processes the input (which makes sense for compression but not in my use case).
  • My best solution so far is the brute-force approach, i.e. building a dictionary of every possible substring and counting how often it occurs in the input. However, this is slow for large inputs and has a huge memory consumption.
  • Another idea was searching for single commands that occur most frequently, and then somehow inspecting the local environments of those commands for repeated patterns. I didn’t come up with a good algorithm here, though.

What else algorithms are there that could be useful in this scenario? What I’m looking for is not necessarily the best match but a good heuristics. My input data is pretty big, strings up to a length of about 100MB; the chunk sizes will usually be in the range from 10 to 50.

Design patterns for machine learning as a service?

I wondered if there are design patterns and common best practices w.r.t. machine learning as a service (MLaaS). Making use of the model view controller (MVC) pattern seems quite obvious to me. Using a Docker container created from a Docker image to isolate machine learning functionality and providing it via a RESTful API seems to be common as well. However finding best practices w.r.t. API design (for e.g. time series data analysis) is not straightforward.

Does somewhere know where I can find in-depth resources about design patterns w.r.t. machine learning as a service (in general or specific to Python)?

What are the common conversational design patterns used by voice-based assistants like Siri, Alexa, Google Assistant and Cortana?

Conversational UI is common in IoT and other connected devices these days, not to mention smartphones and also computers. The main players in the market would probably be Siri from Apple, Alexa from Amazon, Google Assistant and Cortana from Microsoft.

It is typically rare for someone to own products from all of these vendors, but I wonder if there are some common conversational design patterns that are shared by these types of devices in the way they interact with the user, or if they are designed to behave differently to suit the particular target market or group, and where these differences might be.

The specific areas that I am thinking about from a user experience point of view are:

  • Choice of default voice (and variety available for customization)
  • Language used to trigger specific actions
  • Type of language used by voice assistant to respond
  • Type of audio cues and indicators for specific actions/status

Legacy deep-inheritance XML schemas: how to design patterns for APIs that map those to and from flat schemas?

Consider a purely hypothetical legacy proprietary library for XML models, which has some really deep nested inheritance within its corresponding POJOs — 1-10 fields per class, lots of special instance classes that extend archetypal classes, types as wrappers of lists of type instances etc. The resulting model looks really pretty with some dubious performance specs but that’s besides the point.

I want to make this work with the ugly, high-performance flat models that kids these days and people that claim not to have a drinking or substance abuse problem prefer for some reason. So say my beautiful, shiny model is something like this:

<QueryRequestSubtypeObject>   <QueryRequestHeaders>     <QueryReqParams>       <Param value = 4/>       <ParamDescWrapper index = 12>          <WrappedParamDesc>Foobar</WrappedParamDesc>   ... 

And the corresponding object as modeled by the vendor instead looks like

{    paramVal = 4    paramTypeIndex = 12    paramDesc = "Foobar"     } 

There are also regular updates to this ivory Tower of Babylon as well as updates to the business logic as vendor specs change.

Now the part where I convert my ancient classic into a teen flick is straightforward enough, however ugly it might be. Say something like below would be used by a query constructor and that would be enough abstraction for all business logic involved:

def extractParamVal(queryRequestSubtypeObject):     return queryRequestSubtypeObject.getQueryRequestHeaders.getQueryReqParams.getParam.getValue 

Alas, it is not that simple. Now I want to convert whatever ugly, flat, subsecond latency response that comes back into our elegant, delicate model (with 5-10 second latency, patience is a virtue after all!). With some code like this:

queryRequestSubtypeObject = new QueryRequestSubtypeObject queryRequestHeaders = new QueryRequestHeaders queryReqParams = new QueryReqParams queryReqParamList = new ArrayList param = new Param param.setValue(4) queryReqParamList.add(param) queryReqParams.setQueryReqParamList(queryReqParamList) queryRequestSubtypeObject.setQueryRequestHeaders(queryRequestHeaders) ... 

Code like this needs to be somewhere somehow for each and every field that is returned if someone were to convert data into this hypothetical format. Some solutions I have tried:

  • External libraries: Libraries like Dozer use reflections which does not scale well for bulk mapping massive objects like this. Mapstruct et al use code generation which does not do well with deep nesting involving cases like the list wrappers I mentioned.

  • Factory approach: Generic response factories that take a set of transformation functions. Idea is to bury all model specific implementation into business logic based abstractions. In reality this results in some FAT functions.

  • Chain of responsibility: Methods that handle initialization of each field and other methods that handle what goes where from vendor response and some other methods that handle creation of a portion of the mapping and some other methods that handle a sub-group… loooooong chains of responsibility

Given all of these approaches resulted in technical nightmares of some sort, is there an established way to handle cases like this specifically? Ideally it would have minimal non-business logic abstractions involved while providing enough granularity to implement updates and have it technically solid as well. Bonus points for the ability to isolate any given component, wherever it might be in the model hierarchy, without null pointers getting thrown somewhere for unit testing

C++ tactics / data structures / design patterns to avoid or postpone unnecessary object creation?

A couple of months ago I wrote a C++ program for computational mathematics that was supposed to compete with a highly optimized C code.

After a while I did manage to get it fast enough to beat the C code, but before that happened while analyzing what was slow, I was really surprised about how long time it can take for C++ to create objects, especially if we have complicated class structures and templates.

I realized the usefulness to be able to “lazify” or postpone object creation if it is unnecessary. Do there exist any good methods to do this?

Some things I thought about, but don’t have any sources on recommended ways to do:

  1. hash-tables, storing already created objects in case we are likely to want to re-solve same equation systems thousands of times and only sporadically create a new one.
  2. static variables that are constructed once at program startup and then “recycled” with new data when needed.
  3. some kind of object oriented design pattern I may be unaware of?

Patterns for listing users in a “room”

I’m building a little app that lists music recommendations to users in a room. I want to list the users in the room but have a hard time finding patterns for this.

Right now it looks like this (by no means “good”):

enter image description here

Since I only really need to present the username, I thought I would add categorical colors (with no expertice in knowning how to pic colors that look “good” while simultaneously being distinguishable). It just doesn’t look good so I was wondering: This must be a common use case, how have other pages and applications solved this? But find myself dumbfounded trying to find examples of lists of usernames in a “room”.

Should I skip the colors? Thought it could be useful as a marker if I’m referring to users on other parts in the view (as a shorthand).

C# extension methods design patterns and usage guidelines?

C# extension methods have seen a rise in usage over recent years. The offical microsoft guidelines on usage state: “In general, we recommend that you implement extension methods sparingly and only when you have to” The extension method guidelines.

On the other hand Microsoft are now using them heavily in dot net core (see the microsoft extensions name space). This is particularly prevalent is asp core where the initializaion of the IServiceCollection is implemented in extension methods, for example see The service collection service extensions. Dot net core tutorials has an article listing this as a design pattern suggesting that it is best practice. Quite a few popular nuget packages also use this method to initialise services: Swagger is configured this way microsoft docs as well as mediatr their implementation

Should the guidelines be updated or should this practice be avoided? If the guidelines are to be updated what should they be?

Recommended patterns for help cues/hints/documentation

I’ve been scouring the web, but haven’t been finding much with regards to different options on how to present help text and/or documentation. Obviously, there’s an established pattern of just having a Help or FAQ menu item and then linking to some giant document, but I’m looking for something a little more context-sensitive and/or integrated within the content (but non-obtrusive, of course).

Can anyone point me to some good examples of alternatives to the traditional Help menu item? I’ve got some ideas of my own but would like some inspiration or validation.

Best approached to manage API versioning in source code as per microservices design patterns consideing DRY and SOLID principles

I would like to understand what shall be the best approaches for management different versions for same microservice api source code. Example I have one microservices that host a business service

API ( v1 ) – Controller1 – Service1 + addXXX ( Model m1 )

API ( v2 ) – Controller2 – Service + addXXX ( Model m2 )

Both v1 and v2 have common database / schema / table and points to same functionality however few differences in implementation logic.Shall I maintain and manage different projects/modules to prevent code conflicts in domain models for backward compatibility and release management or shall put in same module and depend on developers knowledge to know all before he starts coding and fix unit test cases for all previous versions.

Modifying the JPEG compression to use custom block patterns

recently, while watching a video about machine learning, I had an idea about creating an image compression algorithm, that works similar to JPEG, but is based on arbitrary bitmaps as blocks.

I should mention that I would not call myself a computer scientist, so my understanding of the math and algorithms involved are still limited at this point. I have an intuition for how deconstructing an image into blocks, which are represented as combinations of 8×8 px patterns (made up of cosine waves) works.


enter image description here

so my question is: is it possible to replace the standard JPEG patterns with something else? — for instance the features that have been found by a neural network (see below).

enter image description here

I’d be very interested in what the results / artifacts of such a compression would look like, both applied to the original data set (the neural net was trained on) and random other images.

any ideas / directions on how to achieve this are very much appreciated.

thanks a lot in advance!