Input: Recognized text from images of labels taken with a cell phone in varying conditions. One image may enclose the entire text, or just a part of it.
Expected output: The most likely version of the original text, ideally with an indication of the certainty. More images will of course provide better results.
Even though it seems to me that this should be a rather common problem, I could not find any research, algorithms or code directly related.
My best idea so far (after three or four attempts of an implementation now discarded for various reasons…) was to find the best matches for various inputs, potentially by finding the longest common substrings, and then generate some sort of tree indicating the most frequent connections between individual characters. Parsing the tree should then return the most likely original. Even though this might work in principle, it’s always in the details, and there may be much more efficient solutions out there.