Text detection in computer vision

I’m curious about the way text recognition works in machine learning(or more generally, the question of object vs not object) in computer vision.

How are systems trained when the not-object data set is so much greater in quantity and apparently lacks structure?

One approach is having the algorithm first searches for a text box and once it finds one applies character recognition. Thus the initial classification comes down to “text” or “not text”. “Not text” doesn’t have any particular structure though and in fact almost everything is “not text”…so how is this dealt with?

What would the “not text” training set be? Random images? Clearly you need negative examples.