If we want to map abbreviations of full-English words (e.g. map “Jan” to “January”), how can we identify abbreviations which map to multiple words?



Short Version:

How can we construct such a trie which maps abbreviations of names-of-the-month to full-month (we map the abbreviation "mar" to "march")?

  • The set of all abbreviations is formed by:
    • keeping the first letter of the month name. (all abbreviations of "january" begin with "j")
    • deleting 1 or more characters ("jan" deletes "uary" from "january")

The Looonnnnggg Version:

How can we construct such a trie?
What algorithm will build the appropriate trie from the container of verbose strings.

Consider the English names for months of the year:

  • January
  • February
  • March
  • April
  • [… truncated …]
  • October
  • November
  • December

We find it useful say that the English names for months of the year are "verbose" strings.
For any "verbose" string $ v$ , and any string $ a$ we say that $ a$ is an "abbreviation" of $ v$ if and only if all of the following conditions are met:

  • $ a$ non-empty. $ |a| \geq 1$
  • $ a$ can formed by deleting 1 or more characters from "verbose" string $ v$
  • $ a(1) = v(1)$ . Assume that string indexing begins at $ 1$ , and not $ 0$ .

For example, "jan" is an abbreviation of "january."

Suppose you want to write an algorithm which:

  • accepts a list of verbose strings as inputs.
  • the algorithm outputs a "trie" data-structure (information retrieval tree) $ T$ such that:
    • The trie $ T$ accepts any ASCII string as input.
    • An output (leaf node) of the trie should be set of strings $ S$ such that:
      • every string in $ S$ is a verbose string
      • the string fed as input into trie $ T$ is an abbreviation of every verbose string in container $ S$

Some examples of input to the trie and output of the trie are shown below:

  • Example 1

    • Input: "Ma"
    • Output: $ \{$ "March", "May"$ \}$
  • Example 2

    • Input: "Mar"
    • Output: $ \{$ "March"$ \}$
  • Example 3

    • Input: "Decuary"
    • Output: $ \{$ " "$ \}$ ….. the empty-set

The output from the trie should be one of:

  • the empty set
  • a set of one item
  • a set of two or more items

For months of the year, we might write javascript so that an end-user can type in any half-way reasonable date-format, instead getting an error message when they put slashes instead of dashes, etc….

If you do not like the months of the year application, a different use-case would be to write write your own Linux Shell (similar to BASH). Maybe any half-way reasonable abbreviation of "make directory" will map to "mkdir" In that case, we could have many-to-one mapping from high-level shell-commands to low-level Linux commands.

The question is:

How can we construct such a trie?
What algorithm will build the appropriate trie from the container of verbose strings.

Also, can we avoid brute-force generating a list of all aberrations before-hand? The set of all strings form-able by deleting 1 or more characters from the verbose strings is quite large. We would like to avoid combinatorial explosion, if we can.

The programming language (Java, python, C+ + ) does not matter for answering this question.