My question: What data structure or algorithm could I use to solve the following set membership problem efficiently?
Imagine I am generating a random 32-bit integer every second, and adding it to a list of N integers. Imagine that each time I generate an integer, I ship it (and any other information I might have easy access to) off to a client. Later, the client will submit an integer (along with any other information I have given it), and I want to be able to quickly determine if a given integer has ever appeared in my set, but I want to do it in constant time, using a small, constant amount of memory. In other words, I don’t want to just keep all of my generated numbers in a list or hash table and check the submitted integer against this list. Ideally, adding a number to this set is a constant-time operation. I do not ever need to delete numbers from the set.
Option 1: I could use a bloom filter and add each number to the bloom filter. Unfortunately, the set membership test would be probabilistic, but I could make the filter big enough to reduce my probability of a false positive quite low. I’m open to this approach, but want to know what other options I could have.
Option 2: I have been reading about cryptographic accumulators. If each of the numbers I was generating were prime, I believe I could use an RSA accumulator to store a single accumulator value of constant size. Each time I add a new prime to my set, I add it to the accumulator, then I generate its witness as well and ship both the number and the witness to my client. Later, the client would submit the integer it is testing and the witness, and I would be able to quickly determine if the number being submitted is in fact a member of my set or not. Possible problems: I need to be able to hash to a prime number deterministically. (Not the end of the world, but adds complexity) I think I have to update my witness values as I add new values to the accumulator. Lastly: My understanding of accumulators is rudimentary.
Possible modifications: 1: Would this be any easier if subsequent values in my chain of numbers were somehow dependent upon previous values in some way? You could imagine some sort of non-cryptographic hash chain, whose current hash value includes enough information about its previous values that it could quickly determine, “Yep, if my current hash value is X, and you submit Y, Y definitely was a previous member of my chain.” 2: If I understand them correctly, accumulators seem like a very space-efficient way to store sets of prime numbers (and perhaps other values), but in the literature they all assume potential adversaries. In my case, I don’t need my witness be be unforgeable, so I would think that would make the problem much easier to solve. Perhaps it just means that I get to use smaller constants (so I don’t have to use RSA-2048?). Or perhaps this simplifies my problem even further? 3: What if the “random” values I was generating were known to be increasing, or were simply timestamps? (I still need to know if that particular timestamp were ever used)
Related problems: This seems a lot like having a lot of hash values in a Merkle tree or hash chain (blockchain), and wanting to be able to determine if a particular hash value were ever seen in the chain, without having to store every value that had been seen in the chain. I’m hopeful that with the additional concept of generating a “witness” value of some sort to be stored along with the number, the server can make a membership determination with much less overhead than having to store all of the numbers. This also seems similar to a verifiable log or authenticated dictionary.