The original problem is given a large input file, with n input lines of random string, find the number of pairs-> meaning same number and type of characters, in the file. Constraint on type of characters is: any legal ascii(0-127) char besides 10 and 13. I am trying to implement a hashCode function for a wrapper class over a sized 128 int array that is ideally quick and results in minimal collisions.

Hence, my approach is to process each line into a int counting array of size 128 encoding the ascii set distributions into a wrapper class, then to store this array as a key in a hashtable/hashmap ( or possibly eventually to make my own hashtable implementation with linear probe to store it in). with value as the number of matching character distributions, from which I can calculate the number of pairs additively via Handshake Lemma to accumulate as I read through lines of text.

Hence to do this, I need some advice on selecting a appropriate fast hashing function that minimises collisions, and perhaps more importantly for my own understanding, the proof or a link to the proof. Initial number of total lines is given, so if there is a formula to calculate the required size, from that, it can be done. What I have done, arbitrarily using this as my hash function so far:

`public int hashCode(){ int hash = 0; for (int i = 0; i<128;i++){ hash+= b[i]*(i+1); } return (hash*this.size)%capacity; } `

Capacity meaning the size of the hashtable itself. In math terms n = Size of String, m = size of hashtable: $ \sum^{128}_{i=0}(x[i]*(i+1))*n \mod m$

The requirements for this hash function are: given a constant sized 128 int array in which each value in the index is bounded maximally by length of the string how can I come up with a unique hash.

or alternatively: given a string of size x with unique character distributions, how can I calculate the hash value, such that another string with same character distribution has the same hash value.

Ideally it will be a fast compute hash function

Thanks!