Matlab, 30828 8620 6848
It builds the hash by assigning a prime number to each ascii character/position combo and calculating their product for each word modulo the largest prime smaller than 2^24. Note that for testing I moved the call to primes outside into the tester directly before the while loop and passed it into the hash function, because it sped it up by about a factor of 1000, but this version works, and is self-contained. It may crash with words longer than about 40 characters.
function h = H(s)
p = primes(1e6);
h = 1;
for i=1:length(s)
h = mod(h*p(double(s(i))*i),16777213);
end
end
Tester:
clc
clear variables
close all
file = fopen('british-english-huge.txt');
hashes = containers.Map('KeyType','uint64','ValueType','uint64');
words = 0;
p = primes(1e6);
while ~feof(file)
words = words + 1;
word = fgetl(file);
hash = H(word,p);
if hashes.isKey(hash)
hashes(hash) = hashes(hash) + 1;
else
hashes(hash) = 1;
end
end
collisions = 0;
for key=keys(hashes)
if hashes(key{1})>1
collisions = collisions + hashes(key{1});
end
end