Memory Usage | Memory Overhead | Insertion Time | Query Time | |
---|---|---|---|---|
std::unordered_map | 588M | 5.03x | 2.626 sec | 2.134 sec |
sparse_hash_map | 494M | 4.22x | 7.393 sec | 2.112 sec |
dense_hash_map | 1280M | 10.94x | 1.455 sec | 1.436 sec |
libcuckoo | 708M | 6.05x | 2.026 sec | 2.120 sec |
klib khash | 642M | 5.48x | 4.232 sec | 1.647 sec |
load | Chaining | Open Addressing |
---|---|---|
100% | 1.31x | 1.00x |
90% | 1.37x | 1.11x |
80% | 1.47x | 1.25x |
70% | 1.60x | 1.42x |
50% | 2.09x | 2.00x |
25% | 4.03x | 4.00x |
div
instruction on 64bit integer can take 32-96
cycles. Almost all major hash implementation use power of 2 table size,
so that the modulo is just one bitwise and operation. The problem with
power of 2 table size is it scales too fast! If our data size is 1 bit
above 2GB, the table must be at least 4GB, giving us 50% load. Finding
a fast alternative modulo operation is critical for creating a table
with high load without loosing much performance.
Professor Lemire is probably the first person that addresses this issue.
He wrote a blog post that provides a fast alternative to modulo.
1 2 3 |
|
1 2 3 4 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
capacity_clz
, the scale is defined by the most
significant 4 bits of the capacity capacity_m
s4b
. The capacity_ms4b
is pre-computed on hash table creation or resizing time. It’s a round
up of desired capacity with finer granularity compare to power of 2
tables.
I used Intel Architecture Code Analyzer to analyze the instruction
throughput of my methods, and the result is very satisfying:
●Power of 2 table with quadratic probing
●Block Throughput: 4.10 Cycles
●Total Num Of Uops: 9
●Fast mod and scale with quadratic probing
●Block Throughput: 4.15 Cycles
●Total Num Of Uops: 12
dense_hash_map
under high load.
The reason is robin hood hashing moves the buckets around during the
insert, but dense_hash_m
ap
simply probe and insert it to an empty
bucket if found.
Luckily, robin hood hashing gets a faster lookup time compare to
dense_hash_map
. I think the major reason is robin hood hashing
results a great expected probing number, and the overall throughput
benefits from it.
The benchmark code is available at hash_bench. My robin hood
hashing implementation is available at opic robin hood hashing.
std::
string
has 24 bytes overhead on small strings, so the memory
comparison is not fair. I’ll conduct another set of benchmarks using
integers tonight.
Also, one of the author of libcuckoo (@dga) pointed out that libcuckoo
would perform better if I use thread-unsafe version. I’ll also update
the benchmark with this new setup.
The short string problem brings up a question: what is the best
practice to use C++ hash map with short strings? Isn’t this a common
use case in daily programming? I tried to do some quick search but
didn’t find any useful information, and I’m suck at C++…
Any good idea on how to do this better?
Algorithm,
C,, HashTable,
Tweet
« Autoconf Tutorial Part-3
Writing a memory allocator for fast serialization »