| Jul | AUG | Sep |
| 08 | ||
| 2021 | 2022 | 2023 |
COLLECTED BY
Collection: Common Crawl
HashTable size():int -- how many entries? has(key:Hashable):boolean -- is the key in the table? get(key:Hashable):Object throws KeyNotFound -- get the associated value. put(key:Hashable, value:Object) -- associate a key and value. remove(key:Hashable) throws KeyNotFound -- remove the key from the table. Hashable hash():intYou may have used hash tables by another name in you programming language. For instance in Python they are called dictionaries, in Ruby hashs, and in Java they are called HashMaps.
type Hashable interface { Equals(b Hashable) bool Hash() int } type entry struct { key Hashable value interface{} next *entry } type hash struct { table []*entry size int }A
hash is a struct with to elements. An array of pointers to entry. The
entries hold our key value pairs. The way the table works is we convert the key
into a number which we then clamp to the size of our table. That number will be
the index of some entry in our table from which we can add, lookup, or remove
the key.
The entry struct represents the key value pair and represents a linked list.
Since covering linked list operations is a bit beyond the scope of this paper,
let me just present operations on the *entry but with no explanation. The
function should be obvious even if the implementation is obscure.
func (self *entry) Put(key Hashable, value interface{}) (e *entry, appended bool) { if self == nil { return &entry{key, value, nil}, true } if self.key.Equals(key) { self.value = value return self, false } else { self.next, appended = self.next.Put(key, value) return self, appended } } func (self *entry) Get(key Hashable) (has bool, value interface{}) { if self == nil { return false, nil } else if self.key.Equals(key) { return true, self.value } else { return self.next.Get(key) } } func (self *entry) Remove(key Hashable) *entry { if self == nil { panic(Errors["list-not-found"]) } if self.key.Equals(key) { return self.next } else { self.next = self.next.Remove(key) return self } }
bucket
func (self *hash) bucket(key Hashable) int { return key.Hash() % len(self.table) }
Put method to place our key value pair into the list. If it was
actually appended onto the list (rather than updating and existing entry) we
increment the size field.
func (self *hash) Put(key Hashable, value interface{}) (err error) { bucket := self.bucket(key) var appended bool self.table[bucket], appended = self.table[bucket].Put(key, value) if appended { self.size += 1 } }Now there is one more wrinkle I will return to in a moment which is resizing the table when it gets too full.
func (self *hash) Get(key Hashable) (value interface{}, err error) { bucket := self.bucket(key) if has, value := self.table[bucket].Get(key); has { return value, nil } else { return nil, Errors["not-found"] } }
Remove on the linked
list instead of Put and update the head as before. We check to make sure it
is in the linked list first as this slightly simplifies the removal algorithm
above.
func (self *hash) Remove(key Hashable) (value interface{}, err error) { bucket := self.bucket(key) has, value := self.table[bucket].Get(key) if !has { return nil, Errors["not-found"] } self.table[bucket] = self.table[bucket].Remove(key) self.size -= 1 return value, nil }
bucket function depends on the table size.
func (self *hash) expand() error { table := self.table self.table = make([]*entry, len(table)*2) self.size = 0 for _, E := range table { for e := E; e != nil; e = e.next { if err := self.Put(e.key, e.value); err != nil { return err } } } return nil }
func (self *hash) Put(key Hashable, value interface{}) (err error) { bucket := self.bucket(key) var appended bool self.table[bucket], appended = self.table[bucket].Put(key, value) if appended { self.size += 1 } if self.size * 2 > len(self.table) { return self.expand() } return nil }
Figure 1. CPU and Storage
Unfortunately, the algorithm presented above does work well when using secondary
storage mediums like hard disks and solid state drives. There are several
reasons for this:
(一)Secondary Storage is slower than RAM
(二)The bus is slower
(三)Many peripherals hang off of the South Bridge
(四)Disks may be daisy chained causing bus contention
To deal with these factors and others when using disks:
(一)Read and write pages which are blocks of size 4096 bytes.
(二)Try and read contiguous runs and if writing more than one page write
contiguous runs as well.
(三)Batch writes.
(四)Don't read one byte at a time, read several blocks and get the byte that you
need.
(五)Employ caching at every layer.
(六)Measure performance in terms of number of disk accesses (eg. Block read and
writes).
Figure 2. Block File
We could make a fairly straight forward adaption of our separate chained hash
table above to this restriction. However, there is a problem: what do we do when
the table needs to be expanded? If the table is static then there is not
problem, we simply allocate the correct number of blocks right away. But, if we
have to expand the table every entry will need to be rehashed. This will cause
us to read from every block from our old table (N reads) and write to every
block in our new table (2*N writes) -- ouch.
The solution is of course Linear Hashing.
Figure 3. Example
In the figure, nis the number of blocks, iis the number of bits of the
hash functions and ris the number of records.
So to find which bucket a key goes to:
bkt_idx = let hash = x x x x a_1 a_2 ... a_i (* base 2 expansion of the hash of the key *) m = a_1 a_2 ... a_i (* just the first i bits *) in if m < n then m else m - 2^(i-1) (* == 0 a_2 a_3 ... a_i *) endIn go
func bucket(hash uint) uint { m := hash & ((1<<i)-1) // last i bits of hash as // bucket number m if m < n { return m } else { return m ^ (1<<(i-1)) // unset the top bit } }
func (self LinearHash) Insert(key Hashable, value []byte) error { hash := key.Hash() bkt_idx := self.bucket(hash) bkt := self.get_bucket(bkt_idx) if err := bkt.Put(key,value); err != nil { return err } self.r += 1 if r > UTILIZATION * self.n * (self.records_per_block) { return self.split() } return nil }As I mentioned above, if a bucket is full it should chain out an extra block for itself. This can be handled transparently.
Figure 4. Chaining Example
Figure 5. Split Example Part 1
Note that the bucket we added in the example was
1 a_i == 1 0 == a_1 a_2There are some keys in the old bucket
0 which is now called 00which
actually belong to bucket 10. So in order to make the addition of the new
bucket correct we need to split bucket 00.
Figure 5. Split Example Part 2
In general if we add
1 a_2 a_3 ... a_iWe split
0 a_2 a_3 ... a_iIn code5
func (self LinearHash) split() error { bkt_idx := self.n % (1 << (self.i - 1)) bkt_a := self.get_bucket(bkt_idx) bkt_b, err := self.allocate() if err != nil { return err } self.n += 1 if n > (1 << i) { self.i += 1 } return blk_a.split_into(bkt_b) // The split into function is left as // an exercise for the reader! }