Architecture/Code_design/Algorithms/Search/hash_table.theory.txt


              
   HASH TABLE  
              


HASH TABLE ==>                    #Implementation of an associative array where:
                                  #  - values are stored in an array
                                  #  - keys are hashed and used as array indices ("buckets")
                                  #     - "keyspace": all possible keys
                                  #Also called "hash map"
                                  #Array size N:
                                  #  - often smaller that hashes domain, in which case hashes are
                                  #    reduced using the modulo operator.
                                  #Collisions:
                                  #  - number of collisions:
                                  #     - min: ⌊n/N⌋
                                  #     - average: n/N
                                  #     - max: n
                                  #  - collision resolution schemes:
                                  #     - increasing array size
                                  #     - separate chaining
                                  #     - open addressing
                                  #     - perfect hashing
                                  #     - extendible hashing
                                  #Pros:
                                  #  - time complexity of operations (search, insert, delete): O(1)
                                  #Cons:
                                  #  - low space efficiency O(m) where m are possible hashes
                                  #  - time cost of the hashing function
                                  #     - i.e. not efficient if small number of keys
                                  #  - time|space cost of handling collisions
                                  #  - poor locality of reference
                                  #  - unordered
                                  #  - not persistent data structure


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:           RESIZING            :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


ALL-AT-ONCE REHASH ==>            #Triggered on specific density threshold
                                  #Re-inserting each element, recalculating each hash

INCREMENTAL REHASH ==>            #Build the new array in parallel
                                  #  - either time-based, or incrementally after each operation
                                  #Check both old and new array on search|delete
                                  #Use only new array on insert

LINEAR HASHING ==>                #Divide into three parts 1), 2) and 3)
                                  #  - when hashes falls into 1), use twice modulo instead
                                  #     - which means it could fall into either 1) or 3)
                                  #Adding a single bucket:
                                  #  - added to 3)
                                  #  - first 2) bucket becomes 1)
                                  #     - bucket re-hashed ("split") with new modulo
                                  #     - point between 1) and 2) is the "split pointer"
                                  #If 2) has no more buckets:
                                  #  - 1) and 3) merged into 2)
                                  #I.e. rehash is done incrementally
                                  #Only works with separate chaining

DISTRIBUTED HASH TABLES ==>       #See below


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:          COLLISIONS           :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


[SEPARATE] CHAINING ==>           #Way to handle collisions in a hash table
                                  #Storing several values ("overflow") as a data structure, instead of a single value
                                  #Often used data structures:
                                  #  - linked list:
                                  #     - can either store a pointer to a linked list, or directly a linked list ("list
                                  #       head cells")
                                  #  - dynamic array ("array hash table")
                                  #  - search tree: only makes sense when hashing function is highly not uniform

OPEN ADDRESSING ==>               #Way to handle collisions in a hash table.
                                  #Also called "closed hashing"
                                  #How:
                                  #  - on write collision (value already exists) or read collision (value does not match)
                                  #  - iterates to next element ("probing" through a "probe sequence") until one fits
                                  #     - cycles from end to start
                                  #  - possible schemes are below
                                  #Clustering:
                                  #  - when values are closed to each other, causing collisions
                                  #  - increases with high density
                                  #  - can be:
                                  #     - "primary":
                                  #        - when probing sequence is close to each other (e.g. linear probing)
                                  #        - amplified when hash function is not uniform
                                  #     - "secondary":
                                  #        - when probing sequence is far from each other (e.g. double hashing)
                                  #  - probe sequences overlap if using the same (from worst to best):
                                  #     - current bucket, i.e. can overlap even on different key and key hash
                                  #     - first bucket, i.e. overlaps if key hash is same
                                  #     - key, i.e. no overlap
                                  #  - when probing iteration is determined only by current item:
                                  #     - i.e. not first one nor input
                                  #     - collisions will make probe sequences overlap, amplifying the problem
                                  #Time complexity:
                                  #  - O(1 + n/N), i.e. density, i.e. probability of collision with clustering
                                  #  - high locality of reference improves cache performance
                                  #     - i.e. if values are too large to fit in CPU cache, open addressing is less interesting
                                  #Deletion:
                                  #  - must copy the next values (according to iteration) to fill the iteration gap
                                  #  - "lazy deletion":
                                  #     - deleted values are just flagged instead of erased from memory
                                  #     - inserting into a deleted-flagged value simply overwrite it
                                  #     - searching through a deleted-flagged value makes it iterate to next value, but the
                                  #       finally found value is then copied to the deleted-flagged value
                                  #        - can also do this as part of a cleanup routine, e.g. after a specific number of
                                  #          operations
                                  #     - pro: faster deletion
                                  #     - con: slower search
                                  #Comparison with separate chaining:
                                  #  - pros:
                                  #     - more space efficient
                                  #     - more time efficient when low density because fewer pointers
                                  #  - cons:
                                  #     - less time efficient when high density because of clustering, i.e.:
                                  #        - need to dynamically resize on smaller density threshold
                                  #        - stronger need for uniform hash function
                                  #     - it is possible to run out of values if no resize is done

PERFECT HASHING ==>               #Way to handle collisions in a hash table.
                                  #Choosing a hashing function (i.e. universal hashing function) that creates no collision
                                  #  - must try several hashing function until find one
                                  #"FKS hashing":
                                  #  - doing hashing twice, and using a hash iliffe vector instead
                                  #  - O(n²) space complexity, but much higher chance to find hashing functions
                                  #Set of keys can be:
                                  #  - static
                                  #  - dynamic: must rebuild and|or resize either:
                                  #     - when collision happens
                                  #     - after specific number of operations
                                  #Comparison with separate chaining:
                                  #  - pro: very time efficient
                                  #  - con: building is very slow, so set of keys should be mostly static

EXTENDIBLE HASHING ==>            #Way to handle collisions in a hash table.
                                  #How:
                                  #  - each bucket has its own modulo size ("local depth")
                                  #     - initially same as global depth
                                  #  - "global depth":
                                  #     - hash table size
                                  #     - resized by a factor of D (often 2)
                                  #On insert collision:
                                  #  - if bucket's local depth < global depth:
                                  #     - split it into D buckets
                                  #     - 1 bucket will be full, the others empty
                                  #     - repeat until either no collision or local depth = global depth
                                  #  - if still collision:
                                  #     - resize hash table
                                  #        - i.e. global depth is increased but not buckets' local depth
                                  #        - i.e. O(1) time complexity
                                  #     - then re-insert
                                  #Often implemented using a hash trie (called "directory")
                                  #  - in which case, local depth is also number of pointers from trie to bucket


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:        OPEN ADDRESSING        :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


K-CHOICE HASHING ==>              #Using k different probing methods|sequences.
                                  #On insertion, picking the one with the least number of collisions.

LINEAR PROBING ==>                #Fixed increment, usually 1
                                  #Stops iteration when going back to same position
                                  #Pros:
                                  #  - low secondary clustering
                                  #  - high locality of reference
                                  #Cons:
                                  #  - high primary clustering
                                  #  - probe sequence overlaps on current bucket

ROBIN HOOD HASHING ==>            #Similar to linear probing except:
                                  #  - on final insertion, the number of iterations performed is stored
                                  #  - during iteration, if the new item's number of iterations > current item's stored number,
                                  #    swap them
                                  #Goal: spread the lengths of probe sequences, reducing worst time complexity.

QUADRATIC PROBING ==>             #Quadratic polynomial increment, e.g. x²
                                  #Stops iteration when x is bigger than array size
                                  #  - x might repeat itself though:
                                  #     - at least first x/2 items will be unique if either:
                                  #        - array size is prime
                                  #        - array size is 2ᵐ and increment is n*(n-1)/2
                                  #     - all items will be unique if:
                                  #        - array size is prime and 3+m*4, and increment is alternating between x² and -x²
                                  #  - i.e. important to keep low density
                                  #Pros and cons:
                                  #  - in-between linear probing and double hashing
                                  #  - probe sequence overlaps on first bucket

DOUBLE HASHING ==>                #Increment is HASHₓ(key), using a series of universal hash functions
                                  #Iteration keeps going until found
                                  #Pros:
                                  #  - low primary clustering
                                  #  - no probe sequence overlaps
                                  #Cons:
                                  #  - high secondary clustering
                                  #  - low locality of reference
                                  #  - requires extra hash computation

CUCKOO HASHING ==>                #Similar to double hashing except:
                                  #  - HASHₓ(key) is location not increment
                                  #  - instead on inserting on last location found, insert on first one and push other
                                  #    values one position forward
                                  #  - uses limited number m of hash functions:
                                  #     - on loops, must trigger an array resize
                                  #     - often divide array in m, with each slice having its own hash function
                                  #     - when m is smaller:
                                  #        - pro: better worst time complexity
                                  #        - con:
                                  #           - higher probability of cycles, i.e. density must be smaller
                                  #           - m = 2 -> density < 0.5, m = 3 -> density < 0.91
                                  #Comparison with double hashing:
                                  #  - worst average time complexity
                                  #  - better worst time complexity (because resize is triggered more often)

COALESCED HASHING ==>             #Hybrid with separate chaining
                                  #On collision:
                                  #  - uses the first empty bucket on the table
                                  #     - its position is stored to avoid linearly searching it each time
                                  #  - use pointers, i.e. form linked list inside the table
                                  #"Cellar":
                                  #  - optimization where part of the hash table is reserved for collision values
                                  #  - pro: faster to find next empty bucket
                                  #  - con: must use heuristic to find good ratio, often 0.14
                                  #Deletions are slow


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:    DISTRIBUTED HASH TABLE     :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


DISTRIBUTED HASH TABLE (DHT) ==>  #Hash table distributed over several nodes:
                                  #  - nodes communicate using an "overlay network"
                                  #  - keyspace is split among nodes using a "keyspace partioning" scheme
                                  #Goals: scalability, decentralization, fault tolerance

CONSISTENT HASHING ==>            #Keyspace partioning scheme, where:
                                  #  - each node gets assigned as value a random position ("token") in keyspace
                                  #  - each key belongs to the node with the closest greater|lesser node value
                                  #Also called "stable hashing"
                                  #"Weighting":
                                  #  - assigning several values per node, to ensure better spread
                                  #Resizing:
                                  #  - on node removal, its keys are reassigned
                                  #  - on node insertion, its keys are calculated, reassigned from other nodes
                                  #  - i.e. in average, node removal|insertion move N/n keys, where N is number
                                  #    of keys and n number of nodes

RENDEZVOUS HASHING ==>            #Keyspace partioning scheme, where:
                                  #  - each node gets assigned a random value
                                  #     - total order over all nodes can also be set in case of score ties
                                  #  - each key belongs to the node with the lowest HASH(key, node_value) ("score"/"weight")
                                  #     - can also pick the n lowest scores, to assign several nodes per key
                                  #Resizing is similar to consistent hashing
                                  #Pro:
                                  #  - more distributed, less clustered keyspace
                                  #Con:
                                  #  - need to iterate over each node, instead of just the closest one
                                  #     - i.e. O(n) time complexity where n is number of nodes


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:         COMBINATIONS          :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


HASH TRIE ==>                     #Hash table where hashes are stored in a trie.
                                  #Goal:
                                  #  - slightly lower time complexity of hash map
                                  #  - but with much lower space complexity
                                  #  - better persistent data structure
                                  #"Hash array mapped trie" (HAMT): when using array mapped trie


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:         BLOOM FILTER          :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


BLOOM FILTER ==>                  #Implementation for a set, using a probabilistic data structure
                                  #How:
                                  #  - big single array of m bits
                                  #  - each value's hash is bit-or'd
                                  #  - hashes must have constant number k of 1-bits
                                  #     - can do it by using k hash functions that map to [0-m-1]
                                  #False positives are possible (but no false negatives), with probability p
                                  #Operations:
                                  #  - cannot iterate|retrieve values (because they are hashed)
                                  #  - intersections increases p, but not union
                                  #  - remove():
                                  #     - not available
                                  #     - can emulate by using a second bloom filter for elements that have been removed
                                  #     - but this introduces possible false negatives
                                  #  - length():
                                  #     - not available
                                  #     - can approximate with -m/k * log(1-X/m), where X is number of bits set
                                  #Pros:
                                  #  - space complexity: m, i.e. O(n), with very small amount of bits per n
                                  #  - time complexity: O(k)
                                  #Cons:
                                  #  - limited operations
                                  #  - false positives
                                  #If set elements is known, bit array is more efficient
                                  #Parameters:
                                  #  - p = (1-(1-1/m)ᵏⁿ)ᵏ
                                  #  - optimal k = m/n * log(2)
                                  #     - all following assume optimal k
                                  #  - p = 2**(m/n * -log(2))
                                  #     - p increases exponentially with n, decreases exponentially with m
                                  #     - p doubles when m/n increases by 1
                                  #  - m = n * log₂(p) / -log(2)
                                  #     - for given p, m increases linearly with n
                                  #  - m/n = log₂(p) / -log(2)
                                  #     - 9.5 bits per element for 1% false positive
                                  #     - adding around 5 bits per element divide chance by 10, e.g. 19.2 bits per element for 0.01% false positive

COUNTING FILTER ==>               #Like bloom filter except:
                                  #  - instead of flagging individual bits, increments a counter of size c bits
                                  #     - higher c bits gives higher max limit
                                  #I.e. new operations are available:
                                  #  - remove() becomes available by decreasing counter
                                  #     - even after removal, there is change that it might still be marked as false positive though
                                  #  - length() by summing counter/k
                                  #However space complexity is higher: m*c, i.e. O(n*c)