-
Notifications
You must be signed in to change notification settings - Fork 16
/
hash_table.theory.txt
349 lines (306 loc) · 24.6 KB
/
hash_table.theory.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
HASH TABLE
HASH TABLE ==> #Implementation of an associative array where:
# - values are stored in an array
# - keys are hashed and used as array indices ("buckets")
# - "keyspace": all possible keys
#Also called "hash map"
#Array size N:
# - often smaller that hashes domain, in which case hashes are
# reduced using the modulo operator.
#Collisions:
# - number of collisions:
# - min: ⌊n/N⌋
# - average: n/N
# - max: n
# - collision resolution schemes:
# - increasing array size
# - separate chaining
# - open addressing
# - perfect hashing
# - extendible hashing
#Pros:
# - time complexity of operations (search, insert, delete): O(1)
#Cons:
# - low space efficiency O(m) where m are possible hashes
# - time cost of the hashing function
# - i.e. not efficient if small number of keys
# - time|space cost of handling collisions
# - poor locality of reference
# - unordered
# - not persistent data structure
/=+===============================+=\
/ : : \
)==: RESIZING :==(
\ :_______________________________: /
\=+===============================+=/
ALL-AT-ONCE REHASH ==> #Triggered on specific density threshold
#Re-inserting each element, recalculating each hash
INCREMENTAL REHASH ==> #Build the new array in parallel
# - either time-based, or incrementally after each operation
#Check both old and new array on search|delete
#Use only new array on insert
LINEAR HASHING ==> #Divide into three parts 1), 2) and 3)
# - when hashes falls into 1), use twice modulo instead
# - which means it could fall into either 1) or 3)
#Adding a single bucket:
# - added to 3)
# - first 2) bucket becomes 1)
# - bucket re-hashed ("split") with new modulo
# - point between 1) and 2) is the "split pointer"
#If 2) has no more buckets:
# - 1) and 3) merged into 2)
#I.e. rehash is done incrementally
#Only works with separate chaining
DISTRIBUTED HASH TABLES ==> #See below
/=+===============================+=\
/ : : \
)==: COLLISIONS :==(
\ :_______________________________: /
\=+===============================+=/
[SEPARATE] CHAINING ==> #Way to handle collisions in a hash table
#Storing several values ("overflow") as a data structure, instead of a single value
#Often used data structures:
# - linked list:
# - can either store a pointer to a linked list, or directly a linked list ("list
# head cells")
# - dynamic array ("array hash table")
# - search tree: only makes sense when hashing function is highly not uniform
OPEN ADDRESSING ==> #Way to handle collisions in a hash table.
#Also called "closed hashing"
#How:
# - on write collision (value already exists) or read collision (value does not match)
# - iterates to next element ("probing" through a "probe sequence") until one fits
# - cycles from end to start
# - possible schemes are below
#Clustering:
# - when values are closed to each other, causing collisions
# - increases with high density
# - can be:
# - "primary":
# - when probing sequence is close to each other (e.g. linear probing)
# - amplified when hash function is not uniform
# - "secondary":
# - when probing sequence is far from each other (e.g. double hashing)
# - probe sequences overlap if using the same (from worst to best):
# - current bucket, i.e. can overlap even on different key and key hash
# - first bucket, i.e. overlaps if key hash is same
# - key, i.e. no overlap
# - when probing iteration is determined only by current item:
# - i.e. not first one nor input
# - collisions will make probe sequences overlap, amplifying the problem
#Time complexity:
# - O(1 + n/N), i.e. density, i.e. probability of collision with clustering
# - high locality of reference improves cache performance
# - i.e. if values are too large to fit in CPU cache, open addressing is less interesting
#Deletion:
# - must copy the next values (according to iteration) to fill the iteration gap
# - "lazy deletion":
# - deleted values are just flagged instead of erased from memory
# - inserting into a deleted-flagged value simply overwrite it
# - searching through a deleted-flagged value makes it iterate to next value, but the
# finally found value is then copied to the deleted-flagged value
# - can also do this as part of a cleanup routine, e.g. after a specific number of
# operations
# - pro: faster deletion
# - con: slower search
#Comparison with separate chaining:
# - pros:
# - more space efficient
# - more time efficient when low density because fewer pointers
# - cons:
# - less time efficient when high density because of clustering, i.e.:
# - need to dynamically resize on smaller density threshold
# - stronger need for uniform hash function
# - it is possible to run out of values if no resize is done
PERFECT HASHING ==> #Way to handle collisions in a hash table.
#Choosing a hashing function (i.e. universal hashing function) that creates no collision
# - must try several hashing function until find one
#"FKS hashing":
# - doing hashing twice, and using a hash iliffe vector instead
# - O(n²) space complexity, but much higher chance to find hashing functions
#Set of keys can be:
# - static
# - dynamic: must rebuild and|or resize either:
# - when collision happens
# - after specific number of operations
#Comparison with separate chaining:
# - pro: very time efficient
# - con: building is very slow, so set of keys should be mostly static
EXTENDIBLE HASHING ==> #Way to handle collisions in a hash table.
#How:
# - each bucket has its own modulo size ("local depth")
# - initially same as global depth
# - "global depth":
# - hash table size
# - resized by a factor of D (often 2)
#On insert collision:
# - if bucket's local depth < global depth:
# - split it into D buckets
# - 1 bucket will be full, the others empty
# - repeat until either no collision or local depth = global depth
# - if still collision:
# - resize hash table
# - i.e. global depth is increased but not buckets' local depth
# - i.e. O(1) time complexity
# - then re-insert
#Often implemented using a hash trie (called "directory")
# - in which case, local depth is also number of pointers from trie to bucket
/=+===============================+=\
/ : : \
)==: OPEN ADDRESSING :==(
\ :_______________________________: /
\=+===============================+=/
K-CHOICE HASHING ==> #Using k different probing methods|sequences.
#On insertion, picking the one with the least number of collisions.
LINEAR PROBING ==> #Fixed increment, usually 1
#Stops iteration when going back to same position
#Pros:
# - low secondary clustering
# - high locality of reference
#Cons:
# - high primary clustering
# - probe sequence overlaps on current bucket
ROBIN HOOD HASHING ==> #Similar to linear probing except:
# - on final insertion, the number of iterations performed is stored
# - during iteration, if the new item's number of iterations > current item's stored number,
# swap them
#Goal: spread the lengths of probe sequences, reducing worst time complexity.
QUADRATIC PROBING ==> #Quadratic polynomial increment, e.g. x²
#Stops iteration when x is bigger than array size
# - x might repeat itself though:
# - at least first x/2 items will be unique if either:
# - array size is prime
# - array size is 2ᵐ and increment is n*(n-1)/2
# - all items will be unique if:
# - array size is prime and 3+m*4, and increment is alternating between x² and -x²
# - i.e. important to keep low density
#Pros and cons:
# - in-between linear probing and double hashing
# - probe sequence overlaps on first bucket
DOUBLE HASHING ==> #Increment is HASHₓ(key), using a series of universal hash functions
#Iteration keeps going until found
#Pros:
# - low primary clustering
# - no probe sequence overlaps
#Cons:
# - high secondary clustering
# - low locality of reference
# - requires extra hash computation
CUCKOO HASHING ==> #Similar to double hashing except:
# - HASHₓ(key) is location not increment
# - instead on inserting on last location found, insert on first one and push other
# values one position forward
# - uses limited number m of hash functions:
# - on loops, must trigger an array resize
# - often divide array in m, with each slice having its own hash function
# - when m is smaller:
# - pro: better worst time complexity
# - con:
# - higher probability of cycles, i.e. density must be smaller
# - m = 2 -> density < 0.5, m = 3 -> density < 0.91
#Comparison with double hashing:
# - worst average time complexity
# - better worst time complexity (because resize is triggered more often)
COALESCED HASHING ==> #Hybrid with separate chaining
#On collision:
# - uses the first empty bucket on the table
# - its position is stored to avoid linearly searching it each time
# - use pointers, i.e. form linked list inside the table
#"Cellar":
# - optimization where part of the hash table is reserved for collision values
# - pro: faster to find next empty bucket
# - con: must use heuristic to find good ratio, often 0.14
#Deletions are slow
/=+===============================+=\
/ : : \
)==: DISTRIBUTED HASH TABLE :==(
\ :_______________________________: /
\=+===============================+=/
DISTRIBUTED HASH TABLE (DHT) ==> #Hash table distributed over several nodes:
# - nodes communicate using an "overlay network"
# - keyspace is split among nodes using a "keyspace partioning" scheme
#Goals: scalability, decentralization, fault tolerance
CONSISTENT HASHING ==> #Keyspace partioning scheme, where:
# - each node gets assigned as value a random position ("token") in keyspace
# - each key belongs to the node with the closest greater|lesser node value
#Also called "stable hashing"
#"Weighting":
# - assigning several values per node, to ensure better spread
#Resizing:
# - on node removal, its keys are reassigned
# - on node insertion, its keys are calculated, reassigned from other nodes
# - i.e. in average, node removal|insertion move N/n keys, where N is number
# of keys and n number of nodes
RENDEZVOUS HASHING ==> #Keyspace partioning scheme, where:
# - each node gets assigned a random value
# - total order over all nodes can also be set in case of score ties
# - each key belongs to the node with the lowest HASH(key, node_value) ("score"/"weight")
# - can also pick the n lowest scores, to assign several nodes per key
#Resizing is similar to consistent hashing
#Pro:
# - more distributed, less clustered keyspace
#Con:
# - need to iterate over each node, instead of just the closest one
# - i.e. O(n) time complexity where n is number of nodes
/=+===============================+=\
/ : : \
)==: COMBINATIONS :==(
\ :_______________________________: /
\=+===============================+=/
HASH TRIE ==> #Hash table where hashes are stored in a trie.
#Goal:
# - slightly lower time complexity of hash map
# - but with much lower space complexity
# - better persistent data structure
#"Hash array mapped trie" (HAMT): when using array mapped trie
/=+===============================+=\
/ : : \
)==: BLOOM FILTER :==(
\ :_______________________________: /
\=+===============================+=/
BLOOM FILTER ==> #Implementation for a set, using a probabilistic data structure
#How:
# - big single array of m bits
# - each value's hash is bit-or'd
# - hashes must have constant number k of 1-bits
# - can do it by using k hash functions that map to [0-m-1]
#False positives are possible (but no false negatives), with probability p
#Operations:
# - cannot iterate|retrieve values (because they are hashed)
# - intersections increases p, but not union
# - remove():
# - not available
# - can emulate by using a second bloom filter for elements that have been removed
# - but this introduces possible false negatives
# - length():
# - not available
# - can approximate with -m/k * log(1-X/m), where X is number of bits set
#Pros:
# - space complexity: m, i.e. O(n), with very small amount of bits per n
# - time complexity: O(k)
#Cons:
# - limited operations
# - false positives
#If set elements is known, bit array is more efficient
#Parameters:
# - p = (1-(1-1/m)ᵏⁿ)ᵏ
# - optimal k = m/n * log(2)
# - all following assume optimal k
# - p = 2**(m/n * -log(2))
# - p increases exponentially with n, decreases exponentially with m
# - p doubles when m/n increases by 1
# - m = n * log₂(p) / -log(2)
# - for given p, m increases linearly with n
# - m/n = log₂(p) / -log(2)
# - 9.5 bits per element for 1% false positive
# - adding around 5 bits per element divide chance by 10, e.g. 19.2 bits per element for 0.01% false positive
COUNTING FILTER ==> #Like bloom filter except:
# - instead of flagging individual bits, increments a counter of size c bits
# - higher c bits gives higher max limit
#I.e. new operations are available:
# - remove() becomes available by decreasing counter
# - even after removal, there is change that it might still be marked as false positive though
# - length() by summing counter/k
#However space complexity is higher: m*c, i.e. O(n*c)