Hash Table

See Hash Map

Python Implementation

Python's Dictionary Implementation: Being All Things to All People Beautiful Code

Dictionaries are represented by a C structure, PyDictObject, defined in Include/dictobject.h.

Here’s a schematic of the structure representing a small dictionary mapping “aa”, “bb”, “cc”, …, “mm” to the integers 1 to 13:

int ma_fill         13
int ma_used         13
int ma_mask         31
PyDictEntry ma_table[]:
[0]: aa, 1                  hash(aa) == -1549758592, -1549758592 & 31 = 0
[1]: ii, 9                  hash(ii) == -1500461680, -1500461680 & 31 = 16
[2]: null, null
[3]: null, null
[4]: null, null
[5]: jj, 10                 hash(jj) == 653184214, 653184214 & 31 = 22
[6]: bb, 2                  hash(bb) == 603887302, 603887302 & 31 = 6
[7]: null, null
[8]: cc, 3                  hash(cc) == -1537434360, -1537434360 & 31 = 8
[9]: null, null
[10]: dd, 4                 hash(dd) == 616211530, 616211530 & 31 = 10
[11]: null, null
[12]: null, null
[13]: null, null
[14]: null, null
[15]: null, null
[16]: gg, 7                 hash(gg) == -1512785904, -1512785904 & 31 = 16
[17]: ee, 5                 hash(ee) == -1525110136, -1525110136 & 31 = 8
[18]: hh, 8                 hash(hh) == 640859986, 640859986 & 31 = 18
[19]: null, null
[20]: null, null
[21]: kk, 11                hash(kk) == -1488137240, -1488137240 & 31 = 8
[22]: ff, 6                 hash(ff) == 628535766, 628535766 & 31 = 22
[23]: null, null
[24]: null, null
[25]: null, null
[26]: null, null
[27]: null, null
[28]: null, null
[29]: ll, 12                hash(ll) == 665508394, 665508394 & 31 = 10
[30]: mm, 13                hash(mm) == -1475813016, -1475813016 & 31 = 8
[31]: null, null

Load Factor

The performance of most collision resolution methods does not depend directly on the number n of stored entries, but depends strongly on the table's load factor, the ratio n/s between n and the size s of its bucket array. With a good hash function, the average lookup cost is nearly constant as the load factor increases from 0 up to 0.7 or so. Beyond that point, the probability of collisions and the cost of handling them increases.

On the other hand, as the load factor approaches zero, the size of the hash table increases with little improvement in the search cost, and memory is wasted.

Hash function

At the heart of the hash table algorithm is a simple array of items; this is often simply called the hash table. Hash table algorithms calculate an index from the data item's key and use this index to place the data into the array. The implementation of this calculation is the hash function, f:

index = f(key, arrayLength)

The hash function calculates an index within the array from the data key. arrayLength is the size of the array.

A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions, and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically by the chi-squared test.

Perfect hash function

If all of the keys that will be used are known ahead of time, a perfect hash function can be used to create a perfect hash table, in which there will be no collisions. If minimal perfect hashing is used, every location in the hash table can be used as well.

Perfect hashing allows for constant time lookups in the worst case.

Collision resolution

Collisions are practically unavoidable when hashing a random subset of a large set of possible keys. For example, if 2500 keys are hashed into a million buckets, even with a perfectly uniform random distribution, according to the birthday paradox there is a 95% chance of at least two of the keys being hashed to the same slot.

Therefore, most hash table implementations have some collision resolution strategy to handle such events. Some common strategies are described below. All these methods require that the keys (or pointers to them) be stored in the table, together with the associated values.

Separate chaining

Separate Chaining

Open addressing

Open Addressing

Hash collision resolved by open addressing with linear probing (interval=1). Note that “Ted Baker” has a unique hash, but nevertheless collided with “Sandra Dee” which had previously collided with “John Smith”.

Implementations

Collision Resolution Load Factor
Java Separate chaining 3/4
Python Open addressing 2/3

The main reason for Python to choose an “open addressing” implementation is that many basic language mechanism are based on hash maps (namespaces, classes, etc). In every python program many hash maps are continuously created and updated. The “open addressing” implementation reduce a lot the need for dynamic memory allocation. In case of a collision, the entry is stored anyway in the hash main table, without the need to allocate and link a new element of a linked list.

algorithms/data_structures/hash_table.txt · Last modified: 2010/08/10 (external edit)
CC Attribution-Noncommercial-Share Alike 3.0 Unported
Valid CSS Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0