See Hash Map
Python's Dictionary Implementation: Being All Things to All People Beautiful Code
Dictionaries are represented by a C structure, PyDictObject, defined in Include/dictobject.h.
Here’s a schematic of the structure representing a small dictionary mapping “aa”, “bb”, “cc”, …, “mm” to the integers 1 to 13:
int ma_fill 13 int ma_used 13 int ma_mask 31 PyDictEntry ma_table[]: [0]: aa, 1 hash(aa) == -1549758592, -1549758592 & 31 = 0 [1]: ii, 9 hash(ii) == -1500461680, -1500461680 & 31 = 16 [2]: null, null [3]: null, null [4]: null, null [5]: jj, 10 hash(jj) == 653184214, 653184214 & 31 = 22 [6]: bb, 2 hash(bb) == 603887302, 603887302 & 31 = 6 [7]: null, null [8]: cc, 3 hash(cc) == -1537434360, -1537434360 & 31 = 8 [9]: null, null [10]: dd, 4 hash(dd) == 616211530, 616211530 & 31 = 10 [11]: null, null [12]: null, null [13]: null, null [14]: null, null [15]: null, null [16]: gg, 7 hash(gg) == -1512785904, -1512785904 & 31 = 16 [17]: ee, 5 hash(ee) == -1525110136, -1525110136 & 31 = 8 [18]: hh, 8 hash(hh) == 640859986, 640859986 & 31 = 18 [19]: null, null [20]: null, null [21]: kk, 11 hash(kk) == -1488137240, -1488137240 & 31 = 8 [22]: ff, 6 hash(ff) == 628535766, 628535766 & 31 = 22 [23]: null, null [24]: null, null [25]: null, null [26]: null, null [27]: null, null [28]: null, null [29]: ll, 12 hash(ll) == 665508394, 665508394 & 31 = 10 [30]: mm, 13 hash(mm) == -1475813016, -1475813016 & 31 = 8 [31]: null, null
The performance of most collision resolution methods does not depend directly on the number n of stored entries, but depends strongly on the table's load factor, the ratio n/s between n and the size s of its bucket array. With a good hash function, the average lookup cost is nearly constant as the load factor increases from 0 up to 0.7 or so. Beyond that point, the probability of collisions and the cost of handling them increases.
On the other hand, as the load factor approaches zero, the size of the hash table increases with little improvement in the search cost, and memory is wasted.
At the heart of the hash table algorithm is a simple array of items; this is often simply called the hash table. Hash table algorithms calculate an index from the data item's key and use this index to place the data into the array. The implementation of this calculation is the hash function, f:
index = f(key, arrayLength)
The hash function calculates an index within the array from the data key. arrayLength is the size of the array.
A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions, and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically by the chi-squared test.
If all of the keys that will be used are known ahead of time, a perfect hash function can be used to create a perfect hash table, in which there will be no collisions. If minimal perfect hashing is used, every location in the hash table can be used as well.
Perfect hashing allows for constant time lookups in the worst case.
Collisions are practically unavoidable when hashing a random subset of a large set of possible keys. For example, if 2500 keys are hashed into a million buckets, even with a perfectly uniform random distribution, according to the birthday paradox there is a 95% chance of at least two of the keys being hashed to the same slot.
Therefore, most hash table implementations have some collision resolution strategy to handle such events. Some common strategies are described below. All these methods require that the keys (or pointers to them) be stored in the table, together with the associated values.
Hash collision resolved by open addressing with linear probing (interval=1). Note that “Ted Baker” has a unique hash, but nevertheless collided with “Sandra Dee” which had previously collided with “John Smith”.
| Collision Resolution | Load Factor | |
|---|---|---|
| Java | Separate chaining | 3/4 |
| Python | Open addressing | 2/3 |
The main reason for Python to choose an “open addressing” implementation is that many basic language mechanism are based on hash maps (namespaces, classes, etc). In every python program many hash maps are continuously created and updated. The “open addressing” implementation reduce a lot the need for dynamic memory allocation. In case of a collision, the entry is stored anyway in the hash main table, without the need to allocate and link a new element of a linked list.