Hash Function
Hash functions are a very important topic in computer programming, so I am going to try to learn more about them.
References
Definitions
- Linear Probing
- Scheme in computer programming for resolving collisions in hash tables
- When a collision is found, linear probing searches adjacent cells for the matching key or an empty cell.
- Quadratic Probing
- Scheme in computer programming for resolving collisions in hash tables
- Operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial until an open slot is found.
- Incurs less clustering than linear probing
- Open Addressing
- Method of collision resolution in hash tables. With this method, a hash collision is resolved by probing, or searching through alternative locations in the array until either the target record is found, or an unused array slot is found, which indicates there is no such key in the table.
- Double Hashing
- Computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs.
- Bloom filter
- A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.
- Associative arrays
- An associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection.
- The dictionary problem is the classic problem of designing efficient data structures that implement associative arrays. The two major solutions to the dictionary problem are hash tables and search trees.
- Geometric Hashing
- Geometric hashing is a method for efficiently finding two-dimensional objects represented by discrete points that have undergone affine transformations, though extensions exist to other object representations and transformations.
- In Euclidean Geometry, an affinity is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles.
- Dynamic Sets
- A set that supports the insertion and/or deletion of objects.
- The Birthday Problem
- In probability theory, the birthday problem asks for the probability that, in a set of n random chosen people, at least two will share a birthday. The birthday paradox refers to the counterintuitive fact that only 23 people are needed for the probability to exceed 50%
Hash Function Notes
A hash function is any function that can be used to map data or arbitrary size to fixed size values, although there are some hash functions that support variable length output. The values returned by a hash function are sometimes called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash function is called hashing or scatter storage addressing.
- Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval.
- They require an amount of storage space only factionally greater than the total space required for the data or records themselves. Hashing is a computationally and storage space-efficient form of data access that avoids the non-constant access time of ordered and unordered lists and structured trees, and the often exponential storage requirements of direct access of state spaces of large or variable-length keys.
- Use of hash functions relies on statistical properties of key and function interaction: worst-case behavior is intolerably bas but rare, and average-case behavior can be nearly optimal.
Overview
- In a hash table, a hash function takes a key as an input, which is associated with a datum or record and used to identify it to the data storage and retrieval application.
- The keys may be fixed length, like an integer, or variable length, like a string. In some cases, the key is the datum itself.
- The output is a hash code used to index a hash table holding the data or records, or pointers to them.
- A hash function may be considered to perform three functions:
- Convert variable length keys into fixed length (usually machine word length or less) values, by folding them by words or other units using a party-preserving operator like ADD or XOR
- Scramble the bits of the key so that the resulting values are uniformly distributed over the "keyspace"
- Map the key values into ones less than or equal to the size of the table
- A good hash function:
- Should be very fast to compute
- Should minimize duplication of output values (collisions)
- Hash functions rely on generating favorable probability distributions for their effectiveness, reducing access time to nearly constant
- A necessary adjunct to the hash function is a collision-resolution method that employs an auxiliary data structure like linked lists, or systematic probing of the table to find an empty slot
Hash Tables
- Hash functions are used in conjunction with hash tables to store and retrieve data items or data records.
The hash function translates the key associated with each datum or record into a hash code, which is used to index the hash table. When an item is to be added to the table, the hash code may index an empty slot (also called a bucket), in which case the item is added to the table there. If the hash code indexes a full slot, some kind of collision resolution is required: the new item may be omitted (not added to the table), or replace the old item, or it can be added to the table in some other location by a specified procedure.
- That procedure pends on the structure of the hash table:
- In chained hashing, each slot is the head of a linked list or chain, and items that collide at the slot are added to the chain.
- In open address hashing, the table is probed starting from the occupied slot in a specified manner, usually by linear probing, quadratic probing, or double hashing until an open slot is located or the entire table is probed (overflow)
Specialized Uses
- Hash functions are also used to build caches for large data sets stored in slow media. A cache is generally simpler than a hashed search table since any collision can be resolved by discarding or writing back the older of the two colliding items.
- Hash functions are an essential ingredient of the Bloom filter.
- A special case of hashing is known as geometric hashing or the grid method.
- Hash tables are used to implement associated arrays and dynamic sets
Properties
Uniformity
- A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability.
- If the second sentence does not hold true, then the number of collisions, and thus, the lookup operation time, increases.
- The criterion only requires the value to be uniformly distributed, not random in any sense
- If a typical set of m records is hashed to n table slots, the probability of a bucket receiving more than m/n records should be vanishingly small.
- A hash function is said to be perfect if id it can be found that it achieves absolute (or collision-less) uniformity.
Testing and Measurement
- When testing a hash function, the uniformity of hash values can be evaluated by the chi-squared test. This test is a goodness-of-fit measure: it's the actual distribution of items in buckets versus the expected (or uniform) distribution of items:
- A ratio within one confidence interval (0.95-1.05) is indicative that the hash function evaluated has an expected uniform distribution.
- Hash functions can have some technical properties that make it more likely that they'll have a uniform distribution when applied. One is the strict avalanche criterion: whenever a single input is complemented, each of the output bits changes with a 50% probability.
- If the output bits are too resistant to change, then the keys would become clustered around those values.
- If the output bits are changed too readily, the mapping is approaching a fixed XOR function of a single bit.
Efficiency
- In data storage and retrieval applications, the use of a hash function is a trade-off between the search time and data storage space.
- If search time were unbounded, a very compact unordered linear list would be the best medium
- If the storage space were unbounded, a randomly accessible structure indexable by the key-value would be very large, very sparse, but very fast.
- It's usually preferable to choose a faster hash function over one that needs more computation but saves a few collisions.
Universality
- A universal hashing scheme is a randomized algorithm that selects a hashing function h among a family of such functions, in such a way that the probability of a collision of any two distinct keys is , where m is the number of distinct hash values desired - independently of two keys.
Applicability
- A hash function that allows only certain table sizes, strings only up to a certain length, or can't accept a seed (allow double hashing) isn't as useful as one that does.
Deterministic
- A hash procedure must be deterministic - meaning that for a given input value it must always generate the same hash value. It must be a function, in the mathematical sense, of the data to be hashed.
Defined Range
- It is often desirable that the output of a hash function have a fixed size.
Variable Range
- In many applications, the range of hash values may be different for each run of the program or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters - the input data z, and the number n of allowed hash values.
Dynamic Hash Function
- When the hash function is used to store values in a hash table that outlives the run of the program, and the hash table needs to be expanded or shrunk, the hash table is referred to as a dynamic hash table.
Data Normalization
- In some applications, the input data may contain features that are irrelevant for comparison purposed. For such data, one must use a hash function that is compatible with the data equivalence criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value.
Hashing Integer Data Types
There are several common algorithms for hashing integers. The method giving the best distribution is data-dependent. One of the simplest and most common methods in practice is the modulo division method.
- Identity hash function
- If the data to be hashed is small enough, one can use the data itself (reinterpreted as an integer) as the hashed value. The cost of composing this identity hash function is effectively zero. This hash function is perfect, as it maps each input to a distinct hash value.
- In mathematics, an identify function is a function that always returns the value that was used as its argument, unchanged.
- Trivial Hash function
- If the keys are uniformly or sufficiently uniformly distributed over the key space, so that the key values are essentially random, they may be considered to be already 'hashed'.
- Folding
- A folding hash code is produced by dividing the input into sections of m bits, where 2m is the table size, and using a parity preserving bitwise operation such as
ADD
orXOR
to combine the sections, followed by a mask or shifts to trim off any excess bit at the high or low end.
- A folding hash code is produced by dividing the input into sections of m bits, where 2m is the table size, and using a parity preserving bitwise operation such as
- Mid-squares
- A mid-squares hash code is produced by squaring the input and extracting an appropriate number of middle digits or bits. The mid-squares method produces a reasonable hash code if there is not a lot of leading or trailing zeros in the key.
- Division hashing
- A standard technique is to use a modulo function on the key, by selecting a divisor M, which is a prime number close to the table size, so . The table size if usually a power of 2. This gives a distribution from . This gives good results over a large number of keys.
- Algebraic Coding
- Algebraic coding is a variant of the division method of hashing which uses division by a polynomial modulo 2 instead of an integer to map n bits to m bits.
- Unique Permutation Hashing
- Has a guaranteed best worst-case insertion time.
- Multiplicative Hashing
- Fibonacci Hashing
- Zobrist Hashing
- Customized Hash Function
Hashing Variable Length Data
- When the data values are long or variable length character stings, their distribution is usually very uneven, with complicated dependencies. For such data, it is prudent to use a hash function that depends on all characters of the string - and depends on each character in a different way.
- Middle and ends
- Character folding
- Word length folding
- Radix conversion hashing
- Rolling hash
- Fuzzy hash
- Perceptual hash
Analysis
Worst case result for a hash function can be accessed in two ways: theoretical and practical. Theoretical worst case is the probability that all keys map to a single slot. Practical worst case is expected longest probe sequence. This analysis considers uniform hashing characteristic of universal hash functions.
History
- The term hash offers a natural analogy with its non-technical meaning (to chop up or make a mess out of something), given how hash functions scramble their input data to derive their output. The term hash began to be used in the late '60s.