Hash Table is a data structure that offers very fast
insertion and searching. If data is distributed evenly in Hash Table, its
insertion and searching time is close to constant. Size of the data does not
affect its performance. Hash Table is so fast that it is used by computer and
many applications for searching and insertion in large volume data, like spell
checker. It is faster than a Tree and relatively easier to implement.
Hash table does have some disadvantages. Hash table is
implemented using arrays and arrays are difficult to expand once it is created.
Its performance degrades badly if it is too full. Also there is no convenient
way to visit data in hash table in any order.
As I have said, Hash table is easy to implement, it is similar to array, get the index(hash) of the data then access the data at the index in array. But generating index or hash, both is similar term in Hash Table, is critical task. So we will discuss that.
Hashing
Hashing is transformation of sequence of characters into
shorter fixed size key (Hash), which will represent the data in data set. Hash
is used to index and retrieve data in hash table and database. Hashing is one
way operation i.e. original string can not be generated back from its hash.
Hashing should be referentially transparent i.e. it should always generate same
hash key for same input. An ideal hash function should not generate same hash
key for different inputs. If it is generating same hash key for different
inputs, it is called collision. A hash function with low risk of collision is
most preferred. The hash key is used as
index number to insert data in array.
Risk of collision increases as a large set of data is
squeezed into smaller hash table. It
may appear that possibility of collision renders hashing
scheme impractical, but there is several workaround. We will discuss Open
Addressing and Separate chaining.
Open addressing
If any data is hashed to an index which is already occupied,
search for empty cell in the array in systematic way, so the data will be stored in the
array only. The number of steps required to find next vacant length is called Probe Length. We will discuss three ways
of open addressing, which is used to find the next vacant cell.
Linear Probing
Linear probing searches for vacant cell sequentially. If
index 10 is occupied, it will go to 11 and then 12, and so on, incrementing the
index until it finds a vacant cell. The "sequence of filled cells" grows larger
as it keep adding new data next to it and this forms cluster. This clustering
can result in long probe length hence degrading the Hash Table performance.
Quadratic Probing
We saw that clustering can occur in Linear probing approach
to open addressing. Once a cluster forms it tends to grow larger. Items that
hash to any value in the range of cluster will step along and insert themselves
at the end of cluster, thus making it even bigger. Quadratic probing is an
attempt to keep clusters from forming by probing more widely separated cells.
If the hash value is X, probe start from X+1 then X+4 and then X+9 and so on,
increasing the hash value by square of step number.
STEP 1 – X+1^2
STEP 2 – X+ 2^2
STEP 3 – X+3^2
:
STEP N - X+N^2
Quadratic probe does eliminate the clustering problem that
we saw with Linear probing, but it suffers from different clustering problem.
This clustering happens because all the items which hash to a particular cell
follow the same sequence in trying to find the vacant cell.
Double hashing
To eliminate both type of clustering, there is an approach:
Double Hashing. Probe sequence is generated based on item, rather than same for
every item. So, items that hash to same index number have different probe
sequence. Second hash function which generates probe sequence should not be
same as hash function used to generate index. Second hash function should never
output 0; otherwise there would be no step.
Separate chaining
In open chaining, collision is handled by looking for vacant
cell in same array. Another approach is
to insert linked list at each index. Items are hashed in usual way to an index
and the item is inserted in the linked list at that index. If another item is
hashed to the same index, the new item will also be inserted at the end of the
linked list present at that index.