Thursday, February 16, 2012

Data Structure : Hash Table and Hash Function


Hash Table is a data structure that offers very fast insertion and searching. If data is distributed evenly in Hash Table, its insertion and searching time is close to constant. Size of the data does not affect its performance. Hash Table is so fast that it is used by computer and many applications for searching and insertion in large volume data, like spell checker. It is faster than a Tree and relatively easier to implement.

Hash table does have some disadvantages. Hash table is implemented using arrays and arrays are difficult to expand once it is created. Its performance degrades badly if it is too full. Also there is no convenient way to visit data in hash table in any order.

As I have said, Hash table is easy to implement, it is similar to array, get the index(hash) of the data then access the data at the index in array. But generating index or hash, both is similar term in Hash Table, is critical task. So we will discuss that.

Hashing
Hashing is transformation of sequence of characters into shorter fixed size key (Hash), which will represent the data in data set. Hash is used to index and retrieve data in hash table and database. Hashing is one way operation i.e. original string can not be generated back from its hash. Hashing should be referentially transparent i.e. it should always generate same hash key for same input. An ideal hash function should not generate same hash key for different inputs. If it is generating same hash key for different inputs, it is called collision. A hash function with low risk of collision is most preferred.  The hash key is used as index number to insert data in array.

Risk of collision increases as a large set of data is squeezed into smaller hash table. It
may appear that possibility of collision renders hashing scheme impractical, but there is several workaround. We will discuss Open Addressing and Separate chaining.

Open addressing
If any data is hashed to an index which is already occupied, search for empty cell in the array in systematic way, so the data will be stored in the array only. The number of steps required to find next vacant length is called Probe Length. We will discuss three ways of open addressing, which is used to find the next vacant cell.

Linear Probing
Linear probing searches for vacant cell sequentially. If index 10 is occupied, it will go to 11 and then 12, and so on, incrementing the index until it finds a vacant cell. The "sequence of filled cells" grows larger as it keep adding new data next to it and this forms cluster. This clustering can result in long probe length hence degrading the Hash Table performance.

Quadratic Probing
We saw that clustering can occur in Linear probing approach to open addressing. Once a cluster forms it tends to grow larger. Items that hash to any value in the range of cluster will step along and insert themselves at the end of cluster, thus making it even bigger. Quadratic probing is an attempt to keep clusters from forming by probing more widely separated cells. If the hash value is X, probe start from X+1 then X+4 and then X+9 and so on, increasing the hash value by square of step number.
STEP 1 – X+1^2
STEP 2 – X+ 2^2
STEP 3 – X+3^2
:
STEP N - X+N^2

Quadratic probe does eliminate the clustering problem that we saw with Linear probing, but it suffers from different clustering problem. This clustering happens because all the items which hash to a particular cell follow the same sequence in trying to find the vacant cell.

Double hashing
To eliminate both type of clustering, there is an approach: Double Hashing. Probe sequence is generated based on item, rather than same for every item. So, items that hash to same index number have different probe sequence. Second hash function which generates probe sequence should not be same as hash function used to generate index. Second hash function should never output 0; otherwise there would be no step.

Separate chaining
In open chaining, collision is handled by looking for vacant cell in same array.  Another approach is to insert linked list at each index. Items are hashed in usual way to an index and the item is inserted in the linked list at that index. If another item is hashed to the same index, the new item will also be inserted at the end of the linked list present at that index.

No comments:

Post a Comment