# Hash and hash bucket

## concept

There is a kind of data structure commonly used for search. Given a set of elements, the common algorithms that can be used for search are
Traversal. Traversal is a very rough and simple means. There are no special requirements for the collection of elements, and the time complexity is O(N). Efficiency is usually low.
Binary search. Binary search requires an orderly set of elements with a time complexity of O(logN),
Binary search tree, the search rule is simple, but it degenerates into a single tree, and the time complexity is O(N),
The time complexity of AVL tree and red black tree is O(logN)

All of the above common methods, without exception, have a key step: element comparison. By comparing the element to be found with the element in the set, the position of the element in the set can be determined. The real efficiency depends on the number of comparisons.

If there is a method that can find without element comparison, the ideal search can be achieved.

Hash is such a data structure,

When storing elements, establish a one-to-one correspondence between the elements and the stored location through some methods, so there is no need to compare the data when searching. This method is called hash method. The set stored by elements is called hash table or Hash list, and the method of establishing mapping relationship is called hash function or hash function. This function is used to calculate the insertion position of the element in the hash table.

For example,
Element set: 1 6 7 5 4 9
Hash function: F (x) = x% capacity

When inserting an element, calculate the position of the element in the hash table through the hash function. If 1% 10 = 1, then 1 is placed in position 1.

When finding an element, calculate the position of the element in the table through the function, and detect whether the element stored in the position is the element to be found.

## Hash Collisions

If an element 11 is added and calculated through the hash function, the result should be inserted at the position of 1. At this time, there are already elements at this position, which is the cause of hash conflict. When different elements calculate the same hash address through the same hash function, this problem becomes hash conflict.

## Design of hash function

If the design of hash function is unreasonable, the hash addresses of different elements will be concentrated in a certain address range, which makes the probability of hash conflict very high. Therefore, the hash function requires
1. The definition field of the function must contain all keys to be stored. If the hash table has M addresses, its value field must be between [0,m).
2. The addresses calculated by the hash function should be evenly distributed in the hash table.
3. The hash function should be as simple as possible

Common methods:
The direct addressing method takes a linear function of the keyword as the hash address, Hash = A*Key+B. This function is simple and evenly distributed, but the distribution of the keyword should be analyzed in advance, which is suitable for the case of small and continuous element sets. For example, find the character that appears only once in the string. The direct addressing method can be used to take the ASCII code value of the character in the string as the hash address Array subscript, count the number of times.

Divide and leave the remainder method, let the number of addresses in the hash table be m, and take a prime number p close to or equal to m from the numbers not greater than m
Hash = key% P is a very common method for calculating hash addresses.

For example, if the keyword is 1234, its square is 1522756. Take the middle three bits to get 227 as the hash address. If the square wie2329684 of 5678, take the middle three bits 239 or 396 as the hash address. The square centring method is more suitable when you don't know the distribution of keywords and the number of keywords is not very large.

In the folding method, firstly, the keyword is divided into several parts with equal digits. For example, the keyword 12345678 can be divided into 123456 and 78. The last digit can be shorter. Add the divided numbers and take the last digit as the hash address according to the capacity of the hash table. It is suitable for the case with more keyword digits.

The random number method takes the random function value of the keyword as the hash address, Hash = random(Key), which is suitable for the case of different keyword lengths.

In the mathematical analysis method, there are n numbers with the same number of digits, and there are r different symbols on each bit. The probability of occurrence is not necessarily the same. It may be evenly distributed on some bits. According to the size of the hash table, select the bits with more uniform symbol distribution as the hash address, such as the set of mobile phone numbers. The distribution of the first three bits is more repeated, which are more than one, uneven, and the last four bits are divided The cloth is uniform.

The design of hash function can only minimize the probability of hash conflict. No matter how scientific a hash function is designed, it can not completely avoid hash conflict.

## Resolution of hash conflicts

##### Closed hash

In the above example, if you want to continue to insert element 54 into the hash table, according to the address calculated by the hash function, it should be inserted at position 5, and there is already an element at position 5. The processing method of closed hash starts after the element of 5, successively backward to find whether there is a non empty position. If it is found, insert the element. If there is no element at the end, it returns The header continues to find non empty positions from the header.
In a closed hash, an identifier is added to each position in the hash table to represent the state

```EMPTY，Indicates that the location is empty, EXIST，Indicates that there is a valid element at this location, DELETE Indicates that the location element has been deleted. When inserting, after calculating the address through the hash function, first detect whether there is a hash conflict at the location. If not, insert directly and change the location status to EXIST，When deleting, first calculate the position in the table through the hash function. It is found that there are elements in the position. Compare it with the elements to be deleted. If yes, it will be deleted directly, but the status of the position cannot be changed to EMPTY，Should read DELETE，Because if the location is empty, it indicates that the element does not exist.
```

This method of finding vacancies in turn is called linear detection. The rules of linear detection insertion and deletion are relatively simple, but there are some defects. In case of hash conflict, data accumulation is easy to occur, resulting in the accumulation of elements occupying a large number of spare positions, increasing the time consumption during search and affecting the efficiency.

Another method is called secondary detection. Instead of looking for vacancies one by one, the next position is through H(i) = (H0 + i) ²) Or H(i) = H0 - i ² Compared with linear detection, it can solve the problem of data accumulation, but in the worst case of linear detection, it can be found after walking around the hash table, while the secondary detection cannot be determined. When the hollow position of the table is relatively small, it may need to be found many times.

In general, when there are too many elements in the hash table, the remaining positions in the table are less and less, the probability of conflict is higher and higher, and it will affect the efficiency of detection. Hash is a data structure that pursues efficient search, and when there are many effective elements in the hash table, it will greatly affect the efficiency. To solve this problem, the elements stored in the hash table cannot be too many, and the capacity of a certain program must be expanded, so the hash table will not be full. When the capacity is expanded, the concept of load factor is mentioned. The load factor is the value of the total number of effective elements / spaces. Some studies have shown that the capacity should be expanded when the linear detection load factor reaches about 70%, while the capacity should be expanded when the secondary detection load factor reaches 50%. Therefore, hash is also a data structure with low space utilization.

##### Open hash

Open hash, also known as chain address method, is more used in development than closed hash. Open hash is actually a collection of data and linked list. Each element in the array is a single linked list, and the elements with hash conflict are mounted on the linked list.

Firstly, the hash address is calculated through the hash function. Then, the elements with the same hash address are grouped into a set, called a bucket, and the elements in the bucket are organized through a single linked list. The hash table stores the header of the linked list. This structure is called hash bucket.

According to the rules of the hash bucket, insert the ideal state of the element,

In fact, when the inserted elements are special, most elements will still be concentrated in a linked list, resulting in some linked lists being particularly long. In this way, finding elements in the hash bucket actually becomes finding elements in the linked list. The query time complexity of the linked list is O(N). Therefore, inserting elements should also avoid this problem. In the closed hash, the problem is solved by expanding the capacity according to the load factor. The same is true for the hash bucket. In the rules of the hash bucket, when the number of elements in the hash bucket is equal to the number of buckets, capacity expansion should be considered. After the capacity of the hash bucket changes, the address calculated by the hash function also changes, so the elements in the old hash bucket should be moved to the new hash bucket.
Sometimes there are more extreme cases. Although the expansion conditions are not met, there are already many nodes attached in some linked lists, so the performance of the hash table will decline. At this time, the processing method is to convert the linked list into a red black tree when the nodes in the linked list reach a certain threshold and have not been expanded. The general design is that when the number of nodes in the linked list is equal to 8, the linked list will be converted into a red black tree. When deleted, when the number of nodes in the red black tree is less than 6, the red black tree will be converted into a bit linked list.

In the design of the division and retention method, the division and retention method takes a prime number that is not greater than the capacity of the table but closest to it. In SGI-STL3.0, hash_ At the bottom of map, a method for determining prime numbers related to division, retention and remainder method is given. Enumerate the 28 commonly used prime numbers.
53ul, 97ul, 193ul, 389ul, 769ul,
1543ul, 3079ul, 6151ul, 12289ul, 24593ul,
49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,
50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul,
1610612741ul, 3221225473ul, 4294967291ul
Generic design: an ideal hash table should be able to store any element, but some elements cannot be modulo calculated directly, such as common string types. atoi can only convert numeric string to string format, so it needs some other algorithms to realize modular operation, such as specially designed string hash algorithm, etc.

Implement a simple hash bucket

```#pragma once

#include<iostream>
#include"Common.h"
#include<vector>

using namespace std;

template<class T>

struct HashBucketNode {
HashBucketNode<T>* next;
T data;

HashBucketNode(const T& x = T())
:next(nullptr)
,data(x)
{}
};

template<class T>

class T2Intdef {
public:
const T& operator()(const T& data) {
return data;
}
};

//An algorithm for string transformation
class T2str {
public:
size_t operator()(const string& s){
const char* str = s.c_str();
unsigned int seed = 131;
unsigned int hash = 0;
while (*str)
{
hash = hash * seed + (*str++);
}
return (hash & 0x7FFFFFFF);
}
};

template<class T,class T2Int = T2Intdef<T>>

class HashBucket {
typedef HashBucketNode<T> Node;
public:
HashBucket(size_t capacity = 53)
:table(Getnextprime(capacity))
,size(0)
{}

~HashBucket() {
Destroy();
}

bool Insert(const T& data) {
//Check whether capacity expansion is required
Checkcapacity();
size_t bucketloc = Hashfuc(data);
Node* cur = table[bucketloc];
while (cur) {
if (data == cur->data) {
return false;
}
cur = cur->next;
}//If there is a position to insert, cur must go to an empty position

cur = new Node(data);

cur->next = table[bucketloc];
table[bucketloc] = cur;
++size;
//This is the chain header insertion method. Originally, the first node of the chain list is table[bucketloc]
//First give the address of the head node to cur's next
//Then set cur as the new header node
return true;
}

bool Erase(const T& data) {
size_t bucketloc = Hashfuc(data);
Node* cur = table[bucketloc];
Node* prev = nullptr;
while (cur) {
if (data == cur->data) {
//Determine whether cur is a header node
if (nullptr == prev) {
table[bucketloc] = cur->next;
}
else {
prev->next = cur->next;
}
delete cur;
--size;
return true;
}
else {
prev = cur;
cur = cur->next;
}
}
return false;
}

Node* Find(const T& data) {
size_t bucketloc = Hashfuc(data);
Node* cur = table[bucketloc];
while (cur) {
if (data == cur->data) {
return cur;
}
cur = cur->next;
}
return nullptr;
}

size_t Size()const {
return size;
}

bool Empty() {
return 0 == size;
}

void Print() {
for (size_t i = 0; i < table.capacity(); i++) {
Node* cur = table[i];
cout << "table[" << i << "]";
while (cur) {
cout << cur->data << "--->";
cur = cur->next;
}
cout << "NULL" << endl;
}
cout << "=========================================";
}

void Swap(HashBucket<T,T2Int>& ht) {
table.swap(ht.table);
std::swap(size, ht.size);
}

private:
//Hash function division and remainder method
size_t Hashfuc(const T& data) {
T2Int t2int;
return t2int(data) % table.capacity();
}

void Destroy() {
for (size_t i = 0; i < table.capacity(); i++) {
Node* cur = table[i];
while (cur) {
table[i] = cur->next;
delete cur;
cur = table[i];
}
}
size = 0;
}
void Checkcapacity() {
//When the number of valid elements is full
if (size == table.capacity()) {
//Create a new hash bucket
HashBucket<T, T2Int> newh(Getnextprime(table.capacity()));
//Remove the old watch
for (size_t i = 0; i < table.capacity(); i++) {
Node* cur = table[i];
while (cur) {
table[i] = cur->next;
size_t newloc = newh.Hashfuc(cur->data);
cur->next = newh.table[newloc];
newh.table[newloc] = cur;
newh.size++;
cur = table[i];
//Disconnect cur first
}
}
this->Swap(newh);
}
}
private:
std::vector<Node*> table;//Hashtable
size_t size;//Number of valid elements
};
```
```#pragma once

const int PRIMECOUNT = 28;
const size_t primeList[PRIMECOUNT] =
{
53ul, 97ul, 193ul, 389ul, 769ul,
1543ul, 3079ul, 6151ul, 12289ul, 24593ul,
49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,
50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul,
1610612741ul, 3221225473ul, 4294967291ul
};

size_t Getnextprime(size_t prime) {
size_t i = 0;
for (; i < PRIMECOUNT; i++) {
if (primeList[i] > prime) {
return primeList[i];
}
}
return primeList[i];
}
```

Container with hash bucket at the bottom:
unordered series containers
unordered_map
unordered_set
unordered_multimap
unordered_multiset
The bottom layer of these containers is a hash bucket structure, which is proposed in C++11 and contains the header file < unordered_ map><unorder_ Set >, the query efficiency is O(1), which is applied to scenes that do not care whether the element sequence is orderly and pursue search efficiency. There is only a forward iterator inside.

Posted on Tue, 02 Nov 2021 08:09:18 -0400 by tonypr100