## hash function

In computer, function is a black box with input and output, and hash function is one of them. We usually touch two types of hash functions.

- Hash function for hash table. For example, the hash function in bloom filter and the hash function of HashMap.
- Hash function for encryption and signature. For example, MD5, SHA-256.

Hash functions usually have the following characteristics.

- Fixed length. Any input must have the same output length.
- Certainty. The same input must get the same output.
- Unidirectionality. The output is obtained from the input, but the input cannot be inferred from the output.

## Hash function quality

The hash function is used to map a pile of data information to a short data, which represents the whole data information. For example, the ID number.

How to measure the quality of a hash function is mainly from the following aspects

- Whether the hash values are evenly distributed and random, which is conducive to improving the utilization of hash table space and increasing the difficulty of hash cracking;
- The probability of hash collision is very low, and the collision probability should be controlled in a certain range;
- Whether the calculation is faster, the shorter the calculation time of a hash function, the higher the efficiency.

## Collision probability

What is collision?

When the same hash value maps different data, a collision occurs.

Collision is inevitable and can only reduce the collision probability as much as possible, which is determined by hash length and algorithm.

How to evaluate the collision probability. There is a classic problem in probability, the birthday problem. The mathematical law reveals that the probability of two people having the same birthday among 23 people will be greater than 50%, and the probability of two people having the same birthday among 100 people will be more than 99%. This is against intuitive experience, so it is also called birthday paradox.

Birthday problem is the theoretical guidance of collision probability. In cryptography, according to this theory, attackers can find hash function collisions only \ ({\ TextStyle {\ sqrt {2 ^ {n}} = 2 ^ {n / 2} \) times.

The following is a collision reference table for different bit hashes:

In addition, according to the derivation on Wiki, we can also get the following formula.

Specify the number of existing hash values \ (n \), and estimate the collision probability \ (p (n) \)

Specify the collision probability \ (p \) and the maximum hash range \ (d \), and estimate the number of hashes required to reach the collision probability \ (n \)

Specify the collision probability \ (p \) and the maximum hash range \ (d \), and estimate the number of collisions \ (rn \)

Estimate theoretical collision probability

public static double collisionProb(double n, double d) { return 1 - Math.exp(-0.5 * (n * (n - 1)) / d); }

Estimate the number of hashes required to reach the collision probability

public static long collisionN(double p, double d) { return Math.round(Math.sqrt(2 * d * Math.log(1 / (1 - p))) + 0.5); }

Estimate the number of collision hashes

public static double collisionRN(double n, double d) { return n - d + d * Math.pow((d - 1) / d, n); }

According to the above formula, let's evaluate String.hashCode(), which returns int in Java, so the hash range is \ (2 ^ {32} \). Take a look at the performance of String.hashCode() under 10 million UUID s.

10 million UUID s, and the theoretical number of collisions is 11632.50

collisionRN(10000000, Math.pow(2, 32)) // 11632.50

Use the following code to test

private static Map<Integer, Set<String>> collisions(Set<String> values) { Map<Integer, Set<String>> result = new HashMap<>(); for (String value : values) { Integer hashCode = value.hashCode(); Set<String> bucket = result.computeIfAbsent(hashCode, k -> new TreeSet<>()); bucket.add(value); } return result; } public static void main(String[] args) throws IOException { Set<String> uuids = new HashSet<>(); for (int i = 0; i< 10000000; i++){ uuids.add(UUID.randomUUID().toString()); } Map<Integer, Set<String>> values = collisions(uuids); int maxhc = 0, maxsize = 0; for (Map.Entry<Integer, Set<String>> e : values.entrySet()) { Integer hashCode = e.getKey(); Set<String> bucket = e.getValue(); if (bucket.size() > maxsize) { maxhc = hashCode; maxsize = bucket.size(); } } System.out.println("UUID total: " + uuids.size()); System.out.println("Total hash values: " + values.size()); System.out.println("Total number of collisions: " + (uuids.size() - values.size())); System.out.println("Collision probability: " + String.format("%.8f", 1.0 * (uuids.size() - values.size()) / uuids.size())); if (maxsize != 0) { System.out.println("Maximum collision string: " + maxsize + " " + values.get(maxhc)); } }

The total number of collisions 11713 is very close to the theoretical value.

UUID total: 10000000 Total hash values: 9988287 Total number of collisions: 11713 Collision probability: 0.00117130

Note that the above test is not enough to draw a conclusion on the performance of string.hashCode(). There are many strings and they cannot be overwritten one by one.

The hashCode algorithm in JDK determines its distribution in the hash table. We can continuously optimize the algorithm by estimating the theoretical and measured values.

For some famous hash algorithms, such as FNV-1 and Murmur2, there is a post on the Internet to compare their collision probability and distribution.

## Summary

Hash function is to map long information into short data with fixed length, judge the quality of a hash function, and consider its collision probability and the distribution of hash value.