What is the difference between Hashtable, HashMap and TreeMap

understand

Hashtable, HashMap and TreeMap are the most common Map implementations. They are container types that store and manipulate data in the form of key value pairs.

Hashtable is an example provided by early Java class libraries Hashtable The implementation itself is synchronous and does not support null keys and values. Due to the performance overhead caused by synchronization, it has rarely been recommended.

HashMap is a more widely used hash table implementation. Its behavior is roughly the same as that of HashTable. The main difference is that HashMap is not synchronous and supports null keys and values. Generally, HashMap can achieve constant time performance by put ting or get ting, so it is the first choice for most access scenarios using key values. For example, it implements a runtime storage structure corresponding to user ID and user information.

TreeMap is a Map that provides sequential access based on red black tree. Unlike HashMap, its get, put, remove and other operations are O (log(n)) time complexity. The specific order can be determined by the specified Comparator or according to the natural order of keys.

analysis

There are many extensible problems related to Map, from various data structures, typical application scenarios to technical considerations of program design and implementation, especially in Java 8, HashMap itself has undergone great changes, which are often investigated.

  • Understand some key points of Map related similar overall structure, especially ordered data structure.

  • Analyze the design and implementation points of HashMap from the source code, understand the capacity and load factor, why these parameters are needed, how to affect the performance of Map, and how to choose in practice.

  • Understand the relevant principles and improvement reasons of tree transformation.

In addition to typical code analysis, some interesting concurrency related problems are often mentioned. For example, HashMap may appear in a concurrent environment Infinite cycle CPU And inaccurate size. This is a typical use error, because HashMap explicitly states that it is not a thread safe data structure. If you ignore this and simply use it in a multithreaded scenario, problems will inevitably occur.

Understanding the causes of this error is also a good way to deeply understand the operation of concurrent programs.

thorough

1. Overall structure of map

First, let's have an overall understanding of Map related types. Although Map is usually included in the Java Collection framework, it is not a Collection in a narrow sense. For details, you can refer to the following simple class diagram.

 

Hashtable is special. As an early collection related type similar to Vector and Stack, it extends the Dictionary class, which is obviously different from HashMap in class structure.

Other Map implementations such as HashMap extend AbstractMap, which contains general method abstraction. The purpose of different maps can be reflected from the class diagram structure, and the design purpose has been reflected in different interfaces.

Most scenes using Map are usually put, accessed or deleted, and there are no special requirements for order. HashMap is basically the best choice in this case. The performance of HashMap depends very much on the validity of hash code. Please be sure to master some basic conventions between hashCode and equals, such as:

  • equals and hashCode must be equal.

  • If you override hashCode, you should also override equals.

  • The hashCode needs to be consistent, and the hash value returned by state change should still be consistent.

  • The symmetry, reflection and transmission of equals.

The analysis content for ordered Map is relatively limited. Although LinkedHashMap and TreeMap can guarantee a certain order, they are still very different.

  • LinkedHashMap usually provides that the traversal order conforms to the insertion order. Its implementation is to maintain a two-way linked list for entries (key value pairs). Note that through a specific constructor, we can create instances that reflect the access order. The so-called put, get, compute, etc. are all counted as "access".

This behavior is applicable to some specific application scenarios. For example, we build a space occupation sensitive resource pool to automatically release the least frequently accessed objects. This can be realized by using the mechanism provided by LinkedHashMap. Refer to the following example:

import java.util.LinkedHashMap;
import java.util.Map;  
public class LinkedHashMapSample {
    public static void main(String[] args) {
        LinkedHashMap<String, String> accessOrderedMap = new LinkedHashMap<String, String>(16, 0.75F, true){
            @Override
            protected boolean removeEldestEntry(Map.Entry<String, String> eldest) { // Implement user-defined deletion policy, otherwise the behavior is no different from that of general Map
                return size() > 3;
            }
        };
        accessOrderedMap.put("Project1", "Valhalla");
        accessOrderedMap.put("Project2", "Panama");
        accessOrderedMap.put("Project3", "Loom");
        accessOrderedMap.forEach( (k,v) -> {
            System.out.println(k +":" + v);
        });
        // Simulated access
        accessOrderedMap.get("Project2");
        accessOrderedMap.get("Project2");
        accessOrderedMap.get("Project3");
        System.out.println("Iterate over should be not affected:");
        accessOrderedMap.forEach( (k,v) -> {
            System.out.println(k +":" + v);
        });
        // Trigger delete
        accessOrderedMap.put("Project4", "Mission Control");
        System.out.println("Oldest entry should be removed:");
        accessOrderedMap.forEach( (k,v) -> {// The traversal order remains unchanged
            System.out.println(k +":" + v);
        });
    }
}
 
  • For TreeMap, its overall order is determined by the order relationship of keys, which is determined by Comparator or Comparable (natural order).

The problem of building a scheduling system with priority is essentially a typical priority queue scenario. The Java standard library provides a PriorityQueue based on binary heap. They all rely on the same sorting mechanism, including the vest TreeSet of TreeMap.

Similar to the hashCode and equals conventions, in order to avoid ambiguity, the natural order also needs to comply with a convention that the return value of compareTo must be consistent with equals, otherwise ambiguity will occur.

We can analyze the put method implementation of TreeMap:

public V put(K key, V value) {
    Entry<K,V> t = ...
    cmp = k.compareTo(t.key);
    if (cmp < 0)
        t = t.left;
    else if (cmp > 0)
        t = t.right;
    else
        return t.setValue(value);
        // ...
   }

As you can see from the code,   When I do not comply with the Convention, two objects that do not meet the requirements of uniqueness are treated as the same (because compareTo returns 0), which will lead to ambiguous behavior.

2.HashMap source code analysis

HashMap design and implementation is a very high-frequency interview question, so a relatively detailed source code interpretation will be conducted here, mainly focusing on:

  • HashMap implements basic point analysis internally.

  • capacity and load factor.

  • Tree.

First, let's take a look at the internal structure of HashMap. It can be regarded as a composite structure composed of an array (node < K, V > [] table) and a linked list. The array is divided into bucket s, and the addressing of key value pairs in the array is determined by the hash value; Key value pairs with the same hash value are stored in the form of linked list. You can refer to the following diagram. It should be noted here that if the size of the linked list exceeds the threshold (tree_threshold, 8), the linked list in the figure will be transformed into a tree structure.

From the implementation of the non copy constructor, it seems that the table (array) is not initialized at the beginning, but only some initial values are set.

public HashMap(int initialCapacity, float loadFactor){  
    // ... 
    this.loadFactor = loadFactor;
    this.threshold = tableSizeFor(initialCapacity);
}
 

Therefore, we deeply doubt that HashMap may be initialized when it is first used according to the lazy load principle (except for the copy constructor, I will only introduce the most common scenario here). In that case, let's look at the implementation of the put method. It seems that there is only one putVal call:

public V put(K key, V value) {
    return putVal(hash(key), key, value, false, true);
}

It seems that the main password is hidden in putVal

final V putVal(int hash, K key, V value, boolean onlyIfAbent,
               boolean evit) {
    Node<K,V>[] tab; Node<K,V> p; int , i;
    if ((tab = table) == null || (n = tab.length) = 0)
        n = (tab = resize()).length;
    if ((p = tab[i = (n - 1) & hash]) == ull)
        tab[i] = newNode(hash, key, value, nll);
    else {
        // ...
        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for first 
           treeifyBin(tab, hash);
        //  ... 
     }
}

From the first few lines of putVal method, we can find several interesting places:

  • If the table is null, the resize method is responsible for initializing it, as can be seen from tab = resize().

  • The resize method takes into account two responsibilities, creating an initial storage table or resizing when the capacity does not meet the demand.

  • During the process of placing new key value pairs, capacity expansion will occur if the following conditions occur.

if (++size > threshold)
    resize();
  • The position of the specific key value pair in the hash table (array index) depends on the following bit operations:
i = (n - 1) & hash

After carefully observing the source of the hash value, we will find that it is not the hashCode of the key itself, but another hash method inside the HashMap. Note that why do you need to shift the high-order data to the low-order for XOR? This is because the hash value difference calculated from some data is mainly in the high order, and the hash addressing in HashMap ignores the high order above the capacity, so this processing can effectively avoid hash collision in similar cases.

static final int hash(Object kye) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>>16;
}
  • The linked list structure mentioned earlier (called bin here) will be treelized when a certain threshold value is reached. Later, we will analyze why HashMap needs to process bin.

It can be seen that the logic of putVal method itself is very centralized. It is related to everything from initialization, capacity expansion to tree.

Further analyze the resize method of part-time and multi position:

final Node<K,V>[] resize() {
    // ...
    else if ((newCap = oldCap << 1) < MAXIMUM_CAPACIY &&
                oldCap >= DEFAULT_INITIAL_CAPAITY)
        newThr = oldThr << 1; // double there
       // ... 
    else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    else {  
        // zero initial threshold signifies using defaultsfults
        newCap = DEFAULT_INITIAL_CAPAITY;
        newThr = (int)(DEFAULT_LOAD_ATOR* DEFAULT_INITIAL_CAPACITY;
    }
    if (newThr ==0) {
        float ft = (float)newCap * loadFator;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?(int)ft : Integer.MAX_VALUE);
    }
    threshold = neThr;
    Node<K,V>[] newTab = (Node<K,V>[])new Node[newap];
    table = n;
    // Move to new array structure e array structure 
   }

According to the resize source code, without considering the extreme cases (the maximum theoretical limit of capacity is specified by maximum_capability, and the value is 1 < < 30, that is, the 30th power of 2), we can conclude as follows:

  • The threshold value is equal to (load factor) x (capacity). If they are not specified when building HashMap, it is based on the corresponding default constant value.

  • The threshold is usually adjusted in multiples (newthr = oldthr < < 1). I mentioned earlier that according to the logic in putVal, when the number of elements exceeds the threshold size, the Map size is adjusted.

  • After capacity expansion, the elements in the old array need to be relocated to the new array, which is a major source of overhead for capacity expansion.

3. Capacity, load factor and tree

Earlier, we quickly sorted out the relevant logic of HashMap from creation to putting in key value pairs. Now think about why we need to care about capacity and load factor?

This is because the capacity and load factor determine the number of available buckets. Too many empty buckets will waste space. If they are too full, it will seriously affect the performance of the operation. In extreme cases, if there is only one bucket, it will degenerate into a linked list, which can not provide the performance of the so-called constant time storage.

Since capacity and load factor are so important, how should we choose in practice?

If you can know the number of key value pairs to be accessed by HashMap, you can consider setting an appropriate capacity in advance. We can make a simple estimation of the specific value according to the conditions of capacity expansion. According to the previous code analysis, we know that it needs to meet the calculation conditions:

Load factor * capacity > Number of elements

Therefore, the preset capacity needs to be greater than the "estimated number of elements / load factor", and it is an idempotent of 2. The conclusion is very clear.

For the load factor, I suggest:

  • If there are no special requirements, do not change it easily, because the default load factor of JDK itself is very consistent with the requirements of general scenarios.

  • If you really need to adjust, it is recommended not to set a value greater than 0.75, because it will significantly increase conflicts and reduce the performance of HashMap.

  • If the load factor is too small, adjust the preset capacity value according to the above formula. Otherwise, it may lead to more frequent capacity expansion, increase unnecessary overhead, and affect its own access performance.

The tree transformation was mentioned earlier, and the corresponding logic is mainly in putVal and treeifyBin.

final void treeifyBin(Node<K,V>[] tab, int hash) {
    int n, index; Node<K,V> e;
    if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
        resize();
    else if ((e = tab[index = (n - 1) & hash]) != null) {
        // Tree transformation logic
    }
}

The surface is a simplified treeifyBin diagram. Combining these two methods, the logic of tree transformation is very clear. It can be understood that when the number of bin is greater than treeify_ When threshold:

  • If the capacity is less than MIN_TREEIFY_CAPACITY, only simple capacity expansion will be carried out.

  • If the capacity is greater than min_ TREEIFY_ Capability, tree transformation will be carried out.

So why should HashMap be treelized?

This is essentially a security issue. Because in the process of element placement, if an object hash conflict is placed in the same bucket, a linked list will be formed. We know that the linked list query is linear, which will seriously affect the access performance.

In the real world, the construction of hash conflict data is not very complex. Malicious code can use these data to interact with the server, resulting in a large amount of CPU occupation on the server, which constitutes a hash collision denial of service attack. Similar attacks have occurred in domestic front-line Internet companies.

Tags: Java

Posted on Sun, 10 Oct 2021 05:05:29 -0400 by CodeJunkie88