Python summary (primary data structure dictionary and collection)

Dictionary and collection Basics

What is a dictionary and what is a collection? A dictionary is a set of elements composed of key and value pairing. In Python 3.7 + dictionary, it is determined to be ordered (Note: in 3.6, dictionary order is an implementation detail, which is officially called a language feature in 3.7, so it cannot be 100% guaranteed in 3.6). Before 3.6, it was disordered and its length is variable, Elements can be deleted and changed arbitrarily.

Compared with lists and tuples, dictionaries have better performance, especially for finding, adding and deleting operations. Dictionaries can be completed in constant time complexity.

The set is basically the same as the dictionary. The only difference is that the set has no key and value pairing. It is a series of unordered and unique element combinations.

First, let's look at the creation of dictionaries and collections. There are usually the following ways:

d1 = {'name': 'jason', 'age': 20, 'gender': 'male'}
d2 = dict({'name': 'jason', 'age': 20, 'gender': 'male'})
d3 = dict([('name', 'jason'), ('age', 20), ('gender', 'male')])
d4 = dict(name='jason', age=20, gender='male') 
d1 == d2 == d3 ==d4
True

s1 = {1, 2, 3}
s2 = set([1, 2, 3])
s1 == s2
True

Here, dictionaries and collections in Python, whether keys or values, can be mixed types. For example, the following example creates a collection with elements 1, 'hello',5.0:

S={1, 'hello', 5.0}

Element access

        The dictionary can access the index key directly. If it does not exist, an exception will be thrown:

d = {'name': 'jason', 'age': 20}
d['name']
'jason'
d['location']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'location'

You can also use the get(key, default) function to index. If the key does not exist, calling the get() function returns a default value. For example, the following example returns' null '.

d = {'name': 'jason', 'age': 20}
d.get('name')
'jason'
d.get('location', 'null')
'null'

Index operation is not supported for collection access, because a collection is essentially a hash table, which is different from a list. Therefore, if the following operations are wrong, Python will throw an exception:

s = {1, 2, 3}
s[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing

To judge whether an element is in the dictionary or set, we can use value in dict/set.

s = {1, 2, 3}
1 in s
True
10 in s
False

d = {'name': 'jason', 'age': 20}
'name' in d
True
'location' in d
False

Addition, deletion and modification

d = {'name': 'jason', 'age': 20}
d['gender'] = 'male' # Add element pair 'gender': 'male'
d['dob'] = '1999-02-01' # Added element pair 'dob': '1999-02-01'
d
{'name': 'jason', 'age': 20, 'gender': 'male', 'dob': '1999-02-01'}
d['dob'] = '1998-01-01' # Update value corresponding to key 'dob' 
d.pop('dob') # Delete element pair with key 'dob'
'1998-01-01'
d
{'name': 'jason', 'age': 20, 'gender': 'male'}

s = {1, 2, 3}
s.add(4) # Add element 4 to collection
s
{1, 2, 3, 4}
s.remove(4) # Remove element 4 from the collection
s
{1, 2, 3}

Note that the last element in the collection is deleted during the pop() operation of the collection. The collection itself is disordered. You can't know which element will be deleted. Use it with caution. In practical applications, in many cases, we need to sort dictionaries or collections, for example, take out the 50 pairs with the largest value. For dictionaries, we usually sort them in ascending or descending order according to keys or values:

d = {'b': 1, 'a': 2, 'c': 10}
d_sorted_by_key = sorted(d.items(), key=lambda x: x[0]) # Sort by dictionary keys in ascending order
d_sorted_by_value = sorted(d.items(), key=lambda x: x[1]) # Sort by dictionary values in ascending order
d_sorted_by_key
[('a', 2), ('b', 1), ('c', 10)]
d_sorted_by_value
[('b', 1), ('a', 2), ('c', 10)]

A list is returned here. Each element in the list is a tuple composed of the keys and values of the original dictionary.

For a set, its sorting is very similar to the list and tuple mentioned above. Just call sorted(set) directly, and the result will return an ordered list.

s = {3, 4, 2, 1}
sorted(s) # Sorts the elements of the collection in ascending order
[1, 2, 3, 4]

Dictionary and collection performance

Dictionaries and collections are highly performance optimized data structures, especially for find, add, and delete operations. Next, let's take a look at their performance in specific scenarios and compare them with other data structures such as lists.

For example, the background of e-commerce enterprises stores the ID, name and price of each product. Now the demand is that given the ID of a commodity, we need to find its price.

If we use a list to store these data structures and find them, the corresponding code is as follows:

def find_product_price(products, product_id):
    for id, price in products:
        if id == product_id:
            return price
    return None 
     
products = [
    (143121312, 100), 
    (432314553, 30),
    (32421912367, 150) 
]

print('The price of product 432314553 is {}'.format(find_product_price(products, 432314553)))

# output
The price of product 432314553 is 30

Assuming that the list has n elements and the search process needs to traverse the list, the time complexity is O (n). Even if we sort the list first and then use binary search, it will require o (logn) time complexity, not to mention O(nlogn) time for sorting the list.

However, if we use a dictionary to store these data, the search will be very convenient and efficient, and can be completed with O(1) time complexity. The reason is also very simple. As mentioned just now, the internal composition of the field is a hash table. You can find its corresponding value directly through the hash value of the key.

products = {
  143121312: 100,
  432314553: 30,
  32421912367: 150
}
print('The price of product 432314553 is {}'.format(products[432314553])) 

# output
The price of product 432314553 is 30

Similarly, now demand becomes to find out how many different prices there are for these goods. We also use the same method to compare.

If you still choose to use the list, the corresponding code is as follows, where A and B are two-layer loops. Similarly, assuming that the original list has n elements, in the worst case, the time complexity of O (n^2) is required.

# list version
def find_unique_price_using_list(products):
    unique_price_list = []
    for _, price in products: # A
        if price not in unique_price_list: #B
            unique_price_list.append(price)
    return len(unique_price_list)

products = [
    (143121312, 100), 
    (432314553, 30),
    (32421912367, 150),
    (937153201, 30)
]
print('number of unique price is: {}'.format(find_unique_price_using_list(products)))

# output
number of unique price is: 3

However, if we choose to use the data structure of set, because the set is a highly optimized hash table, the elements in it cannot be repeated, and its addition and lookup operations only need the complexity of O(1), then the total time complexity is only O(n).

# set version
def find_unique_price_using_set(products):
    unique_price_set = set()
    for _, price in products:
        unique_price_set.add(price)
    return len(unique_price_set)        

products = [
    (143121312, 100), 
    (432314553, 30),
    (32421912367, 150),
    (937153201, 30)
]
print('number of unique price is: {}'.format(find_unique_price_using_set(products)))

# output
number of unique price is: 3

Maybe you don't have an intuitive understanding of these time complexity. I can give you an example in the actual work scene to feel it.

The following code initializes the product with 100000 elements and calculates the running time of using list and collection to count the price and quantity of products respectively:

import time
id = [x for x in range(0, 100000)]
price = [x for x in range(200000, 300000)]
products = list(zip(id, price))

# Time to calculate list version
start_using_list = time.perf_counter()
find_unique_price_using_list(products)
end_using_list = time.perf_counter()
print("time elapse using list: {}".format(end_using_list - start_using_list))
## output
time elapse using list: 41.61519479751587

# Time when the collection version was calculated
start_using_set = time.perf_counter()
find_unique_price_using_set(products)
end_using_set = time.perf_counter()
print("time elapse using set: {}".format(end_using_set - start_using_set))
# output
time elapse using set: 0.008238077163696289

You can see that with only 100000 data, the speed difference between the two is so large. In fact, the background data of large enterprises often has hundreds of millions or even billions of orders of magnitude. If an inappropriate data structure is used, it is easy to cause the server to crash, which will not only affect the user experience, but also bring huge property losses to the company.

How dictionaries and collections work

Through examples and comparison with lists, we see the efficiency of dictionary and set operation set. However, why can dictionaries and collections be so efficient, especially find, insert, and delete operations?

Of course, this is inseparable from the data structure inside the dictionary and collection. Unlike other data structures, the internal structure of dictionaries and collections is a hash table.

For the dictionary, this table stores three elements: hash, key and value.

For a set, there is no key and value pairing in the hash table, but only a single element.

Let's look at the hash table structure of the old version of Python as follows:

--+-------------------------------+
  | Hash value(hash)  key(key)  value(value)
--+-------------------------------+
0 |    hash0      key0    value0
--+-------------------------------+
1 |    hash1      key1    value1
--+-------------------------------+
2 |    hash2      key2    value2
--+-------------------------------+
. |           ...
__+_______________________________+

It is not hard to imagine that as the hash table expands, it will become more and more sparse. For example, I have a dictionary:

{'name': 'mike', 'dob': '1999-01-01', 'gender': 'male'}

Then it will be stored in a form similar to the following:

entries = [
['--', '--', '--']
[-230273521, 'dob', '1999-01-01'],
['--', '--', '--'],
['--', '--', '--'],
[1231236123, 'name', 'mike'],
['--', '--', '--'],
[9371539127, 'gender', 'male']
]

Such a design structure obviously wastes storage space. In order to improve the utilization of storage space, in addition to the structure of the dictionary itself, the current hash table will separate the index from the hash value, key and value, that is, the following structure:

Indices
----------------------------------------------------
None | index | None | None | index | None | index ...
----------------------------------------------------

Entries
--------------------
hash0   key0  value0
---------------------
hash1   key1  value1
---------------------
hash2   key2  value2
---------------------
        ...
---------------------

Then, in this example, the storage form under the new hash table structure will become as follows:

indices = [None, 1, None, None, 0, None, 2]
entries = [
[1231236123, 'name', 'mike'],
[-230273521, 'dob', '1999-01-01'],
[9371539127, 'gender', 'male']
]

We can clearly see that the space utilization has been greatly improved.

After clarifying the specific design structure, we will then look at the working principle of these operations.

Insert operation

Each time an element is inserted into a dictionary or collection, Python will first calculate the hash value of the key (hash(key)), and then do the and operation with mask = PyDicMinSize - 1 to calculate the position where the element should be inserted into the hash table index = hash (key) & mask. If this position in the hash table is empty, this element will be inserted into it.

If this location is already occupied, Python compares whether the hash value and key of the two elements are equal.

If both are equal, it indicates that the element already exists. If the values are different, the value is updated.

If one of the two elements is not equal, this situation is usually called hash collision, which means that the keys of the two elements are not equal, but the hash values are equal. In this case, Python will continue to look for the empty location in the table until it finds the location.

It is worth mentioning that, generally speaking, in this case, the simplest way is linear search, that is, start from this position and look for vacancies one by one. Of course, Python has optimized this internally, and this step is more efficient.

Find operation

Similar to the previous insert operation, Python will find the position it should be in according to the hash value; Then, compare whether the hash value and key of the element in this position of the hash table are equal to the element to be searched. If equal, return directly; If not, continue searching until a null bit is found or an exception is thrown.

Delete operation

For the delete operation, Python will temporarily assign a special value to the element at this position, and then delete it when the hash table is resized.

It is not difficult to understand that the occurrence of hash conflicts often reduces the speed of dictionary and collection operations. Therefore, in order to ensure its efficiency, the dictionary and the hash table in the set usually ensure that there is at least 1 / 3 of the remaining space. With the continuous insertion of elements, when the remaining space is less than 1 / 3, Python will regain more memory space and expand the hash table. However, in this case, all element positions in the table will be rearranged.

Although hash conflicts and hash table size adjustment will slow down, they rarely occur. Therefore, on average, this can still ensure that the time complexity of insertion, search and deletion is O(1).

summary

The dictionary is an ordered data structure in Python 3.7 +, while the collection is disordered. Its internal hash table storage structure ensures the efficiency of its search, insert and delete operations. Therefore, dictionaries and collections are usually used in scenarios such as efficient search and de duplication of elements.

Tags: Database hive SQL

Posted on Mon, 22 Nov 2021 04:30:41 -0500 by horstuff