Be careful with ToLower and ToUpper. Be careful to drag your system down

I don't know when to start. Many programmers like to use ToLower and ToUpper to realize string equality comparison that ignores case mode. It's possible that this habit is introduced from other languages. It's JS bold to guess. In order not to cause controversy, I mean JS is the meaning of technician~

1: Background

1. Story telling

In our order aggregation system, each order will be marked with its source, such as JD, Taobao, Etao, Shopex Some channels, such as advanced configuration input and custom order source, are also provided on the UI. Later, the customer fed back that he could not query the order by inputting xxx. Here, take Shopex as an example. The user queries it with lowercase Shopex, but the system marks it with uppercase Shopex, so there is no way to match it naturally. In order to solve this problem, the development brother changed it to uppercase for comparison, using code table As follows:

                var orderfrom = "shopex".ToUpper();

                customerIDList = MemoryOrders.Where(i =>i.OrderFrom.ToUpper()==orderFrom)
                                       .Select(i => i.CustomerId).ToList();

After the change, it's such a good thing to go online. At first glance, it's no problem. As a result, the query speed is obviously slower than before. Just a few more times, ok... During the monitoring, it was found that the CPU and memory were suddenly high and low, and they fluctuated abnormally. The little brother was writing a bug again. After checking the code, he asked why he wrote it. The little brother said that in js, it was so compared~~~

2. string.Compare transformation

In fact, in C เท, there is a special way to ignore the case comparison. It has high performance and does not cost memory. It is string.Compare, so you can change the above code to the following.

                var orderfrom = "shopex";

                customerIDList = MemoryOrders.Where(string.Compare(i.TradeFrom, tradefrom, 
                                                                   StringComparison.OrdinalIgnoreCase) == 0)
                                             .Select(i => i.CustomerId).ToList();

The StringComparison.OrdinalIgnoreCase enumeration is used to ignore the case. After going online, there is no problem except that the CPU fluctuates a little.

2: Why ToLower and ToUpper have such an impact

For the convenience of demonstration, I found an English short article, and then demonstrated why ToUpper has such a great impact on cpu, memory and query performance by querying a certain word. The code is as follows:

        public static void Main(string[] args)
        {
            var strList = "Hooray! It's snowing! It's time to make a snowman.James runs out. He makes a big pile of snow. He puts a big snowball on top. He adds a scarf and a hat. He adds an orange for the nose. He adds coal for the eyes and buttons.In the evening, James opens the door. What does he see? The snowman is moving! James invites him in. The snowman has never been inside a house. He says hello to the cat. He plays with paper towels.A moment later, the snowman takes James's hand and goes out.They go up, up, up into the air! They are flying! What a wonderful night!The next morning, James jumps out of bed. He runs to the door.He wants to thank the snowman. But he's gone.".Split(' ');

            var query = "snowman".ToUpper();

            for (int i = 0; i < strList.Length; i++)
            {
                var str = strList[i].ToUpper();

                if (str == query)
                    Console.WriteLine(str);
            }

            Console.ReadLine();
        }

1. Research on memory fluctuation

Since the memory fluctuates, it means that there is something dirty in the memory. When learning the basic knowledge of C, you should know that strings are immutable. Once there is a change, new strings will be generated. That is to say, new strings will appear after ToUpper. In order to prove with data, use windbg to demonstrate.

0:000> !dumpheap -type System.String -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff8e7a9a120        1           24 System.Collections.Generic.GenericEqualityComparer`1[[System.String, mscorlib]]
00007ff8e7a99e98        1           80 System.Collections.Generic.Dictionary`2[[System.String, mscorlib],[System.Globalization.CultureData, mscorlib]]
00007ff8e7a9a378        1           96 System.Collections.Generic.Dictionary`2+Entry[[System.String, mscorlib],[System.Globalization.CultureData, mscorlib]][]
00007ff8e7a93200       19         2264 System.String[]
00007ff8e7a959c0      429        17894 System.String
Total 451 object

You can see that there are Count=429 string objects on the managed heap. How does this 429 come from? Composition: there are 128 essays, 128 after ToUpper, 165 by default, 2 query strings, 6 unknown strings, and finally 128 + 128 + 165 + 2 + 6 = 429. Take a look at them.

!dumpheap -mt 00007ff8e7a959c0 > !DumpObj 000002244282a1f8

0:000> !DumpObj /d 0000017800008010
Name:        System.String
MethodTable: 00007ff8e7a959c0
EEClass:     00007ff8e7a72ec0
Size:        38(0x26) bytes
File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      HOUSE.
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff8e7a985a0  4000281        8         System.Int32  1 instance                6 m_stringLength
00007ff8e7a96838  4000282        c          System.Char  1 instance               48 m_firstChar
00007ff8e7a959c0  4000286       d8        System.String  0   shared           static Empty
                                 >> Domain:Value  0000017878943bb0:NotInit  <<
0:000> !DumpObj /d 0000017800008248
Name:        System.String
MethodTable: 00007ff8e7a959c0
EEClass:     00007ff8e7a72ec0
Size:        40(0x28) bytes
File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      SNOWMAN
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff8e7a985a0  4000281        8         System.Int32  1 instance                7 m_stringLength
00007ff8e7a96838  4000282        c          System.Char  1 instance               53 m_firstChar
00007ff8e7a959c0  4000286       d8        System.String  0   shared           static Empty
                                 >> Domain:Value  0000017878943bb0:NotInit  <<

After checking the two uppercase "HOUSE" and "SNOWMAN", and returning to my scene, if there is a small million order, it will generate a small million strings on the managed heap. If I click again, it will generate another small million strings. How can the memory not increase abruptly...

2. Research on CPU and query time

Now you know that there may be millions of string objects on the heap. The distribution and release of these objects cause a lot of pressure on the cpu. After toUpper, the speed slows down. What's worse, it will also cause gc shivering trigger. Once shivering, all thread s will be suspended for recycling, and the speed will be slower...

3: string.Compare parsing

Look back at string.Compare. Why is that so ๐Ÿฎ๐Ÿ‘ƒ , you can check the source code through dnspy, which has a core function, as shown in the following figure:

		// Token: 0x060004B8 RID: 1208 RVA: 0x00010C48 File Offset: 0x0000EE48
		[SecuritySafeCritical]
		private unsafe static int CompareOrdinalIgnoreCaseHelper(string strA, string strB)
		{
			int num = Math.Min(strA.Length, strB.Length);
			fixed (char* ptr = &strA.m_firstChar)
			{
				fixed (char* ptr2 = &strB.m_firstChar)
				{
					char* ptr3 = ptr;
					char* ptr4 = ptr2;
					while (num != 0)
					{
						int num2 = (int)(*ptr3);
						int num3 = (int)(*ptr4);
						if (num2 - 97 <= 25)
						{
							num2 -= 32;
						}
						if (num3 - 97 <= 25)
						{
							num3 -= 32;
						}
						if (num2 != num3)
						{
							return num2 - num3;
						}
						ptr3++;
						ptr4++;
						num--;
					}
					return strA.Length - strB.Length;
				}
			}
		}

This code is very delicate, using 97 cleverly, comparing two strings one by one according to ascii code of capital mode, which is much faster than making a pile of things on the heap.

Then I modify the code to see how it works on the heap...

 public static void Main(string[] args)
        {
			...

            var query = "snowman";

            for (int i = 0; i < strList.Length; i++)
            {
                if (string.Compare(strList[i], query, StringComparison.OrdinalIgnoreCase) == 0)
                {
                    Console.WriteLine(strList[i]);
                }
            }

            Console.ReadLine();
        }


0:000> !dumpheap -type System.String -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff8e7a9a120        1           24 System.Collections.Generic.GenericEqualityComparer`1[[System.String, mscorlib]]
00007ff8e7a99e98        1           80 System.Collections.Generic.Dictionary`2[[System.String, mscorlib],[System.Globalization.CultureData, mscorlib]]
00007ff8e7a9a378        1           96 System.Collections.Generic.Dictionary`2+Entry[[System.String, mscorlib],[System.Globalization.CultureData, mscorlib]][]
00007ff8e7a93200       19         2264 System.String[]
00007ff8e7a959c0      300        13460 System.String
Total 322 objects

From System.String, we can see that there are 300 on the heap now, but the original is 429, which is equivalent to 129 less. That is to say, 128 touppers plus one ToUpper in Query are eliminated.

4: Summary

What are our bad writing methods? They are vulnerable to a lot of data. They are also good opportunities for growth~

If you have more questions to interact with me, please come in under the scan~

Tags: C# Windows ascii less

Posted on Tue, 05 May 2020 02:36:40 -0400 by swr