String takes up too much memory and I think of all sorts of fantastic ways to compress it

One: Background

1. Storytelling

In one of our full memory projects, we need to pour tens of millions of trades into the memory of a large brand store. You know that trades usually have an order source, provincial and urban areas. When you pour these fields in, you will find that they particularly erode the memory. Because they are all string types, you don't know if you know very well about memory erodibility. I'll ask you a question.

Answer: How much memory does an empty string consume?You know what?

After thinking about it, let's verify that windbg is used to host the heap and the code is as follows:

        static void Main(string[] args)
        {
            string s = string.Empty;

            Console.ReadLine();
        }

0:000> !clrstack -l
OS Thread Id: 0x308c (0)
        Child SP               IP Call Site
ConsoleApp6.Program.Main(System.String[]) [C:\dream\Csharp\ConsoleApp1\ConsoleApp6\Program.cs @ 19]
    LOCALS:
        0x00000087391febd8 = 0x000002605da91420
0:000> !DumpObj /d 000002605da91420
Name:        System.String
String:      
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9eb2b85a0  4000281        8         System.Int32  1 instance                0 m_stringLength
00007ff9eb2b6838  4000282        c          System.Char  1 instance                0 m_firstChar
00007ff9eb2b59c0  4000286       d8        System.String  0   shared           static Empty
                                 >> Domain:Value  000002605beb2230:NotInit  <<
0:000> !objsize 000002605da91420
sizeof(000002605da91420) = 32 (0x20) bytes (System.String)

As you can see from the diagram, an empty string takes up 32 bytes. If 500w empty strings are 32byte x 500w = 152M, it's not unknown, it's a shock.It's just an empty string with nothing in it.

2. Return to Trade

Now that the problem is in place, let's go back to Trade and simulate a 20-week trade read from the database as a file for demonstration purposes.

    class Program
    {
        static void Main(string[] args)
        {
            var trades = Enumerable.Range(0, 20 * 10000).Select(m => new Trade()
            {
                TradeID = m,
                TradeFrom = File.ReadLines(Environment.CurrentDirectory + "//orderfrom.txt")
                                 .ElementAt(m % 4)
            }).ToList();

            GC.Collect();  //Easy to test, clear out temporary variables
            Console.WriteLine("Successful execution");
            Console.ReadLine();
        }
    }

    class Trade
    {
        public int TradeID { get; set; }
        public string TradeFrom { get; set; }
    }

Then use windbg to run the managed heap and measure the trades.

0:000> !dumpheap -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff9eb2b59c0   200200      7010246 System.String

0:000> !objsize 0x000001a5860629a8
sizeof(000001a5860629a8) = 16097216 (0xf59fc0) bytes (System.Collections.Generic.List`1[[ConsoleApp6.Trade, ConsoleApp6]])

From the output above, you can see that there are 200200 = 20W (program allocation) + 200 (system allocation) managed heaps, and then look at size:16097216/1024/1024= 15.35M, which is all that was shown.

2. Analysis of compression techniques

1. Use dictionary processing

Actually, there are 20w strings on the managed heap, but if you look closely, you will find that they are actually four states of repeated display, either Taobao or Taobao.This gives me the opportunity to optimize, so why not build the OrderFrom dictionary when I get the data, and then add a mapping value from the TradeFromID record dictionary to trade, because there are few eigenvalues, so byte is fine. With this idea, you can modify the code as follows:

    class Program
    {
        public static Dictionary<int, string> orderfromDict = new Dictionary<int, string>();

        static void Main(string[] args)
        {
            var trades = Enumerable.Range(0, 20 * 10000).Select(m =>
            {
                var tradefrom = File.ReadLines(Environment.CurrentDirectory + "//orderfrom.txt")
                                 .ElementAt(m % 4);

                var kv = orderfromDict.FirstOrDefault(k => k.Value == tradefrom);

                if (kv.Key == 0)
                {
                    orderfromDict.Add(orderfromDict.Count + 1, tradefrom);
                }

                var trade = new Trade() { TradeID = m, TradeFromID = (byte)kv.Key };

                return trade;

            }).ToList();

            GC.Collect();  //Easy to test, clear out temporary variables

            Console.WriteLine("Successful execution");

            Console.ReadLine();
        }
    }

    class Trade
    {
        public int TradeID { get; set; }

        public byte TradeFromID { get; set; }

        public string TradeFrom
        {
            get
            {
                return Program.orderfromDict[TradeFromID];
            }
        }
    }

The code is still simple. Next, use windbg to see how much space is compressed.

0:000> !dumpheap -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff9eb2b59c0      204        10386 System.String

0:000> !clrstack -l
OS Thread Id: 0x2ce4 (0)
        Child SP               IP Call Site
ConsoleApp6.Program.Main(System.String[]) [C:\dream\Csharp\ConsoleApp1\ConsoleApp6\Program.cs @ 42]
    LOCALS:
        0x0000006f4d9ff078 = 0x0000016fdcf82ab8

0000006f4d9ff288 00007ff9ecd96c93 [GCFrame: 0000006f4d9ff288] 
0:000> !objsize 0x0000016fdcf82ab8
sizeof(0000016fdcf82ab8) = 6897216 (0x693e40) bytes (System.Collections.Generic.List`1[[ConsoleApp6.Trade, ConsoleApp6]])

As you can see from the output above, string s on the managed heap are now: 204 = 4 (program allocation) + 200 (system allocation), which are four of the dictionary oh, in space: 6897216/1024/1024= 6.57M, which is nearly 60% optimized for the previous 15.35M.

Although 60% is optimized, it is a destructive optimization that requires modifying my Trade structure, defining a Dictionary, and modifying business logic slightly. Everyone knows that the code on the line can not be changed but not changed. It's certainly true that you're hugging around to fix the problem, right. That's the problem. How to minimize the changes is also the problem.Can you compress space, is there anything like this that has the best of both worlds???

2. Resist pool with string

As it seems, everyone wakes up. The resident pool appears to solve this problem. The CLR maintains a dictionary mechanism that I just defined internally. Duplicate strings do not need to be reassigned on the heap. Just save their reference address. If you don't know the resident pool, I suggest you read this article: https://www.cnblogs.com/huangxincheng/p/12799736.html

Next, you only need to wrap one layer in the tradefront fieldString.InternThen, don't make the changes too small, the code is as follows:

        static void Main(string[] args)
        {
            var trades = Enumerable.Range(0, 20 * 10000).Select(m => new Trade()
            {
                TradeID = m,
                TradeFrom = string.Intern(File.ReadLines(Environment.CurrentDirectory + "//orderfrom.txt")
                                 .ElementAt(m % 4)),   //Wrap One LayerString.Intern
            }).ToList();

            GC.Collect();  //Easy to test, clear out temporary variables
            Console.WriteLine("Successful execution");
            Console.ReadLine();
        }

Then grab the managed heap with windbg.

0:000> !dumpheap -stat 
Statistics:
              MT    Count    TotalSize Class Name
00007ff9eb2b59c0      204        10386 System.String

0:000> !clrstack -l
OS Thread Id: 0x13f0 (0)
        Child SP               IP Call Site

ConsoleApp6.Program.Main(System.String[]) [C:\dream\Csharp\ConsoleApp1\ConsoleApp6\Program.cs @ 27]
    LOCALS:
        0x0000005e4d3ff0a8 = 0x000001f8a15129a8

0000005e4d3ff2b8 00007ff9ecd96c93 [GCFrame: 0000005e4d3ff2b8] 
0:000> !objsize 0x000001f8a15129a8
sizeof(000001f8a15129a8) = 8497368 (0x81a8d8) bytes (System.Collections.Generic.List`1[[ConsoleApp6.Trade, ConsoleApp6]])

Observed, when the resident pool is used, the space is 8497368/1024/1024 =8.1M. You may wonder why the memory is 24% larger than dictionary?Looking closely, you can see that when you use a resident pool, TradeFrom in List <Trade> stores the string's memory address in the heap and takes up eight bytes on the x64 machine. Trade on a dictionary-based memory heap does not allocate TradeFrom, but instead uses a byte instead. Overall, a trade saves 7 bytes of space. Then look at it with windbg.

0:000> !da -length 1 -details 000001f8b16f9b68
Name:        ConsoleApp6.Trade[]
Size:        2097176(0x200018) bytes
Array:       Rank 1, Number of elements 262144, Type CLASS

    Fields:
                      MT    Field   Offset                 Type VT     Attr            Value Name
        00007ff9eb2b85a0  4000001       10             System.Int32      1     instance                    0     <TradeID>k__BackingField
        00007ff9eb2b59c0  4000002        8            System.String      0     instance     000001f8a1516030     <TradeFrom>k__BackingField

0:000> !DumpObj /d 000001f8a1516030
Name:        System.String
String:      WAP

You can see that 000001f8a1516030 is the reference address to string=Wap on the heap, which takes up 8byte of space.

Go back to dump and see if it has <Trade From>k_uOf the BackingField field.

0:000> !da -length 1 -details 000001ed52759ac0
Name:        ConsoleApp6.Trade[]
Size:        262168(0x40018) bytes
Array:       Rank 1, Number of elements 32768, Type CLASS
    Fields:
                      MT    Field   Offset                 Type VT     Attr            Value Name
        00007ff9eb2b85a0  4000002        8             System.Int32      1     instance                    0     <TradeID>k__BackingField
        00007ff9eb2b7d20  4000003        c              System.Byte      1     instance                    0     <TradeFromID>k__BackingField


Three: Summary

You can use it according to your own situation. The resident pool method is the smallest change, simple and rough. Although you can save memory by building your own dictionary, you need to modify your business logic. This risk is at your own risk.

If you have more questions to interact with me, scan below to enter ~

Tags: C# Database

Posted on Wed, 03 Jun 2020 20:47:40 -0400 by Shaun13