String takes up too much memory and I think of all sorts of fantastic ways to compress it

One: Background 1. Storytelling In one of our...
One: Background
2. Analysis of compression techniques
Three: Summary

One: Background

1. Storytelling

In one of our full memory projects, we need to pour tens of millions of trades into the memory of a large brand store. You know that trades usually have an order source, provincial and urban areas. When you pour these fields in, you will find that they particularly erode the memory. Because they are all string types, you don't know if you know very well about memory erodibility. I'll ask you a question.

Answer: How much memory does an empty string consume?You know what?

After thinking about it, let's verify that windbg is used to host the heap and the code is as follows:

static void Main(string[] args) { string s = string.Empty; Console.ReadLine(); } 0:000> !clrstack -l OS Thread Id: 0x308c (0) Child SP IP Call Site ConsoleApp6.Program.Main(System.String[]) [C:\dream\Csharp\ConsoleApp1\ConsoleApp6\Program.cs @ 19] LOCALS: 0x00000087391febd8 = 0x000002605da91420 0:000> !DumpObj /d 000002605da91420 Name: System.String String: Fields: MT Field Offset Type VT Attr Value Name 00007ff9eb2b85a0 4000281 8 System.Int32 1 instance 0 m_stringLength 00007ff9eb2b6838 4000282 c System.Char 1 instance 0 m_firstChar 00007ff9eb2b59c0 4000286 d8 System.String 0 shared static Empty >> Domain:Value 000002605beb2230:NotInit << 0:000> !objsize 000002605da91420 sizeof(000002605da91420) = 32 (0x20) bytes (System.String)

As you can see from the diagram, an empty string takes up 32 bytes. If 500w empty strings are 32byte x 500w = 152M, it's not unknown, it's a shock.It's just an empty string with nothing in it.

2. Return to Trade

Now that the problem is in place, let's go back to Trade and simulate a 20-week trade read from the database as a file for demonstration purposes.

class Program { static void Main(string[] args) { var trades = Enumerable.Range(0, 20 * 10000).Select(m => new Trade() { TradeID = m, TradeFrom = File.ReadLines(Environment.CurrentDirectory + "//orderfrom.txt") .ElementAt(m % 4) }).ToList(); GC.Collect(); //Easy to test, clear out temporary variables Console.WriteLine("Successful execution"); Console.ReadLine(); } } class Trade { public int TradeID { get; set; } public string TradeFrom { get; set; } }

Then use windbg to run the managed heap and measure the trades.

0:000> !dumpheap -stat Statistics: MT Count TotalSize Class Name 00007ff9eb2b59c0 200200 7010246 System.String 0:000> !objsize 0x000001a5860629a8 sizeof(000001a5860629a8) = 16097216 (0xf59fc0) bytes (System.Collections.Generic.List`1[[ConsoleApp6.Trade, ConsoleApp6]])

From the output above, you can see that there are 200200 = 20W (program allocation) + 200 (system allocation) managed heaps, and then look at size:16097216/1024/1024= 15.35M, which is all that was shown.

2. Analysis of compression techniques

1. Use dictionary processing

Actually, there are 20w strings on the managed heap, but if you look closely, you will find that they are actually four states of repeated display, either Taobao or Taobao.This gives me the opportunity to optimize, so why not build the OrderFrom dictionary when I get the data, and then add a mapping value from the TradeFromID record dictionary to trade, because there are few eigenvalues, so byte is fine. With this idea, you can modify the code as follows:

class Program { public static Dictionary<int, string> orderfromDict = new Dictionary<int, string>(); static void Main(string[] args) { var trades = Enumerable.Range(0, 20 * 10000).Select(m => { var tradefrom = File.ReadLines(Environment.CurrentDirectory + "//orderfrom.txt") .ElementAt(m % 4); var kv = orderfromDict.FirstOrDefault(k => k.Value == tradefrom); if (kv.Key == 0) { orderfromDict.Add(orderfromDict.Count + 1, tradefrom); } var trade = new Trade() { TradeID = m, TradeFromID = (byte)kv.Key }; return trade; }).ToList(); GC.Collect(); //Easy to test, clear out temporary variables Console.WriteLine("Successful execution"); Console.ReadLine(); } } class Trade { public int TradeID { get; set; } public byte TradeFromID { get; set; } public string TradeFrom { get { return Program.orderfromDict[TradeFromID]; } } }

The code is still simple. Next, use windbg to see how much space is compressed.

0:000> !dumpheap -stat Statistics: MT Count TotalSize Class Name 00007ff9eb2b59c0 204 10386 System.String 0:000> !clrstack -l OS Thread Id: 0x2ce4 (0) Child SP IP Call Site ConsoleApp6.Program.Main(System.String[]) [C:\dream\Csharp\ConsoleApp1\ConsoleApp6\Program.cs @ 42] LOCALS: 0x0000006f4d9ff078 = 0x0000016fdcf82ab8 0000006f4d9ff288 00007ff9ecd96c93 [GCFrame: 0000006f4d9ff288] 0:000> !objsize 0x0000016fdcf82ab8 sizeof(0000016fdcf82ab8) = 6897216 (0x693e40) bytes (System.Collections.Generic.List`1[[ConsoleApp6.Trade, ConsoleApp6]])

As you can see from the output above, string s on the managed heap are now: 204 = 4 (program allocation) + 200 (system allocation), which are four of the dictionary oh, in space: 6897216/1024/1024= 6.57M, which is nearly 60% optimized for the previous 15.35M.

Although 60% is optimized, it is a destructive optimization that requires modifying my Trade structure, defining a Dictionary, and modifying business logic slightly. Everyone knows that the code on the line can not be changed but not changed. It's certainly true that you're hugging around to fix the problem, right. That's the problem. How to minimize the changes is also the problem.Can you compress space, is there anything like this that has the best of both worlds???

2. Resist pool with string

As it seems, everyone wakes up. The resident pool appears to solve this problem. The CLR maintains a dictionary mechanism that I just defined internally. Duplicate strings do not need to be reassigned on the heap. Just save their reference address. If you don't know the resident pool, I suggest you read this article: https://www.cnblogs.com/huangxincheng/p/12799736.html

Next, you only need to wrap one layer in the tradefront fieldString.InternThen, don't make the changes too small, the code is as follows:

static void Main(string[] args) { var trades = Enumerable.Range(0, 20 * 10000).Select(m => new Trade() { TradeID = m, TradeFrom = string.Intern(File.ReadLines(Environment.CurrentDirectory + "//orderfrom.txt") .ElementAt(m % 4)), //Wrap One LayerString.Intern }).ToList(); GC.Collect(); //Easy to test, clear out temporary variables Console.WriteLine("Successful execution"); Console.ReadLine(); }

Then grab the managed heap with windbg.

0:000> !dumpheap -stat Statistics: MT Count TotalSize Class Name 00007ff9eb2b59c0 204 10386 System.String 0:000> !clrstack -l OS Thread Id: 0x13f0 (0) Child SP IP Call Site ConsoleApp6.Program.Main(System.String[]) [C:\dream\Csharp\ConsoleApp1\ConsoleApp6\Program.cs @ 27] LOCALS: 0x0000005e4d3ff0a8 = 0x000001f8a15129a8 0000005e4d3ff2b8 00007ff9ecd96c93 [GCFrame: 0000005e4d3ff2b8] 0:000> !objsize 0x000001f8a15129a8 sizeof(000001f8a15129a8) = 8497368 (0x81a8d8) bytes (System.Collections.Generic.List`1[[ConsoleApp6.Trade, ConsoleApp6]])

Observed, when the resident pool is used, the space is 8497368/1024/1024 =8.1M. You may wonder why the memory is 24% larger than dictionary?Looking closely, you can see that when you use a resident pool, TradeFrom in List <Trade> stores the string's memory address in the heap and takes up eight bytes on the x64 machine. Trade on a dictionary-based memory heap does not allocate TradeFrom, but instead uses a byte instead. Overall, a trade saves 7 bytes of space. Then look at it with windbg.

0:000> !da -length 1 -details 000001f8b16f9b68 Name: ConsoleApp6.Trade[] Size: 2097176(0x200018) bytes Array: Rank 1, Number of elements 262144, Type CLASS Fields: MT Field Offset Type VT Attr Value Name 00007ff9eb2b85a0 4000001 10 System.Int32 1 instance 0 <TradeID>k__BackingField 00007ff9eb2b59c0 4000002 8 System.String 0 instance 000001f8a1516030 <TradeFrom>k__BackingField 0:000> !DumpObj /d 000001f8a1516030 Name: System.String String: WAP

You can see that 000001f8a1516030 is the reference address to string=Wap on the heap, which takes up 8byte of space.

Go back to dump and see if it has <Trade From>k_uOf the BackingField field.

0:000> !da -length 1 -details 000001ed52759ac0 Name: ConsoleApp6.Trade[] Size: 262168(0x40018) bytes Array: Rank 1, Number of elements 32768, Type CLASS Fields: MT Field Offset Type VT Attr Value Name 00007ff9eb2b85a0 4000002 8 System.Int32 1 instance 0 <TradeID>k__BackingField 00007ff9eb2b7d20 4000003 c System.Byte 1 instance 0 <TradeFromID>k__BackingField

Three: Summary

You can use it according to your own situation. The resident pool method is the smallest change, simple and rough. Although you can save memory by building your own dictionary, you need to modify your business logic. This risk is at your own risk.

If you have more questions to interact with me, scan below to enter ~

3 June 2020, 20:47 | Views: 4284

Add new comment

For adding a comment, please log in
or create account

0 comments