I have recently been asked to investigate the problem of high memory consumption by one of the backend services of our product. I cannot reveal the product and component names, their logic and especially the code, so I’ll call this service MxService and will blur some areas on screenshots.
Defect description contained a graph of MxService memory consumption over some period of time:
As you see MxService occupied 3.5 GB of memory and then continued executing without releasing it.
Of course the first suspicion was memory leak somewhere in the code. It’s the kind of bugs that is very difficult to analyze and detect. Here is my step-by-step investigation into this issue.
1. The first step is to reproduce the problem on my own server to have a possibility for debugging, extended analysis of process memory, tracing logging, etc. When reproducing it’s very important to follow precisely the steps that caused the problem. At this moment you never know what the exact root cause is and any difference in environment or steps could be the reason that issue will not be reproduced. For our case the main reproducing factor is to have the same MxService processing running as in described defect. I checked the logs of the server where the problem had been reported and found out what tasks were executing. I’ve launched the same tasks on my own server and left them for several hours. When I checked the memory consumed by MxService I saw that the problem reproduced:
Cool! Having the issue reproduced on my own computer simplifies analysis greatly. In the original problem we saw 3.5 GB of occupied memory and here we have about 1.2 GB. But the service was still increasing its memory consumption and if we gave it more time it would take all it could.
2. As far as MxService is a mix of managed and unmanaged code we have to identify which part causes memory leaks (or huge memory consumption) to narrow the root cause of the problem. To achieve this I used VMMap utility from Sysinternals:
As we can see the problem is most likely to be in managed part of MxService.
3. For further analysis I used a great tool ‘.NET Memory Profiler’. It’s very functional and has a bunch of features like detection of memory leaks, detailed information for memory consumed by process instances, comparing process snapshots taken at different moments of time, etc. I launched MxService under Memory Profiler, restarted the processing tasks and took base snapshot. Then I waited for several hours while the size of managed heap grew up to 435 Mb, made a comparison snapshot and analyzed process memory. Here it is:
As we can see about 75% of managed heap (328 of 435 Mb) is occupied by 5 instances of class SomeWorker and that memory is mostly occupied by 11K instances of class SomeData held by SomeWorker objects. There are 11,388 instances of SomeData which totally hold 326 Mb of memory, so one instance occupies about 30K.It’s pretty much and we should check why it does happen. From the screenshot above we have a hint that the issue is probably in holding compiled Regex instances.
4. At this point we have enough information for code analysis:
public class SomeData
readonly Regex Parser = new Regex
@"SOME COMPLEX REGEXP GOES HERE",
RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled
Voila! Every instance of SomeData holds reference to compiled Regex which is known to be heavy in memory consumption.
5. Fixing the code.
In this case the fix is pretty simple. As far as specified regex is the same for all instances of SomeData, we could mark it as static so that it is shared between all SomeData objects. It’s very important to be sure that the shared object is thread-safe. Fortunately, Regex is thread-safe so fix proposed is correct.
6. Verifying the fix.
As you can see the size of Managed Heap has decreased more than twice (388 Mb vs 881 Mb). QA has also approved that memory consumption by the service decreased more than twice for the same processing size.
The first assumption about the memory leak has proved not to be true. The use of handy utilities (VMMap, .NET Memory Profiler) helped a lot in problem investigation. The real root cause of the problem was in non-optimal use of data structures in the code and simple safe refactoring decreased memory consumption more than twice.