June 25th, 2013 @ 15:30

One of the main side effects of my recent work on Printrun is that we now store the parsed G-code instead of the plain text one. I initially wrote and tested these changes on machines which were not memory tight (with either 8 or 16GB of RAM) and did not initially foresee that I was creating a huge memory hog. Indeed, instead of just storing the G-code commands as a list of strings, we now store a list of Python objects with various attributes, the raw G-code string being just one of them. Loading a 3.8MB file with about 140000 G-code lines lead to about 244MB of memory usage just for the parsed G-code, which is definitely not right (especially since some users have very large G-code files, more than 20 times bigger than the one I’m using).

Let’s look at how we can find out where we are using memory and what we can do to improve the memory usage.

Understanding the data structures

We basically have a large batch of Line objects. Each Line object is basically a storage hub for a single G-code command and the attributes we parsed from it: the command (as a string), X/Y/Z/E/F G-code arguments and absolute X/Y/Z position and a few meta-information, such as whether the coordinates were imperial, whether the move was relative or whether the command is going to involve the extruder.

These lines are accessed through three main paths: a list which references all the lines, a dictionary of Layer objects referencing all the G-codes started from a given Z (these Z are the layer indices) and a list of Layer objects, with one Layer object for each Z change (so a head lift during a move would create two new layers, the following commands going into the last created layer).

Profiling memory usage of Python code

I used two main approaches to profile the code: the first one is simply to use the sys.getsizeof function from Python standard library, which gives you the size of the object you provide it (not including the size of objects it references but including the size of the garbage collector metadata overhead), while the second one (which is actually the one I used the most, I only used the first one at a very late stage) is the memory inspector Heapy.

Heapy is very easy to use: just install guppy (the programming environment that provides Heapy), and then:

from guppy import hpy
h = hpy()
# the code you want to memory-profile
print h.heap()

Which should print something like (and this report is the starting point of our adventure, based on commit de02427):

Partition of a set of 2030760 objects. Total size = 256042336 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 139536   7 146214528  57 146214528  57 dict of printrun.gcoder.Line
     1 838522  41 43666288  17 189880816  74 str
     2 140075   7 23991064   9 213871880  84 list
     3 279047  14 21207472   8 235079352  92 tuple
     4 419904  21 10077696   4 245157048  96 float
     5 139536   7  8930304   3 254087352  99 printrun.gcoder.Line
     6  73040   4  1752960   1 255840312 100 int
     7    536   0   150080   0 255990392 100 dict of printrun.gcoder.Layer
     8    536   0    34304   0 256024696 100 printrun.gcoder.Layer
     9      4   0    13408   0 256038104 100 dict (no owner)

What we can see here is that 57% of the memory is used for the dictionaries that hold the Line objects attributes, 17% for storing strings, 9 for lists, 8 for tuples, 4 for floats. and 3 for the actual Line objects and 1 for integer objects.

Getting rid of temporary data

My first shot (which was even before using Heapy, so it might sound a little silly) was to drop data which had a limited lifetime. We were indeed storing the result of splitting the G-code string using a complex regular expression until the point where we consumed all the information from this split to parse the attributes of the command. Thus in 86bf8ca we started dropping this subresult (which was basically a tuple holding a bunch of strings) after using it, and reduced memory consumption by about 14%, down to 200MB:

Partition of a set of 1331825 objects. Total size = 210155496 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 139536  10 146214528  70 146214528  70 dict of printrun.gcoder.Line
     1 279047  21 21207472  10 167422000  80 tuple
     2 279103  21 16753624   8 184175624  88 str
     3 419904  32 10077696   5 194253320  92 float
     4 139536  10  8930304   4 203183624  97 printrun.gcoder.Line
     5    559   0  5016888   2 208200512  99 list
     6  73040   5  1752960   1 209953472 100 int
     7    536   0   150080   0 210103552 100 dict of printrun.gcoder.Layer
     8    536   0    34304   0 210137856 100 printrun.gcoder.Layer
     9      4   0    13408   0 210151264 100 dict (no owner)

The __slots__ magic

From there since I was not very sure of where to cut costs, I started using Heapy and figured out that most of our memory (70%) was being used by the dynamic dictionaries used to store Line objects attributes. This is an expected downside of Python’s flexibility, as you can add new attributes on the fly at any time, so it can’t just allocate a static amount of memory at object creation to store all the attributes. This is where the __slots__ class variable comes to the rescue: it lets you specify which attributes will ever be used on this object. We thus implemented this trick in 2a6eec7, further reducing memory consumption on our test file to 77MB, which is 62% less memory than before.

Partition of a set of 1192289 objects. Total size = 80685288 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 139536  12 25674624  32  25674624  32 printrun.gcoder.Line
     1 279047  23 21207472  26  46882096  58 tuple
     2 279103  23 16753624  21  63635720  79 str
     3 419904  35 10077696  12  73713416  91 float
     4    559   0  5016888   6  78730304  98 list
     5  73040   6  1752960   2  80483264 100 int
     6    536   0   150080   0  80633344 100 dict of printrun.gcoder.Layer
     7    536   0    34304   0  80667648 100 printrun.gcoder.Layer
     8      4   0    13408   0  80681056 100 dict (no owner)
     9      4   0     1120   0  80682176 100 dict of guppy.etc.Glue.Interface

Replacing lists of tuples with arrays

At this step the memory usage of Line objects felt pretty much right, still too high but back to a reasonable ratio compared to the other data structures. However, the costs of tuple objects now looks a little bit too high. We use tuples in two places, the first one being to store the current absolute X/Y/Z position at the end of each G-code line, the other being in a big list which is used to map between the line number and the layer number & position in this layer line list. The later can be easily replaced by two lists holding the same information without the tuples, or by two Python array, which are very compact storage arrays for standard types (characters/integers/floating point numbers) and thus have much less overhead than a list holding integer objects. Switching to integer arrays in 5e1a0be brought us down to 65.5MB, a new 15% improvement.

Removing more tuples

As explained in the previous part, we also had tuples storing the destination absolute position of each G-code line, which we simply split into three different properties in b27cb37, reducing memory consumption down to 57MB, a new 13% improvement.

Going C-style

So, here we are:

Partition of a set of 840204 objects. Total size = 59763064 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 139536  17 27907200  47  27907200  47 printrun.gcoder.Line
     1 279103  33 16753624  28  44660824  75 str
     2 419904  50 10077696  17  54738520  92 float
     3    558   0  3706096   6  58444616  98 list
     4      2   0  1116400   2  59561016 100 array.array
     5    536   0   150080   0  59711096 100 dict of printrun.gcoder.Layer
     6    536   0    34304   0  59745400 100 printrun.gcoder.Layer
     7      4   0    13408   0  59758808 100 dict (no owner)
     8      4   0     1120   0  59759928 100 dict of guppy.etc.Glue.Interface
     9      1   0     1048   0  59760976 100 dict of printrun.gcoder.GCode

47% of the memory is used by the Line objects, 28 by string objects and 17 by float objects. The strings take 16MB in memory (while we only have 3.8MB of input data, that’s a x4 factor, which is probably due to the fact that sys.getsizeof("") reports 40 bytes overhead per str objects), and each float takes 24 bytes, while a double should take no more than 8 bytes.
The solution we used for getting rid of all the overheads is to provide an extension module written in C, using Cython, which would store all the properties in a C structure. However, we had to take care of two issues: the first one was the need for a fallback mode if the Cython module isn’t available (so we had to keep the previous Python-based Line implementation and derive from either this PyLine or the Cython-based gcoder_line.Line object), while the second one was that Python properties could be either None or some value of the type we wanted, and that we use this behavior to detect whether an attribute was present in the G-code line or not (that is, if after parsing the command of an object called line the line.x attribute is None, then there was no X argument in the command). The solution to this second requirement is to store for each property an extra boolean telling wether the property was ever set or not, so that when the property is accessed we can return None if no value was set.

Let’s talk about implementation now. I first tried to implement argument access in a Pythonic way, hacking __setattr__ and __getattr__ to do the extra None-checking, as can be seen in a3b27af, but this solution was much slower (3 times slower) than the original Python code, though much lighter, down to 26MB, a new 54% improvement (though we lost track of some memory here, see below):

Partition of a set of 141504 objects. Total size = 27358912 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 139536  99 22325760  82  22325760  82 printrun.gcoder.Line

The solution to this slowdown, implemented in a511f20, was simply to use the properties mechanism from Cython, which lets you write getters and setters for a given attribute, which get compiled into very fast C code, while using the same amount of memory per-instance.

The next improvement was to compress all the boolean properties specifying whether a variable had been set into a single bitarray stored as a 4 bytes (32 bits) integer, which reduced the per-instance memory usage from 160 bytes to 144 (1c41072). We extended this trick to all the boolean properties in fc43d55, further reducing per-instance cost to 136 bytes.

Getting rid of the cyclic garbage collector

Though memory consumption had already been trimmed down a lot, I was a little bit confused because the Line objects were still taking 136 bytes, while the actual properties I had declared should have been taking about 97 bytes, and doing a manual sizeof on the C struct which was being compiled announced a size of 104 bytes (the 7 bytes difference is just a matter of memory padding). After looking around a lot and using the standard “get rid of one line of code at a time until things start making sense”, I figured out the 32 extra bytes were brought by the fact that one of my properties was a Python object (the tuple of strings from the G-code splitter regexp), and that referencing a Python triggered the addition of cyclic garbage collection metadata. Fine, I refactored the code to not store this information at all. The Cython-based Line objects then started being the right size, but the final Line objects (the ones which inherited from either the PyLine or Cython-based gcoder_line.Line) were still 32 bytes too large, even though they had no more memory __slots__ than their ancestors (so no other property than the parent’s could reference other Python objects, and as the Cython parent could not, the child class should not be able to reference other Python objects).

After investigating a lot more, I figured out that while the same derivation worked fine with new-style Python objects (the ones which derive from object class), the Cython objects were a little bit different and that deriving them would always pull in this GC metadata. I thus rewrote the code to not require the derivation (by moving the previous class methods to module methods) and after 77888d9 we had memory consumption down to 17.6MB on our test object.

After a few more tricks and fine tuning of the properties list (we got rid of super rare properties), the size of Line objects was reduced to 80 bytes, and the global memory consumption down to 15.3MB:

Partition of a set of 140384 objects. Total size = 16011448 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 139536  99 11162880  70  11162880  70 printrun.gcoder_line.GLine
     1    536   0  2394240  15  13557120  85 printrun.gcoder.Layer
     2      2   0  1313424   8  14870544  93 list
     3      2   0  1116400   7  15986944 100 array.array
     4      4   0    13408   0  16000352 100 dict (no owner)
     5    275   0     6600   0  16006952 100 float
     6      4   0     1120   0  16008072 100 dict of guppy.etc.Glue.Interface
     7      1   0     1048   0  16009120 100 dict of printrun.gcoder.GCode
     8      8   0      704   0  16009824 100 __builtin__.weakref
     9      1   0      448   0  16010272 100 types.FrameType

Adding Heapy support for Cython objects

However, the numbers I announced in the previous section are wrong. Indeed, one big issue with using Cython is that Heapy can’t automagically follow C pointers, and as we store the raw G-code string and G-code command (the “G1” or “M80” part for instance) as C strings, Heapy lost track of them, and we are using more memory than Heapy claims.
Luckily enough, Heapy provides a mechanism to specify a function which will take an object of the given type as an argument and will return the space used by the object and its properties, as documented on http://www.smira.ru/wp-content/uploads/2011/08/heapy.html
The only trick there was that the C code is only a subproduct of the Cython extension build process, and that I couldn’t add raw C-code to the Cython source, so that I had to write a patch which can be applied to the C source after it has been initially generated (the patch is available in 040b89f).

After this patch, we now have a correct estimation of the memory usage, and we have effectively reduced memory consumption from 244MB down to 19.4MB, which is a reduction factor of 12.5. We still use 5 times more memory than the initial file we load, but given how much metadata we store, it feels quite decent. I don’t expect anyone to load a G-code so big than 5 times the size would go over the memory limits of the person’s system.

Partition of a set of 140383 objects. Total size = 20369977 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 139536  99 15521497  76  15521497  76 printrun.gcoder_line.GLine
     1    536   0  2394240  12  17915737  88 printrun.gcoder.Layer
     2      2   0  1313424   6  19229161  94 list
     3      2   0  1116400   5  20345561 100 array.array
     4      4   0    13408   0  20358969 100 dict (no owner)
     5    275   0     6600   0  20365569 100 float
     6      4   0     1120   0  20366689 100 dict of guppy.etc.Glue.Interface
     7      1   0     1048   0  20367737 100 dict of printrun.gcoder.GCode
     8      7   0      616   0  20368353 100 __builtin__.weakref
     9      1   0      448   0  20368801 100 types.FrameType

So, long story short, Heapy is a very cool tool to profile the memory usage of Python code, Cython can help reducing your memory usage by letting you write extension modules very easily and that you should not forget about Python’s cyclic garbage collector when investigating your objects’ memory usage.