Previously I'd done a bit of fiddling around with Python, revisiting some string concatenation benchmarks from an old-ish article and trying to explain some unexpected results. After playing around a bit with IronPython I was curious whether it'd be faster or slower than CPython on windows.
I installed the latest versions of both IronPython (2.7.5) and CPython (2.7.12) into my Windows 10 VM and re-ran the same set of tests.
The most interesting thing I learned was that some changes to how memory was allocated for the new buffer caused the "naive" method to be on par with the optimised version. As it turns out, IronPython doesn't actually have this - so running stest.py we get the following:
$ ipy64 stest.py 1 method 1 time 2406.60858154 ms output size 86 kb $ ipy64 stest.py 6 method 6 time 46.9284057617 ms output size 86 kb
IronPython is a totally different beast to CPython, so my previous method of debugging - taking the code and examining it with the dis module doesn't yield anything useful:
This is because it compiles it into a different set of instructions to be executed using the .NET CLR (it's important to note that it does not go directly to .NET IL, there's still a level of instructions above this similar to CPythons opcodes).
However since we're on Windows with .NET we do have Visual Studio - which is arguably easier than working through python bytecode instructions in a text editor. To begin with it's extremely simple to find out where exactly we spend most of our execution time using dotTrace by JetBrains:
So the program execution is split with roughly 50% spent in initialisation (ImportSite, again!) but that's not included in our benchmark, however the remaining 50% is spent in String.Concat() in mscorlib (source here) which is what we're interested in:
[System.Security.SecuritySafeCritical] // auto-generated public static String Concat(String str0, String str1) { Contract.Ensures(Contract.Result() != null); Contract.Ensures(Contract.Result().Length == (str0 == null ? 0 : str0.Length) + (str1 == null ? 0 : str1.Length)); Contract.EndContractBlock(); if (IsNullOrEmpty(str0)) { if (IsNullOrEmpty(str1)) { return String.Empty; } return str1; } if (IsNullOrEmpty(str1)) { return str0; } int str0Length = str0.Length; String result = FastAllocateString(str0Length + str1.Length); FillStringChecked(result, 0, str0); FillStringChecked(result, str0Length, str1); return result; }
This explains why things are so slow - when concatenating two strings there are no realloc-based tricks like CPython had - we allocate a new memory buffer every time, copy both strings into it, and let the garbage collector handle the old buffers.
Sadly it's pretty non-trivial for someone like me to implement a similar optimisation here - since we depend on the underlying string implementation in .NET we're stuck with how string concatenation was implemented there. I toyed with the idea of re-writing a hacky reimplementation of FastAllocateString as FastReallocateString specifically for the += operator (it's possible to do - we need to change PythonBinaryOperationBinder.BindDelegate() to handle Add and AddAssign differently) this would've involved getting stuck into the mscorlib sources in coreclr - and I'd rather stay in Python-land for the time being.
However since it's possible to access the .NET standard libraries from IronPython we can at least test how System.Text.StringBuilder performs, since it is designed to solve this very problem. So I setup the stest.py code I previously used, and re-ran them on my Windows 10 VM for both CPython 2.7.12 and IronPython 2.7.5. Just to quickly recap, here are the string concatenation methods I tested:
Method 1: simple concatenation: s1 += s2
Method 2: concatenation using MutableString (s1 += s2, where s1 is a MutableString)
Method 3: appending to a long array of char
Method 4: building a list of strings, then calling "".join()
Method 5: writing to a cStringIO buffer in memory using write()
Method 6: same as Method 4 but using a list comprehension inline
Method 7: using System.Text.StringBuilder (IronPython only)
runtime (ms) | concatenations per second | |
---|---|---|
method 1 | 16.00 | 1,250,000 |
method 2 | 108.99 | 183,503 |
method 3 | 14.99 | 1,334,222 |
method 4 | 3.00 | 6,666,666 |
method 5 | 6.00 | 3,333,333 |
method 6 | 2.00 | 10,000,000 |
runtime (ms) | concatenations per second | |
---|---|---|
method 1 | 2517.44 | 7,944 |
method 2 | 3968.87 | 5,039 |
method 3 | 25.39 | 787,711 |
method 4 | 42.13 | 474,721 |
method 5 | 35.56 | 562,429 |
method 6 | 33.22 | 602,046 |
method 7 | 22.43 | 891,662 |
Conclusion
So in IronPython the most efficient way to join strings together is to hook into .NET's System.Text library and use StringBuilder, no surprises there. What was surprising was how much slower IronPython was than CPython. I'm curious if this is just a freak result or if IronPython is known to be pretty slow. I'll probably not attempt to speed up the AddAssign op in IronPython, however I would like to investigate why IronPython is so slow, and if there are any plans to improve things. In addition I was surprised that the realloc trick had little-to-no effect on CPython in Windows (i.e. method 1 was slow even on 2.7.12).
I am a little sick of this benchmark now - I might revisit it one final time to compare it across CPython, IronPython, PyPy and Pyjion to complete the picture, but only if I'm really bored :)