After working through Philip Guo's excellent series of lectures on the CPython VM I found myself looking at the performance of string concatenation. Strings are immutable in python - so they cannot be modified in-place, you have to create a new string. This also affects string concatenation, so if we want to tack one string onto the end of another we might have some code like this:

    x = "hello"
    x += ", world!"

When the CPython interpreter is running the above code, it's doing something conceptually similar to the following:

allocate a buffer for the string "hello" 6 bytes => len("hello") + 1, and copy the string into it
make x reference this
allocate a new buffer with length 14 bytes => len("hello") + len(", world") + 1
copy "hello", then ", world" into this new buffer, and make x reference it
decrement reference on the old string, possibly free it

Note even though Python is unlike C in that it doesn't rely on a trailing NULL to terminate strings (we store each string's length), it does by convention still null-terminate them - which is where the extra "+ 1" byte comes from.

Anyway, if we have to repeatedly join strings together we have to be careful how we do it, just in case we accidentally introduce some slow, inefficient code. This has been long understood in the python community, for instance here is an article from 2004 which explored different methods of concatenating strings, and compared their performance.

I was curious if there was any difference on how things shaped up on modern hardware (the test used an old 400 MHz Celeron) and a newer version of Python (this used python 2.2.1, the latest version of python 2 is 2.7.12) - so I grabbed the source code and started playing around.

It needed a tiny bit of tweaking however - the timing module it uses doesn't seem exist in the standard library and current version in pypi isn't compatible. I modified it to use the time module, the source code is here on GitHub if you're interested: stest-new.py.

As a quick recap, there are 6 methods of concatenating strings which we're putting under the microscope:

Method 1: simple concatenation: s1 += s2

Method 2: concatenation using MutableString (s1 += s2, where s1 is a MutableString)

Method 3: appending to a long array of char

Method 4: building a list of strings, then calling "".join()

Method 5: writing to a cStringIO buffer in memory using write()

Method 6: same as Method 4 but using a list comprehension inline

The original tests were performed using Python 2.2.1, for comparison's sake I've re-run them on my computer just to see:

    $ for i in {1..6}; do python stest.py $i; done

The results for Python 2.2.1 are below:

	runtime (ms)	concatenations per second
Method 1	55.11	362,910
Method 2	74.67	267,852
Method 3	10.80	1,851,337
Method 4	6.21	3,220,611
Method 5	8.11	2,467,612
Method 6	5.42	3,694,808

So in Python 2.2.1 the list comprehension method was the fastest by a pretty clear margin. However when I re-ran using 2.7.12 things turned out very differently:

	runtime (ms)	concatenations per second
Method 1	1.7995	11,113,977
Method 2	90.1073	221,957
Method 3	3.9557	5,055,967
Method 4	2.1804	9,172,689
Method 5	4.8047	4,162,585
Method 6	1.4191	14,093,289

In the time since 2.2.1 the performance of the naïve string concatenation method has improved hugely, it's now it's the fastest method (note: this is using a Python interpreter I built, using the Python 2,7.12 package that comes with Debian it's actually the fastest). This is surprising, since I thought it was relatively well-established and understood that it was slow, or just not ... pythonic. I was curious exactly why Method 6 was now slower than Method 1, so I disassembled the functions using the dis module.

There were a number of SET_LINENO instructions in the 2.2.1 version which I've not shown - it makes the disassembly a little easier to read and the performance impact would have been negligible - when tracing is turned off (which it is) all this instruction did was set the current frame's f_lineno and continue executing the next instruction.

Disassembly - Method 1

Python 2.2.1	Python 2.7.12
6 LOAD_CONST 1 ('') 9 STORE_FAST 0 (out_str) 15 SETUP_LOOP 37 (to 55) 18 LOAD_GLOBAL 1 (xrange) 21 LOAD_GLOBAL 2 (loop_count) 24 CALL_FUNCTION 1 27 GET_ITER 31 FOR_ITER 20 (to 54) 34 STORE_FAST 1 (num) 40 LOAD_FAST 0 (out_str) 43 LOAD_FAST 1 (num) 46 UNARY_CONVERT 47 INPLACE_ADD 48 STORE_FAST 0 (out_str) 51 JUMP_ABSOLUTE 28 54 POP_BLOCK 58 LOAD_FAST 0 (out_str) 61 RETURN_VALUE	0 LOAD_CONST 1 ('') 3 STORE_FAST 0 (out_str) 6 SETUP_LOOP 31 (to 40) 9 LOAD_GLOBAL 0 (xrange) 12 LOAD_GLOBAL 1 (loop_count) 15 CALL_FUNCTION 1 18 GET_ITER 19 FOR_ITER 17 (to 39) 22 STORE_FAST 1 (num) 25 LOAD_FAST 0 (out_str) 28 LOAD_FAST 1 (num) 31 UNARY_CONVERT 32 INPLACE_ADD 33 STORE_FAST 0 (out_str) 36 JUMP_ABSOLUTE 19 39 POP_BLOCK 40 LOAD_FAST 0 (out_str) 43 RETURN_VALUE

Python 2.2.1

Python 2.7.12

 6 LOAD_CONST               1 ('')
 9 STORE_FAST               0 (out_str)
15 SETUP_LOOP              37 (to 55)
18 LOAD_GLOBAL              1 (xrange)
21 LOAD_GLOBAL              2 (loop_count)
24 CALL_FUNCTION            1
27 GET_ITER            
31 FOR_ITER                20 (to 54)
34 STORE_FAST               1 (num)
40 LOAD_FAST                0 (out_str)
43 LOAD_FAST                1 (num)
46 UNARY_CONVERT       
47 INPLACE_ADD         
48 STORE_FAST               0 (out_str)
51 JUMP_ABSOLUTE           28
54 POP_BLOCK           
58 LOAD_FAST                0 (out_str)
61 RETURN_VALUE

 0 LOAD_CONST               1 ('')
 3 STORE_FAST               0 (out_str)
 6 SETUP_LOOP              31 (to 40)
 9 LOAD_GLOBAL              0 (xrange)
12 LOAD_GLOBAL              1 (loop_count)
15 CALL_FUNCTION            1
18 GET_ITER            
19 FOR_ITER                17 (to 39)
22 STORE_FAST               1 (num)
25 LOAD_FAST                0 (out_str)
28 LOAD_FAST                1 (num)
31 UNARY_CONVERT       
32 INPLACE_ADD         
33 STORE_FAST               0 (out_str)
36 JUMP_ABSOLUTE           19
39 POP_BLOCK           
40 LOAD_FAST                0 (out_str)
43 RETURN_VALUE

So pretty much identical, which I half-expected - even though we're looking at two versions of Python which were released 14 years apart the compiler doesn't seem to get a great deal of changes, so the generated bytecode shouldn't change much for simple code.

Disassembly - Method 6

Python 2.2.1	Python 2.7.12
6 LOAD_CONST 1 ('') 9 LOAD_ATTR 0 (join) 12 BUILD_LIST 0 15 DUP_TOP 16 LOAD_ATTR 1 (append) 19 STORE_FAST 0 (_[1]) 22 LOAD_GLOBAL 3 (xrange) 25 LOAD_GLOBAL 4 (loop_count) 28 CALL_FUNCTION 1 31 GET_ITER 35 FOR_ITER 17 (to 55) 38 STORE_FAST 2 (num) 41 LOAD_FAST 0 (_[1]) 44 LOAD_FAST 2 (num) 47 UNARY_CONVERT 48 CALL_FUNCTION 1 51 POP_TOP 52 JUMP_ABSOLUTE 32 55 DELETE_FAST 0 (_[1]) 58 CALL_FUNCTION 1 61 STORE_FAST 1 (out_str) 67 LOAD_FAST 1 (out_str) 70 RETURN_VALUE	0 LOAD_CONST 1 ('') 3 LOAD_ATTR 0 (join) 6 BUILD_LIST 0 9 LOAD_GLOBAL 1 (xrange) 12 LOAD_GLOBAL 2 (loop_count) 15 CALL_FUNCTION 1 18 GET_ITER 19 FOR_ITER 13 (to 35) 22 STORE_FAST 0 (num) 25 LOAD_FAST 0 (num) 28 UNARY_CONVERT 29 LIST_APPEND 2 32 JUMP_ABSOLUTE 19 35 CALL_FUNCTION 1 38 STORE_FAST 1 (out_str) 41 LOAD_FAST 1 (out_str) 44 RETURN_VALUE

Python 2.2.1

Python 2.7.12

 6 LOAD_CONST               1 ('') 
 9 LOAD_ATTR                0 (join)
12 BUILD_LIST               0
15 DUP_TOP             
16 LOAD_ATTR                1 (append)
19 STORE_FAST               0 (_[1])
22 LOAD_GLOBAL              3 (xrange)
25 LOAD_GLOBAL              4 (loop_count)
28 CALL_FUNCTION            1
31 GET_ITER            
35 FOR_ITER                17 (to 55)
38 STORE_FAST               2 (num)
41 LOAD_FAST                0 (_[1])
44 LOAD_FAST                2 (num)
47 UNARY_CONVERT       
48 CALL_FUNCTION            1
51 POP_TOP             
52 JUMP_ABSOLUTE           32
55 DELETE_FAST              0 (_[1])
58 CALL_FUNCTION            1
61 STORE_FAST               1 (out_str)
67 LOAD_FAST                1 (out_str)
70 RETURN_VALUE

 0 LOAD_CONST               1 ('')
 3 LOAD_ATTR                0 (join)
 6 BUILD_LIST               0
 9 LOAD_GLOBAL              1 (xrange)
12 LOAD_GLOBAL              2 (loop_count)
15 CALL_FUNCTION            1
18 GET_ITER            
19 FOR_ITER                13 (to 35)
22 STORE_FAST               0 (num)
25 LOAD_FAST                0 (num)
28 UNARY_CONVERT       
29 LIST_APPEND              2
32 JUMP_ABSOLUTE           19
35 CALL_FUNCTION            1
38 STORE_FAST               1 (out_str)

41 LOAD_FAST                1 (out_str)
44 RETURN_VALUE

There are slightly more differences when comparing how 2.2.1 and 2.7.12 generate the bytecode for the list comprehension method. However, aside from a couple of quirks in the 2.2.1 version (i'm not sure why we call DUP_TOP on the list we created, and I have no idea what _[1] is) much of the bytecode is broadly the same - we produce a list of integers by applying CALL_FUNCTION to xrange with argument loop_count, and then iterate over the results, calling UNARY_CONVERT on each and assembling either a list or string using INPLACE_ADD or LIST_APPEND.

Since the generated bytecode contains no substantial differences, if we want to understand why the naive concatenation method (which uses INPLACE_ADD) became super-fast over the last 14 years we'll need to inspect how the Python VM interprets this code.

Analysis - INPLACE_ADD in Python 2.2.1

To save some space and time I'll skip straight to the meat and bones of where the actual string concatenation occurs - which is in Objects/stringobject.c, in the string_concat() function. It's quite small, so I've incuded it below - but I've stripped some macros for easier reading:

static PyObject *
string_concat(register PyStringObject *a, register PyObject *bb)
{
        register unsigned int size;
        register PyStringObject *op;
        if (!PyString_Check(bb)) {
                PyErr_Format(PyExc_TypeError,
                             "cannot concatenate 'str' and '%.200s' objects",
                             bb->ob_type->tp_name);
                return NULL;
        }
#define b ((PyStringObject *)bb)
        /* Optimize cases with empty left or right operand */
        if ((a->ob_size == 0 || b->ob_size == 0) &&
            PyString_CheckExact(a) && PyString_CheckExact(b)) {
                if (a->ob_size == 0) {
                        Py_INCREF(bb);
                        return bb;
                }
                Py_INCREF(a);
                return (PyObject *)a;
        }
        size = a->ob_size + b->ob_size;
        /* PyObject_NewVar is inlined */
        op = (PyStringObject *)
                PyObject_MALLOC(sizeof(PyStringObject) + size * sizeof(char));
        if (op == NULL)
                return PyErr_NoMemory();
        PyObject_INIT_VAR(op, &PyString_Type, size);
        memcpy(op->ob_sval, a->ob_sval, (int) a->ob_size);
        memcpy(op->ob_sval + a->ob_size, b->ob_sval, (int) b->ob_size);
        op->ob_sval[size] = '\0';
        return (PyObject *) op;
#undef b
}

This is pretty much exactly the algorithm I described in my first paragraph - after a couple of checks to ensure we're definitely dealing with strings we malloc a new buffer and memcpy both strings into it.

Analysis - INPLACE_ADD in Python 2.7.12

Again I'll skip straight to where the concatenation occurs - which is in Python/ceval.c in the string_concatenate() function:

static PyObject *
string_concatenate(PyObject *v, PyObject *w,
                   PyFrameObject *f, unsigned char *next_instr)
{
    /* This function implements 'variable += expr' when both arguments                         
       are strings. */
    Py_ssize_t v_len = PyString_GET_SIZE(v);
    Py_ssize_t w_len = PyString_GET_SIZE(w);
    Py_ssize_t new_len = v_len + w_len;
    if (new_len < 0) {
        PyErr_SetString(PyExc_OverflowError,
                        "strings are too large to concat");
        return NULL;
    }

    if (v->ob_refcnt == 2) {
        /* In the common case, there are 2 references to the value                             
         * stored in 'variable' when the += is performed: one on the                           
         * value stack (in 'v') and one still stored in the                                    
         * 'variable'.  We try to delete the variable now to reduce                            
         * the refcnt to 1.                                                                    
         */
        switch (*next_instr) {
        case STORE_FAST:
        {
            int oparg = PEEKARG();
            PyObject **fastlocals = f->f_localsplus;
            if (GETLOCAL(oparg) == v)
                SETLOCAL(oparg, NULL);
            break;
        }
        case STORE_DEREF:
        {
            PyObject **freevars = (f->f_localsplus +
                                   f->f_code->co_nlocals);
            PyObject *c = freevars[PEEKARG()];
            if (PyCell_GET(c) == v)
                PyCell_Set(c, NULL);
            break;
        }
        case STORE_NAME:
        {
            PyObject *names = f->f_code->co_names;
            PyObject *name = GETITEM(names, PEEKARG());
            PyObject *locals = f->f_locals;
            if (PyDict_CheckExact(locals) &&
                PyDict_GetItem(locals, name) == v) {
                if (PyDict_DelItem(locals, name) != 0) {
                    PyErr_Clear();
                }
            }
            break;
        }
        }
    }

    if (v->ob_refcnt == 1 && !PyString_CHECK_INTERNED(v)) {
        /* Now we own the last reference to 'v', so we can resize it                           
         * in-place.                                                                           
         */
        if (_PyString_Resize(&v, new_len) != 0) {
            /* XXX if _PyString_Resize() fails, 'v' has been                                   
             * deallocated so it cannot be put back into                                       
             * 'variable'.  The MemoryError is raised when there                               
             * is no value in 'variable', which might (very                                    
             * remotely) be a cause of incompatibilities.                                      
             */
            return NULL;
        }
        /* copy 'w' into the newly allocated area of 'v' */
        memcpy(PyString_AS_STRING(v) + v_len,
               PyString_AS_STRING(w), w_len);
        return v;
    }
    else {
        /* When in-place resizing is not an option. */
        PyString_Concat(&v, w);
        return v;
    }
}

This one is a little more complex because there are a couple of neat optimisations.

Firstly, if the next instruction to be executed is one of a handful (STORE_FAST, STORE_DEREF, STORE_NAME) we save ourselves a few cycles by doing a little setup ahead of time.

However more importantly, if there's only a single reference to the destination string we attempt to resize it in-place using PyString_Resize() instead of allocating an entirely new buffer. We can check this is the case by forcing this condition to be false:

    //if (v->ob_refcnt == 1 && !PyString_CHECK_INTERNED(v)) {                                  
    if (0) {
        /* Now we own the last reference to 'v', so we can resize it                           
         * in-place.                                                                           
         */

If we build a Python 2.7.12 compiler without this resize optimisation, and retest:

	runtime (ms)	concatenations per second
Method 1	48.7949	409,878
Method 2	89.2184	224,168
Method 3	3.9204	5,101,535
Method 4	2.1489	9,307,231
Method 5	4.8215	4,148,073
Method 6	1.4203	14,081,697

We're right back where we started, with Method 1 being the second-slowest and the list comprehension used in Method 6 outperforming everyone else.

Conclusion

In the intervening years since the original benchmark was performed, an optimisation has been introduced to improve how the CPython VM handles string concatenation. In the context of this specific benchmark it means that if you need to concatenate strings you can go with a naive, slightly more intuitive implementation and performance very close to the recommended, more-pythonic implementation.

It's very likely that this is not applicable in all cases, I'm sure that if we have a number of objects allocated on the heap then it'll get slower very quickly (we won't be able to extend that buffer every time, and will need to find new locations) - however it's still an interesting and surprising result nonetheless.

Sean McLemon

Tag cpython

Python - Efficient String Concatenation in Python (2016 edition)

Disassembly - Method 1

Disassembly - Method 6

Analysis - INPLACE_ADD in Python 2.2.1

Analysis - INPLACE_ADD in Python 2.7.12

Conclusion