Scott Wolchok

Parameter Passing in C and C++

Posted at — Mar 7, 2021

Now that we know how to read assembly language, we can talk about how parameter passing works at the machine level and its consequences for writing faster code. We will focus on x86_64, but ARM64 works in a roughly similar way.

What does “parameter passing” mean?

Let’s say you’ve written a large program, and in it you have two files:

// Something.h

int doSomething(int x, int y);
// OtherStuff.cpp

void doOtherStuff() {
  // ...
  doSomething(123, 456);
  // ...
}

The fundamental question of parameter passing is this: at the assembly language level, what does otherStuff need to do to pass 123 and 456 to doSomething as x and y? There is no single machine instruction for “call this function with these arguments”; somebody had to decide how that would work and document the calling convention for x86_64!

On x86_64 Linux and macOS (among other OSes), these decisions follow the System V AMD64 ABI. If you really want to, you can stop reading now and go read that document instead. However, it is a long, technical specification and I’ve personally never read it straight through. In the rest of this article, I’ll walk through some example code, summarize how parameter passing works in each case, and talk through the implications for how you structure your code.

Primitive types

Consider this function that takes far too many arguments:

#include <cstdint>

int takePrimitives(
    int8_t intArg1,
    int16_t intArg2,
    int32_t intArg3,
    int64_t intArg4,
    const char *intArg5,
    int &intArg6,
    float floatArg1,
    double floatArg2,
    int64_t intArg7,
    int64_t intArg8,
    int64_t intArg9,
    int64_t intArg10);

static int x = 123456;
void callWithPrimitives() {
    takePrimitives(1, 2, 3, 4, "hello", x, 0.5, 0.25, 6, 7, 8, 9);
}

and the corresponding generated assembly:

.LCPI0_0:
        .long   0x3f000000                      # float 0.5
.LCPI0_1:
        .quad   0x3fd0000000000000              # double 0.25
callWithPrimitives():                # @callWithPrimitives()
        pushq   %rax
        movss   .LCPI0_0(%rip), %xmm0           # xmm0 = mem[0],zero,zero,zero
        movsd   .LCPI0_1(%rip), %xmm1           # xmm1 = mem[0],zero
        movl    $4, %ecx
        movl    $.L.str, %r8d
        movl    $x, %r9d
        movl    $1, %edi
        movl    $2, %esi
        movl    $3, %edx
        pushq   $9
        pushq   $8
        pushq   $7
        pushq   $6
        callq   takePrimitives(signed char, short, int, long, char const*, int&, float, double, long, long, long, long)
        addq    $32, %rsp
        popq    %rax
        retq
.L.str:
        .asciz  "hello"

x:
        .long   123456                          # 0x1e240

This lays out parameter passing for primitives nicely. The first 6 integer or pointer arguments (intArg1 through intArg6 in the example) go into %rdi, %rsi, %rdx, %rcx, %r8, and %r9, in that order, regardless of whether they are 1, 2, 4, or 8 bytes in size. C++ reference parameters are represented as pointers. The first 8 floating point arguments go in registers %xmm0 through %xmm7 (here, we use only %xmm0 and %xmm1). Remaining arguments are pushed onto the stack from right to left.

Consequence: consistent argument order & position

Suppose that you have a common series of arguments that you need to pass to several functions in a row. It would be ideal if those arguments stayed in the same order and positions:

#include <cstdint>

struct Color {
    int r, g, b, a;
};

Color makeColor(int red, int green, int blue, int alpha);

Color makeColor(int red, int green, int blue) {
    return makeColor(red, green, blue, 255);
}

Color makeColorBad(int alpha, int red, int green, int blue);

Color makeColorBad(int red, int green, int blue) {
    return makeColorBad(255, red, green, blue);
}

and the generated assembly:

makeColor(int, int, int):                        # @makeColor(int, int, int)
        movl    $255, %ecx
        jmp     makeColor(int, int, int, int)                # TAILCALL
makeColorBad(int, int, int):                    # @makeColorBad(int, int, int)
        movl    %edx, %ecx
        movl    %esi, %edx
        movl    %edi, %esi
        movl    $255, %edi
        jmp     makeColorBad(int, int, int, int)            # TAILCALL

makeColorBad has to shift red, green, and blue to make everything end up in the correct registers, whereas makeColor just puts 255 into the appropriate register and continues on.1

Returning primitives

Returning primitives is even simpler than passing them. Here’s a quick example:

#include <cstdint>

int64_t getInt();
double getDouble();

void sinkInt(int64_t);
// Primitives should normally be passed by value;
// using reference here because the return register
// for double is %xmm0 and so is the first argument
// register, so the assembly would not show %xmm0 at
// all if passed by value.
void sinkDouble(const double&);

void demonstrateResultReturn() {
    sinkInt(getInt());
    sinkDouble(getDouble());
}

and the generated assembly:

demonstrateResultReturn():           # @demonstrateResultReturn()
        pushq   %rax
        callq   getInt()
        movq    %rax, %rdi
        callq   sinkInt(long)
        callq   getDouble()
        movsd   %xmm0, (%rsp)
        movq    %rsp, %rdi
        callq   sinkDouble(double const&)
        popq    %rax
        retq

We can see that integer/pointer return values go in %rax and floating-point return values go in %xmm0.

Structs

Now let’s see how structs work. Here are the different ways we could pass a 2-dimensional and 3-dimensional integer vector to a function:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
};

struct Vec3 {
    int64_t x;
    int64_t y;
    int64_t z;
};

void takeVec2ByValue(Vec2);
void takeVec2ByPointer(Vec2*);
void takeVec2ByConstRef(const Vec2&);
void takeVec3ByValue(Vec3);
void takeVec3ByPointer(Vec3*);
void takeVec3ByConstRef(const Vec3&);

void callVec2ByValue() {
    Vec2 v{1, 2};
    takeVec2ByValue(v);
}

void callVec2ByPointer() {
    Vec2 v{1, 2};
    takeVec2ByPointer(&v);
}

void callVec2ByConstRef() {
    Vec2 v{1, 2};
    takeVec2ByConstRef(v);
}

void callVec3ByValue() {
    Vec3 v{1, 2, 3};
    takeVec3ByValue(v);
}

void callVec3ByPointer() {
    Vec3 v{1, 2, 3};
    takeVec3ByPointer(&v);
}

void callVec3ByConstRef() {
    Vec3 v{1, 2, 3};
    takeVec3ByConstRef(v);
}

and here is the generated assembly:

callVec2ByValue():                   # @callVec2ByValue()
        movl    $1, %edi
        movl    $2, %esi
        jmp     takeVec2ByValue(Vec2)        # TAILCALL
callVec2ByConstRef():                # @callVec2ByConstRef()
        subq    $24, %rsp
        movups  .L__const.callVec2ByPointer().v(%rip), %xmm0
        movaps  %xmm0, (%rsp)
        movq    %rsp, %rdi
        callq   takeVec2ByConstRef(Vec2 const&)
        addq    $24, %rsp
        retq
callVec2ByPointer():                 # @callVec2ByPointer()
        subq    $24, %rsp
        movups  .L__const.callVec2ByPointer().v(%rip), %xmm0
        movaps  %xmm0, (%rsp)
        movq    %rsp, %rdi
        callq   takeVec2ByPointer(Vec2*)
        addq    $24, %rsp
        retq
callVec3ByValue():                   # @callVec3ByValue()
        subq    $24, %rsp
        movq    .L__const.callVec3ByPointer().v+16(%rip), %rax
        movq    %rax, 16(%rsp)
        movups  .L__const.callVec3ByPointer().v(%rip), %xmm0
        movups  %xmm0, (%rsp)
        callq   takeVec3ByValue(Vec3)
        addq    $24, %rsp
        retq
callVec3ByConstRef():                # @callVec3ByConstRef()
        subq    $24, %rsp
        movq    .L__const.callVec3ByPointer().v+16(%rip), %rax
        movq    %rax, 16(%rsp)
        movups  .L__const.callVec3ByPointer().v(%rip), %xmm0
        movaps  %xmm0, (%rsp)
        movq    %rsp, %rdi
        callq   takeVec3ByConstRef(Vec3 const&)
        addq    $24, %rsp
        retq
callVec3ByPointer():                 # @callVec3ByPointer()
        subq    $24, %rsp
        movq    .L__const.callVec3ByPointer().v+16(%rip), %rax
        movq    %rax, 16(%rsp)
        movups  .L__const.callVec3ByPointer().v(%rip), %xmm0
        movaps  %xmm0, (%rsp)
        movq    %rsp, %rdi
        callq   takeVec3ByPointer(Vec3*)
        addq    $24, %rsp
        retq
.L__const.callVec2ByPointer().v:
        .quad   1                               # 0x1
        .quad   2                               # 0x2

.L__const.callVec3ByPointer().v:
        .quad   1                               # 0x1
        .quad   2                               # 0x2
        .quad   3                               # 0x3

For our Vec2, passing by value is just like passing the two elements separately: they go in registers, assuming there aren’t too many other arguments. Passing by const reference is different: we create the Vec2 on the stack (in this case, by copying its data from a constant) and then pass a pointer to it in the first integer argument register, %rdi. This is identical to explicitly passing a pointer to a Vec2 on the stack.

Vec3 is different, because structs larger than two 8-byte words cannot go in registers. To pass a Vec3 by value, we push it onto the stack, and the called function knows that it will find its argument there. In contrast, when passing by const reference or pointer, we still create our Vec3 on the stack, but we must also explicitly pass a pointer to it in %rdi. This is because the referred-to Vec3 could, of course, be anywhere in memory; it doesn’t have to be on the caller’s stack.

Aside: structure packing

Note that structs are laid out by packing fields together as close as possible, subject to alignment requirements (see “Aggregates and Unions” in Section 3.1.2 of the System V AMD64 ABI document). Let’s take a quick look at what would happen if we used int32_t instead of int64_t in our previous example:

#include <cstdint>

struct Vec2 {
    int32_t x;
    int32_t y;
};

struct Vec3 {
    int32_t x;
    int32_t y;
    int32_t z;
};

void takeVec2ByValue(Vec2);
void takeVec2ByConstRef(const Vec2&);
void takeVec3ByValue(Vec3);
void takeVec3ByConstRef(const Vec3&);

void callVec2ByValue() {
    Vec2 v{1, 2};
    takeVec2ByValue(v);
}

void callVec2ByConstRef() {
    Vec2 v{1, 2};
    takeVec2ByConstRef(v);
}

void callVec3ByValue() {
    Vec3 v{1, 2, 3};
    takeVec3ByValue(v);
}

void callVec3ByConstRef() {
    Vec3 v{1, 2, 3};
    takeVec3ByConstRef(v);
}

We can see from the generated assembly that Vec2 now fits in one register and Vec3 fits in two registers; we don’t waste space by using separate registers for each field.

Returning structs

Returning structs has a similar discontinuity going from Vec2 to Vec3:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
};

struct Vec3 {
    int64_t x;
    int64_t y;
    int64_t z;
};

Vec2 globalVec2;
Vec3 globalVec3;

Vec2 getVec2();
Vec3 getVec3();
void getVec3ByOutParameter(Vec3 *out);

void testReturningVec2() {
    globalVec2 = getVec2();
}

void testReturningVec3() {
    globalVec3 = getVec3();
}

void testVec3OutParameter() {
    Vec3 out;
    getVec3ByOutParameter(&out);
    globalVec3 = out;
}

and the generated assembly:

testReturningVec2():                 # @testReturningVec2()
        pushq   %rax
        callq   getVec2()
        movq    %rax, globalVec2(%rip)
        movq    %rdx, globalVec2+8(%rip)
        popq    %rax
        retq
testReturningVec3():                 # @testReturningVec3()
        subq    $24, %rsp
        movq    %rsp, %rdi
        callq   getVec3()
        movq    16(%rsp), %rax
        movq    %rax, globalVec3+16(%rip)
        movups  (%rsp), %xmm0
        movups  %xmm0, globalVec3(%rip)
        addq    $24, %rsp
        retq
testVec3OutParameter():              # @testVec3OutParameter()
        subq    $24, %rsp
        movq    %rsp, %rdi
        callq   getVec3ByOutParameter(Vec3*)
        movq    16(%rsp), %rax
        movq    %rax, globalVec3+16(%rip)
        movups  (%rsp), %xmm0
        movups  %xmm0, globalVec3(%rip)
        addq    $24, %rsp
        retq
globalVec2:
        .zero   16

globalVec3:
        .zero   24

Just like for parameter passing, small structs like Vec2 get returned in registers %rax and %rdx, in that order.2 Larger structs like Vec3 get returned on the stack: the calling function makes a temporary Vec3 and passes a pointer to it in %rdi, which the called function fills in. Notice that the assembly for testReturningVec3 is identical to the assembly for testVec3OutParameter!

What about floating point?

If our Vec structs contained double instead of int64_t, things would work out similarly:

#include <cstdint>

struct Vec2 {
    double x;
    double y;
};

struct Vec3 {
    double x;
    double y;
    double z;
};

void takeVec2ByValue(Vec2);
void takeVec2ByConstRef(const Vec2&);
void takeVec3ByValue(Vec3);
void takeVec3ByConstRef(const Vec3&);

void callVec2ByValue() {
    Vec2 v{1, 2};
    takeVec2ByValue(v);
}

void callVec2ByConstRef() {
    Vec2 v{1, 2};
    takeVec2ByConstRef(v);
}

void callVec3ByValue() {
    Vec3 v{1, 2, 3};
    takeVec3ByValue(v);
}

void callVec3ByConstRef() {
    Vec3 v{1, 2, 3};
    takeVec3ByConstRef(v);
}

The generated assembly (not included below for brevity) is nearly identical to the assembly for the int64_t case, except that callVec2ByValue puts x and y in %xmm0 and %xmm1 instead of %rdi and %rsi, just as if they were floating-point primitives not contained in a struct.

Returning structs with floating-point elements

As you might expect, returning our new Vec2 is very similar to the previous case as well, using %xmm0 and %xmm1 again instead of %rdi and %rsi.

Furthermore, if we have one double and one int64_t, they get returned in %xmm0 and %rdi, but it is still the case that a 24-byte struct gets returned in memory, not registers, even if it is a mix of double and integers.

Consequence: single out parameter vs. struct for returning 2 values

One way that structure return can make a difference is when considering how to return multiple values from a function. You might do it by returning a struct, or you might return one value normally and one via an out parameter:

#include <utility>

// Let's not worry about the best way to read things from files.
struct FileHandle;

// Returns whether read succeeded without error; read integer goes
// into *out.
bool readIntFromFile(FileHandle& f, int* out);

std::pair<int, bool> readIntFromFile(FileHandle& f);

int lastIntRead;

void readIntOutParameter(FileHandle& f) {
    int myInt;
    if (readIntFromFile(f, &myInt)) {
        lastIntRead = myInt;
    }
}

void readIntStruct(FileHandle& f) {
    auto p = readIntFromFile(f);
    if (p.second) {
      lastIntRead = p.first;
    }
}

Here’s the generated assembly:

readIntOutParameter(FileHandle&):   # @readIntOutParameter(FileHandle&)
        pushq   %rax
        leaq    4(%rsp), %rsi
        callq   readIntFromFile(FileHandle&, int*)
        testb   %al, %al
        je      .LBB0_2
        movl    4(%rsp), %eax
        movl    %eax, lastIntRead(%rip)
.LBB0_2:
        popq    %rax
        retq
readIntStruct(FileHandle&):         # @readIntStruct(FileHandle&)
        pushq   %rax
        callq   readIntFromFile(FileHandle&)
        btq     $32, %rax
        jae     .LBB1_2
        movl    %eax, lastIntRead(%rip)
.LBB1_2:
        popq    %rax
        retq
lastIntRead:
        .long   0                               # 0x0

If we use an out parameter, we have to reserve stack space for it, pass a pointer to that stack space as the out parameter, and then load the value from memory in order to use it. If we return multiple values in a struct that fits in registers, we can access them without touching memory again. On modern processors, arithmetic operations on the CPU tend to be significantly faster than memory access, so I would guess that the struct return option would be slightly better. The important point to take away, though, is that it’s not worse, so if you think it’s more readable, go for it!

The Itanium ABI: how C++ types work

The System V AMD64 ABI document focuses on the C programming language. How to represent C++ types on x86_64 Linux and macOS is governed instead by the Itanium C++ ABI. In particular, Chapter 3 of that document “describes how to define and call functions.”

The this pointer

First, let’s talk about C++ member functions. Where does the this pointer come from? It turns out that it is a secret extra first argument to all C++ member functions:

struct GetThis {
    const void *getThis() const;
};

const void *GetThis::getThis() const {
    return this;
}

const void *equivalentFunction(const void *p) {
    return p;
}

We can see from the generated assembly that GetThis::getThis() const and equivalentFunction work exactly the same way.

Consequence: *this has to go in memory

There is an immediate, arguably-disappointing consequence to this ABI for member functions. We saw previously that small structs can (copy constructors permitting, as we’ll see later) be passed in registers. However, if you call an non-inline member function on a small C++ object, this must be a pointer, so the object must go in memory, even if it started life in registers!

Let’s revisit the debugPrint example from How to Read Assembly Language:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
    void debugPrint() const;
};

int64_t normSquared(Vec2 v) {
    v.debugPrint();
    return v.x * v.x + v.y * v.y;
}

and its generated assembly:

        subq    $24, %rsp
        movq    %rdi, 8(%rsp)
        movq    %rsi, 16(%rsp)
        leaq    8(%rsp), %rdi
        callq   Vec2::debugPrint() const
        movq    8(%rsp), %rcx
        movq    16(%rsp), %rax
        imulq   %rcx, %rcx
        imulq   %rax, %rax
        addq    %rcx, %rax
        addq    $24, %rsp
        retq

We are forced to copy %rdi and %rsi (which hold our Vec2) to the stack so that we can supply the this pointer to Vec2::debugPrint() const! If we had instead written void Vec2DebugPrint(Vec2 v), we could just call it directly:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
};

void Vec2DebugPrint(Vec2 v);

int64_t normSquared(Vec2 v) {
    Vec2DebugPrint(v);
    return v.x * v.x + v.y * v.y;
}

Looking at the new generated assembly:

        pushq   %r14
        pushq   %rbx
        pushq   %rax
        movq    %rsi, %r14
        movq    %rdi, %rbx
        callq   Vec2DebugPrint(Vec2)
        imulq   %rbx, %rbx
        imulq   %r14, %r14
        leaq    (%r14,%rbx), %rax
        addq    $8, %rsp
        popq    %rbx
        popq    %r14
        retq

Now, we keep %rsi and %rdi in registers and don’t have to spill them to the stack. However, this doesn’t come for free: we still have to push and pop %r14 and %rbx, which we use to stash our Vec2 so we can continue using it after Vec2DebugPrint returns. We could come out ahead if our function was longer and had to push and pop those registers anyway.

Would I recommend avoiding member functions just to pass objects in registers? Definitely not, but it is something to keep in the back of your mind in case you ever need to squeeze every last bit of performance out of some code. It would be even better to just ensure that simple member functions can be inlined and avoid parameter passing concerns altogether.

The sad story of “non-trivial for the purposes of calls”

The Itanium ABI defines a type to be “non-trivial for the purposes of calls” if (quoting directly from the ABI):

A copy constructor, move constructor, or destructor is “trivial” if, in short, it is not user-provided or defaulted and the class has no virtual member functions.3 In particular, ~MyClass() = default is trivial, but ~MyClass() {} is not.

The consequences of a type being non-trivial for the purposes of calls are not good: in short, the object must be copied or moved to a temporary on the stack as appropriate, passed by reference, and then have the temporary’s destructor called.

Let’s see an example:

struct TrivialForPurposesOfCalls {
    int x;
    // Doesn't matter if we uncomment this or not.
    // ~TrivialForPurposesOfCalls() = default;
};

struct NontrivialForPurposesOfCalls {
    int x;
    ~NontrivialForPurposesOfCalls() {}
};

void sink(TrivialForPurposesOfCalls);
void sink(NontrivialForPurposesOfCalls);
void sinkConstRef(const NontrivialForPurposesOfCalls&);

void passTrivial() {
    sink(TrivialForPurposesOfCalls{1});
}

void passNontrivial() {
    sink(NontrivialForPurposesOfCalls{1});
}

void passNontrivialConstRef() {
    sinkConstRef(NontrivialForPurposesOfCalls{1});
}

and the generated assembly:

passTrivial():                       # @passTrivial()
        movl    $1, %edi
        jmp     sink(TrivialForPurposesOfCalls) # TAILCALL
passNontrivial():                    # @passNontrivial()
        pushq   %rax
        movl    $1, (%rsp)
        movq    %rsp, %rdi
        callq   sink(NontrivialForPurposesOfCalls)
        popq    %rax
        retq
passNontrivialConstRef():            # @passNontrivialConstRef()
        pushq   %rax
        movl    $1, (%rsp)
        movq    %rsp, %rdi
        callq   sinkConstRef(NontrivialForPurposesOfCalls const&)
        popq    %rax
        retq

Our “default” destructor for NontrivialForPurposesOfCalls was a mistake! We cannot really pass it by value, in a register; we must pass it as though as we were passing by const reference.

For another example of worse generated code for a non-trivial type, check out Arthur O’Dwyer’s [[trivial_abi]] 101 article.

Consequence: it’s hard to match C using std::unique_ptr

One standard library type that gets affected by this is std::unique_ptr.4 std::unique_ptr clearly has a non-trivial destructor, so it is non-trivial for purposes of calls. Let’s compare some simple unique_ptr code to a version using int *:

#include <memory>

// Suppose these functions are not inlined because
// they're defined in another file.
__attribute__((noinline))
void printAndFreeHeapAllocatedInt(std::unique_ptr<int> x) {
    printf("%d\n", *x);
}

__attribute__((noinline))
void printAndFreeHeapAllocatedInt(int *x) {
    printf("%d\n", *x);
    delete x;
}

void consumeHeapAllocatedInt(std::unique_ptr<int> x) {
    printAndFreeHeapAllocatedInt(std::move(x));
}

void consumeHeapAllocatedInt(int *x) {
    printAndFreeHeapAllocatedInt(x);
}

From reading the generated assembly, we can see that the std::unique_ptr version has the following extra costs compared to the int * version:

Given the current Itanium C++ ABI, these costs are not avoidable when std::unique_ptr<int> is passed by value.5

Consequence: const std::shared_ptr<int>& isn’t an extra indirection compared to std::shared_ptr<int>

If you were familiar with the way parameter passing works for C types but didn’t know about the “non-trivial for the purposes of calls” rules, you might have thought that passing const std::shared_ptr<T>& implied a double indirection that shared_ptr<T> does not. (In other words, the former would pass a pointer to pointer to T, wheras the latter would pass a pointer to T.) This is not the case:

#include <memory>

void print(int);

void byConstRef(const std::shared_ptr<int>& x) {
    print(*x);
}

void byValue(std::shared_ptr<int> x) {
    print(*x);
}

As we can see, both of these functions actually have the same assembly.6 There is no need to worry about an “extra” layer of indirection – it is already there in the by-value case! (Of course, it would be better to just pass an int directly and have callers dereference any std::shared_ptr they may or may not have.)


  1. To be fair, register-to-register moves are often unusually cheap instructions due to move elimination by the CPU, but they’re still not free and they certainly increase code size. ↩︎

  2. Not all architectures' ABIs have this nice symmetry between result return and parameter passing. On 32-bit ARM, parameters go in r0 through r3, similarly to x86_64, and results are returned in r0. 64-bit primitives like int64_t and double are returned in r0 and r1, again similarly to x86_64. However, a struct containing 2 ints has to be returned in memory, not in r0 and r1! (See Procedure Call Standard for the ARM Architecture, Section 5.4. ARM64 fixed this; the corresponding section of Procedure Call Standard for the ARM 64-Bit Architecture defines result return simply by saying that result return for a type goes in the same registers that would be used if the type was passed as an argument.) ↩︎

  3. The full definitions of trivial for copy constructors, move consructors, and destructors are slightly more complicated. ↩︎

  4. Chandler Carruth presented these concerns as part of his excellent CppCon 2019 talk, “There Are No Zero-Cost Abstractions”. ↩︎

  5. For more discussion on how these costs could be avoided in both the short term and the long term, see Chandler Carruth’s talk mentioned in the previous footnote. ↩︎

  6. For byValue, the caller is responsible for destroying x. ↩︎