Solidity IR-based Codegen Changes

Solidity can generate EVM bytecode in two different ways: Either directly from Solidity to EVM opcodes (« old codegen ») or through an intermediate representation (« IR ») in Yul (« new codegen » or « IR-based codegen »).

The IR-based code generator was introduced with an aim to not only allow code generation to be more transparent and auditable but also to enable more powerful optimization passes that span across functions.

Currently, the IR-based code generator is still marked experimental, but it supports all language features and has received a lot of testing, so we consider it almost ready for production use.

You can enable it on the command line using --experimental-via-ir or with the option {"viaIR": true} in standard-json and we encourage everyone to try it out!

For several reasons, there are tiny semantic differences between the old and the IR-based code generator, mostly in areas where we would not expect people to rely on this behaviour anyway. This section highlights the main differences between the old and the IR-based codegen.

Semantic Only Changes

This section lists the changes that are semantic-only, thus potentially hiding new and different behavior in existing code.

When storage structs are deleted, every storage slot that contains a member of the struct is set to zero entirely. Formerly, padding space was left untouched. Consequently, if the padding space within a struct is used to store data (e.g. in the context of a contract upgrade), you have to be aware that delete will now also clear the added member (while it wouldn’t have been cleared in the past).

open in Remix
```
// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.7.1;

contract C {
    struct S {
        uint64 y;
        uint64 z;
    }
    S s;
    function f() public {
        // ...
        delete s;
        // s occupies only first 16 bytes of the 32 bytes slot
        // delete will write zero to the full slot
    }
}
```
We have the same behavior for implicit delete, for example when array of structs is shortened.
Function modifiers are implemented in a slightly different way regarding function parameters and return variables. This especially has an effect if the placeholder _; is evaluated multiple times in a modifier. In the old code generator, each function parameter and return variable has a fixed slot on the stack. If the function is run multiple times because _; is used multiple times or used in a loop, then a change to the function parameter’s or return variable’s value is visible in the next execution of the function. The new code generator implements modifiers using actual functions and passes function parameters on. This means that multiple evaluations of a function’s body will get the same values for the parameters, and the effect on return variables is that they are reset to their default (zero) value for each execution.

open in Remix
```
// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.7.0;
contract C {
    function f(uint _a) public pure mod() returns (uint _r) {
        _r = _a++;
    }
    modifier mod() { _; _; }
}
```
If you execute f(0) in the old code generator, it will return 2, while it will return 1 when using the new code generator.

open in Remix
```
// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.7.1 <0.9.0;

contract C {
    bool active = true;
    modifier mod()
    {
        _;
        active = false;
        _;
    }
    function foo() external mod() returns (uint ret)
    {
        if (active)
            ret = 1; // Same as ``return 1``
    }
}
```
The function C.foo() returns the following values:
- Old code generator: 1 as the return variable is initialized to 0 only once before the first _; evaluation and then overwritten by the return 1;. It is not initialized again for the second _; evaluation and foo() does not explicitly assign it either (due to active == false), thus it keeps its first value.
- New code generator: 0 as all parameters, including return parameters, will be re-initialized before each _; evaluation.
The order of contract initialization has changed in case of inheritance.

The order used to be:
- All state variables are zero-initialized at the beginning.
- Evaluate base constructor arguments from most derived to most base contract.
- Initialize all state variables in the whole inheritance hierarchy from most base to most derived.
- Run the constructor, if present, for all contracts in the linearized hierarchy from most base to most derived.
New order:
- All state variables are zero-initialized at the beginning.
- Evaluate base constructor arguments from most derived to most base contract.
- For every contract in order from most base to most derived in the linearized hierarchy execute:
  1. If present at declaration, initial values are assigned to state variables.
  2. Constructor, if present.

This causes differences in some contracts, for example:

open in Remix
// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.7.1;

contract A {
    uint x;
    constructor() {
        x = 42;
    }
    function f() public view returns(uint256) {
        return x;
    }
}
contract B is A {
    uint public y = f();
}
Previously, y would be set to 0. This is due to the fact that we would first initialize state variables: First, x is set to 0, and when initializing y, f() would return 0 causing y to be 0 as well. With the new rules, y will be set to 42. We first initialize x to 0, then call A’s constructor which sets x to 42. Finally, when initializing y, f() returns 42 causing y to be 42.

Copying bytes arrays from memory to storage is implemented in a different way. The old code generator always copies full words, while the new one cuts the byte array after its end. The old behaviour can lead to dirty data being copied after the end of the array (but still in the same storage slot). This causes differences in some contracts, for example:

open in Remix
```
// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.8.1;

contract C {
    bytes x;
    function f() public returns (uint _r) {
        bytes memory m = "tmp";
        assembly {
            mstore(m, 8)
            mstore(add(m, 32), "deadbeef15dead")
        }
        x = m;
        assembly {
            _r := sload(x.slot)
        }
    }
}
```
Previously f() would return 0x6465616462656566313564656164000000000000000000000000000000000010 (it has correct length, and correct first 8 elements, but then it contains dirty data which was set via assembly). Now it is returning 0x6465616462656566000000000000000000000000000000000000000000000010 (it has correct length, and correct elements, but does not contain superfluous data).
For the old code generator, the evaluation order of expressions is unspecified. For the new code generator, we try to evaluate in source order (left to right), but do not guarantee it. This can lead to semantic differences.

For example:

open in Remix
```
// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.8.1;
contract C {
    function preincr_u8(uint8 _a) public pure returns (uint8) {
        return ++_a + _a;
    }
}
```
The function preincr_u8(1) returns the following values:
- Old code generator: 3 (1 + 2) but the return value is unspecified in general
- New code generator: 4 (2 + 2) but the return value is not guaranteed
On the other hand, function argument expressions are evaluated in the same order by both code generators with the exception of the global functions addmod and mulmod. For example:

open in Remix
```
// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.8.1;
contract C {
    function add(uint8 _a, uint8 _b) public pure returns (uint8) {
        return _a + _b;
    }
    function g(uint8 _a, uint8 _b) public pure returns (uint8) {
        return add(++_a + ++_b, _a + _b);
    }
}
```
The function g(1, 2) returns the following values:
- Old code generator: 10 (add(2 + 3, 2 + 3)) but the return value is unspecified in general
- New code generator: 10 but the return value is not guaranteed
The arguments to the global functions addmod and mulmod are evaluated right-to-left by the old code generator and left-to-right by the new code generator. For example:

open in Remix
```
// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.8.1;
contract C {
    function f() public pure returns (uint256 aMod, uint256 mMod) {
        uint256 x = 3;
        // Old code gen: add/mulmod(5, 4, 3)
        // New code gen: add/mulmod(4, 5, 5)
        aMod = addmod(++x, ++x, x);
        mMod = mulmod(++x, ++x, x);
    }
}
```
The function f() returns the following values:
- Old code generator: aMod = 0 and mMod = 2
- New code generator: aMod = 4 and mMod = 0

The new code generator imposes a hard limit of type(uint64).max (0xffffffffffffffff) for the free memory pointer. Allocations that would increase its value beyond this limit revert. The old code generator does not have this limit.

For example:

open in Remix

// SPDX-License-Identifier: GPL-3.0
pragma solidity >0.8.0;
contract C {
    function f() public {
        uint[] memory arr;
        // allocation size: 576460752303423481
        // assumes freeMemPtr points to 0x80 initially
        uint solYulMaxAllocationBeforeMemPtrOverflow = (type(uint64).max - 0x80 - 31) / 32;
        // freeMemPtr overflows UINT64_MAX
        arr = new uint[](solYulMaxAllocationBeforeMemPtrOverflow);
    }
}

The function f() behaves as follows:

Old code generator: runs out of gas while zeroing the array contents after the large memory allocation
New code generator: reverts due to free memory pointer overflow (does not run out of gas)

Internals

Internal function pointers

The old code generator uses code offsets or tags for values of internal function pointers. This is especially complicated since these offsets are different at construction time and after deployment and the values can cross this border via storage. Because of that, both offsets are encoded at construction time into the same value (into different bytes).

In the new code generator, function pointers use internal IDs that are allocated in sequence. Since calls via jumps are not possible, calls through function pointers always have to use an internal dispatch function that uses the switch statement to select the right function.

The ID 0 is reserved for uninitialized function pointers which then cause a panic in the dispatch function when called.

In the old code generator, internal function pointers are initialized with a special function that always causes a panic. This causes a storage write at construction time for internal function pointers in storage.

Cleanup

The old code generator only performs cleanup before an operation whose result could be affected by the values of the dirty bits. The new code generator performs cleanup after any operation that can result in dirty bits. The hope is that the optimizer will be powerful enough to eliminate redundant cleanup operations.

For example:

open in Remix

// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.8.1;
contract C {
    function f(uint8 _a) public pure returns (uint _r1, uint _r2)
    {
        _a = ~_a;
        assembly {
            _r1 := _a
        }
        _r2 = _a;
    }
}

The function f(1) returns the following values:

Old code generator: (fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe, 00000000000000000000000000000000000000000000000000000000000000fe)
New code generator: (00000000000000000000000000000000000000000000000000000000000000fe, 00000000000000000000000000000000000000000000000000000000000000fe)

Note that, unlike the new code generator, the old code generator does not perform a cleanup after the bit-not assignment (_a = ~_a). This results in different values being assigned (within the inline assembly block) to return value _r1 between the old and new code generators. However, both code generators perform a cleanup before the new value of _a is assigned to _r2.