I saw this mentioned on Discord, so I tested it, and found it to be true. I can't think of a reason for it to be this way, so I figured it should be written up as a bug.
This code:
local c=x+y |
is faster than this code:
local c=0 |
Here's a dead-simple test cart. Hold a button down to switch from running the first assignment per pixel to running the second. I get stat(1) = 0.7709 for c=x+y and 0.8296 for c=0, when at best they ought to be at least identical.
You probably already noticed this, but modifying your code, I noticed the relationship is correct when c is global (0.8881 for c=x+y vs. 0.8295 for c=0).
Oddly, changing local c=0 to c=0 by removing the local qualifier does not change the performance for the assignment case.
It looks like this behavior is known: https://pico-8.fandom.com/wiki/CPU
> Simple (x=y): 0 cycles if right side of expression already has a cycle cost. 2 cycles otherwise. (yes, this means x=x+y is cheaper than x=y). [Note; this may be changed/fixed in the future]
(I don't know why this is true; my best guess is that these cycle costs are an artificial limitation to make the console feel more real?)
Thanks @Felice -- this problem would force optimised code to have "+ zero_val" sprinkled everywhere =:O
It happens because the ADD and SUB vm instructions are discounted one cycle. This was a change I made while re-balancing the cpu costs to roughly match existing totals after the disruption of making binary op function more expensive. It seemed reasonable for addition to be cheaper than other arithmetic operations. But C = (A+B) only needs a single vm instruction on the right hand side -- it adds AND loads the result onto the stack. LOADK (for loading a constant onto the stack) is also a single instruction but not discounted, so C = 0 ends up being more expensive.
There's no nice way to roll back the arithmetic costing, so I think the natural solution will be to simply make LOADK cost the same as ADD, SUB. Loading a constant should be cheap anyway! I tried a collection of cpu-intensive carts, and this normally results in 0.01..0.02 cpu saving which isn't too disruptive.
Discounting only LOADK, and not other types of LOAD instructions, is ugly in itself but as far as I can tell won't open up any horrible optimisation patterns which is my main concern here (like the one we have here: adding a zero-valued variable to every constant).
Hey @zep, that sounds pretty appropriate, actually!
It makes sense to me that loading a constant would be fast like an addition. In retro-similar hardware you would typically have bitfields in the opcode that specified which registers to operate on, and in the case of immediate operands would instead be combined to produce a larger bitfield with a decent range for common immediate values.
Also, other kinds of loads usually require additional reads from the instruction stream, where you either set up a source location in a register beforehand, or the opcode causes the CPU to fetch the the source location from following bytes. So it makes sense that an immediate load would be faster than other kinds of loads.
And yeah, the alternative is that suddenly people start trading out pointless +0 tokens for performance in inner loops, which would just be weird and counter-intuitive.
Thanks for looking into it! :)
[Please log in to post a comment]