Here's a demo effect kludged together from some doodles I had been playing with.
I'm quite fond of my new and improved 3D shading. (It's not quite as speedy as the line shading I was using previously, but it gives smoother gradients.)
-Electric Gryphon
To say it in the philosphical words of Ron Simmons: Damn!
This shading rocks beyond limits!
Looking through your source for ideas (stole one already), I saw this:
if(sx%2==1)then poke(start_a,bor(band(peek(start_a),0x000f),band(v,0x00f0))) |
This is to fill the first pixel if a span starts on an odd pixel. You have sx and y, would it be faster just to use pset(sx,y,v/16)?
Very spiffy work, regardless. :)
Nope, that poke/peek construct is six times faster than pset.
The built-in single-pixel primitives are INCREDIBLY slow due to handling the camera/clip functions.
I know pset is slow, but for a single pixel at the start of a span that is otherwise poke'd in, that is a ton of overhead, especially with the logic calls. Annoyingly, band/bor cost more than a simple op like +-*/ should, since they're implemented as an actual call, so they have call overhead.
Wait...
There's something seriously wrong with cycle counting...
I have a test bench that tells me exactly how many pseudocycles an instruction or bit of code takes to execute.
pset(sx,y,v) takes 5 cycles. I believe that's just the standard overhead for the call itself. The internal workings appear to be given to you free.
poke(start_a,bor(band(peek(start_a),0x000f),band(v,0x00f0))) take 0 cycles.
There's no way that's right, and I know from a lot of testing that it ain't my test bench that's messing up.
The test bench is a simple process: it has a loop with a large number of iterations that takes exactly one second with no inner code. I know that a FOR loop has exactly one cycle of overhead per iteration, so each additional second of runtime indicates one more cycle added inside the loop. Putting pset() inside the loop increases total runtime to 6s, so it's 6-1=5 cycles for pset(). Putting that poke() sequence inside the loop makes it run for exactly one second.
I considered that the lua compiler might be taking loop invariants out of the loop, which would result in a single poke() outside the loop, so I added the code above the poke:
start_a=band(time(),0x1fff)+0x6000 |
By itself, that code comes in at 11 cycles, meaning the loop now has a 12s overhead.
With the complex poke(), it's still 12s / 11cyc. There's no chance it's being loop-invariant'ed anymore. I know the poke is happening inside the loop since I can see the actual pixels getting written to the screen as the 12s pass. So... yeah, the poke() is basically coming in as free...
WTF?
Why is a poke with all that math inside of it coming in at 0 cycles? Can I put my entire game in the argument to a poke() call and have it run at native speed? :)
Does your test bench measure the stat(1) accounting as well? This seems pretty serious…
Wow…
poke(start_a,bor(band(peek(start_a),0x000f),band(v,0x00f0)))
takes half the stat(1) time as
poke(start_a,1)
when run in a loop. Very strange!
So, it turns out bor, band, peek, poke, and other functions like bnot all have a negative cost. I don't understand why this is, but stepping through with a debugger, these built-in functions all decrease the "CPU use" value in memory by 3 each time they're called. The only reason why you can't decrease the value to zero is the "instruction_limiter" function, which adds 1024 to the "CPU use" value every so often.
Anyway, here's a cart that demonstrates the issue:
I wonder if zep was trying to compensate for the difference between a regular operator (1 cycle) and the hacked-in 2-arg calls for binary operators (um, 5 cycles, I think?), and either overcompensated, or adjusted global timing at some point and forgot to adjust the compensation.
I could have sworn the binary ops were more expensive until recently, but I may be thinking of the token cost, rather than the cycle cost. Token cost is a lot more apparent, after all.
This can definitely be abused, but seems to only really work when viewed from the pico-8 executable as opposed to the web browser.
If you press the z button, the frame rate goes up (becoming smooth) and the stat(1) value drops from 2.5 to 0.6. (On the browser, the state value changes, but the frame rate stays choppy.)
if(btn(4))then for i=0,5000 do bnot(bnot(bnot(bnot(bnot(bnot(bnot(1))))))) bnot(bnot(bnot(bnot(bnot(bnot(bnot(1))))))) bnot(bnot(bnot(bnot(bnot(bnot(bnot(1))))))) bnot(bnot(bnot(bnot(bnot(bnot(bnot(1))))))) bnot(bnot(bnot(bnot(bnot(bnot(bnot(1))))))) bnot(bnot(bnot(bnot(bnot(bnot(bnot(1))))))) bnot(bnot(bnot(bnot(bnot(bnot(bnot(1))))))) bnot(bnot(bnot(bnot(bnot(bnot(bnot(1))))))) bnot(bnot(bnot(bnot(bnot(bnot(bnot(1))))))) end end |
I was able to get to a negative stat(1) by having 10000 cycles. This clearly means that frames are being sent back from the future. :-)
[Please log in to post a comment]