I'm using stat(1) to see what type of CPU usage my game is using and in certain spots it's a lot, sometimes hitting 100% and causing some serious jitters.
When that started happening, I went back through my code and did some refactoring of things...removing dupe code, nesting conditions, removing methods, limiting loops per tick...and that helped.
I'm not so dumb as to not realize that bloated code will impact CPU usage, but what else increases that usage?
Maybe that's a big question...I dunno, I'm looking for a somewhat high level answer, I guess. Just general "things to pay attention to" kind of stuff.
I'm not using any fancy techniques like poke/peek or memory storage or anything like that. Very straight forward coding. I don't feel like my game(s) are very complex or doing a whole lot, which tells me I'm probably missing something in managing overhead.
It's all determined by cycle counts. PICO-8 seems to run at roughly 4MHz, which is to say roughly four million cycles per second.
I've tested cycle counts in the past and have a general idea of how they work out. I may be off here and there; this is just to give you an idea of how it works. Don't take these numbers as anything close to gospel. I don't trust my memory at all, and neither should you. :)
An assignment to a global usually costs you two cycles.
An assignment to a local usually costs you one cycle.
Math operators usually cost you one cycle.
(For example, LOCAL A=B+C should cost you one cycle for the addition and one cycle for the assignment.)
POKE and PEEK are supposed to be one cycle each, I think. It's a bit buggy and doesn't always work out right.
Usually a function call has a few cycles (3, I think?) of overhead, plus another cycle for each argument passed to it, plus the assignment if you use the result. Plus, of course, whatever the function does internally. Built-in functions don't always have internal costs, but most have some.
Ideally, graphical operations are supposed to cost you based on the number of pixels drawn. I can't remember if you should get one pixel per cycle, or two (since writing an 8-bit byte in one cycle conceptually writes two 4-bit pixels), but basically it should be proportional to the number of pixels drawn. (There's a bug in the RECT/RECTFILL code that reduces costs, but that's supposedly fixed in the next version.) CLS() is thus rather expensive, so don't do it more than once, and don't do it at all if you're going to write the whole screen anyway. Transparent pixels in a sprite still cost a cycle each. Offscreen pixels still cost a cycle as well.
Referencing a table entry, either with table.member or table[member] has one cycle of overhead, I think.
A simple FOR I=#,# loop, with a numeric range, has a few cycles of setup time and one cycle overhead per iteration. The FOR K,V IN PAIRS(T) version is the same, as I recall. Beware of the FOR V IN ALL(T) construct though, as it is very, very slow, for no apparent reason ( @zep: is this even as-intended?).
Um, what else... oh, I dunno, I suppose this is enough to get the idea.
If you want an overly-simplified rule of thumb, assume every token you add to your function costs an extra cycle. I don't think that's 100% accurate, but it'll serve as a good way to figure out your relative code costs, at least, if not your draw costs.
Feel free to ask questions. I know I'm not often very clear when I explain things.
Good to know! - I was looking for this kind of info (see my thread on "reliable operator cost")!
This one is really scary:
Usually a function call has a few cycles (3, I think?) of overhead, plus another cycle for each argument passed to it
Function calls have added cost in any language, and each arg is usually extra. Quite often it's a lot more than that. PICO-8 Lua and zep are both being fairly kind to us here. :)
Anyway, it's only a few cycles, and you're bound to be putting a dozen or more, at least, inside the function, so it really doesn't hurt much.
This is great info. One small correction: as the Dank Tombs write-up pointed out, table lookups (e.g. table[member]) are significantly slower than peek() calls, so the cycle count shouldn't be considered the same.
Beyond that, when I have a problem with CPU spiking as you describe, I've often found that the limited fill rate of the Pico-8 raster "hardware" is the problem. If you can, find a clever way to avoid clearing the screen (as Felice mentions) or drawing very large sprites every frame.
However, in general I find it's best to start optimizing from the top down in the usual way—look at the time complexity of your algorithms, try to rearrange your data structures so that elements can be retrieved and iterated through in the most efficient way possible, etc. Save cycle counting for fine tuning if you can.
@musuca: when I tested table lookups, they were one cycle. I believe this choice was made by zep so as not to discourage the use of class/struct-like tables with the table.member syntax.
It does cost an extra cycle, though. Like, T.M=1 is one more cycle than X=1. (Edit: Not even. See below)
Also, the poke/peek() timing is screwed up and can even go negative. That's why he thought table lookups are so much slower. They're actually quite speedy.
Here, I just set up and ran my benchmark with a bunch of permutations of table lookups. You can draw your own conclusions. It's not very consistent, alas.
CYC CODE --- ----------------------------------- 1 L1 = 2 1 L1 = L2 1 L1 = T2.M2 2 L1 = T2.T2.M2 1 T1.M1 = 2 1 T1.M1 = L2 2 T1.M1 = T2.M2 3 T1.M1 = T2.T2.M2 2 T1.T1.M1 = 2 2 T1.T1.M1 = L2 3 T1.T1.M1 = T2.M2 4 T1.T1.M1 = T2.T2.M2 |
Note that the legend is L)ocal, T)able (also local), and M)ember.
It's interesting that a table member is the same speed as a local. I wonder if this is to make all the SELF.BLAH references in a class feel less troubling and thus encourage you to use classes. Too bad they still cost extra tokens.
Edit: Made the table more readable and added a few more times.
@Felice about pairs() vs all():
pairs() is standard lua (so, handled directly by the interpreter I guess) while all() is plain lua:
--(pulled from pico8.exe by nosy me) function all(c) if (c == nil or #c == 0) then return function() end end local i=1 local li=nil return function() if (c[i] == li) then i=i+1 end while(c[i]==nil and i <= #c) do i=i+1 end li=c[i] return c[i] end end |
so I guess it doesn't get a pass on cycle count.
also for some reason all() is a replacement for standard ipairs(). the reason might be related to fixed point numbers, as there had been a long running bug that would reorder integer-indexed tables from 0x.0001 if I remember correctly. that would be great if ipairs() could make a "standard" comeback somehow.
Ah. Ick. :) Oh, well. I'll just keep using pairs() or iterate numerically.
Awesome...thanks for all the insight, Felice et al. Great tips to keep an eye out for.
I'm definitely using the all() loop frequently, so I'll replace those with pair() loops.
You say that cls() is expensive...but how do you avoid using it when making a game? Or is there something other than cls() that is better to use?
And a specific example from my game about drawing pixels...
I have a somewhat large maze level that a character walks through (see here), thus there are things "out of sight" offscreen that are getting drawn. At one point, I thought that not drawing all those extra sprites would save me some CPU, so I added in a check that would only draw sprites within the player's view. It worked just fine but the CPU usage actually went up...I'm assuming from the calculation to see if the sprite is in-view. As such, I took that check out and the game is currently drawing all sprites every tick, even if they're offscreen.
I guess in the long run, if it's working, it's working...but thought that was an interest result and one I wasn't exactly expecting.
Well, you have a budget for each frame, and as long as you're not exceeding your budget, you're fine. (Too-)early optimization is the bane of programmers everywhere, because you waste time looking for little speedups that will be dwarfed by dumb algorithmic changes you make or fix.
I worked for 20 years doing a lot of stuff right up next to the hardware, basically the equivalent of writing drivers, so I count cycles and tighten loops habitually, but anyone writing normal, higher-level code should be thinking in terms of algorithmic complexity. Look up "big-o notation" if you're not familiar with it. Reducing complexity usually gives bigger returns and allows your app to scale up better.
For most programmers and for most of the code they write, spending a lot of time on the cycle-counting stuff is for later, when you've already tuned the algorithms and the dataset. Sure, you want to know what's natively expensive and write your code with it in mind, e.g. avoid all(), but otherwise don't go nuts with the finicky details. I really only wrote about the finicky details because you appeared to want to know where exactly your cycles were going.
Fair enough. Like I said, right now it's working...and that's after some serious refactoring/rewriting to simplify things. That helped a to but I'm at the finish line and trying to squeeze a bit more so I can get a few more game details in there.
Plus, this is the first time I've hit smack up against Pico-8's limits with a game so it's a new thing to learn about.
Hey @Felice - I'm trying out a project that requires consistent timing and benchmark data like you shared here would be very helpful. Do you have any more resources on cycle costs?
And when you say peek()/poke() can use negative cycles, what do you mean by that? Is using memset() where len=1 any more consistent?
Also, do you know what a pico-8 cycle equates to in actual time, or is that not a consistent figure?
I basically just run stuff through my little cycle counter whenever I have a loop to tune, so there's no exhaustive list, no. It's especially difficult to say what any given thing will cost because I think there are a lot of little pragmatic patches that zep's done which can come together to produce an unintuitive cycle count, though by and large they help us out a lot, so it's tough to complain.
Speaking of which, that's what seems to be happening with peek and poke. The "negative" count I mentioned can happen when you use additional peeks, pokes, or binary (band, bor, etc.) ops in the second operand of the poke. Peek, poke, and the binary ops are have tweaks on their cycle counts to make them speedy like they'd be if they weren't a full-on function call. I think some of them overlap or remove one or two cycles more than they should, because you can end up with something that basically gives you cycles back, or at the very least has zero cost. I mean, that could be super useful if you want a real-time raytracer, but most of us mean to play by the rules, so it's a bit ... disappointing.
The counts you get should be reliable for a given piece of code. So if it takes 5 (or, sigh, -5) cycles now, it'll take that many on every run. The unreliable aspect is in what you'd expect vs. what you get. Oh and possibly it might change in a new version, of course.
I don't think zep has ever stated what the pseudo-frequency of the pico-8 processor is, but the magic numbers I came up with to force a certain operation to take exactly one second if repeated suggest around 4.2MHz. This seemed like a strange choice to me until recently, when I realized I was thinking in MHz (mega=1000 x 1000) instead of MiHz (mebi=1024 x 1024). 4MiHz is 4 x 1024 x 1024 = 4,194,304, which is pretty much exactly the 4.2MHz I was guessing at. So I think 4MiHz is very likely the clock speed we're getting, but we have no official confirmation.
[Please log in to post a comment]