Log In  


tl;dr

In the pico8 console, run load #prof, then edit the last tab with some code you want to measure:

prof(function(x)
  local _=sqrt(x)   -- code to measure
end,function(x)
  local _=x^0.5     -- some other code to measure
end,{ locals={9} }) -- "locals" (optional) are passed in as args

Run the cart: it will tell you exactly how many cycles it takes to run each code snippet.


what is this?

The wiki is helpful to look up CPU costs for various bits of code, but I often prefer to directly compare two larger snippets of code against each other. (plus, the wiki can get out of date sometimes)

For the curious, here's how I'm able to calculate exact cycle counts
(essentially, I run the code many times and compare it against running nothing many times, using stat(1) and stat(2) for timing)

-- slightly simplified from the version in the cart
function profile_one(func)
  local n = 0x1000

  -- we want to type
  --   local m = 0x80_0000/n
  -- but 8𝘮𝘩z is too large a number to handle in pico-8,
  -- so we do (0x80_0000>>16)/(n>>16) instead
  -- (n is always an integer, so n>>16 won't lose any bits)
  local m = 0x80/(n>>16)

  -- given three timestamps (pre-calibration, middle, post-measurement),
  --   calculate how many more 𝘤𝘱𝘶 cycles func() took compared to noop()
  -- derivation:
  --   𝘵 := ((t2-t1)-(t1-t0))/n (frames)
  --     this is the extra time for each func call, compared to noop
  --     this is measured in #-of-frames (at 30fps) -- it will be a small fraction for most ops
  --   𝘧 := 1/30 (seconds/frame)
  --     this is just the framerate that the tests run at, not the framerate of your game
  --     can get this programmatically with stat(8) if you really wanted to
  --   𝘮 := 256*256*128 = 8𝘮𝘩z (cycles/second)
  --     (𝘱𝘪𝘤𝘰-8 runs at 8𝘮𝘩z; see https://www.lexaloffle.com/dl/docs/pico-8_manual.html#CPU)
  --   cycles := 𝘵 frames * 𝘧 seconds/frame * 𝘮 cycles/second
  -- optimization / working around pico-8's fixed point numbers:
  --   𝘵2 := 𝘵*n = (t2-t1)-(t1-t0)
  --   𝘮2 := 𝘮/n := m (e.g. when n is 0x1000, m is 0x800)
  --   cycles := 𝘵2*𝘮2*𝘧
  local function cycles(t0,t1,t2) return ((t2-t1)-(t1-t0))*m/30 end

  local noop=function() end -- this must be local, because func is local
  flip()
  local atot,asys=stat(1),stat(2)
  for i=1,n do noop() end -- calibrate
  local btot,bsys=stat(1),stat(2)
  for i=1,n do func() end -- measure
  local ctot,csys=stat(1),stat(2)

  -- gather results
  local tot=cycles(atot,btot,ctot)
  local sys=cycles(asys,bsys,csys)
  return {
    lua=tot-sys,
    sys=sys,
    total=tot,
  }
end

how do I use it?

Here's an older demo to wow you:

Cart #cyclecounter-2 | 2022-01-16 | Code ▽ | Embed ▽ | License: CC4-BY-NC-SA
15

This is neat but impractical; for everyday usage, you'll want to load #prof and edit the last tab.

The cart comes with detailed instructions, reproduced here for your convenience:

-----------------------
-- ★ usage guide ★ --
-----------------------

웃: i have two code snippets;
    which one is faster?

🐱: edit the last tab with your
    snippets, then run the cart.
    it will tell you precisely
    how much cpu it takes to
    run each snippet.

    the results are also copied
    to your clipboard.

웃: what do the numbers mean?

🐱: the cpu cost is reported
    as lua and system cycle
    counts. look up stat(1)
    and stat(2) for more info.

    if you're not sure, just
    look at the first number.
    lower is faster (better)

웃: why "{locals={9}}"
    in the example?

🐱: accessing local variables
    is faster than global vars.

    so if your test involves
    local variables, simulate
    this by passing them in:

      prof(function(a)
        sqrt(a)
      end,{ locals={9} })

    /!\     /!\     /!\     /!\
    local values from outside
    the current scope are also
    slower to access! example:

      global = 4
      local outer = 4
      prof(function(x)
        local _ = x --fast
      end,function(x)
        local _ = outer --slow
      end,function(x)
        local _ = global --slow
      end,{ locals={4} })
    /!\     /!\     /!\     /!\

웃: can i do "prof(myfunc)"?

🐱: no, this sometimes gives
    wrong results! always use
    inline functions:

      prof(function()
        --code for myfunc here
      end)

    as an example, "prof(sin)"
    reports "-2" -- wrong! but
    "prof(function()sin()end)"
    correctly reports "4"

    (see the technical notes at
    the start of the next tab
    for a brief explanation.
    technically, "prof(myfunc)"
    will work if myfunc was made
    by the user, but you will
    risk confusing yourself)

There are also instructions included on two alternate ways you can profile your code, without using prof:

---------------
 ★ method 2 ★
---------------

this cart is based on
code by samhocevar:
https://www.lexaloffle.com/bbs/?pid=60198#p

if you do this method, be very
careful with local/global vars.
it's very easy to accidentally
measure the wrong thing.

here's an example of how to
measure cycles (ignoring this
cart and using the old method)

  function _init()
    local a=11.2 -- locals

    local n=1024
    flip()
    local tot1,sys1=stat(1),stat(2)
    for i=1,n do   end --calibrate
    local tot2,sys2=stat(1),stat(2)
    for i=1,n do local _=sqrt(a) end --measure
    local tot3,sys3=stat(1),stat(2)

    function cyc(t0,t1,t2) return ((t2-t1)-(t1-t0))*128/n*256/stat(8)*256 end
    local lua = cyc(tot1-sys1,tot2-sys2,tot3-sys3)
    local sys = cyc(sys1,sys2,sys3)
    print(lua.."+"..sys.."="..(lua+sys).." (lua+sys)")
  end

run this once, see the results,
then change the "measure" line
to some other code you want
to measure.

note: wrapping the code inside
"_init()" is required, otherwise
builtin functions like "sin"
will be measured wrong.
(the reason is explained at
the start of the next tab)

---------------
 ★ method 3 ★
---------------

another way to measure cpu cost
is to run something like this:

  function _draw()
    cls(1)
    local x=9
    for i=1,1000 do
      local a=sqrt(x) --snippet1
  --    local b=x^0.5 --snippet2
    end
  end

while running, press ctrl-p to
see the performance monitor.
the middle number shows how much
of cpu is being used, as a
fraction. (0.60 = 60% used)

now, change the comments on the
two code snippets inside _draw()
and re-run. compare the new
result with the old to determine
which snippet is faster.

note: every loop iteration costs
an additional 2 cycles, so the
ratio of the two fractions will
not match the ratio of the 
execution time of the snippets.
but this method can quickly tell
you which snippet is faster.

various results

Here are some speed comparisons I found interesting. Some of these may be out of date now, but they were interesting:

poke4 v. memcopy

prof(function() memcpy(0,0x200,64) end,       -- 71 (7 lua, 64 sys)
     function() poke4(0,peek4(0x200,16)) end) -- 67 (7 lua, 60 sys)

Copying 64 bytes of memory is very slightly faster if you use poke4 instead of memcpy -- interesting!
(iirc this is true for other data sizes... find out for yourself for sure by downloading and running the cart!)

edit: this has changed in 0.2.4b! the memcpy in this example now takes 39 cycles

constant folding

I thought lua code was not optimized by the lua compiler/JIT at all, but it turns out there are a few specific optimizations it will do.

prof(function() return 2+2 end,
     function() return 2+2+2+2+2+2+2+2 end)

These functions both take a single cycle! That long addition gets optimized by lua, apparently. @luchak found these explanations:

https://stackoverflow.com/questions/33991369/does-the-lua-compiler-optimize-local-vars/33995520
> Since Lua often compiles source code into byte code on the fly, it is designed to be a fast single-pass compiler. It does do some constant folding

A No Frills Introduction to Lua 5.1 VM Instructions (book)
> As of Lua 5.1, the parser and code generator can perform limited constant expression folding or evaluation. Constant folding only works for binary arithmetic operators and the unary minus operator (UNM, which will be covered next.) There is no equivalent optimization for relational, boolean or string operators.

constant folding...?

One further test case:

prof(function() local a=2  return 2+2+2+2+2+2+2+a end, --2
     function() local a=2  return a+2+2+2+2+2+2+2 end) --8

These cost different amounts! Constant-folding only seems to work at the start of expressions. (This is all highly impractical code anyway, but it's fun to dig in and figure out this sort of thing)

credits

Cart by pancelor.

Thanks to @samhocevar for the initial snippet that I used as a basis for this profiler!

Thanks to @freds72 and @luchak for discussing an earlier version of this with me!

Thanks to thisismypassword for updating the wiki's CPU page!

changelog

v1.4

  • redo explanations
  • more thorough explanation of pitfalls of alternate methods
    • why measuring sin() at top-level is no good
    • why function _draw() for i=1,1000 do ... end end can be misleading

v1.3

  • simpler BBS post, friendlier cart instructions

v1.2

  • rewrite; recommend using load #prof instead now

v1.1

  • added: press X to copy to clipboard
  • added: can pass args; e.g. profile("lerp", lerp, {args={1,4,0.3}})

v1.0

  • intial release
15


2

the profiler is missing an input variable somehow - the current pattern forces declaration of a local (or global) to mimic real life usage

qol request: copy results to clipboard


good points -- added! passing input variables is slightly awkward, but it's at least possible now


I've updated the description + cart (load #prof) to be a lot clearer



[Please log in to post a comment]