Just a few notes from playing around with userdata. These notes assume 8M VM cycles/sec and large enough arrays to avoid substantial overhead and fully realize economies of scale.
- Fast ops -
add
/mul
/copy
/etc. - cost 1/16 cycle. - Slow ops -
div
/convert
/etc. - cost 1/4 cycle. matmul
is charged as a fast op based on the size of the output matrix. I'm a little suspicious that the answer seems to be so simple, so I'm wondering if I missed something.copy
andmemmap
/memcpy
are approximately the same speed for 64 bit datatypes. For smaller datatypes,memcpy
is proportionally faster, though of course you then have to manage strides/spans yourself.memcpy
should also enablereinterpret_cast
type shenanigans.- There is substantial overhead for small spans. If you use spans of length 1 you pay 1/4 cycle/span, same as a slow op. It looks like this may be a flat cost per span, but I'm not sure. Using the full/strided forms of the ops does not seem to have noticeable additional costs beyond the per-span cost.
- For full-screen rendering, you have about 1 cycle/pixel at 480x270x60Hz. This includes whatever scaling/conversion you need to do at the end of the process. So realistically, you'll get in the neighborhood of 10 additions/multiplications per pixel. Exact numbers depend on whether you need a divide at the end, and whether or not you can work in
u8
. - userdata flat access w/ locals seems to cost 1 cycle/element including the assignment.
- userdata
get
is 1/4 cycle / element at scale ... but each explicit assignment will cost you 1 cycle on top of this. - userdata
set
is 1 cycle / element at scale.
There also seems to be some interesting behavior happening where multivals, even very large multivals, do not seem to noticeably increase CPU usage when passed to pack
or set
. While I'm enjoying taking advantage of this for f64
to u8
conversions at the same cost as convert
, I'm worried this might not last.
[Please log in to post a comment]