Adapting a solution found ca. pages 179-180 of this white paper:
support.amd.com/TechDocs/25112.PDF (archived)
Apparently it's commonly called "popcount" or "popcnt", after an asm instruction on some CPUs. Basically, it's the population-of-1's count. I think I assumed it would be called "bitcount".
I finagled it into this PICO-8-specific form, which counts every bit (including fractional bits) in constant and pretty-brief time, with just 2 muls, 2 adds, 3 shifts, and 5 bitwise ops:
function popcount(v) v-=v>>>1&0x5555.5555 v=(v&0x3333.3333)+(v>>>2&0x3333.3333) return (v*0x1.1&0x0f0f.0f0f)*0x0101.0101<<8&0xff end |
It's a bit uglier than it would be in C/C++, but it's mostly the same operations, with a minor adaptation for the 16-bit shifts inherent in 16.16 fixed-point multiplies. PICO-8 also benefits from being able to do a fast multiply by a fixed-point value like 0x1.1 in the third line of the function, which would have required a shift and an add otherwise.
2023 Edit: BUT WAIT!
We can do (slightly) better on PICO-8.
I realized we could eliminate the final <<8 by taking advantage of the automatic shifting that happens with PICO-8 fixed-point multiplies. We're already shifting less than the original version, which had to >>24 to ge the right value at the ones position of a 32-bit int, and that's because we're working with fixed-point multiplies that include a >>16 to maintain the fraction.
Turns out there are just enough bits of headroom during the final two multiplies that no section will overflow into the next if we scale up the constants. We can turn the final line's *0x1.1
into *0x4.4
to add a <<2, which also had to be applied to the &0x0f0f.0f0f
mask to get &0x3c3c.3c3c)
and then also in the final line we can turn *0x0101.0101
into *0x4040.4040
to add a <<6, meaning we don't need the <<8 anymore:
function popcount(v) v-=v>>>1&0x5555.5555 v=(v&0x3333.3333)+((v>>>2)&0x3333.3333) return (v*0x4.4&0x3c3c.3c3c)*0x4040.4040&0xff end |
Just leaving it here in case anyone ever needs it.
Changelog:
- 2023-10-20: Updated for more recent PICO-8 syntax.
- 2023-10-21: Avoided one final shift by incorporating it into two of the intermediate fixed-point multiplies.
Just in case you haven't seen it, I strongly recommend this page if you find yourself dealing with this sort of algorithm with any frequency: https://graphics.stanford.edu/~seander/bithacks.html
That page presents your algorithm as
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count |
Heads up! I found a small optimization while updating to PICO-8's current more-concise bit op syntax. 🙂
See the "2023 EDIT" in the OP to remove one shift operation.
Couldn't compete in the number of operations area, so I looked at token numbers, but most ideas were dead ends :
lshr(v) is the same as lshr(v,0) : useless
no ternary operator
Can't use 0 as false, it's considered true in lua (at least I learned something).
-- 23 tokens function popcount(v) if (v==0) return 0 return ceil(v&0x0.0001)+popcount(v>>1) |
Performance wise, we're looking at an average of 32 bit operations, 16 ceil, 16 additions, 16 comparisons, and 16 function calls (!). You'd really need the token very badly to use it.
Technically, my function gives better performance if v is zero, so I'd recommend it if you want to count the number of ones in zero, in witch case you can optimize it further by using 0 in your code instead.
[Please log in to post a comment]