3d engine experiment and some thoughts

ilovezeno • 2025-01-27*2025-01-27 19:18* •

BBS>

Picotron>Work in Progress

a3denginetest

by ilovezeno

Cart #a3denginetest-0 | 2025-01-27 | Embed ▽ | License: CC4-BY-NC-SA

Since many interesting 3D projects have already been created on Pico-8, it's likely that 3D applications on Picotron will also attract significant interest. I'd like to share some thoughts on my recent experiments:

Strategy for Textured Triangle Rendering and Userdata

Picotron's userdata is particularly intriguing for someone like me who frequently uses NumPy, even though its XY order is opposite to NumPy’s. This difference has occasionally led to subtle, hard-to-detect bugs in my code. FReDs72 has demonstrated a highly efficient triangle rendering method using userdata. I made some modifications to it, including implementing a fixed command buffer instead of generating a new one each time (as I found that creating large userdata structures is not cheap) and manually unrolling functions and optimizing structures. These changes improved performance by about 20%-30%.

However, there are currently some bugs with userdata functions and tline3d, which I hope will be fixed soon. The batch processing method for rendering triangles provides significant performance gains, making a Z-buffer approach too costly and infeasible. As a result, I decided to use Z-sorting for triangle rendering. This approach requires global sorting of all triangles and depth-related objects before rendering. The rendering process is as follows:

1 Sort the object list.
2 Perform axis-aligned bounding box (AABB) tests for individual objects.
3 Transform the vertices of individual objects.
4 Perform triangle face culling and add triangles to the rendering buffer.
5 Once all objects are processed, sort the rendering buffer.
6 Render triangles and sprites from back to front.
A key point worth mentioning is the computational overhead of userdata. While a parallel C-based userdata implementation may be faster than Lua for mathematical operations, the setup and function call overhead must be considered. Even the choice of broadcasting mechanisms for mathematical operations impacts speed. For example, selectively operating on certain elements of a 5x1 vector might be slower than operating on all five elements. Careful benchmarking is necessary.

Dual-Level Sorting

Since the strategy involves Z-sorting all renderable objects, pre-sorting objects can effectively accelerate subsequent sorting steps. In my implementation, objects in the table are first transformed into clip space and sorted. Then, individual objects are processed. Depending on the object type, they are broken down into triangles or 2D maps and added to the rendering buffer. The buffer is then sorted again. By ensuring objects closer to the camera are processed first (in reverse drawing order), proper rendering is maintained.

Matrix Form Choices for Vertex Transformations

In modern 3D engines, the inclusion of the W component makes vertex coordinates four-dimensional (XYZW), with projection into clip space typically achieved using a 4x4 transformation matrix. This transformation involves 16 multiplications and 12 additions per vertex. However, if the operations are limited to rotation and scaling, the W component and clip-space convenience can be sacrificed in favor of a simpler 3D engine approach. This allows the transformation matrix to degrade into a 3x4 format, which Picotron supports via its matmul3d operation to save computation time.

However, matmul3d is only suitable for vector-matrix or matrix-matrix multiplications. If you have a 3xN vector group, this operation cannot be used directly. To prioritize performance, iterating over vectors in a loop would incur significant function call and I/O overhead, making it unacceptable. This leaves two options:

Manually append the W component to the 3xN vector table stored in POD files, transforming it into a 4xN format for direct 4xN × 3x4 matrix multiplication. However, this approach increases RAM usage and introduces additional memory access overhead.
Decompose the 3x4 matrix into a 3x3 transformation matrix and a 3x1 translation vector. Multiply the 3xN vector group by the 3x3 matrix and then add the 3x1 vector to each result. Since the appended W equals 1, this operation is equivalent to the first approach in terms of results. For computational efficiency, I chose the second approach.
For each object, the transformation steps are as follows:

Vertex coordinates → Object-to-world matrix → World-to-camera matrix → Camera-to-clip-space matrix (degraded format) → Screen-space projection.
Each object requires its own object-to-world transformation matrix, while the downstream matrices are fixed. To minimize computational costs, the downstream matrices are pre-multiplied into a single world-to-clip-space matrix (3x4 format). During object processing, the object-specific transformation matrix is multiplied by the world-to-clip-space matrix, and the resulting matrix is applied to transform vertices into clip space. Before this, AABB testing ensures that the object is at least partially within the clip space. Since W is not included, I used X/Z and Y/Z for projections.

Current Performance
Currently, the system can maintain 30 frames per second when rendering around 600 textured triangles. This provides a reasonable guideline for resource scaling when designing games.

The 3d model "Abandoned house" used here is from Alex, he made it via crocotile 3d