First thought: Wow, that looks just like how Syndicate works.
Second: that's a terrible idea. (at least in 2025)
There's a tutorial (this one, I think https://youtu.be/i_XV78N7Zuo) on how to make a tool to compose your tiles.
If you want to make a tile-space renderer, that's harder, but having done it, I can probably talk you through it. You need to look through tile-space diagonally to make in-front/behind work correctly. The way I'd do it today would probably be to 'shoot rays' from the view direction, into the tile-space, and record the first, or however many tile fragments necessary to completely obscure the view. Then, just* render from that look-up-table. (there's a fruity view(x, y) to tile(x, y, z) transform, and you still need to render transient objects at the correct depth. Also, scrolling/panning, do you only do that by tile, or do you also do sub-tile-fragment pan?)
If you can get away with just stacking some tilemaps, do that instead, but ask if you need more.
Well, you can do the composition on the fly, by having the different sides of the diagonal on different tilemap layers, that'd make things easier, right?
Check my thinking here, but you'll get a checkerboard pattern alternating diagonals, right? Then, I'd suggest
It's how this works: https://github.com/aes/autotile3d and it would work.
But then, there's this, I guess: https://youtu.be/dclc8w6JW7Y