Previous journal: | Next journal: |
---|---|
0086-2023-05-26.md | 0088-2023-05-28.md |
- I created the raybox_reciprocal repo to try and optimise the reciprocal approximator in isolation.
- I've worked out that the
reciprocal
design (includinglzc
) can fit in 200x200µm as-is. I'm trying to optimise this as much as possible before I start trying to see if I can share it by reworking the Raybox design so it doesn't need so many parallel instances. - Currently, the max. speed it can run at is about 40MHz (25ns propagation delay). This might be too close for comfort but note that we can tolerate waiting for it when it is in the tracer... just not so much when used realtime for column/sprite rendering. In that case, if we really can't make it fast enough, we could try just calculating the reciprocal in the tracer and storing it in a bigger buffer (which might end up taking up less area).
- One area we might be able to optimise is
lzc
. I think the usual approach here is to have each bit of the LZC result calculated by testing a group of bits. - I think the rest of the delay comes from multipliers and what I've observed to be a pretty
long chain when looking at the
.dot
file out of Yosys. - Depending on how big it ends up being, and how deep, should we try sky130 HS cells? Might be worth it to meet timing requirements.
- NOTE: Some reciprocals are known to only need to work with inputs that are typically
≤ 1.0, e.g.
1/rayDir
. - Can some reciprocals work fine with Q6.10, or even have Q6.10 input and Q10.6 output?
- How much silicon space will we save by making memories (except trace buffer) external? I don't think we can make the trace buffer external because it would need to be separate from texture/sprite memory, and hence would require too many IOs.
- Per this:
Unfortunately, the OpenLANE tools are currently configured to only accept Verilog-2005, not SystemVerilog. For this, we can use an open-source tool called
sv2v
to modify our code from SystemVerilog to Verilog.% mv Counter.v Counter.sv % sv2v -w adjacent Counter.sv % rm Counter.sv % less Counter.v
- Do we need to worry about
USE_POWER_PINS
? - OpenLane start-up re sky130 HD cells:
[INFO]: PDK Root: /home/zerotoasic/asic_tools/pdk [INFO]: Process Design Kit: sky130A [INFO]: Setting PDKPATH to /home/zerotoasic/asic_tools/pdk/sky130A [INFO]: Standard Cell Library: sky130_fd_sc_hd [INFO]: Optimization Standard Cell Library: sky130_fd_sc_hd
- Write verification for
raybox_reciprocal
? Implement its own fixed-point algorithm in cocotb, and then also write a verification that compares it to actual reciprocal and verifies it is within a given tolerance. - Then try to share one instance in Raybox:
- 2 for rayDir (only needed during trace setup for each column in tracer).
- 2 for sprites (only needed once per each sprite in tracer).
- 1 for wall height scaler (needed constantly when out of VBLANK/tracing).
- Sprite height scaler?? Need to work out when this is needed if we have more than 1 sprite. Probably per each sprite, but maybe this can be done during VBLANK and stored.
- Should we use:
"CLOCK_TREE_SYNTH": false, "CLOCK_PORT": null,
- Also look into:
- GRT_ALLOW_CONGESTION
- GRT_OVERFLOW_ITERS
-
Extracted
reciprocal
andlzc_b
modules. -
Added as an OpenLane design:
cd $OPENLANE_ROOT pushd designs git clone [email protected]:algofoogle/raybox_reciprocal popd make mount ./flow.tcl -design raybox_reciprocal -init_design_config -add_to_designs
-
This created the
config.json
file which I only changed to specify that source files are insrc/rtl/
:{ "DESIGN_NAME": "reciprocal", "VERILOG_FILES": "dir::src/rtl/*.v", "CLOCK_PORT": "clk", "CLOCK_PERIOD": 10.0, "DESIGN_IS_CORE": true }
-
Tried hardening:
time ./flow.tcl -design raybox_reciprocal -verbose 1
-
NOTE:
logs/synthesis/1-synthesis.log
has some interesting info in it like:1. Executing Verilog-2005 frontend: /openlane/designs/raybox_reciprocal/src/rtl/reciprocal.v Parsing SystemVerilog input from `/openlane/designs/raybox_reciprocal/src/rtl/reciprocal.v' to AST representation. Generating RTLIL representation for module `\reciprocal'. /openlane/designs/raybox_reciprocal/src/rtl/reciprocal.v:29: Warning: converting real value 6.004236e+03 to binary 24'000000000001011101110100. /openlane/designs/raybox_reciprocal/src/rtl/reciprocal.v:35: Warning: converting real value 4.100415e+03 to binary 24'000000000001000000000100. reciprocal params for Q12.12: n1466=001774, n10012=001004, nSat=7FFFFF Successfully finished Verilog frontend.
-
Oops, got the OpenLane LVS error about unmatched pins at step 33, because I'm using negative bit vector indices on the inputs and outputs.
-
NOTE:
summary.py --design raybox_reciprocal --run --full-summary
still works and tells us:DIEAREA_mm^2 : 0.04553003842500001 cell_count : 1971 cells_pre_abc : 7875 Total_Physical_Cells : 699 suggested_clock_period : 23.89 suggested_clock_frequency : 41.85851820845542 CLOCK_PERIOD : 10.0 SYNTH_MAX_FANOUT : 10 FP_CORE_UTIL : 50 PL_TARGET_DENSITY : 0.55
In summary:
- 45,530squm (approx. 214x214µm, actually closer to 208x219µm).
- Propagation delay would allow up to about 42MHz operation.
- It took about 7mins on my desktop PC.
-
Running
./flow.tcl -design raybox_reciprocal -tag RUN_2023.05.27_05.50.03 -gui
: -
I managed to get it down to 200x200µm 40,000squm with:
{ "DESIGN_NAME": "reciprocal", "VERILOG_FILES": "dir::src/rtl/*.v", "CLOCK_PORT": "clk", "CLOCK_PERIOD": 10.0, "DESIGN_IS_CORE": true, "FP_SIZING": "absolute", "DIE_AREA": "0 0 200 200", "PL_TARGET_DENSITY": 0.65, "GRT_OVERFLOW_ITERS": 100 }
-
I also created a separate top module called
reciprocal_12_12
so I can avoid the negative vector bit indices, which allows the process to complete. Mind you, it does lead to slightly different timing for some reason... while previous attempts got to 41.98MHz, with this new top it gets to 40.37MHz. Maybe the pins ended up with different placement? -
Lower speed of course means we have timing violations (setup violations in this case), with most paths showing worst slack of -14.7ns (i.e. 10ns - -14.7ns = 24.7ns, which is the recommended clock period we're getting for ~40MHz).
-
I added this config:
"FP_PIN_ORDER_CFG": "dir::pin_order.cfg"
and this is the contents ofpin_order.cfg
:#N i_.* #S o_.*
-
That gives about the same max speed (40.78MHz), but with inputs neatly and sequentially at the top, and outputs likewise at the bottom...