Technical deep-dives on build systems, toolchains, and cross-platform development
Bridging the Gap Between RTL and Silicon
When developing custom AI accelerators, teams face a fundamental timing problem:
| Simulation Type | Speed | Fidelity | When Available |
|---|---|---|---|
| RTL Simulation | ~1-10 Hz | Cycle-accurate | Early design |
| FPGA Emulation | ~1-10 MHz | Near-accurate | Mid development |
| Real Silicon | GHz | Ground truth | Post tape-out |
RTL simulation is too slow for running real ML workloads. Real hardware doesn’t exist yet. FPGA emulation requires expensive boards and long build times.
Functional simulation fills this gap: fast enough to run real firmware, accurate enough to validate system behavior.
graph LR
subgraph "Development Timeline"
A[RTL Design] --> B[Functional Sim]
B --> C[FPGA Emulation]
C --> D[Silicon Bring-up]
end
subgraph "Speed vs Fidelity"
E["RTL: 1 Hz<br/>Cycle-accurate"]
F["FunSim: 100 MHz+<br/>Behavioral"]
G["Silicon: GHz<br/>Ground truth"]
end
A -.-> E
B -.-> F
D -.-> G
Functional simulation isn’t about modeling every transistor. It’s about modeling what matters for software validation:
graph TD
subgraph "Modeled (Behavioral)"
M1[Network-on-Chip Topology]
M2[Tile MMIO Interfaces]
M3[Memory Controller Behavior]
M4[Interrupt Delivery]
M5[Accelerator Compute Results]
end
subgraph "Abstracted Away"
S1[Pipeline Hazards]
S2[Cache Coherence Protocols]
S3[Clock Domain Crossings]
S4[Power Management States]
end
subgraph "Goal"
G[Firmware runs correctly<br/>System behavior matches spec]
end
M1 --> G
M2 --> G
M3 --> G
| Component | Why Model It? | How (QEMU + Extensions) |
|---|---|---|
| Custom NoC | Validates data routing, congestion, arbitration | QEMU TCG + custom MMIO devices; route traffic via sockets |
| Memory Subsystem | Tests bandwidth limits, bank conflicts | Emulate memory controllers with latency/bandwidth caps |
| Accelerator Tiles | Verify tile scheduling, DMA, interrupts | Model as QEMU platform devices; expose register interface |
| Inter-Chip Communication | Simulate multi-chip/module setups | Loopback TCP/UDP or vhost-user between QEMU instances |
| Clock & Power Domains | Validate DVFS, idle states | Inject synthetic delays or throttling based on workload |
flowchart TD
subgraph "QEMU Instance"
CPU[CPU Model<br/>RISC-V/ARM]
subgraph "Custom Device Models"
NOC[NoC Controller<br/>MMIO + Socket Backend]
MEM[Memory Controller<br/>Bandwidth Limiting]
ACC[Accelerator Model<br/>Compute + DMA]
INT[Interrupt Controller<br/>Level/Edge Triggered]
end
subgraph "Backends"
SOCK[Socket I/O<br/>Inter-Instance]
FILE[File I/O<br/>Logging]
SHM[Shared Memory<br/>Fast Path]
end
end
CPU <--> NOC
CPU <--> MEM
CPU <--> ACC
CPU <--> INT
NOC <--> SOCK
ACC <--> FILE
MEM <--> SHM
| Component | Model Approach | Why |
|---|---|---|
| CPU Cores | Full ISA emulation (QEMU TCG) | Firmware must execute correctly |
| NoC | Message-passing with configurable latency | Validates routing and deadlock-freedom |
| Accelerator Units | Functional compute (golden model) | Validates data flow, not timing |
| Memory | Bandwidth-capped flat address space | Validates addressing and bandwidth limits |
| Interrupts | Level-triggered delivery | Validates firmware interrupt handlers |
| DMA Engines | Async copy with completion interrupt | Validates firmware DMA programming |
For multi-tile systems, we run separate QEMU instances connected via sockets. This approach:
flowchart TB
subgraph "Host Machine"
subgraph "QEMU Instance: Host Controller"
HC[x86/ARM CPU Model]
HD[PCIe Device Model]
HM[Host Memory]
end
subgraph "QEMU Instance: Tile 0"
T0C[RISC-V CPU]
T0A[Accelerator Model]
T0M[Local Memory]
T0N[NoC Interface]
end
subgraph "QEMU Instance: Tile 1"
T1C[RISC-V CPU]
T1A[Accelerator Model]
T1M[Local Memory]
T1N[NoC Interface]
end
subgraph "QEMU Instance: Tile N"
TNC[RISC-V CPU]
TNA[Accelerator Model]
TNM[Local Memory]
TNN[NoC Interface]
end
end
subgraph "Inter-Instance Communication"
RT[Routing Table]
S0[Socket :65000]
S1[Socket :65001]
SN[Socket :6500N]
end
HD <--> S0
T0N <--> S0
T0N <--> S1
T1N <--> S1
T1N <--> SN
TNN <--> SN
RT -.->|"Tile ID → Socket"| S0
RT -.-> S1
RT -.-> SN
Each QEMU instance exposes a chardev socket for inter-tile communication. A routing table maps tile IDs to network addresses:
sequenceDiagram
participant T0 as Tile 0<br/>(Port 65000)
participant RT as Routing Table
participant T1 as Tile 1<br/>(Port 65001)
participant T2 as Tile 2<br/>(Port 65002)
Note over T0,T2: Tile 0 sends message to Tile 2
T0->>RT: Lookup destination: Tile 2
RT-->>T0: 127.0.0.1:65002
T0->>T2: TCP packet (payload + header)
T2->>T2: Process in firmware
T2->>RT: Lookup destination: Tile 0
RT-->>T2: 127.0.0.1:65000
T2->>T0: Response packet
A simple CSV maps tile IDs to socket endpoints:
# tile_id, die, x, y, host, port
0, 0, 0, 0, 127.0.0.1, 65000
1, 0, 0, 1, 127.0.0.1, 65001
2, 0, 1, 0, 127.0.0.1, 65002
3, 0, 1, 1, 127.0.0.1, 65003
600, 0, 0, 0, 127.0.0.1, 65100
This approach enables:
Managing multiple QEMU instances manually is tedious. We use tmux to:
flowchart LR
subgraph "tmux Session: qemu"
subgraph "Window 0: Host"
P0[QEMU x86<br/>Host Controller]
end
subgraph "Window 1: Tiles"
P1[Pane: Tile 0]
P2[Pane: Tile 1]
P3[Pane: Tile 2]
P4[Pane: Tile 3]
end
end
JSON[Config JSON] --> Orchestrator
Orchestrator --> |"tmux new-session"| P0
Orchestrator --> |"tmux split-window"| P1
Orchestrator --> |"tmux split-window"| P2
Orchestrator --> |"tmux split-window"| P3
Orchestrator --> |"tmux split-window"| P4
A JSON configuration defines the simulation topology:
{
"tmux_split": "yes",
"routing_table": "./routing_table.csv",
"deployment": [{
"host": {
"memory": "512M",
"port": 65100
},
"tiles": [
{"tile_id": 0, "port": 65000, "firmware": "tile-0.elf", "cpus": 4},
{"tile_id": 1, "port": 65001, "firmware": "tile-1.elf", "cpus": 4},
{"tile_id": 2, "port": 65002, "firmware": "tile-2.elf", "cpus": 4},
{"tile_id": 3, "port": 65003, "firmware": "tile-3.elf", "cpus": 4}
]
}]
}
Functional simulation isn’t just about “it boots.” We capture metrics that hardware architects and firmware teams actually care about:
| Metric | Why It Matters | How to Measure |
|---|---|---|
| NoC Latency (avg/max) | Predicts end-to-end inference delay | Timestamp packets at source/sink via shared memory or logs |
| Memory Bandwidth Utilization | Reveals bottlenecks before tape-out | Track bytes read/written per tile via QEMU device counters |
| Tile Idle vs. Active % | Measures compute efficiency | Log tile state transitions (busy/idle) from QEMU device |
| Interrupt Latency | Impacts real-time response | Inject event → measure host ISR response time |
| End-to-End Throughput | Ultimate system-level KPI (tokens/sec) | Drive workload from host → time completion |
| Determinism / Reproducibility | Critical for debug | Ensure identical seed → identical logs (use -icount in QEMU) |
graph TD
subgraph "Instrumentation Points"
I1[Firmware Trace Macros]
I2[QEMU Device Counters]
I3[Socket Traffic Logs]
I4[Memory Access Hooks]
end
subgraph "Collected Metrics"
M1[NoC Latency<br/>per hop avg/max]
M2[Tile Utilization<br/>busy vs idle %]
M3[Memory Bandwidth<br/>GB/s per tile]
M4[Queue Depths<br/>backpressure detection]
M5[Interrupt Latency<br/>cycles to ISR]
M6[Throughput<br/>tokens/second]
end
subgraph "Analysis Tools"
A1[Perfetto Traces]
A2[Aggregated Statistics]
A3[Bottleneck Detection]
A4[Regression Comparison]
end
I1 --> M1
I1 --> M2
I2 --> M3
I2 --> M5
I3 --> M4
I4 --> M6
M1 --> A1
M2 --> A1
M3 --> A2
M4 --> A3
M5 --> A2
M6 --> A4
flowchart LR
subgraph "QEMU Tile Instance"
FW[Firmware] --> |"TRACE_BEGIN/END"| UART[Serial Output]
DEV[Device Model] --> |"Counter Registers"| MMIO[MMIO Interface]
NOC[NoC Model] --> |"Packet Timestamps"| SOCK[Socket Log]
end
subgraph "Collection"
UART --> LOG[Per-Tile Logs]
MMIO --> STATS[Statistics Dump]
SOCK --> PCAP[Traffic Capture]
end
subgraph "Analysis Pipeline"
LOG --> MERGE[Log Merger]
STATS --> MERGE
PCAP --> MERGE
MERGE --> PERFETTO[Perfetto JSON]
MERGE --> CSV[Metrics CSV]
MERGE --> REPORT[Summary Report]
end
Firmware emits structured trace events that are converted to Chrome Trace Format for visualization:
TS:000653209372 T:04 C:01 PERFETTO_TAG_BEGIN:COMPUTE
TS:000653215000 T:04 C:01 PERFETTO_TAG_END:COMPUTE
TS:000653215100 T:04 C:01 PERFETTO_TAG_BEGIN:SEND
TS:000653220000 T:04 C:01 PERFETTO_TAG_END:SEND
This enables:
gantt
title Tile Execution Timeline (Perfetto View)
dateFormat X
axisFormat %s
section Tile 0
COMPUTE :0, 100
SEND :100, 120
WAIT :120, 200
COMPUTE :200, 300
section Tile 1
WAIT :0, 50
RECEIVE :50, 70
COMPUTE :70, 170
SEND :170, 190
section Tile 2
WAIT :0, 100
RECEIVE :100, 120
COMPUTE :120, 220
DONE :220, 230
graph TD
subgraph "Unit Tests"
U1[Single-tile compute]
U2[Memory operations]
U3[Interrupt handling]
end
subgraph "Integration Tests"
I1[Two-tile communication]
I2[Loopback: send → receive → verify]
I3[Multi-hop routing]
end
subgraph "System Tests"
S1[Full topology boot]
S2[ML workload execution]
S3[Error injection & recovery]
end
subgraph "Correlation"
C1[Compare vs RTL golden]
C2[Compare vs FPGA results]
C3[Compare vs silicon]
end
U1 --> I1
U2 --> I1
I1 --> S1
I2 --> S2
S2 --> C1
C1 --> C2
C2 --> C3
The simplest integration test validates the full data path:
sequenceDiagram
participant Host
participant Tile0
participant Tile1
participant Tile2
participant Tile3
Host->>Tile0: Send test vector
Tile0->>Tile0: Compute (e.g., MVM)
Tile0->>Tile1: Forward result
Tile1->>Tile1: Compute
Tile1->>Tile2: Forward result
Tile2->>Tile2: Compute
Tile2->>Tile3: Forward result
Tile3->>Tile3: Compute
Tile3->>Host: Return final result
Host->>Host: Verify against golden
Track these across simulation levels to build confidence:
| Metric | FunSim | FPGA | Silicon | Status |
|---|---|---|---|---|
| Compute result | 0xDEADBEEF | 0xDEADBEEF | 0xDEADBEEF | PASS |
| Message count | 1024 | 1024 | 1024 | PASS |
| Relative ordering | A→B→C | A→B→C | A→B→C | PASS |
| Cycle count | N/A | 50,000 | 48,500 | ~3% delta |
Pass configuration to QEMU devices via -global:
qemu-system-riscv64 \
-global driver=noc,property=routing_table,value=./routing.csv \
-global driver=noc,property=tile_id,value=0 \
-chardev socket,id=noc_socket,host=0.0.0.0,port=65000,server=on,wait=off
Capture per-core output to separate files:
-chardev file,id=log_core0,path=log_tile0_core0.txt -serial chardev:log_core0
-chardev file,id=log_core1,path=log_tile0_core1.txt -serial chardev:log_core1
# Pseudocode for orchestrator
def launch_simulation(config):
# Create tmux session
subprocess.run(["tmux", "new-session", "-d", "-s", "sim"])
# Generate routing table from config
generate_routing_table(config)
# Launch each tile in a pane
for i, tile in enumerate(config["tiles"]):
cmd = build_qemu_command(tile)
if i == 0:
subprocess.run(["tmux", "send-keys", "-t", "sim", cmd, "C-m"])
else:
subprocess.run(["tmux", "split-window", "-v", "-t", "sim"])
subprocess.run(["tmux", "send-keys", "-t", f"sim.{i}", cmd, "C-m"])
# Attach for interactive use
subprocess.run(["tmux", "attach", "-t", "sim"])
mindmap
root((Functional Simulation))
Speed
100MHz+ virtual time
Real firmware execution
Interactive debugging
Fidelity
Behavioral accuracy
Message ordering
Interrupt delivery
Scalability
Multi-instance QEMU
Socket interconnect
Distributed deployment
Validation
Loopback tests
Trace analysis
Silicon correlation
The best simulation is one that finds bugs before they’re burned into silicon.