This article provides an in-depth analysis of aVisor — a lightweight AArch64 Type-1 Hypervisor designed for IoT scenarios. Starting from the fundamentals of ARM virtualization, it progressively unfolds the technical architecture and implementation details, covering the boot sequence, exception handling, memory virtualization, device emulation, scheduler, console system, and other core modules.
Table of Contents
- ARM Virtualization Fundamentals
- aVisor Overall Architecture
- Boot Sequence: From Power-On to Guest OS Execution
- Exception Handling and Trap Mechanism
- Memory Virtualization: Stage-2 Address Translation
- Device Emulation and MMIO Interception
- Interrupt Virtualization and Timers
- Scheduler and Context Switching
- Multi-Core Support (SMP)
- Console and Shell System
- Filesystem and VM Loading
- Source Code Structure Overview
1. ARM Virtualization Fundamentals
1.1 ARM Exception Levels
The AArch64 architecture defines four exception levels, with privilege increasing from bottom to top:
┌──────────────────────────────────────┐
│ EL3 Secure Monitor (Firmware/ATF) │ ← Highest privilege, secure world switch
├──────────────────────────────────────┤
│ EL2 Hypervisor │ ← aVisor runs here
├──────────────────────────────────────┤
│ EL1 Guest OS Kernel (Linux) │ ← Guest kernel
├──────────────────────────────────────┤
│ EL0 User Applications │ ← Guest userspace programs
└──────────────────────────────────────┘
As a Type-1 (bare-metal) Hypervisor, aVisor runs directly at EL2 without a host operating system. The Guest OS runs at EL1/EL0, and its privileged operations are automatically trapped by hardware to EL2 for aVisor to handle.
1.2 Key System Registers
| Register | Function |
|---|---|
| HCR_EL2 | Hypervisor Configuration Register — controls trap behavior and Stage-2 translation |
| VTTBR_EL2 | Stage-2 page table base address + VMID |
| VTCR_EL2 | Stage-2 translation control (page size, address width, etc.) |
| ELR_EL2 | Exception Link Register (saves Guest PC on trap) |
| SPSR_EL2 | Saved Program Status Register (saves Guest PSTATE on trap) |
| ESR_EL2 | Exception Syndrome Register (encodes trap reason) |
| FAR_EL2 | Fault Address Register |
| HPFAR_EL2 | Hypervisor IPA Fault Address Register (for Stage-2 faults) |
1.3 Stage-2 Address Translation
ARM virtualization extensions provide two levels of address translation:
Guest VA ──Stage-1(Guest-controlled)──► IPA ──Stage-2(Hypervisor-controlled)──► PA
(EL1 MMU, TTBR0/1_EL1) (EL2 MMU, VTTBR_EL2)
- Stage-1: Managed by the Guest, translating Virtual Addresses (VA) to Intermediate Physical Addresses (IPA)
- Stage-2: Managed by the Hypervisor, translating IPA to real Physical Addresses (PA), achieving memory isolation
1.4 HVC/SMC Calls
- HVC (Hypervisor Call): Guest proactively calls Hypervisor services (e.g., PSCI power management)
- SMC (Secure Monitor Call): When
HCR_EL2.TSC=1, SMC is also trapped to EL2
2. aVisor Overall Architecture
2.1 Design Goals
aVisor is designed for the Raspberry Pi 3 (BCM2837, 4-core Cortex-A53), aiming to run a full Linux Guest with minimal overhead on an embedded platform. Core features:
- Type-1 bare-metal: Runs directly on hardware EL2, no host OS needed
- 1:1 vCPU pinning: Each vCPU is pinned to a specific physical CPU core
- Full virtualization: Guest runs unmodified, standard AArch64 Linux kernel
- Demand paging: Guest memory is dynamically allocated on page-fault
- Complete device emulation: Emulates BCM2837 UART, interrupt controller, timer, Mailbox, and other peripherals
2.2 Architecture Overview
┌─────────────────────────────────────────────────┐
│ Guest Linux │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ vCPU 0 │ │ vCPU 1 │ │ vCPU 2 │ vCPU 3 │ ← EL1/EL0
│ └────┬────┘ └────┬────┘ └────┬────┘ │
├───────┼────────────┼────────────┼───────────────┤
│ │ HVC/Trap │ │ │
│ ┌────▼────────────▼────────────▼────────────┐ │
│ │ aVisor Hypervisor │ │ ← EL2
│ │ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ │
│ │ │Exception │ │ Memory │ │ Device │ │ │
│ │ │ Handling │ │ Mgmt │ │Emulation│ │ │
│ │ │(sync_exc)│ │ (mm.c) │ │(bcm2837)│ │ │
│ │ ├──────────┤ ├──────────┤ ├─────────┤ │ │
│ │ │Scheduler │ │Stage-2 PT│ │ Console │ │ │
│ │ │ (sched) │ │ (vm.c) │ │(console)│ │ │
│ │ └──────────┘ └──────────┘ └─────────┘ │ │
│ └───────────────────────────────────────────┘ │
├─────────────────────────────────────────────────┤
│ BCM2837 Hardware (Raspberry Pi 3) │
│ CPU0 CPU1 CPU2 CPU3 UART Timer EMMC ... │
└─────────────────────────────────────────────────┘
2.3 Memory Layout
aVisor uses the following layout within the Raspberry Pi 3’s 1GB physical memory:
| Address Range | Purpose |
|---|---|
0x00000000 - 0x3EFFFFFF |
Normal memory (Guest RAM + Hypervisor) |
0x3F000000 - 0x3FFFFFFF |
BCM2837 GPU peripherals (MMIO) |
0x40000000 - 0x40FFFFFF |
BCM2836 local peripherals (timers, interrupts, Mailbox) |
The Hypervisor itself occupies the low address region. The Guest kernel image is loaded at IPA 0x08000000, the DTB at 0x3B000000, and the initramfs at 0x02200000.
3. Boot Sequence: From Power-On to Guest OS Execution
3.1 EL3 → EL2 Transition
The aVisor boot code resides in boot.S, with the entry point _start in the .text.boot section. On QEMU raspi3b, the CPU starts at EL3:
_start:
mrs x0, mpidr_el1 // Read CPU ID
and x0, x0, #3
...
mrs x0, CurrentEL // Check current exception level
cmp x0, #3
beq el3 // If EL3, configure secure registers
el3:
msr hcr_el2, HCR_VALUE // Configure Hypervisor control
msr scr_el3, SCR_VALUE // Non-secure + HVC enabled + AArch64
msr spsr_el3, SPSR_VALUE// Target: EL2h, all interrupts masked
adr x0, el2_entry
msr elr_el3, x0
eret // Exception return, enter EL2
SCR_EL3 sets NS=1 (Non-secure world), HCE=1 (enable HVC), and RW=1 (EL2 uses AArch64). SPSR_EL3 targets EL2h mode with all interrupts masked. eret transfers execution to EL2.
3.2 EL2 Page Tables and MMU Initialization
After entering EL2, the BSP (CPU0) executes:
el2_entry:
// 1. Zero out BSS section
adr x0, bss_begin
adr x1, bss_end
sub x1, x1, x0
bl memzero
// 2. Create EL2 page tables
bl __create_page_tables
// 3. Configure MMU control registers
adrp x0, pg_dir
msr ttbr0_el2, x0 // Page table base address
msr tcr_el2, TCR_VALUE // Translation control
msr vtcr_el2, VTCR_VALUE // Stage-2 translation control
msr mair_el2, MAIR_VALUE // Memory attributes
// 4. Enable MMU
mov x0, #SCTLR_MMU_ENABLED
msr sctlr_el2, x0
isb
// 5. Jump to C entry point
br hypervisor_main
__create_page_tables creates three-level page tables (PGD → PUD → PMD) using 2MB block mappings:
- Normal memory (
0x00000000 - 0x3EFFFFFF):MMU_FLAGS = 0x705(Normal, Inner Shareable, AF) - Device memory (
0x3F000000 - 0x3FFFFFFF):MMU_DEVICE_FLAGS = 0x701(Device-nGnRnE, AF) - Local peripherals (
0x40000000): A single 2MB block with device memory attributes
The QEMU linker script linker_qemu.ld sets the code base address to 0x80000 (QEMU’s kernel load address), so the page tables are effectively identity-mapped (VA = PA).
3.3 hypervisor_main Initialization Sequence
void hypervisor_main(void)
{
// 1. Initialize per-CPU data structures
init_per_cpu_data();
// 2. Initialize physical UART (Mini UART, 115200 baud)
uart_init();
// 3. Print logo and initialize Shell
printf(logo);
shell_init();
init_hv();
// 4. Install exception vector table
irq_vector_init();
// 5. Configure timers
init_misc_timer(); // BCM2835 system timer
init_hv_timer(0); // CPU0's Hypervisor physical timer
// 6. Enable interrupt controller
enable_interrupt_controller();
// 7. Mount SD card FAT32 filesystem
f_mount(&fatfs, "/", 1);
// 8. Create VM and load Guest kernel
for (int i = 0; i < get_avisor_config_amount(); i++) {
create_vm(i, get_avisor_config(i));
}
// 9. Start remaining physical CPU cores
start_secondary_cores(1, secondary_main);
start_secondary_cores(2, secondary_main);
start_secondary_cores(3, secondary_main);
// 10. Enter scheduling loop
enable_irq();
while (1)
schedule();
}
3.4 VM Creation and Kernel Loading
create_vm() is the starting point of the VM lifecycle:
- Allocate VM structure: Obtain a slot from the global
vm_array[], assign VMID - Initialize Stage-2 page tables: Mark device regions (
0x3F000000+) as “inaccessible” to trigger MMIO traps on access - Initialize console: Allocate
in_fifoandout_fifo - Create vCPUs: Create one vCPU for each physical CPU core
vCPU creation allocates a THREAD_SIZE (4KB) kernel stack for each vCPU and initializes:
- EL1 system register shadows:
SCTLR_EL1 = 0(MMU/Cache off),MPIDR_EL1 = vcpu_id - Board interface: Binds
bcm2837_board_ops(BCM2837 peripheral emulation) - Entry point: The primary vCPU enters via
switch_from_kthread→prepare_vcpu→raw_binary_loaderto load the kernel
raw_binary_loader loads files from the SD card FAT32 filesystem:
Image(Linux kernel) → IPA0x08000000rasp3b.dtb(Device Tree) → IPA0x3B000000rootfs.gz(initramfs) → IPA0x02200000
After loading, it sets the AArch64 Linux boot protocol registers: x0 = DTB address, PC = kernel entry point.
4. Exception Handling and Trap Mechanism
4.1 Exception Vector Table
aVisor defines a standard AArch64 EL2 exception vector table in entry.S, with each vector entry aligned to 128 bytes:
┌────────────────────────────┐
VBAR_EL2 + 0x000 │ Current EL, SP_EL0, Sync │ (unused)
+ 0x080 │ Current EL, SP_EL0, IRQ │
+ 0x100 │ Current EL, SP_EL0, FIQ │
+ 0x180 │ Current EL, SP_EL0, SError │
+ 0x200 │ Current EL, SP_ELx, Sync │ (Hypervisor's own exceptions)
+ 0x280 │ Current EL, SP_ELx, IRQ │
+ 0x300 │ Current EL, SP_ELx, FIQ │
+ 0x380 │ Current EL, SP_ELx, SError │
+ 0x400 │ Lower EL, AArch64, Sync │ ← Guest trap entry
+ 0x480 │ Lower EL, AArch64, IRQ │ ← Guest IRQ entry
+ 0x500 │ Lower EL, AArch64, FIQ │
+ 0x580 │ Lower EL, AArch64, SError │
└────────────────────────────┘
All Guest synchronous exceptions (HVC, SMC, memory faults, system register accesses, etc.) enter through VBAR_EL2 + 0x400; IRQs enter through + 0x480.
4.2 kernel_entry / kernel_exit Macros
Context save/restore on each trap:
kernel_entry:
// 1. Save Guest general-purpose registers x0-x29 to EL2 stack
stp x0, x1, [sp, #-288]!
stp x2, x3, [sp, #16]
...
// 2. Save ELR_EL2 (Guest return address) and SPSR_EL2
mrs x22, elr_el2
mrs x23, spsr_el2
stp x22, x23, [sp, #256]
// 3. Call vm_leaving_work(): save EL1 sysreg shadows, flush console
bl vm_leaving_work
kernel_exit:
// 1. Call vm_entering_work(): restore sysregs, inject virtual interrupts
bl vm_entering_work
// 2. Restore ELR_EL2 and SPSR_EL2
ldp x22, x23, [sp, #256]
msr elr_el2, x22
msr spsr_el2, x23
// 3. Restore Guest general-purpose registers x0-x29
ldp x0, x1, [sp], #288
...
// 4. Return to Guest
eret
4.3 Synchronous Exception Dispatch
handle_sync_exception() dispatches based on the Exception Class (EC) field of ESR_EL2:
void handle_sync_exception(unsigned long esr, struct pt_regs *regs)
{
int ec = (esr >> 26) & 0x3f;
switch (ec) {
case 0x16: // HVC (AArch64)
handle_system_call(esr, regs); // PSCI and other services
break;
case 0x17: // SMC (AArch64), trapped when HCR_EL2.TSC=1
handle_system_call(esr, regs);
break;
case 0x18: // MSR/MRS system register access
handle_trap_system(esr, regs);
break;
case 0x01: // WFI/WFE
handle_trap_wfx(esr, regs);
break;
case 0x20: // IABT (Lower EL instruction abort)
case 0x24: // DABT (Lower EL data abort)
handle_mem_abort(esr, regs); // Memory fault / MMIO
break;
}
}
4.4 PSCI Emulation
Guest Linux uses HVC to call PSCI (Power State Coordination Interface) for CPU power management:
void handle_system_call(unsigned long esr, struct pt_regs *regs)
{
uint32_t fid = regs->regs[0]; // Function ID
switch (fid) {
case PSCI_VERSION: // 0x84000000
regs->regs[0] = 0x00010000; // v1.0
break;
case PSCI_CPU_ON_64: // 0xC4000003
// target = regs[1], entry = regs[2], context = regs[3]
target_vcpu->state = VCPU_RUNNING;
vcpu_pt_regs(target_vcpu)->pc = regs->regs[2];
asm volatile("sev"); // Wake the WFE-waiting vCPU
regs->regs[0] = PSCI_SUCCESS;
break;
case PSCI_AFFINITY_INFO_64: // 0xC4000004
regs->regs[0] = (target_vcpu->state == VCPU_RUNNING) ? 0 : 1;
break;
case PSCI_SYSTEM_OFF: // 0x84000008
stop_vcpu();
break;
}
}
This enables Guest Linux to boot multiple cores and query CPU status through the standard PSCI interface.
5. Memory Virtualization: Stage-2 Address Translation
5.1 Stage-2 Page Table Structure
aVisor uses a 38-bit IPA space (VTCR_EL2.T0SZ = 26), 4KB pages, and three-level page tables:
IPA[37:30] IPA[29:21] IPA[20:12] IPA[11:0]
│ │ │ │
▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ Level 1│────────►│ Level 2│────────►│ Level 3│────────► 4KB Physical Page
│ (PGD) │ 512 ent │ (PMD) │ 512 ent │ (PTE) │ 512 ent
└────────┘ └────────┘ └────────┘
↑
VTTBR_EL2
Attribute bits in each PTE control Stage-2 permissions:
| Attribute | Normal Memory Page | MMIO Device Page |
|---|---|---|
| AP (Access Permission) | (3<<6) EL1 Read/Write |
0 No access |
| SH (Shareability) | Inner Shareable | — |
| MemAttr | (0x5<<2) WB Cacheable |
0 Device-nGnRnE |
5.2 Demand Paging
aVisor does not pre-allocate all Guest memory at VM creation time. Initially, the Stage-2 page tables are nearly empty, and Guest memory accesses trigger Stage-2 Translation Faults, which the Hypervisor catches and handles by dynamically allocating physical pages:
void handle_mem_abort(unsigned long esr, struct pt_regs *regs)
{
// Obtain faulting IPA from HPFAR_EL2
unsigned long ipa = (hpfar << 8) | (far & 0xFFF);
int dfsc = esr & 0x3f;
if ((dfsc >> 2) == 0x1) {
// Translation Fault: allocate physical page and map it
unsigned long page = allocate_page();
map_stage2_page(vm, ipa, page, MMU_STAGE2_PAGE_FLAGS);
}
else if ((dfsc >> 2) == 0x3) {
if (ipa >= DEVICE_BASE) {
// Device region access → MMIO emulation
int wnr = (esr >> 6) & 1;
int rt = (esr >> 16) & 0x1f;
if (wnr)
board_ops->mmio_write(vcpu, ipa, regs->regs[rt]);
else
regs->regs[rt] = board_ops->mmio_read(vcpu, ipa);
increment_current_pc(4); // Skip the trapped instruction
} else {
// Lazy mapping for normal memory
unsigned long page = allocate_page();
map_stage2_page(vm, ipa, page, MMU_STAGE2_PAGE_FLAGS);
}
}
}
5.3 MMIO Interception Principle
Device MMIO regions are marked as AP=0 (inaccessible) in the Stage-2 page tables. When the Guest accesses these addresses:
- Stage-2 generates a Permission Fault (DFSC class
0x3) - The Hypervisor catches the exception, decodes the target register and read/write direction from
ESR_EL2 - Calls the corresponding MMIO handler to emulate device behavior
- Advances PC by 4 bytes to skip the handled instruction
eretreturns to the Guest to continue execution
6. Device Emulation and MMIO Interception
6.1 BCM2837 Peripheral Emulation Overview
aVisor emulates the major peripherals of the Raspberry Pi 3 in bcm2837.c:
┌──────────────────────────────────────────────────────┐
│ bcm2837_mmio_read/write │
│ │
│ ┌──────────┐ ┌──────────┐ ┌────────────────────┐ │
│ │ PL011 │ │ Mini UART│ │ Interrupt │ │
│ │ UART │ │ (AUX) │ │ Controller │ │
│ │0x3f201xxx│ │0x3f215xxx│ │ IRQ_PENDING_1/2 │ │
│ └──────────┘ └──────────┘ └────────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌────────────────────┐ │
│ │ System │ │ GPIO │ │ Local Interrupt │ │
│ │ Timer │ │ GPFSEL │ │ Controller │ │
│ │ CS/CLO │ │0x3f200xxx│ │ IRQ_PENDING / │ │
│ │ C0-C3 │ └──────────┘ │ Mailbox IPI │ │
│ │0x3f003xxx│ │ 0x40000xxx │ │
│ └──────────┘ └────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ VideoCore Mailbox (vmbox.c) │ │
│ │ ARM memory size / serial / power │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
Each vCPU holds an independent bcm2837_state structure containing interrupt enable registers, AUX UART state, PL011 IMSC, system timer, and other peripheral shadow state.
6.2 PL011 UART Emulation
PL011 is the Guest Linux main console (ttyAMA0). The emulation is FIFO-based:
Output path (Guest → Physical UART):
Guest writes PL011_DR → handle_pl011_write → enqueue(out_fifo)
↓
vm_entering_work → flush_console → _putchar → Physical Mini UART
Input path (Physical UART → Guest):
Physical UART IRQ → handle_uart_irq → enqueue(in_fifo)
↓
Guest reads PL011_DR → handle_pl011_read → dequeue(in_fifo)
PL011 interrupt emulation tracks IMSC (interrupt mask) and RIS (raw interrupt status):
- When
in_fifois non-empty, RIS bit RXRIS (bit 4) is set - When
out_fifois not full, RIS bit TXRIS (bit 5) is set - MIS = RIS & IMSC; when MIS is non-zero, a vIRQ is injected through the interrupt controller chain
6.3 Interrupt Controller Emulation
BCM2837 has two levels of interrupt controllers:
GPU Interrupt Controller (0x3F00B200):
IRQ_PENDING_1: System timer match interrupts (bit 1, 3), AUX/Mini UART interrupt (bit 29)IRQ_PENDING_2: PL011 UART interrupt (bit 25, corresponding to IRQ 57)IRQ_BASIC_PENDING: Aggregates the status of PENDING_1 and PENDING_2
Local Interrupt Controller (0x40000060+) — independent per CPU core:
- bit 3: CNTV (Guest virtual timer) interrupt
- bit 4-7: Mailbox 0-3 interrupts (used for IPI inter-core communication)
- bit 8: GPU interrupt (from the upper-level GPU interrupt controller)
6.4 IPI Mailbox Emulation
Linux SMP uses BCM2836 Mailboxes for inter-processor interrupts. aVisor emulates this with a global array volatile uint32_t ipi_mbox[4][4]:
// Write to Mailbox SET register: atomic OR
handle_local_intc_write(MBOX_SET):
ipi_mbox[target_core][mbox] |= val;
asm volatile("dsb ish");
// Read from Mailbox RDCLR register: return value and clear
handle_local_intc_read(MBOX_RDCLR):
result = ipi_mbox[core][mbox];
handle_local_intc_write(MBOX_RDCLR):
ipi_mbox[core][mbox] &= ~val;
6.5 System Timer Virtualization
The BCM2835 system timer (0x3F003000) is virtualized through time offsetting:
bcm2837_state.systimer.offsetrecords the difference between virtual and physical time- When Guest reads
CLO/CHI, the emulator returnsphysical_count - offset - When Guest writes compare registers
C0-C3,entering_vmprograms the nearest deadline into the physicalTIMER_C3 - When the timer expires, the IRQ is injected into the Guest through the interrupt controller emulation chain
7. Interrupt Virtualization and Timers
7.1 Hypervisor Timer (Scheduling Tick)
Each physical CPU core uses CNTHP (Hypervisor Physical Timer) to generate scheduling ticks:
void init_hv_timer(int core)
{
uint64_t cntfrq;
asm volatile("mrs %0, cntfrq_el0" : "=r"(cntfrq));
uint64_t ticks = cntfrq / TICK_RATE_HZ; // TICK_RATE_HZ = 10
write_cnthp_tval(ticks); // 100ms per tick
enable_cnthp(); // CNTHP_CTL_EL2 = 1
// Route to local interrupt controller
put32(COREn_TIMER_IRQCNTL(core), 1 << 2); // HPtimer → core IRQ
}
7.2 Virtual Interrupt Injection
set_cpu_virtual_interrupt() evaluates before each Guest entry (in vm_entering_work) whether a virtual interrupt should be injected:
void set_cpu_virtual_interrupt(struct avisor_vcpu *vcpu)
{
int virq = 0;
// 1. Check board-level IRQ (GPU interrupt controller has pending)
if (board_ops->is_irq_asserted(vcpu))
virq = 1;
// 2. Check Guest virtual timer interrupt (CNTV)
if (is_cntv_irq_pending()) // CNTV_CTL: ENABLE && !IMASK && ISTATUS
virq = 1;
// 3. Check IPI Mailbox
for (int m = 0; m < 4; m++)
if (ipi_mbox[cpu][m]) { virq = 1; break; }
// 4. Inject via HCR_EL2 VI/VF bits
if (virq) assert_virq(); // Set HCR_EL2.VI
else clear_virq(); // Clear HCR_EL2.VI
}
assert_virq() sets the VI bit (bit 7) of HCR_EL2, causing the hardware to automatically trigger a virtual IRQ exception after eret returns to the Guest. The Guest’s interrupt handler sees an IRQ indistinguishable from real hardware.
8. Scheduler and Context Switching
8.1 Scheduling Algorithm
aVisor uses a Priority-Decay Round-Robin algorithm:
void _schedule(void)
{
while (1) {
// Select the vCPU with the highest counter among RUNNING ones
int c = -1, next = 0;
for (int i = 0; i < MAX_VCPUS; i++) {
if (vcpu[i] && vcpu[i]->state == VCPU_RUNNING
&& vcpu[i]->counter > c) {
c = vcpu[i]->counter;
next = i;
}
}
if (c) break; // Found a runnable vCPU
// Recharge all counters when all are depleted
for (int i = 0; i < MAX_VCPUS; i++)
if (vcpu[i])
vcpu[i]->counter = (vcpu[i]->counter >> 1) + vcpu[i]->priority;
}
switch_to(cpu_data->vcpu[next]);
}
- Each vCPU has a
counter(remaining time slice) andpriority(base priority) - Timer ticks decrement
counter; scheduling triggers when it reaches zero - The recharge formula
counter = counter/2 + priorityimplements an aging effect
8.2 Context Switch
cpu_switch_to implements a classic symmetric stack switch in sched.S:
cpu_switch_to:
// Save prev's callee-saved registers
stp x19, x20, [x0, #THREAD_CPU_CONTEXT + 0]
stp x21, x22, [x0, #THREAD_CPU_CONTEXT + 16]
...
mov x9, sp
str x9, [x0, #THREAD_CPU_CONTEXT + 96] // Save SP
str x30, [x0, #THREAD_CPU_CONTEXT + 104] // Save LR (return address)
// Restore next's callee-saved registers
ldp x19, x20, [x1, #THREAD_CPU_CONTEXT + 0]
...
ldr x9, [x1, #THREAD_CPU_CONTEXT + 96]
mov sp, x9 // Restore SP
ldr x30, [x1, #THREAD_CPU_CONTEXT + 104] // Restore LR
ret // "Return" into next's execution flow
Guest GPRs are not saved/restored during the switch — that is the job of kernel_entry/kernel_exit. cpu_switch_to only switches the Hypervisor’s own C call stack.
8.3 Complete Trap-Schedule-Return Flow
Guest executes ──► Trap (IRQ/HVC/Fault) ──► kernel_entry
│
vm_leaving_work()
├─ save_sysregs()
├─ leaving_vm() (board hook)
└─ flush_console()
│
C exception handler
├─ handle_sync_exception()
├─ handle_irq()
└─ timer_tick() → _schedule()
│
vm_entering_work()
├─ entering_vm() (board hook)
├─ flush_console()
├─ set_cpu_sysregs() (Stage-2 + EL1 regs)
└─ set_cpu_virtual_interrupt()
│
kernel_exit ──► eret ──► Guest continues
9. Multi-Core Support (SMP)
9.1 Physical Core Startup
The 4 CPU cores of the Raspberry Pi 3 are started using a spin-table mechanism:
- The BSP (CPU0) writes the
_startaddress into spin-table addresses (0xE0/E8/F0) inboot.S - Sends
SEV(Send Event) to wake Application Processors (APs) - APs wake from
secondary_cpu_entry, configure EL2 MMU, and jump to the BSP-specified entry point - BSP calls
start_secondary_cores(n, secondary_main)to writesecondary_mainintosmp_cores[n]
void secondary_main(void)
{
irq_vector_init();
init_hv_timer(core_id);
disable_irq();
// Wait for BSP to finish creating VMs
while (hv->nr_vm_ready < get_avisor_config_amount())
;
enable_irq();
while (1)
schedule();
}
9.2 Guest vCPU SMP Startup
Guest Linux declares multi-core startup via enable-method = "psci" in the Device Tree:
- Guest CPU0 issues
HVC #0withx0 = PSCI_CPU_ON,x1 = target_cpu,x2 = entry_point - aVisor catches the HVC, sets the target vCPU state to
VCPU_RUNNING, and sets its PC - Sends
SEVto wake the physical AP waiting inswitch_to_secondary_vcpuviaWFE - The AP detects
regs->pc != 0and enters the Guest’s secondary CPU entry viakernel_exit→eret
10. Console and Shell System
10.1 Dual-Mode Console Architecture
aVisor uses the physical Mini UART (0x3F215040) as the sole physical console, implementing multiplexing through software switching:
┌────────────────────┐
│ Physical Mini UART│
│ (Serial Terminal) │
└────────┬───────────┘
│
┌────▼─────────────────┐
│ handle_uart_irq() │
│ │
│ uart_forwarded_hv? │
│ ├─ true: shell │ ← Hypervisor Shell mode
│ └─ false: VM fifo │ ← Guest console mode
└──────────────────────┘
- Hypervisor mode (default): Input goes to the Shell command processor
- VM mode (after
vmc <id>): Input goes to the specified VM’sin_fifo
10.2 Shell Commands
| Command | Function |
|---|---|
help |
Display all available commands |
vml |
List all VM/vCPU status (PC, counters, trap statistics, etc.) |
vmc <id> |
Switch to the specified VM’s console |
vmld <file> [entry] [core] |
Dynamically load and start a new VM |
ls |
List SD card files |
10.3 Escape Sequences
In VM console mode, the @ key triggers an escape sequence:
| Sequence | Function |
|---|---|
@c |
Return to Hypervisor Shell |
@0-@9 |
Switch to specified VM |
@l |
Display vCPU list |
@@ |
Input literal @ character |
11. Filesystem and VM Loading
11.1 SD Card Driver
aVisor implements a complete BCM2835 EMMC controller driver (sd.c) supporting:
- SD card initialization (CMD0/CMD8/ACMD41/CMD2/CMD3/CMD7 sequence)
- 4-bit data bus mode
- Block reads (
sd_readblock)
11.2 FAT32 Filesystem
By integrating the FatFs generic FAT filesystem library, aVisor can read standard FAT32-formatted SD card images. The disk I/O layer (diskio.c) bridges FatFs and the SD card driver.
File layout on the SD card:
/
├── kernel8.img ← aVisor Hypervisor itself
├── rasp3b.dtb ← Guest Linux Device Tree
├── rootfs.gz ← Guest Linux initramfs
└── Image ← Guest Linux kernel image
11.3 Loading Process
raw_binary_loader loads files page-by-page into the Guest IPA space:
void load_file_to_memory(vm, filename, ipa, max_size)
{
f_open(&file, filename, FA_READ);
while (bytes_read > 0) {
page = allocate_vcpu_page(vm, ipa); // Allocate physical page + Stage-2 mapping
f_read(&file, page_va, PAGE_SIZE, &bytes_read);
dcache_clean_invalidate_range(page_va, PAGE_SIZE);
ipa += PAGE_SIZE;
}
f_close(&file);
}
For each page loaded, an IPA → PA mapping is established via map_stage2_page, and the DCache is flushed to ensure coherency.
12. Source Code Structure Overview
avisor/
├── hypervisor/
│ ├── arch/aarch64/
│ │ ├── boot.S # Boot code: EL3→EL2, page tables, MMU
│ │ ├── entry.S # Exception vector table, kernel_entry/exit
│ │ ├── sched.S # cpu_switch_to context switch
│ │ ├── utils.S # save/restore_sysregs, set_stage2_pgd
│ │ ├── irq.S # IRQ enable/disable
│ │ ├── sync_exc.c # Sync exception dispatch: HVC/SMC/PSCI/sysreg/fault
│ │ ├── timer.c # Hypervisor timer initialization
│ │ ├── vcpu.c # vCPU creation and management
│ │ └── vm.c # VM creation, Stage-2 page table initialization
│ ├── boards/raspi/
│ │ ├── mini_uart.c # Physical UART driver, console forwarding
│ │ ├── irq.c # IRQ dispatch
│ │ ├── timer.c # BCM2835 system timer
│ │ └── sd.c # SD card EMMC driver
│ ├── common/
│ │ ├── main.c # hypervisor_main entry point
│ │ ├── sched.c # Scheduler, virtual interrupt injection
│ │ ├── mm.c # Physical page allocation, Stage-2 page table ops
│ │ ├── shell.c # Hypervisor Shell (vml/vmc/vmld/ls)
│ │ ├── console.c # Console FIFO flushing
│ │ ├── fifo.c # Ring buffer implementation
│ │ ├── loader.c # Guest kernel loader
│ │ ├── smp.c # Multi-core startup
│ │ └── spinlock.c # LL/SC spinlock
│ ├── emulator/raspi/
│ │ ├── bcm2837.c # BCM2837 full peripheral MMIO emulation
│ │ └── vmbox.c # VideoCore Mailbox emulation
│ ├── fs/
│ │ ├── ff.c # FatFs FAT32 filesystem
│ │ └── diskio.c # Disk I/O glue layer
│ └── config.c # VM static configuration
├── include/
│ ├── arch/aarch64/
│ │ ├── sysregs.h # HCR_EL2/SCR/SPSR/VTCR register definitions
│ │ ├── mmu.h # MMU constants (page size, attribute flags)
│ │ └── vm.h # avisor_vm / cpu_sysregs structures
│ ├── common/
│ │ ├── sched.h # avisor_vcpu / per_cpu_data_t
│ │ ├── mm.h # VA_START / PHYS_MEMORY_SIZE
│ │ └── board.h # board_ops interface
│ └── boards/raspi/
│ ├── base.h # DEVICE_BASE / PBASE
│ ├── irq.h # IRQ register addresses
│ └── timer.h # Timer register addresses
└── scripts/
└── mksd3.py # FAT32 SD card image build tool
Conclusion
aVisor implements a fully functional Type-1 Hypervisor in fewer than 10,000 lines of C and assembly code, covering all core virtualization technologies:
| Technology Dimension | aVisor Implementation |
|---|---|
| CPU Virtualization | EL2 Trap-and-Emulate + HCR_EL2 control |
| Memory Virtualization | Stage-2 address translation + demand paging |
| I/O Virtualization | MMIO trap + software device emulation |
| Interrupt Virtualization | HCR_EL2.VI/VF virtual interrupt injection |
| Timer Virtualization | CNTV passthrough + system timer offsetting |
| Multi-Core | PSCI emulation + spin-table physical core startup |
It is an excellent learning resource for understanding ARM virtualization principles — small enough to read through all the code, yet complete enough to run a real Linux kernel.
References
git clone -b boot_linux --single-branch https://github.com/calinyara/avisor.git
cd avisor
./scripts/linux.sh