Up until now, the microprocessors I have dealt with made decisions using the following pattern:
compare A to B
branch ifhigher/lower/same etcto somewhere else
The compare instruction is similar to a subtraction however the result is not used but changes to the ALU flags are used to by the subsequent conditional branch instruction to determine whether the branch (jump) is taken or not. Comparison operands typically can be registers or (sometimes) an immediate values.
The RV32IMAC architecture executes the compare and conditional jump operation as a single instruction. Some examples are:
Note that conditional branches work with registers only. The destination is a 12 bit signed relative offset expressed in two byte steps. In other words if this is 5 then the actualy offset is 10 bytes away. The first two comparisons above use the zero register in the CPU core as one of the operands. The u suffix on the conditional branch instructions indicates that an unsigned comparison is to be made.
The code below shows addition, subtraction, multiplication and division. Immediate addition and subtraction are the same instruction ; you add a negative sign to the immediate value to perform subtraction. Immediate values are 12 bit signed so the range of values is -2048 to +2047.
Multiplying two 32 bit numbers requires two steps: the mul instruction produces the 32 bit low order word result. The mulh(u) insruction produces the 32 bit high order word result. The u suffix is used for unsigned multiplication.
32 bit division uses the div instruction while the rem instruction can be used to determine the remaind er of a division.
I have noticed some problems debugging this chip. Normally when you debug assembly language, the debugger shows you the line of code that will be executed next i.e. it hasn’t happened yet. I have noticed that this is not the case with this debugger and mcu. I suspect it is due to the instruction pipeline in the CPU behaving in way that is not expected by jlink and/or gdb. I have taken to (temporarily) adding nop instructions at various places to stop the CPU from getting ahead me.
/* Initialization routine which sets the stack pointer,
sets initial global values and clears those that are not
specifically initialized. Assumes that the linker script aligned
data sections along a word (4 byte) boundary.
*/
.global Reset_Handler
.section start
Reset_Handler:
lui sp,0x20005 # set stack pointer to top of RAM
lui t2,%hi(a) /* load 20 high bits of address of a into t2 */
addi t2,t2,%lo(a) /* add lower 12 bits of address of a to t2 */
lw t0,0(t2) /* load the value pointed to by (0+t2) into t0 */
lui t2,%hi(b) /* load 20 high bits of address of b into t2 */
addi t2,t2,%lo(b) /* add lower 12 bits of address of b to t2 */
lw t1,0(t2) /* load the value pointed to by (0+t2) into t0 */
add t4,t0,t1 /* register to register addition */
sub t5,t1,t0 /* register to register subtraction */
add t4,t0,1 /* add immediate */
add t5,t0,-1 /* subtract immediate */
add t4,t0,2047 /* maximum immediate addition (12 bits signed) */
add t5,t0,-2048 /* maximum immediate subtraction (12 bits signed) */
lui t2,%hi(a) /* load 20 high bits of address of a into t2 */
addi t2,t2,%lo(a) /* add lower 12 bits of address of a to t2 */
lw t0,0(t2) /* load the value pointed to by (0+t2) into t0 */
lui t2,%hi(b) /* load 20 high bits of address of b into t2 */
addi t2,t2,%lo(b) /* add lower 12 bits of address of b to t2 */
lw t1,0(t2) /* load the value pointed to by (0+t2) into t0 */
mul t4,t0,t1 /* low order multiplication word */
mulhu t5,t0,t1 /* high order multiplication word */
lui t2,%hi(mul64) /* load 20 high bits of address of mul64 into t2 */
addi t2,t2,%lo(mul64) /* add lower 12 bits of address of mul64 to t2 */
sw t4,0(t2) /* store the value in t4 to address pointed to by (0+t2) */
sw t5,4(t2) /* store the value in t5 to address pointed to by (4+t2) */
lui t2,%hi(e) /* load 20 high bits of address of e into t2 */
addi t2,t2,%lo(e) /* add lower 12 bits of address of e to t2 */
lw t0,0(t2) /* load the value pointed to by (0+t2) into t0 */
lui t2,%hi(f) /* load 20 high bits of address of f into t2 */
addi t2,t2,%lo(f) /* add lower 12 bits of address of f to t2 */
lw t1,0(t2) /* load the value pointed to by (0+t2) into t0 */
divu t4,t0,t1 /* 32 bit division result */
rem t5,t0,t1 /* 32 bit division remainder */
lui t2,%hi(divresult) /* load 20 high bits of address of mul64 into t2 */
addi t2,t2,%lo(divresult) /* add lower 12 bits of address of mul64 to t2 */
sw t4,0(t2) /* store the value in t4 to address pointed to by (0+t2) */
sw t5,4(t2) /* store the value in t5 to address pointed to by (4+t2) */
nop /* 1 */
nop /* 2 */
nop /* 3 */
nop /* 4 */
nop /* 5 */
nop /* 6 */
nop /* 7 */
nop /* 8 */
nop /* 9 */
exit_spin:
j exit_spin
a: .word 0x12345678
b: .word 0x23456789
e: .word 19
f: .word 6
.data
mul64: .word 0,0
divresult: .word 0
rem32: .word 0
The GD32VF103 has 32 CPU core registers (x0 to x31) each of which is 32 bits wide. There is also a 32 bit program counter (pc) (instruction pointer). Apart from x0 which is read-only and always returns a value of zero all the registers are interchangeable. This means that any register can be a stack pointer, a link register, an argument to a function and so on. While this freedom may seem great it could lead to chaos if you want pre-compiled program modules or libraries to work with one another. There must be some agreement between authors of such as to which registers carry return results, parameters, behave as a stack pointer and so on. The RISC-V Application Binary Interface (ABI) defines this and also renames the registers so that their use is more apparent. Assemblers and compilers are aware of these names also. The register names used in the RISC-V ABI are:
x0 is renamed to zero. This reminds me of the constant generator in the TIMSP430 which could output 6 different constant values that were commonly used in code. Using the zero register is faster than loading the value 0 from memory and is commonly used in program loops etc.
a0 to a7 : These are used to pass arguments to functions.
a0 and a1 are also used to return values from functions.
x2 is nominated as the Stack Pointer (sp)
x1 is used as a link register (it remembers the return address in leaf functions). It is called “ra” (return address). This is similar to the link register in ARM Cortex-M processors.
t0 to t6 are “temporary” registers. Functions need not preserve values in these registers
s0 to s11 are “saved” or “variable” registers. Functions must preserve values in these registers. They typically are used to hold a variable for quick access in a function (e.g. a loop counter).
x3 is renamed as gp (global pointer) and can be used to point at the middle of the global memory space
x4 is renamed as tp (thread pointer) is used in multi-threaded applications and points at a block of memory containing static data used by the current thread.
The mapping of these ABI register names to the underlying “x” register names may seem a little arbitrary. Presumably it is influenced by various efficiency constraints and the need to accommodate a version of the architecture which has only 16 registers (the “E” or embedded architecture). From a programmers perspective it makes no difference which underlying “X” register is used for each role so don’t worry too much about it!
In summary, the registers typically used by an application program are as follows:
t0-t6
temporary or scratch registers
a0 to a7
function arguments and return values
s0 to s11
registers where you can keep variables inside a block of code. Register s0 is used as a frame pointer inside a function call.
sp
stack pointer
ra
return address for leaf functions
gp
global pointer
tp
thread pointer
zero
a register that always returns a value of zero.
How do I put a number in a register?
The GD32VF103 uses an RV32IMAC core. This means it does Integer calculations only. Has a hardware Multiply, is capable of certain Atomic (non-interruptible) instructions (useful for multitasking and interrupts) and it can execute Compressed (16 bit) instructions as well as 32 bit ones.
From a programmers point of view, it might be nice if we could write instructions like this:
1) Put this 32 bit number into this register.
2) Add 1 to this register.
3) Store this register at this 32 bit memory address.
4) Set this register to zero.
From a CPU design perspective these instructions are less than ideal. Instruction 1 must be more than 32 bits wide as it has to encode the instruction, the target register and the 32 bit value.
Instruction 2 could be easily encoded in 16 bits.
Instruction 3 is, once again, wider than 32 bits.
Instruction 4 could be encoded in 16 (or fewer) bits.
These variable length instructions cause problems for instruction pipelines and complicate the instruction fetch mechanism. It would be nicer if instructions were a fixed width e.g. 32 bits. If you have lots of memory then this is fine. In embedded situations, where memory is in short supply, this is quite wasteful. If all instructions occupy 32 bits then simpler instructions will include lots of unused bits. RISC-V and ARM designers have compromised on instruction size by processing a mix or 16 and 32 bit instructions. This allows more instructions to be packed into less memory and only slightly complicates the instruction fetch and pipeline hardware. In the case of RISC-V the 16 bit instructions are referred to as Compressed instructions (the “C” in RV32IMAC).
Ok, we have 32 bit and 16 bit instructions. How do we do instruction 1 above:
Put this 32 bit value into this register
You could do it in two halves and load the upper 16 bits followed by the lower 16 bits using two 32 bit instructions.
Or, you could execute a command of the following form:
Load the 32 bit value in memory that is N bytes away from here.
In the case of RISC-V, you can do the following:
Load the following 20 bits into the upper bits of this register (clearing the lower 12 bits)
Add the following 12 bit number. The programmer can write these two commands
lui t0,0x12345 /* load upper 20 bits */
addi t0,t0,0x678 /* add lower 12 bits */
This is further complicated by the fact that the addi instruction takes a signed value. If you need to add an immediate value whose 12th bit is set (implying a negative value) you have to figure out two’s compliment values and add what looks like a negative number. Recognizing that this is likely to lead to all sorts of human errors, a handy pseudo instruction is available: load immediate or li. This is translated by the assembler into the correct pair of lui and addi instructions. So, our load now goes like this:
li t0,0x12345678
The Load Store architecture.
All arithmetic and logical operations in the RV32IMAC are carried out via the cpu registers. It is not possible to add values in memory directly to one another : you need to get them into registers first (load), do the calculation and then optionally write (store) the result back to memory. Suppose you want to do the following calculation:
c = a + b;
Typically the process works like this:
Make a pointer to a.
Load the value at a into a register.
Make a pointer to b.
Load the value at b into a (different) register.
Add the two registers together.
Make a pointer to c.
Write the result to c.
The code shown below implements this (not particulary optimal).
lui t2,%hi(a) /* load 20 high bits of address of a into t2 */
addi t2,t2,%lo(a) /* add lower 12 bits of address of a to t2 */
lw t0,0(t2) /* load the value pointed to by (0+t2) into t0 */
lui t2,%hi(b) /* load 20 high bits of address of b into t2 */
addi t2,t2,%lo(b) /* add lower 12 bits of address of b to t2 */
lw t1,0(t2) /* load the value pointed to by (0+t2) into t0 */
add t0,t0,t1 /* add the values at a and b */
lui t2,%hi(c) /* load 20 high bits of address of c into t2 */
addi t2,t2,%lo(c) /* add lower 12 bits of address of c to t2 */
sw t0,0(t2) /* store the value in t0 to address pointed to by (0+t2)
exit_spin:
j exit_spin
/* constants below are in flash */
a: .word 0x12345678
b: .word 0x23456789
/* variables are placed in ram */
.data
c: .word 0
I was looking around for a board to tinker with RV32 assembly language as a way of getting to know the architecture a bit better. I tried using a WCH-Link debugger module and a CH32VF103 board but so far I have had no success using OpenOCD with it. I have opted instead to use a Longan Nano GD32VF103 in conjunction with a J-Link Edu debugger. This worked well enough for me to get going although the debug interface appears to be very sensitive to noise.
Using the Jlink tools from Segger a GDB link to the target as follows: JLinkGDBServer -device GD32VF103C8T6 -if JTAG
First code.
My goal here is to get started into RISC-V assembler with the minimum amount of fuss. When the Longan-Nano GD32VF103 boots it begins executing code at address 0. Typically this code would initialize global and static variables, set the stack pointer and then call on main. For this particular architecture it also needs to set up the interrupt controller. I will do this at a later time. For now I will work without interrupts.
/* init.s
Initialization routine which sets the stack pointer,
sets initial global values and clears those that are not
specifically initialized. Assumes that the linker script aligned
data sections along a word (4 byte) boundary.
*/
.global Reset_Handler
.extern INIT_DATA_VALUES
.extern INIT_DATA_START
.extern INIT_DATA_END
.extern BSS_START
.extern BSS_END
.extern main
.section start
Reset_Handler:
lui sp,0x20005 # set stack pointer to top of RAM
# Fill global and static variables with initial values
la t0,INIT_DATA_VALUES
la t1,INIT_DATA_START
la t2,INIT_DATA_END
init_data_store_loop:
beq t1,t2,done_init_data
lw a0,0(t0)
sw a0,0(t1)
addi t0,t0,4
addi t1,t1,4
j init_data_store_loop
done_init_data:
# Fill uninitialized global and static variables with zero
la t0,BSS_START
la t1,BSS_END
zero_data_store_loop:
beq t0,t1,done_zero_data
sw x0,0(t1)
addi t0,t0,4
j zero_data_store_loop
done_zero_data:
# call main C code
jal main
main_exit_spin: /* should not get here. */
j main_exit_spin
This code needs to be placed at address 0 (aliased from 0x08000000). The linker script helps do this by associating the section name “start” with the first entry in the flash ROM.
There are two files in this project : main.c (a simple C program) and init.s.
The nostdlib argument really says that this is a completely bare-metal program that requires no additional components.
The linker script file name is specified with the -T argument.
The -g3 argument turns debugging information up to the maximum which helps debugging
The -O0 argument turns off all optimizations so that the code is left “as is”.
Debug session
Execute the following command to start the debug session (assuming you have started the JLinkGDBServer in another window).
gdb-multiarch a.out
This starts the following GDB session. GNU gdb (Ubuntu 13.1-2ubuntu2) 13.1 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type “show copying” and “show warranty” for details. This GDB was configured as “x86_64-linux-gnu”. Type “show configuration” for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>.
For help, type “help”. Type “apropos word” to search for commands related to “word”… Reading symbols from a.out… (gdb) target ext :2331 Remote debugging using :2331 main () at main.c:12 12 x+=y; (gdb) monitor reset Resetting target (gdb) load Loading section .text, size 0x9c lma 0x0 Loading section .data, size 0x8 lma 0x9c Start address 0x00000000, load size 164 Transfer rate: 160 KB/sec, 82 bytes/write. (gdb) stepi Reset_Handler () at init.s:17 17 la t0,INIT_DATA_VALUES (gdb) i r ra 0x0 0x0 <Reset_Handler> sp 0x20005000 0x20005000 gp 0x0 0x0 <Reset_Handler> tp 0x0 0x0 <Reset_Handler> t0 0x0 0 t1 0x0 0 t2 0x0 0 fp 0x0 0x0 <Reset_Handler> s1 0x0 0 a0 0x0 0 a1 0x0 0 a2 0x0 0 a3 0x0 0 a4 0x0 0 a5 0x0 0 a6 0x0 0 a7 0x0 0 s2 0x0 0 s3 0x0 0 s4 0x0 0 s5 0x0 0 s6 0x0 0 s7 0x0 0 s8 0x0 0 s9 0x0 0 s10 0x0 0 s11 0x0 0 t3 0x0 0 t4 0x0 0 t5 0x0 0 t6 0x0 0 pc 0x4 0x4 <Reset_Handler+4>
Commands that are entered are shown in bold in the above listing. The first of these is
target ext :2331
This connects to the JLinkGDBServer over TCP port 2331 on the local machine
monitor reset
This resets (and halts in this case) the GD32VF103
load
Loads the program specified in the command line (a.out) into flash memory
stepi
Execute a single assembler instuction pointed to by the the program counter (pc)
i r
Shorthand for info registers. This displays the contents of the CPU registers.
Now that all of this seems to be working further adventures in RISC-V assembler will follow.
Following on from the previous article that talked about the ECLIC in the GD32VF103 RISC-V microcontroller I decided to measure its interrupt response. The example works like this: Configure a port pin to generate an interrupt when it goes through a high-low transition. The interrupt handler then drives the pin back high again.
The main body of this example consists of a loop which drives the pin low thus triggering an interrupt. The interrupt handler sends it high again. This is done in a loop which incorporates a small delay to facilitate measurement. The GPIO pin in question is Port C bit 13 which happens to control the red LED on the Sipeed Longan Nano. The results are shown in the figure below.
As can be seen, it takes 416.667ns to drive the pin high again. This corresponds to 45 clock cycles at 108Mhz which is the time it takes to save the CPU registers, determine the cause of the interrupt, jump to the handler and perform a standard function entry prologue (set up stack etc). Not too shabby 🙂
Interrupt handling is a core part of embedded systems and different architectures have different ways of dealing with it. The RISC-V Bumblebee core in the GD32VF103 uses an interrupt controller called the Enhanced Core-Local Interrupt Controller (ECLIC). All interrupts (internal and external) are handled by the ECLIC. The ECLIC handles prioritization and level/edge triggering of interrupts.The documentation on the ECLIC is pretty poor at the moment. The Bumblebee core datasheet is formatted very badly (fonts all over the place, broken diagrams etc) and it was only by combining this with the assembler provided in the GD32VF103 that I began to get a clearer picture of what is going on.
The ECLIC has two modes of dealing with interrupts : vectored and non-vectored. Vectored interrupt handling is similar to other MCU’s such as the ARM, MSP430 etc. The CPU receives interrupt request ‘N’ and finds the interrupt handler by looking up entry ‘N’ in the interrupt vector table. Saving of registers (context saving) is left to the writer of the interrupt handler and on exit the handler often must execute a return from interrupt instruction (though not for all architectures).
Non-Vectored interrupt handling is quite different. All interrupt requests cause the same interrupt handler code to be executed initially. This code saves the CPU registers, finds out which interrupt happened and calls the handler pointed to by the appropriate interrupt vector table entry. This happens using a standard subroutine/function call instruction. The handler deals with the interrupt and executes a normal function return instruction. The registers are then restored and a return from interrupt instruction is executed.
The advantage of this approach is that the code to preserve/restore the registers is done (safely) and only needs to occur once in memory. Also, the interrupt handlers can be written in C or C++ without the need to tag on unusual attributes like “interrupt” etc. The potential disadvantage is that maybe the handling of interrupts is a little slower than it might have been using the vectored approach because you end up saving and restoring all CPU registers – whether you use them or not.
Excerpt from ~/.platformio/packages/framework-gd32vf103-sdk/RISCV/env_Eclipse/entry.S
.weak irq_entry
irq_entry: // -------------> This label will be set to MTVT2 register
// Allocate the stack space
SAVE_CONTEXT// Save 16 regs
//------This special CSR read operation, which is actually use mcause as operand to directly store it to memory
csrrwi x0, CSR_PUSHMCAUSE, 17
//------This special CSR read operation, which is actually use mepc as operand to directly store it to memory
csrrwi x0, CSR_PUSHMEPC, 18
//------This special CSR read operation, which is actually use Msubm as operand to directly store it to memory
csrrwi x0, CSR_PUSHMSUBM, 19
// ****************** THIS IS WHERE THE JUMP TO THE INTERRUPT HANDLER FUNCTION HAPPENS! **************
service_loop:
//------This special CSR read/write operation, which is actually Claim the CLIC to find its pending highest
// ID, if the ID is not 0, then automatically enable the mstatus.MIE, and jump to its vector-entry-label, and
// update the link register
csrrw ra, CSR_JALMNXTI, ra
//RESTORE_CONTEXT_EXCPT_X5
#---- Critical section with interrupts disabled -----------------------
DISABLE_MIE # Disable interrupts
LOAD x5, 19*REGBYTES(sp)
csrw CSR_MSUBM, x5
LOAD x5, 18*REGBYTES(sp)
csrw CSR_MEPC, x5
LOAD x5, 17*REGBYTES(sp)
csrw CSR_MCAUSE, x5
RESTORE_CONTEXT
// Return to regular code
mret
The comments just before the instruction csrrw ra, CSR_JALMNXTI, ra are a little obscure but their sense is clear enough: If there is a pending interrupt the ECLIC will execute a call to its interrupt handler and will update the Link Register so that when the handler executes a return from subroutine instruction control will return to the next line in the irq_entry.
The macros SAVE_CONTEXT, RESTORE_CONTEXT are defined elsewhere in the startup source code. The constant CSR_JALMNXTI evalautes to 0x7ed – a register number in the ECLIC. According to the Bumblebee core data sheet this register is
“The custom register is used to enable the ECLIC interrupt. The read operation of this register can process the next interrupt and return the entry address of the next interrupt handler. Jump to this address.”.
This is the mechanism by which the developer supplied interrupt handler is called.
How does the irq_entry code get called in the first place? Looking at the code for .platformio/packages/framework-gd32vf103-sdk/RISCV/env_Eclipse/start.S we find the following code that executes during boot:
_start:
csrc CSR_MSTATUS, MSTATUS_MIE
/* Jump to logical address first to ensure correct operation of RAM region */
la a0, _start
li a1, 1
slli a1, a1, 29
bleu a1, a0, _start0800
srli a1, a1, 2
bleu a1, a0, _start0800
la a0, _start0800
add a0, a0, a1
jr a0
_start0800:
/* Set the the NMI base to share with mtvec by setting CSR_MMISC_CTL */
li t0, 0x200
csrs CSR_MMISC_CTL, t0
/* Intial the mtvt*/
la t0, vector_base
csrw CSR_MTVT, t0
/* Intial the mtvt2 and enable it*/
la t0, irq_entry
csrw CSR_MTVT2, t0
csrs CSR_MTVT2, 0x1
Note the instructions la t0, irq_entry , csrw CSR_MTVT2, t0 . These tell the ECLIC where to find the irq_entry code. The Bumblebee core reference manual defines this register CSR_MTV2 (0x7ec) as
“Custom registers are used to set non-vector interrupt handling Mode interrupt entry address” .
So, there we have it. The picture below shows my rough understanding of the process.
I have written up two more demo programs: One which uses the internal clock cycle timer interrupt (mtimecmp) (Systick interrupt), the other uses Timer 6 as an external source of timer interrupts (TimerIRQ). Code is available over on Github
The Longan-nano board shown above includes a microcontroller with a RISC-V core. The chip seems to be very similar to the STM32F103C8T6 with the exception that the ARM-Cortex M3 core has been swapped out for a RISC-V Bumblebee core called the GD32VF103CBT6. This little development kit has an Arduino nano form factor and also sports a 160×80 full colour display, an RGB LED and a micro SD-card socket. This particular one came from Seeed Studio at a cost of $4.90 + shipping.
There are two ways of programming this chip: You can use a JTAG debugger (various kinds supported) or you can put the chip into DFU (Device Firmware Update) mode by holding the Reset and Boot button down, releasing the Reset button first. This allows you to program the board using a simple USB-Serial converter. I’ve gone with this initially but will definitely be exploring the details of the RISC-V core with a proper debugger shortly.
Developing code
The documentation for this chip pushes you towards using PlatformIO for development. This is an extension for Visual Studio Code – another first for me. You first install VSCode and then add in the PlatformIO extension (I just followed the online guide and it all just worked 🙂 )
There is good support for the GD32VF103CBT6 within PlatformIO and so there is no problem with setting up environment variables, directories and so on. I chose to develop in C++ which caused me a slight problem with the gd32vf103.h header file. This is intended for use with C projects and includes a definition the bool type. This causes a problem for C++ as it already has such a type. If you edit gd32vf103.h as follows you can fix this (around line 180 look for the enum declaration called bool)
One more thing: when you include this file in your C++ files be sure to surround the include statement with “extern C” as shown below:
extern "C" {
#include "gd32vf103.h"
}
If you don’t do this you run into all sorts of problems with C++ decorated names.
And along came a gotcha
I have to admit that I have been a little lazy when it comes to function return values over the years. Let’s say you write an I/O function that may, in some future design, return an error code – as the project is only starting however you have not written that code yet.
So, your code might look like this:
int test()
{
int X;
X = 1;
/* etc */
}
Note the missing return statement – after all, you haven’t written that code yet. Well, this runs fine on ARM Cortex MCU’s I’ve used, and, also on x86. On RISC-V/GCC9.2 this crashes your program. I spent a while scratching my head over this one. The C++ standard apparently states that behavior in this example is “undefined”. So, be warned: If you state that you are going to return a value then do!
Demo code I’ve started a repository on github with some examples (multi-colour blinky and a graphics demo for now). You can view it here