Fast PWM on the STM32F103 Bluepill

I needed a signal source for another project I’m working on so I set about building a signal generator of sorts using a “Bluepill” board. The STM32F103 on these boards does not have a Digital to Analogue Converter but it does have a very capable timer which can be used to generate waveforms.
STM32F103_TIM1_DMA_PWM

The timer (TIM1) is shown in the diagram above. In the mode that it is configured for this project it works as follows:
The CNT register counts up to the value in ARR
When CNT = ARR the upper digital comparator outputs a pulse which is used to reset CNT to 0 and to set the PWM output signal high (via the S-R Flip-Flop).
When CNT is equal to the value in CCR1, the lower digital comparator outputs a pulse which resets the PWM output low. ARR therefore controls the output frequency while CCR1 controls the output duty.
(CNT = count, ARR = auto-reload register, CCR1 = Counter-Compare Register 1)

I’m interested in producing audio range signals so up to about 20kHz. An trade-off between switching frequency, resolution, and output filtering arises here. First of all, I want to use a simple RC filter on the output to remove the high frequency switching. This means that the switching frequency has to be a lot higher than the target output signal i.e. much greater than 20kHz. But how much higher? The clock in the STM32F103 runs at 72MHz so this places an upper limit on things. Lets say we go for a switching frequency of 12MHz. This means that the counter will count from zero to five and then back up to zero again (6 states in total). This means that our PWM waveform generation system is only able to produce 6 different outputs (between 2 and 3 bits of resolution) which is pretty poor. I wanted a lot more resolution than that so after playing with the numbers I arrived at the following. Let ARR = 255 (8 bits of resolution). This gives a switching frequency of 280500Hz which is certainly not audible and is pretty easy to filter with passive components.
In order to produce a sine-wave output the CCR1 value needs to be varied sinusoidally. This could be done by generating an interrupt for each PWM cycle and loading values from a look-up table into CCR1. This places a reasonably high load on the CPU however given the high switching frequency. Another option is to use DMA which can transfer values from the lookup-table directly into CCR1. After all the values of the table have been used up, the DMA controller can then generate an interrupt allowing the program to perhaps modify the output signal if necessary. The interrupt rate is a lot lower with this approach and the CPU is free to do other tasks.
stm32f103_280kHz_PWM_logic_out
The figure above shows a logic analyzer view of the outputs. The upper signal is a fixed duty signal from CCR2 (another PWM output channel). The middle signal is the sinusoidal PWM output. The lower signal is a digital output that is toggled at the end of each DMA transfer (every millisecond).

After filtering the output looks like this:
stm32f103_280kHz_PWM

The upper trace is the PWM output after it has been passed through an RC filter with a cutoff of 20kHz. As expected the switching frequency (which is most pronounced around zero-crossings of the sinewave) is attenuated to about 1 tenth of its original size. The lower trace shows the unfiltered output.

The next step is to produce a variable frequency and voltage output – perhaps even two channels. Code is available over on github
Incidentally, the particular board used here is one with 128kB of flash (earlier ones I used had only 64kB).

Measuring GD32VF103 Interrupt response time

Following on from the previous article that talked about the ECLIC in the GD32VF103 RISC-V microcontroller I decided to measure its interrupt response. The example works like this: Configure a port pin to generate an interrupt when it goes through a high-low transition. The interrupt handler then drives the pin back high again.
The main body of this example consists of a loop which drives the pin low thus triggering an interrupt. The interrupt handler sends it high again. This is done in a loop which incorporates a small delay to facilitate measurement. The GPIO pin in question is Port C bit 13 which happens to control the red LED on the Sipeed Longan Nano. The results are shown in the figure below.
GD32VF103IRQResponseTime

As can be seen, it takes 416.667ns to drive the pin high again. This corresponds to 45 clock cycles at 108Mhz which is the time it takes to save the CPU registers, determine the cause of the interrupt, jump to the handler and perform a standard function entry prologue (set up stack etc). Not too shabby 🙂

Code is available over on Github

Interrupts and the GD32VF103

Interrupt handling is a core part of embedded systems and different architectures have different ways of dealing with it. The RISC-V Bumblebee core in the GD32VF103 uses an interrupt controller called the Enhanced Core-Local Interrupt Controller (ECLIC). All interrupts (internal and external) are handled by the ECLIC. The ECLIC handles prioritization and level/edge triggering of interrupts.The documentation on the ECLIC is pretty poor at the moment. The Bumblebee core datasheet is formatted very badly (fonts all over the place, broken diagrams etc) and it was only by combining this with the assembler provided in the GD32VF103 that I began to get a clearer picture of what is going on.

The ECLIC has two modes of dealing with interrupts : vectored and non-vectored. Vectored interrupt handling is similar to other MCU’s such as the ARM, MSP430 etc. The CPU receives interrupt request ‘N’ and finds the interrupt handler by looking up entry ‘N’ in the interrupt vector table. Saving of registers (context saving) is left to the writer of the interrupt handler and on exit the handler often must execute a return from interrupt instruction (though not for all architectures).
Non-Vectored interrupt handling is quite different. All interrupt requests cause the same interrupt handler code to be executed initially. This code saves the CPU registers, finds out which interrupt happened and calls the handler pointed to by the appropriate interrupt vector table entry. This happens using a standard subroutine/function call instruction. The handler deals with the interrupt and executes a normal function return instruction. The registers are then restored and a return from interrupt instruction is executed.

The advantage of this approach is that the code to preserve/restore the registers is done (safely) and only needs to occur once in memory. Also, the interrupt handlers can be written in C or C++ without the need to tag on unusual attributes like “interrupt” etc. The potential disadvantage is that maybe the handling of interrupts is a little slower than it might have been using the vectored approach because you end up saving and restoring all CPU registers – whether you use them or not.

Excerpt from ~/.platformio/packages/framework-gd32vf103-sdk/RISCV/env_Eclipse/entry.S

.weak irq_entry
irq_entry: // -------------> This label will be set to MTVT2 register
  // Allocate the stack space
  

  SAVE_CONTEXT// Save 16 regs

  //------This special CSR read operation, which is actually use mcause as operand to directly store it to memory
  csrrwi  x0, CSR_PUSHMCAUSE, 17
  //------This special CSR read operation, which is actually use mepc as operand to directly store it to memory
  csrrwi  x0, CSR_PUSHMEPC, 18
  //------This special CSR read operation, which is actually use Msubm as operand to directly store it to memory
  csrrwi  x0, CSR_PUSHMSUBM, 19
 
// ****************** THIS IS WHERE THE JUMP TO THE INTERRUPT HANDLER FUNCTION HAPPENS! **************

service_loop:
  //------This special CSR read/write operation, which is actually Claim the CLIC to find its pending highest
  // ID, if the ID is not 0, then automatically enable the mstatus.MIE, and jump to its vector-entry-label, and
  // update the link register 
  csrrw ra, CSR_JALMNXTI, ra 
  
  //RESTORE_CONTEXT_EXCPT_X5

  #---- Critical section with interrupts disabled -----------------------
  DISABLE_MIE # Disable interrupts 

  LOAD x5,  19*REGBYTES(sp)
  csrw CSR_MSUBM, x5  
  LOAD x5,  18*REGBYTES(sp)
  csrw CSR_MEPC, x5  
  LOAD x5,  17*REGBYTES(sp)
  csrw CSR_MCAUSE, x5  


  RESTORE_CONTEXT

  
  // Return to regular code
  mret

The comments just before the instruction csrrw ra, CSR_JALMNXTI, ra are a little obscure but their sense is clear enough: If there is a pending interrupt the ECLIC will execute a call to its interrupt handler and will update the Link Register so that when the handler executes a return from subroutine instruction control will return to the next line in the irq_entry.
The macros SAVE_CONTEXT, RESTORE_CONTEXT are defined elsewhere in the startup source code. The constant CSR_JALMNXTI evalautes to 0x7ed – a register number in the ECLIC. According to the Bumblebee core data sheet this register is
“The custom register is used to enable the ECLIC interrupt. The read operation of this register can process the next interrupt and return the entry address of the next interrupt handler. Jump to this address.”.

This is the mechanism by which the developer supplied interrupt handler is called.
How does the irq_entry code get called in the first place? Looking at the code for .platformio/packages/framework-gd32vf103-sdk/RISCV/env_Eclipse/start.S we find the following code that executes during boot:

_start:

        csrc CSR_MSTATUS, MSTATUS_MIE
        /* Jump to logical address first to ensure correct operation of RAM region  */
    la          a0,     _start
    li          a1,     1
        slli    a1,     a1, 29
    bleu        a1, a0, _start0800
    srli        a1,     a1, 2
    bleu        a1, a0, _start0800
    la          a0,     _start0800
    add         a0, a0, a1
        jr      a0

_start0800:

    /* Set the the NMI base to share with mtvec by setting CSR_MMISC_CTL */
    li t0, 0x200
    csrs CSR_MMISC_CTL, t0

        /* Intial the mtvt*/
    la t0, vector_base
    csrw CSR_MTVT, t0

        /* Intial the mtvt2 and enable it*/
    la t0, irq_entry
    csrw CSR_MTVT2, t0
    csrs CSR_MTVT2, 0x1

Note the instructions la t0, irq_entry , csrw CSR_MTVT2, t0 . These tell the ECLIC where to find the irq_entry code. The Bumblebee core reference manual defines this register CSR_MTV2 (0x7ec) as
“Custom registers are used to set non-vector interrupt handling Mode interrupt entry address” .

So, there we have it. The picture below shows my rough understanding of the process.
ECLIC
I have written up two more demo programs: One which uses the internal clock cycle timer interrupt (mtimecmp) (Systick interrupt), the other uses Timer 6 as an external source of timer interrupts (TimerIRQ). Code is available over on Github

Longan-nano RISC-V and PlatformIO

longan_patchwork

The Longan-nano board shown above includes a microcontroller with a RISC-V core. The chip seems to be very similar to the STM32F103C8T6 with the exception that the ARM-Cortex M3 core has been swapped out for a RISC-V Bumblebee core called the GD32VF103CBT6. This little development kit has an Arduino nano form factor and also sports a 160×80 full colour display, an RGB LED and a micro SD-card socket. This particular one came from Seeed Studio at a cost of $4.90 + shipping.

There are two ways of programming this chip: You can use a JTAG debugger (various kinds supported) or you can put the chip into DFU (Device Firmware Update) mode by holding the Reset and Boot button down, releasing the Reset button first. This allows you to program the board using a simple USB-Serial converter. I’ve gone with this initially but will definitely be exploring the details of the RISC-V core with a proper debugger shortly.

Developing code
The documentation for this chip pushes you towards using PlatformIO for development. This is an extension for Visual Studio Code – another first for me. You first install VSCode and then add in the PlatformIO extension (I just followed the online guide and it all just worked 🙂 )

There is good support for the GD32VF103CBT6 within PlatformIO and so there is no problem with setting up environment variables, directories and so on. I chose to develop in C++ which caused me a slight problem with the gd32vf103.h header file. This is intended for use with C projects and includes a definition the bool type. This causes a problem for C++ as it already has such a type. If you edit gd32vf103.h as follows you can fix this (around line 180 look for the enum declaration called bool)

#ifndef __cplusplus
typedef enum {FALSE = 0, TRUE = !FALSE} bool;
#endif

One more thing: when you include this file in your C++ files be sure to surround the include statement with “extern C” as shown below:

extern "C" {
#include "gd32vf103.h"
}

If you don’t do this you run into all sorts of problems with C++ decorated names.

And along came a gotcha
I have to admit that I have been a little lazy when it comes to function return values over the years. Let’s say you write an I/O function that may, in some future design, return an error code – as the project is only starting however you have not written that code yet.
So, your code might look like this:

int test()
{
    int X;
    X = 1;
    /* etc */
}

Note the missing return statement – after all, you haven’t written that code yet. Well, this runs fine on ARM Cortex MCU’s I’ve used, and, also on x86. On RISC-V/GCC9.2 this crashes your program. I spent a while scratching my head over this one. The C++ standard apparently states that behavior in this example is “undefined”. So, be warned: If you state that you are going to return a value then do!

Demo code
I’ve started a repository on github with some examples (multi-colour blinky and a graphics demo for now). You can view it here

NRF905 Revisited

I have worked with the NRF905 radio transceiver in the past as part of a weather station. At the time I wasn’t entirely happy about the code for the radio as it felt a little hacky. Like many others before me I vowed “This time I’m going to do it right” 🙂 so I have set about rewriting the radio part in a more object oriented way. I have put together a transceiver pair as shown below and I intend working on the code to simply the NRF905 application interface as much as I can. Right now there is little abstraction going on – that needs to change.
The transceiver pair consists of a couple of NRF905’s wired to some STM32F030’s mounted on breakout boards. These are wire-wrapped together and a full schematic will eventually be posted along with the code over on github


Update 19-Nov-2019
I have updated the code on github with a tighter interface to the nrf905 class. I also did an initial range test and was getting error free line of sight range of about 200m @ 10dBm. The transmitter was on top of a car and the receiver was at about 1m above the ground. Transmitter and receiver had 1/4 wave monopole antennae. A longer range is possible if you are not concerned about the odd missing packet.

Added support for the ST7789 display module to Breadboard games

Breadboard games has been based on parallel interface displays so far. These displays have also had resistive touch screen interfaces. I have a number of ST7789 displays left over from Dublin Maker so, once again on a cross country train journey I had my breadboard, logic analyzer and laptop out on the table (making the other passengers a little nervous I suspect :). The result: an ST7789 port of Breadboard games. These displays don’t have touch input so the sprite designed doesn’t work however it does take about 2 euro off the price bringing it down below €5 per kit.
Code is available over here

A measurement of the performance of Rust vs C.

This post follows on from my previous one which was an opening foray into the world of Rust. Following a comment I received I decided I would look at the relative performance of Rust and C for the trivial example of blinky running as fast as the CPU can do it (using the default reset clock speed). This involved modifying the C and Rust code.
The modified main.c
The modified code for main.c is shown below. The main differences between this and the version from my previous post are:
#defines used to define the addresses of the I/O registers. This means that the handling of pointer indirection happens at compile time – not run time. Also, it means that RAM is not used to represent the pointers at run-time.
The BSR and BRR (bit-set and bit reset registers) are used for setting and clearing bit 13. This produces smaller and faster code as there is no software bitwise AND/OR involved at run-time.
Two build changes:
optimization is turned up to level 2 by passing the -O2 argument to arm-none-eabi-gcc
debugging information is removed (building for release)

/* User LED for the Blue pill is on PC13 */
#include  // This header includes data type definitions such as uint32_t etc.
#define rcc_apb2enr (  (volatile uint32_t *) 0x40021018  )
#define gpioc_crh  (  (volatile uint32_t *) 0x40011004  )
#define gpioc_odr (  (volatile uint32_t *) 0x4001100c  )
#define gpioc_bsr (  (volatile uint32_t *) 0x40011010  )
#define gpioc_brr (  (volatile uint32_t *) 0x40011014  )


// The main function body follows 
int main() {
    // Do I/O configuration
    // Turn on GPIO C
    *rcc_apb2enr |= (1 << 4); // set bit 4
    // Configure PC13 as an output
    *gpioc_crh |= (1 << 20); // set bit 20
    *gpioc_crh &= ~((1 << 23) | (1 << 22) | (1 << 21)); // clear bits 21,22,23
    while(1) // do this forever
    {
        *gpioc_bsr = (1 << 13); // set bit 13       
         *gpioc_brr = (1 << 13); // clear bit 13     
    }
}
// The reset interrupt handler
void reset_handler() {
    main(); // call on main 
    while(1); // if main exits then loop here until next reset. 
}

// Build the interrupt vector table
const void * Vectors[] __attribute__((section(".vector_table"))) ={
	reset_handler
};

This code is built and gdb reports it's load size as 76 bytes when loaded onto the STM32F103. Executing the code we see the port pin behave as follows:

rust_blink_speed
I’m not clear on why there is a variation of cycle length every other cycle – maybe its as a result of the sampling on my cheap logic analyzer but the important point here is that the waveform is exactly the same for both C and Rust versions of this program so there is no performance hit in tight loops such as this.

The Rust version
The Rust version of the program is shown below. This is different to the version in my previous post as it uses the stm32f1 crate. The example is taken from a Jonathan Klimt’s blog over here. There’s quite a lot going on under the surface in this crate but crucially this all seems to happen at compile time because the output code is just as efficient as the C code from a speed perspective although it is a good deal bigger : 542 bytes vs C’s 76. Why the difference? There are several reasons but one big chunk is accounted for by the fact that the Rust version has a fully populated interrupt vector table which takes up 304 bytes. The C-version only includes the initial stack pointer and the Reset vector : a total of 8 bytes.

// std and main are not available for bare metal software
#![no_std]
#![no_main]

extern crate stm32f1;
extern crate panic_halt;
extern crate cortex_m_rt;

use cortex_m_rt::entry;
use stm32f1::stm32f103;

// use `main` as the entry point of this application
#[entry]
fn main() -> ! {
    // get handles to the hardware
    let peripherals = stm32f103::Peripherals::take().unwrap();
    let gpioc = &peripherals.GPIOC;
    let rcc = &peripherals.RCC;   

    // enable the GPIO clock for IO port C
    rcc.apb2enr.write(|w| w.iopcen().set_bit());
    gpioc.crh.write(|w| unsafe{
        w.mode13().bits(0b11);
        w.cnf13().bits(0b00)
    });

    loop{
        gpioc.bsrr.write(|w| w.bs13().set_bit());        
        gpioc.brr.write(|w| w.br13().set_bit());                
    }
}

The blinky loops in assembler
The version produced by the C code is shown below. r1 and r2 have been previously set up to point at the BSR and BRR registers; while r3 contains the mask value (1 << 13).

   0x08000034 :    str     r3, [r1, #0]
   0x08000036 :    str     r3, [r2, #0]
   0x08000038 :    b.n     0x8000034 

The version produced by the Rust code is below. In this case, r0 is set to point at the data structure that manages Port C and the offsets used in the store instructions are used to select the BSR and BRR registers. Register r1 has been preset to the mask value (1 << 13).

   0x0800017a :    str     r1, [r0, #12]
   0x0800017c :    str     r1, [r0, #16]
   0x0800017e :    b.n     0x800017a 

Final thoughts
Rust is considered to be a safer language to use than C. This benefit must come at some cost however. I’ve seen here that the cost is not in execution speed so where is it? Well, the C version is completely built in 168ms while the Rust version takes over 2 minutes – note this is a “once-off” cost in a way; incremental builds are nearly as quick as the C version. A full rebuild only happens if you build after a “cargo clean” command or if you modify the Cargo.toml file (which changes the build options).
Does this encourage me to pursue Rust a little more – definitely yes and with a more complex project that might reveal the greater safety of Rust.