Speeding up some micropython with a touch of inline assembly on the Raspberry Pi Pico

I have been working on a graphics library for the ST7735 and the Raspberry Ri Rico. The first version was written in pure micropython and worked well enough but was quite slow – especially when writing out blocks of colour (fillRectangle). This was the original code for fillRectangle:

def fillRectangle(self,x1,y1,w,h,colour):
        self.openAperture(x1,y1,x1+w-1,y1+h-1)        
        pixelcount=h*w
        self.command(0x2c)
        self.a0.value(1)        
        msg=bytearray()
        while(pixelcount >0):
            pixelcount = pixelcount-1          
            msg.append(colour >> 8)
            msg.append(colour & 0xff)
        self.spi.write(msg)

Not only was this slow, it also required that a buffer be created that held the filled rectangle in RAM. This was slow and memory intensive.

The new version looks like this:

def fillRectangle(self,x1,y1,w,h,colour):
        self.openAperture(x1,y1,x1+w-1,y1+h-1)        
        pixelcount=h*w
        self.command(0x2c)
        self.a0.value(1)
        self.fill_block(colour,pixelcount) 

It makes use of an inline assembler function the source code of which is as follows:

@micropython.asm_thumb
    def fill_block(r0,r1,r2):
        # pointer to self passed in r0
        # r1 contains the 16 bit data to be written
        # r2 countains count
        # Going to use SPI0.
        # Base address = 0x4003c000
        # SSPCR0 Register OFFSET 0
        # SSPCR1 Register OFFSET 4
        # SSPDR Register OFFSET 8
        # SSPSR Register OFFSET c
        push({r1,r2,r3,r4,r7})
        # Convoluted load of a 32 value into r7
        mov(r7,0x40)
        lsl(r7,r7,8)
        add(r7,0x03)
        lsl(r7,r7,8)
        add(r7,0xc0)
        lsl(r7,r7,8)
        add(r7,0x00)
        mov(r4,2)        
        label(fill_block_loop_start)
        cmp(r2,0)
        beq(fill_block_exit)        
        mov(r3,r1) # read next byte
        lsr(r3,r3,8)
        strb(r3,[r7,8]) # write to SPI
        label(fill_block_spi_wait1)        
        ldr(r3,[r7,0xc]) # read next byte
        and_(r3,r4)
        beq(fill_block_spi_wait1)
        
        mov(r3,r1) # read next byte        
        strb(r3,[r7,8]) # write to SPI        
        sub(r2,r2,1) # decrement count                
        label(fill_block_spi_wait2)        
        ldr(r3,[r7,0xc]) # read next byte
        and_(r3,r4)
        beq(fill_block_spi_wait2)
        b(fill_block_loop_start)
        
        label(fill_block_exit)
        pop ({r1,r2,r3,r4,r7})

This writes the colour value directly to the SPI port the required number of times. It needs to pause when the SPI FIFO fills up (hence he need for the labels fill_block_spi_wait1/2).

The performance improvement is about a factor of 20!

Code is available over on gihub and is likely to change lots in the next couple of weeks while I prepare for a STEM event.