Pure vs. inline assembly

12 December 2018
Ever since the firmware for my 2nd-generation LED matrix display used some inline assembly to do the pulsing of signals for the 74HCT573 latches, I have been wondering whether I would be better off writing firmware entirely in assembly, rather than using ever-increasing amounts of inline assembly — the latter a trend that became apparent with the LED matrix tile firmware. When I was writing the firmware for the 17-segment LED mainboard I decided that I would try reimplementing the whole firmware in assembly, to see how it compares to using C with bits of inline assembly, and this article is a write-up of my experiences of this. A major down-side to using assembly is relative difficulty in understanding the code after some time away — this is not specific to C vs. assembly, but the low-level nature of assembly makes it more acute. In assembly basic things are multi-stage processes, which vastly increases the scope for small mistakes, and certainly in my case the mind is long-calibrated to what C considers to be the basic things. Having said that anyone programming in C these days is using a language a lot lower-level than what 90% of software developer use.

Programming issues

Most of these issues I have touched on in previous articles to varying extents, but for completeness I detail them here. The main point is that assembly exposes everything related to the hardware architecture, whereas with C code such fine details not not conveniently expressed if at all. It has to be said that a lot of such details are ones that one may well not even realise until being explicitly forced into having to deal with them — for instance until I actually looked into banking, I thought it was possible for all data memory to be part of of one contiguous array.

Banking and Paging

Banking and paging are pretty much the same thing, and that is the splitting up of memory into regions. In Microchip PIC terminology paging is the splitting up of code memory, and banking is the splitting up of data memory. Thankfully none of the PIC microcontrollers I have used split up code memory, so I have never needed to care about page selection and splitting up code into different code sections — I am not surprised as a flat program memory model is quite easy to integrate into an instruction set, and the last time I came across architectures that did segment program memory, they were all in products that I suspected were about to sink without trace. When writing C code, banking — the splitting up of data memory — is taken care for you. However when writing inline assembly it is likely to sting you because you have to be pessimistic about where the compiler has actually put variables in memory — hence the use of such tricks as using part of an array as scratch-space. Unlike the dialect of C that is implemented in SDCC, and I suspect that Microchip's MPLAB-X is little different, writing in assembly given you full control over exactly where various variables end up. Being able to lock down where things are avoids all the instructions that a necessary pessimistic C compiler needs to insert.

The PIC16F1823 microcontroller has the PORTx & PIRx special-purpose registers as well as 96 of the 128 bytes of data memory accessible from Bank 0, so the vast amount of the firmware can operate on this bank without having to care about switching. The I2C-related registers are all on Bank 4 and the 16 bytes of all-bank shared memory in this case is more than enough for a receive buffer, so the only awkward toggling between banks is the clearing of I2C-related interrupt flags within the I2C handling code. Avoiding bank switching means that it is easy to take advantage of PIC's jump-over-one branch instructions, rather than using a hop-scotch of GOTO instructions. The PIC16F630 is even more forgiving by having only two banks, and all the general-purpose registers are shared in the same equivalent locations.

Bit addressing & operations

First noticed with the latch toggling with the 2nd-gen LED matrix the big nice thing with the PIC assembly set is the ability to address individual bits. SDCC is smart enough to compile some bit-wise operations to these instructions, but it is not as good as using them directly. Even though the assembly rewrite of the data-flip involved a lot of cut-n-pasted code, it was dealing with individual bits directly, and hence avoided a load of bit-shifting and bit-masking code. As noted in the past PIC assembly is somewhat biased towards stateless code, although more recently I have found ways to harness the instructions in ways amenable to loops.

Chip-specific instructions

I am unsure of the extent that SDCC takes advantage of chip-specific instructions, but my best guess is that it assumes a somewhat pessimistic sub-set that is more common among all the PIC chip families. The umbrella of variants that the PIC16F823 falls under has some useful instructions that are not found in the other chipsets I have tried, most notably the relative jumps instructions of BRA (constant jump) and BRW (jump based on W accumulator) — the latter allows lookup within constant arrays to be implemented as a cheap function call. There are two indirect-access registers available for reading and writing of data arrays, but code compiled from C seems to only ever use one of them.

Temporary storage

In the past when I have passed parameters in function calls and function call returns, it has almost always only been a single variable — in PIC assembly the W register can be used to do this very cheaply, but in C the W register is hidden. PIC does not have local variables, and as a result passing variables in C is disproportionately expensive — to the point that breaking down code into functions is almost a bad idea. Functions are not reentrant, which in this context means no recursive calls. The firmware for the I2C master has 36 compiler-assigned registers, which is in addition to using the shared memory area as manual stacks — on a chips with total data memory of 128 bytes, this is a large chunk.

Assembly snippets

Much can be found out about assembly by writing some C code and looking at the resulting assembly code, but there are limits to this approach as the compiler has to be general in its assumptions — it is a good starting point but ultimately one needs to digest the instruction set specifications. Below are various snippets of assembly code I feel are useful in getting up-and-running, along with some commentary.

Assembly boilerplate

Assembly files need to start with a bit of meta-data, specifying things such as what the target chipset is and any flash-time configuration registers. The snippet below is for a PIC16F1823 but unless completely new to PIC programming it should be easy guesses what needs to change for other chipsets:

RADIX dec PROCESSOR 16f1823 #include <p16f1823.inc> __config _CONFIG1, _FOSC_INTOSC & _WDTE_OFF & _MCLRE_OFF & _PWRTE_OFF & _CP_OFF & _CPD_OFF; __config _CONFIG2, _LVP_OFF

The RADIX dec line specifies decimal as the default number format. There are tricks that allow the __config statements to span multiple lines, but for now it is easiest to just live with the long line. On reset code execution starts from program address zero, so I put a jump to main, which is my real program entrypoint:

STARTUP CODE 0x0000 goto main

And of course the interrupt handler, using a slightly different style of label. There is a lot of leeway in how labels are presented, such as one same/different lines, and the presence/absence of a colon. Pick your own convention:

Interrupt: CODE 0x0004 goto interrupt

Label are optional as the assembler will add in a default one, but I think it is good form to always include an explicit one — it avoids trouble if in the future you add a second instance of a directive, as the default label will then cause trouble.

Memory allocation

Most of the time datag memory is allocated using the UDATA (uninitialised data) directive as shown below. In this case the optional address 0x20 puts it at the very start of the data memory area on Bank 0. The numbers after RES are the number of bytes to reserve:

Bank0Data UDATA 0x20 segments RES 16 idxSegment RES 1 valChipLo RES 4 valChipHi RES 4

There is the IDATA (initialised data) directive that is much like UDATA, except that initial values for each variable can also be specified. The problem is that although gputils will generate RETLW instructions that contain the data values, you have to write the program code that moves them into data memory. I think it is better to instead just use UDATA and initialise them manually, as then you have better control over where the initial data ends up on program memory. For variables going into the common shared memory area, which on the PIC16F1823 starts at 0x70, use the UDATA_SHR diretive instead that puts memory into this shared area:

CommonData UDATA_SHR cntBytes RES 1 recvBuffer RES 8

There are other data-related directives but I think the above two are the only ones worth using. For instance UDATA_OVR allows two or more variables to use the same address, which I do not consider useful.

Cheap loops

Assuming the number of iterations is less than 255, and the iteration number is not required, for loops are dirt-cheap in PIC assembly. A count-down only needs four instructions, two of which are setup overheads:

movlw 16 ; number of things movwf count loop: ; some stuff ; more stuff decfsz count,1 goto loop

Constant array lookup

In PIC code constant data is stored in program memory, which traditionally was accessed using instructions that can read program code as if it was data, with the data being stored using RETLW instructions. This data would normally be accessed using.. Well I never really looked into it, as SDCC-compiled code uses some library function. However the PIC16F1823 simplifies things:

somedata: brw retlw 0x01 retlw 0x02

The important thing here is the BRW instruction that causes code execution to jump ahead the offset. As a result all that is needed to retrieve a constant is to put the offset into W and do a function call:

movlw 1 call somedata

The desired array value will then be in the W register. I do note that the PIC16F88x chips, which is the other of the two families of PIC chips that I now use, do not include the required instructions.

Array access

For array accesses, an absolute memory location is put into two registers that store the memory address, which is dereferenced when an instruction uses a special indirection register. Firstly the memory address needs to be specified:

movlw high array movwf FSR0H movlw array movwf FSR0L

Most of the PIC chips I have used have all the RAM in the first bank or two, and hence all have an absolute memory address less than 0xff, so FSRxH can be left set to zero. For demonstration purposes, it is assumed some value is loaded into the W accumulator, and this will be written to all elements within the array:

movlw 0x02

The indirection registers can then be manipulated as follows while accessing the memory being pointed to. Here W is written to the “current” array element, and then the index is incremented. This is done four times in total:

movwf INDF0 incf FSR0L,1 movwf INDF0 incf FSR0L,1 movwf INDF0 incf FSR0L,1 movwf INDF0 incf FSR0L,1 movwf INDF0

In reality a loop would be used, but the above hard-coded code strips back all the unneccesary detail.

Performance

For demonstration purposes I will compare C and assembly versions of the segsFlip function, as it is by far the most complex part of the firmware. The assembly version weighs in at 67 instructions and takes 260 cycles to run, whereas the C-based version compiles to about 320 instructions and takes 1528 cycles to run. Many of the reasons for the code inefficiency have been covered in past articles, which often as not come down to necessary pessimism by the compiler. The main optimisation is using the W register for a bit-mask within the segFlipCopyBits sub-function, so that the bit that is set within the target memory locations is a parameter. This means that unrolling is only needed for the source bits read by the conditional BTFSC instructions. Per chip overheads are then setting the two indirection registers and W as appropriate, with processing of valChipLo and valChipHi being interlaced to avoid some variable resets.Locality is exploited in order to avoid needing bank switching and setting of the upper indirection registers, and no temporary variables are used.

I have no doubt that the function could be unrolled completely, which I estimate would entail 128 instructions and take 192 cycles to run, but it would come at the cost of maintainability — the way it is currently implemented using a sub-function makes it easy to extend the code to handle upto eight LED chips. Ironically the flipping algorithm was actually easier in assembly than C, due to the ability to read and write bits directly — I found it easier to redo the algorithm from first principles than translate the C code, due to a lot of trickery being used in the latter. At time of writing the I2C code has not been converted, but I suspect I will do this as well in the very near future.

Return to gpsim

I have a very mixed view of gpsim — it undoubtedly was a huge help back in 2007 and most PIC chips I bought was partly chosen for having gpsim support, but it also has a significant number of issues that limit its usefulness. Aside from projects using DEM16217 LCD displays firmware developed using gpsim had to make varying levels of abstractions and allowances for difference between hardware and emulation, and I generally concluded the benefit was marginal at best — the test setup used when creating a controller for my first LED display required serious stretching of the imagination as a development aid. GPSim is nevertheless very useful for profiling bits of program code, and in this case the memory view was particularly useful when debugging since the most of the functionality — in particular the segment data flipping that was used for performance analysis above — were procedures that could be checked just by looking at the resultant memory content. With C code the memory view is basically useless, but with assembly code where variables are located is nailed down it showed what was happening in the firmware, and this avoided a lot of chip flashing cycles.

Overview

For the vast majority of applications there is no practical benefit of using assembly, and in the cases where there is benefit, a few snippets of inline assembly will get 80% of the gain. — this has to be weighed against both longer development time and more difficulty understanding code when returning to it at a later date. The two scenarios where there is a clear gain in writing an entire firmware in assembly is when there are timing requirements that cannot be met through the expedience of simply whacking up the clock rate, and perhaps more critically when there are tight memory requirements that means paying close attention to where data variables end up. For most circuits I have made in the past, I very much doubt that any power saving from lower clock speeds is significant

Having said that, once over the initial learning hurdles, a large portion of typical PIC code is actually quicker and easier to write in PIC assembly than it is in C. This is because the things done with PIC Microcontrollers are mostly procedural with relatively little state but a lot of bit-banging — the common tasks of checking and setting individual bits are single instructions, which take a bit index rather than having to work out a hexadecimal code. Something that is a pain however is nested loops and conditionals, which makes me reluctant to go over to using assembly completely for microcontroller projects — I dread to imagine how much more painful working out I2C would have been if it was in assembler rather than C — even though it is now very tempting to do so.

Overall the main gain of the whole process of having done pure-assembly gives a much better understanding of what is going on underneath, and this is of benefit even when writing firmware entirely in C — why the compiler or assembler fails is less of a mystery, and it gives insight into circumstances where things don't work in the absence of an “obvious” fault. My one regret is that this resulting article is a little dry, with a complete lack of images.