This also implements more of the FastIO infrastructure.
Unfortunately, definitions aren't exactly straightforward, so we
need lots of tabular data. For example, for the pin function I
I had to step through the user manual, pin by pin.
We also learned a lesson here: Cortex-M0 has a 4 word ( = 16 bytes)
prefetch engine. Loops not starting at such a boundary take
additional 4 clock cycles, making them slower. The tight loop used
for testing previously happened to be 16-byte aligne by accident.
Adding just one line of code in the SET_OUTPUT() macro misaligned
it, so loop repetition rate dropped from 5.3 MHz to 3.7 MHz.
There are many measures to align code to 16-byte boundaries:
- -falign-functions=16 as gcc flag.
- -falign-loops=16 as gcc flag, found to not work.
- -falign-labels=16 as gcc flag, worked for aligning the loop,
but also bloated the binary by 10%.
- __attribute__ ((aligned(16))) attached to functions (not
tested)
- Adding this just before the loop worked fine and increased the
binary by just 16 bytes:
__ASM (".balign 16");
Take care of this when relying on exact execution times, e.g. when
implementing delay_us()!
Only SET_OUTPUT() and WRITE() for now, reading follows later.
A loop like this:
SET_OUTPUT(PIO0_1);
for (;;) {
WRITE(PIO0_1, 0);
WRITE(PIO0_1, 1);
}
toggles a pin at about 5.3 MHz. The low period is 63 ns on the
scope, so 3 clock cycles. With this loop, the binary is 1648
bytes.
Assembly shows four instructions inside the loop, which is about
as good as it can get:
movs r2, #0
str r2, [r3, #8]
adds r2, #2
str r2, [r3, #8]
For comparison, using the MBED provided gpio routines give a
toggle frequency of about 300 kHz, with a low period of 72 clock
cycles. Microoptimisation isn't just the last few percent ...
Tested with this code before main():
static void delay(uint32_t delay) {
while (delay) {
__ASM volatile ("nop");
delay--;
}
}
... and in main():
SET_OUTPUT(PIO0_1);
SET_OUTPUT(PIO0_2);
SET_OUTPUT(PIO0_3);
SET_OUTPUT(PIO0_4);
__ASM (".balign 16");
while (1) {
// 1 pulse on pin 1, two pulses on pin 2, ...
WRITE(PIO0_1, 0);
WRITE(PIO0_1, 1);
WRITE(PIO0_2, 0);
WRITE(PIO0_2, 1);
WRITE(PIO0_2, 0);
WRITE(PIO0_2, 1);
WRITE(PIO0_3, 0);
WRITE(PIO0_3, 1);
WRITE(PIO0_3, 0);
WRITE(PIO0_3, 1);
WRITE(PIO0_3, 0);
WRITE(PIO0_3, 1);
// PIO0_4 needs a pullup 10k to 3.3V
// to show a visible signal.
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(10);
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(10);
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(10);
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(1000);
}
With a 10k pullup, PIO0_4 has a rise time of about 1 microsecond.