This is, reformat the "Changes for Teacup" section, change tabs
for spaces, remove trailing whitespace and keep the file as close
to the original as possible.
This is a pretty complex and, as system clock and baudrate are
known at compile time and never changed at runtime, unneccessary.
Replacing this calculation with fixed values makes the binary
a whopping 564 bytes smaller.
However, how to get these values? Well, we do kind of an
easter-egg. If parameters arenot known, we calculate them at
runtime anyways, and also report them to the user. So she can
insert them into the code and after doing so, whoops, serial
fast and binary small :-)
With known parameters:
SIZES ARM... lpc1114
FLASH : 1092 bytes 4%
RAM : 132 bytes 4%
EEPROM : 0 bytes 0%
Without (1428 bytes more):
SIZES ARM... lpc1114
FLASH : 2520 bytes 8%
RAM : 132 bytes 4%
EEPROM : 0 bytes 0%
On ARM we use only the 16 byte hardware buffer for sending and
receiving over the serial line, which is often too short for
debugging messages. This implementation works fine and still
neither blocks nor introduces delays for short messages.
Costs 72 bytes binary size, mostly because it's the first usage
of delay_us():
SIZES ARM... lpc1114
FLASH : 1656 bytes 6%
RAM : 136 bytes 4%
EEPROM : 0 bytes 0%
Accuracy is pretty good, see committed comments :-)
Code used for testing, in main():
uint32_t i;
SET_OUTPUT(PIO0_1);
while (1) {
// 10 seconds for each frequency, so we
// can measure all three with one upload.
for (i = 10000; i > 0; i--) {
WRITE(PIO0_1, 1);
delay_us(1000);
WRITE(PIO0_1, 0);
delay_us(1000);
}
for (i = 1000; i > 0; i--) {
WRITE(PIO0_1, 1);
delay_us(10000);
WRITE(PIO0_1, 0);
delay_us(10000);
}
for (i = 200; i > 0; i--) {
WRITE(PIO0_1, 1);
delay_us(65000);
WRITE(PIO0_1, 0);
delay_us(65000);
}
}
(Hopefully) no functional change.
Also remove these wd_reset()s in delay_us() to match the behaviour
promised in delay.h. Not that this matters much, watchdog is
disabled by default.
On ARM enabling the pullup on an input pin isn't done by writing
a 1 to the pin, but by setting the corresponding register.
Accordingly we need a distinct function for this.
This also implements more of the FastIO infrastructure.
Unfortunately, definitions aren't exactly straightforward, so we
need lots of tabular data. For example, for the pin function I
I had to step through the user manual, pin by pin.
We also learned a lesson here: Cortex-M0 has a 4 word ( = 16 bytes)
prefetch engine. Loops not starting at such a boundary take
additional 4 clock cycles, making them slower. The tight loop used
for testing previously happened to be 16-byte aligne by accident.
Adding just one line of code in the SET_OUTPUT() macro misaligned
it, so loop repetition rate dropped from 5.3 MHz to 3.7 MHz.
There are many measures to align code to 16-byte boundaries:
- -falign-functions=16 as gcc flag.
- -falign-loops=16 as gcc flag, found to not work.
- -falign-labels=16 as gcc flag, worked for aligning the loop,
but also bloated the binary by 10%.
- __attribute__ ((aligned(16))) attached to functions (not
tested)
- Adding this just before the loop worked fine and increased the
binary by just 16 bytes:
__ASM (".balign 16");
Take care of this when relying on exact execution times, e.g. when
implementing delay_us()!
Only SET_OUTPUT() and WRITE() for now, reading follows later.
A loop like this:
SET_OUTPUT(PIO0_1);
for (;;) {
WRITE(PIO0_1, 0);
WRITE(PIO0_1, 1);
}
toggles a pin at about 5.3 MHz. The low period is 63 ns on the
scope, so 3 clock cycles. With this loop, the binary is 1648
bytes.
Assembly shows four instructions inside the loop, which is about
as good as it can get:
movs r2, #0
str r2, [r3, #8]
adds r2, #2
str r2, [r3, #8]
For comparison, using the MBED provided gpio routines give a
toggle frequency of about 300 kHz, with a low period of 72 clock
cycles. Microoptimisation isn't just the last few percent ...
Tested with this code before main():
static void delay(uint32_t delay) {
while (delay) {
__ASM volatile ("nop");
delay--;
}
}
... and in main():
SET_OUTPUT(PIO0_1);
SET_OUTPUT(PIO0_2);
SET_OUTPUT(PIO0_3);
SET_OUTPUT(PIO0_4);
__ASM (".balign 16");
while (1) {
// 1 pulse on pin 1, two pulses on pin 2, ...
WRITE(PIO0_1, 0);
WRITE(PIO0_1, 1);
WRITE(PIO0_2, 0);
WRITE(PIO0_2, 1);
WRITE(PIO0_2, 0);
WRITE(PIO0_2, 1);
WRITE(PIO0_3, 0);
WRITE(PIO0_3, 1);
WRITE(PIO0_3, 0);
WRITE(PIO0_3, 1);
WRITE(PIO0_3, 0);
WRITE(PIO0_3, 1);
// PIO0_4 needs a pullup 10k to 3.3V
// to show a visible signal.
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(10);
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(10);
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(10);
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(1000);
}
With a 10k pullup, PIO0_4 has a rise time of about 1 microsecond.
We previously put replacements for the von Neuman architecture
into arduino.h already, now let's complete this by having only
one #include <avr/pgmspace.h> in arduino.h. Almost all sources
include arduino.h anyways, so this is mostly a code reduction.
We have only one UART, we use only one UART, so it's pointless to
do pin mapping calculations at runtime.
Binary size down by 268 bytes:
SIZES ARM... lpc1114
FLASH : 1724 bytes 6%
RAM : 156 bytes 4%
EEPROM : 0 bytes 0%
Pretty complex, this MBED system, it requires no less than
24 additional files. This will be fleshd out before too long.
SIZES ARM... lpc1114
FLASH : 5956 bytes 19%
RAM : 176 bytes 5%
EEPROM : 0 bytes 0%
This shows the new strategy to deal with architecture-specific
code:
- Keep common code as before.
- Keep the header file unchanged as well, no architecture
specific headers.
- Move architecture specific code to an architecture specific
file and wrap the whole contents into an architecture test.
- Also wrap the whole contents with #ifdef TEACUP_C_INCLUDE.
Without this wrapping, Arduino IDE as well as Configtool would
compile the stuff twice, because they compile everything
unconditionally.
- Last not least, #define TEACUP_C_INCLUDE and #include all
architecture specific files unconditionally.
Build tests were successful with the Makefile, with Configtool
and with Arduino 1.5.8, so this strategy is expected to work.
Regarding the copy operation of this commit: code unchanged,
other than rewriting of all the comments for the current idea of
'proper' formatting, getting rid of tabs and some other whitespace
editing.
This was forgotten with the recent move to storing configuration
items as tuples (value, enabled). It should fix the refusal to
build reported in issue #86.
The recent switch to send 'ok' postponed requires also sending a
newline in a few places, because this 'ok' is no longer at the
start of the line. Now it appears in its own line.
Some whitespace at line end was removed in heater.c.
Costs 14 bytes binary size on AVR.
Previously acknoledgement was sent as soon as the command was
parsed. Accordingly, the host would send the next command and
this command would wait in the RX buffer without being parsed.
This worked reasonably, unless an incoming line of G-code was
longer than the RX buffer, in which case the line end was dropped
and parsing of the line never completed. With a 64 bytes buffer
on AVR this was rarely the case, with the 16 bytes hardware buffer
on ARM LPC1114 it happens regularly. And there's no recovering
from such a situation.
This should solve issue #52.