As dda_clock() is potantially too slow for high step rates, we
call it with a secondary interrupt with slightly slower priority.
This makes sure the slow part is ignored on high system load,
still reasonably synchonized with the clock tick.
Test: steppers should move and accelerate now.
Current binary size:
SIZES ARM... lpc1114
FLASH : 7756 bytes 24%
RAM : 960 bytes 24%
EEPROM : 0 bytes 0%
Works nicely, much less code than on AVR, because we have 32-bit
hardware timers.
Test: steppers should move. Only slowly, because dda_clock() isn't
called, yet, so no acceleration.
Pulse time on the Debug LED is 5.21 us or 250 clock ticks.
All in one chunk, because it's all hardware-independent and doing
them one by one would end up on not more than some typing
exercises.
Compiles fine. For testing, remove if (DEBUG... for M114 in
gcode_process.c. Then one can see how the queue fills up when
sending movements and M114 repeatedly. This time with actual
coordinates.
No stepper movements, yet, because set_timer() is still empty.
Compiles fine. For testing, remove if (DEBUG... for M114 in
gcode_process.c. Then one can see how the queue fills up when
sending movements and M114 repeatedly.
queue_step() isn't called, yet, the stepper timer is still missing.
This test code in SysTickHanlder() should give you a rather
accurate clock with only a few seconds deviation per hour:
#include "serial.h"
#include "sersendf.h"
void SysTick_Handler(void) {
static uint32_t count = 0;
static uint8_t minutes = 0, seconds = 0;
count++;
if ( ! (count % 500)) { // A full second.
seconds++;
if ( ! (seconds % 60)) {
seconds = 0;
minutes++;
}
sersendf_P(PSTR("%su:"), minutes);
if (seconds < 10)
serial_writechar('0');
sersendf_P(PSTR("%su\n"), seconds);
}
[...]
This enables pinio_init(), power_on() and power_off(). Now one
can turn on the power supply with M119 and turn it off with M2.
Code changes were neccessary. Setting a pin first, then making
it an output doesn't work on ARM. A pin has to be an output
before it permanently accepts a given state. As I was never
sure the former strategy actually worked on AVR, the order of
these two steps was changed for both, AVR and ARM.
Again, the whole file compiled flawlessly without change. Still,
to get it linked as well, most of the functionality had to
be #ifdef'd out.
Nevertheless, the firmware shows first signs of life, e.g. M115
works.
This uses 4 bytes less RAM, without any loss, due to fewer holes
in variable arrangements.
The general strategy is simple:
- Ideally, all variables are aligned in groups of 4 bytes
(32 bits). This allows fastest access on 32-bit CPUs and doesn't
change anything on 8 or 16 bit ones.
- 1x 32-bits variable = 4 bytes = 4-byte group.
- 2x 16-bits variable together = 4-byte group.
- 4x 8-bits variable together = 4-byte group.
- Have as few incomplete groups as possible.
Another strategy is to simply order variables by size.
There's a compiler flag to pack such variable arrangements, but
this costs Flash size and processing time.
Just did it, no code changes neccessary. Except ajusting the
boundaries to not yet ported code.
Successful tests: controller answers with "ok", just like an AVR.
Binary size raised, of course:
SIZES ARM... lpc1114
FLASH : 3064 bytes 10%
RAM : 194 bytes 5%
EEPROM : 0 bytes 0%
This is, reformat the "Changes for Teacup" section, change tabs
for spaces, remove trailing whitespace and keep the file as close
to the original as possible.
This is a pretty complex and, as system clock and baudrate are
known at compile time and never changed at runtime, unneccessary.
Replacing this calculation with fixed values makes the binary
a whopping 564 bytes smaller.
However, how to get these values? Well, we do kind of an
easter-egg. If parameters arenot known, we calculate them at
runtime anyways, and also report them to the user. So she can
insert them into the code and after doing so, whoops, serial
fast and binary small :-)
With known parameters:
SIZES ARM... lpc1114
FLASH : 1092 bytes 4%
RAM : 132 bytes 4%
EEPROM : 0 bytes 0%
Without (1428 bytes more):
SIZES ARM... lpc1114
FLASH : 2520 bytes 8%
RAM : 132 bytes 4%
EEPROM : 0 bytes 0%
On ARM we use only the 16 byte hardware buffer for sending and
receiving over the serial line, which is often too short for
debugging messages. This implementation works fine and still
neither blocks nor introduces delays for short messages.
Costs 72 bytes binary size, mostly because it's the first usage
of delay_us():
SIZES ARM... lpc1114
FLASH : 1656 bytes 6%
RAM : 136 bytes 4%
EEPROM : 0 bytes 0%
Accuracy is pretty good, see committed comments :-)
Code used for testing, in main():
uint32_t i;
SET_OUTPUT(PIO0_1);
while (1) {
// 10 seconds for each frequency, so we
// can measure all three with one upload.
for (i = 10000; i > 0; i--) {
WRITE(PIO0_1, 1);
delay_us(1000);
WRITE(PIO0_1, 0);
delay_us(1000);
}
for (i = 1000; i > 0; i--) {
WRITE(PIO0_1, 1);
delay_us(10000);
WRITE(PIO0_1, 0);
delay_us(10000);
}
for (i = 200; i > 0; i--) {
WRITE(PIO0_1, 1);
delay_us(65000);
WRITE(PIO0_1, 0);
delay_us(65000);
}
}
(Hopefully) no functional change.
Also remove these wd_reset()s in delay_us() to match the behaviour
promised in delay.h. Not that this matters much, watchdog is
disabled by default.
On ARM enabling the pullup on an input pin isn't done by writing
a 1 to the pin, but by setting the corresponding register.
Accordingly we need a distinct function for this.
This also implements more of the FastIO infrastructure.
Unfortunately, definitions aren't exactly straightforward, so we
need lots of tabular data. For example, for the pin function I
I had to step through the user manual, pin by pin.
We also learned a lesson here: Cortex-M0 has a 4 word ( = 16 bytes)
prefetch engine. Loops not starting at such a boundary take
additional 4 clock cycles, making them slower. The tight loop used
for testing previously happened to be 16-byte aligne by accident.
Adding just one line of code in the SET_OUTPUT() macro misaligned
it, so loop repetition rate dropped from 5.3 MHz to 3.7 MHz.
There are many measures to align code to 16-byte boundaries:
- -falign-functions=16 as gcc flag.
- -falign-loops=16 as gcc flag, found to not work.
- -falign-labels=16 as gcc flag, worked for aligning the loop,
but also bloated the binary by 10%.
- __attribute__ ((aligned(16))) attached to functions (not
tested)
- Adding this just before the loop worked fine and increased the
binary by just 16 bytes:
__ASM (".balign 16");
Take care of this when relying on exact execution times, e.g. when
implementing delay_us()!
Only SET_OUTPUT() and WRITE() for now, reading follows later.
A loop like this:
SET_OUTPUT(PIO0_1);
for (;;) {
WRITE(PIO0_1, 0);
WRITE(PIO0_1, 1);
}
toggles a pin at about 5.3 MHz. The low period is 63 ns on the
scope, so 3 clock cycles. With this loop, the binary is 1648
bytes.
Assembly shows four instructions inside the loop, which is about
as good as it can get:
movs r2, #0
str r2, [r3, #8]
adds r2, #2
str r2, [r3, #8]
For comparison, using the MBED provided gpio routines give a
toggle frequency of about 300 kHz, with a low period of 72 clock
cycles. Microoptimisation isn't just the last few percent ...
Tested with this code before main():
static void delay(uint32_t delay) {
while (delay) {
__ASM volatile ("nop");
delay--;
}
}
... and in main():
SET_OUTPUT(PIO0_1);
SET_OUTPUT(PIO0_2);
SET_OUTPUT(PIO0_3);
SET_OUTPUT(PIO0_4);
__ASM (".balign 16");
while (1) {
// 1 pulse on pin 1, two pulses on pin 2, ...
WRITE(PIO0_1, 0);
WRITE(PIO0_1, 1);
WRITE(PIO0_2, 0);
WRITE(PIO0_2, 1);
WRITE(PIO0_2, 0);
WRITE(PIO0_2, 1);
WRITE(PIO0_3, 0);
WRITE(PIO0_3, 1);
WRITE(PIO0_3, 0);
WRITE(PIO0_3, 1);
WRITE(PIO0_3, 0);
WRITE(PIO0_3, 1);
// PIO0_4 needs a pullup 10k to 3.3V
// to show a visible signal.
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(10);
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(10);
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(10);
WRITE(PIO0_4, 0);
delay(10);
WRITE(PIO0_4, 1);
delay(1000);
}
With a 10k pullup, PIO0_4 has a rise time of about 1 microsecond.