Note by Traumflug: this
while read -r LINE; do
some commands
done << (some other command)
didn't work here (bash 4.3.11 on Ubuntu 14.04), so I had to swap
the sequence of these two commands for using a pipe. Anyways,
excellent idea, shortens some simulator runs drastically!
Finally had to look it up myself. RAMPS users are apparently all
incapable or too stupid to write such fixes into an issue report.
From IRC:
xxx: oh, motors move now, thanks for helping
yyy: please write the changes into a github issue
xxx: will do
...
xxx: have to run now
... and these are the last words the world reads about xxx :-)
Observed three times now.
This is mostly for less confusion, because analog pins can
be used as digital ones, too. It also matches what other firmwares
do, so people can simply copy & paste pin definitions.
Definitions were taken from Sprinter's fastio.h (which was
initially crafted by copying Teacup's arduino*.h files :-) )
This was the goal: to not bit-shift when calling setTimer(). Binary
size another 40 bytes off, about 1.2 % better performance:
SIZES ATmega... '168 '328(P) '644(P) '1280
FLASH : 20136 bytes 141% 66% 32% 16%
RAM : 2318 bytes 227% 114% 57% 29%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 302 clock cycles.
LED on time maximum: 718 clock cycles.
LED on time average: 311.258 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 9124.
LED on time minimum: 307 clock cycles.
LED on time maximum: 708 clock cycles.
LED on time average: 357.417 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 302 clock cycles.
LED on time maximum: 708 clock cycles.
LED on time average: 330.322 clock cycles.
Admittedly it looks like advancing in babysteps, but really
catching every bit shifting instance isn't trivial, sometimes
these shifts are already embedded in other calculations.
Still no binary size or performance change.
While this shifting meant to increase accuracy, there's no actual
use of it, other than that this value gets shifted back and forth.
Let's start to get rid of it.
Performance stays exactly the same:
SIZES ATmega... '168 '328(P) '644(P) '1280
FLASH : 20188 bytes 141% 66% 32% 16%
RAM : 2318 bytes 227% 114% 57% 29%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 306 clock cycles.
LED on time maximum: 722 clock cycles.
LED on time average: 315.253 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 9124.
LED on time minimum: 311 clock cycles.
LED on time maximum: 712 clock cycles.
LED on time average: 361.416 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 306 clock cycles.
LED on time maximum: 712 clock cycles.
LED on time average: 334.319 clock cycles.
This finally brings Z axis up to speed.
So far we always assumed the fastest axis to have the same steps/mm
as the X axis. In cases where this wasn't true, the movement
wouldn't do sufficient acceleration steps and, accordingly,
not reach the expected maximum speed. This was particularly visible
on a typical Mendel printer, where the Z axis would reach only a
6th of the commanded speed in some configurations.
For now this is a square root function which should solve entirely
in the preprocessor. Test results described in the file.
Test code for runtime results, inserted right before the main loop
in mendel.c:
for (uint32_t i = 0; i < 10000000; i++) {
uint32_t mathlib = (uint32_t)(sqrt(i) + .5);
uint32_t preprocessor = (uint32_t)(SQRT(i) + .5);
if (mathlib != preprocessor) {
sersendf_P(PSTR("%lu: %lu %lu\n"), i, mathlib, preprocessor);
break;
}
if ((i & 0x00001fff) == 0)
sersendf_P(PSTR("%lu\n"), i);
}
sersendf_P(PSTR("Square root check done.\n"));
Test code for compile time results:
sersendf_P(PSTR("10000000: %lu\n"), (uint32_t)SQRT(10000000));
sersendf_P(PSTR("10000000: %lu\n"), (uint32_t)sqrt(10000000));
sersendf_P(PSTR("20000000: %lu\n"), (uint32_t)SQRT(20000000));
sersendf_P(PSTR("20000000: %lu\n"), (uint32_t)sqrt(20000000));
sersendf_P(PSTR("30000000: %lu\n"), (uint32_t)SQRT(30000000));
sersendf_P(PSTR("30000000: %lu\n"), (uint32_t)sqrt(30000000));
sersendf_P(PSTR("40000000: %lu\n"), (uint32_t)SQRT(40000000));
sersendf_P(PSTR("40000000: %lu\n"), (uint32_t)sqrt(40000000));
sersendf_P(PSTR("50000000: %lu\n"), (uint32_t)SQRT(50000000));
sersendf_P(PSTR("50000000: %lu\n"), (uint32_t)sqrt(50000000));
sersendf_P(PSTR("60000000: %lu\n"), (uint32_t)SQRT(60000000));
sersendf_P(PSTR("60000000: %lu\n"), (uint32_t)sqrt(60000000));
sersendf_P(PSTR("70000000: %lu\n"), (uint32_t)SQRT(70000000));
sersendf_P(PSTR("70000000: %lu\n"), (uint32_t)sqrt(70000000));
sersendf_P(PSTR("80000000: %lu\n"), (uint32_t)SQRT(80000000));
sersendf_P(PSTR("80000000: %lu\n"), (uint32_t)sqrt(80000000));
sersendf_P(PSTR("90000000: %lu\n"), (uint32_t)SQRT(90000000));
sersendf_P(PSTR("90000000: %lu\n"), (uint32_t)sqrt(90000000));
'all_time' sounds like forever to me, but this variable really
tracks the last time we hit one of "all the axes". It sticks
out more now in looping, so rename it to make sense.
Clean up code to reduce duplication by consolidating code into
loops for per-axis actions.
Part 9 is, finally use this set_direction() thing. As a dessert
topping, it reduces binary size by another 122 bytes.
SIZES ATmega... '168 '328(P) '644(P) '1280
FLASH : 19988 bytes 140% 66% 32% 16%
RAM : 2302 bytes 225% 113% 57% 29%
EEPROM: 32 bytes 4% 2% 2% 1%
Clean up code to reduce duplication by consolidating code into
loops for per-axis actions.
Part 8 is, move remaining update_current_position() into a loop.
This makes the binary 134 bytes smaller. As it's not critical,
no performance test.
SIZES ATmega... '168 '328(P) '644(P) '1280
FLASH : 20134 bytes 141% 66% 32% 16%
RAM : 2302 bytes 225% 113% 57% 29%
EEPROM: 32 bytes 4% 2% 2% 1%
Clean up code to reduce duplication by consolidating code into
loops for per-axis actions.
Part 7 is, turn update_current_position() in dda.c partially into
a loop. Surprise, surprise, this changes neither binary size nor
performance. Looking into the generated assembly, the loop is
indeed completely unrolled. Apparently that's smaller than a
real loop.
SIZES ATmega... '168 '328(P) '644(P) '1280
FLASH : 20270 bytes 142% 66% 32% 16%
RAM : 2302 bytes 225% 113% 57% 29%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 888.
Sum of all LED on time: 279945 clock cycles.
LED on time minimum: 306 clock cycles.
LED on time maximum: 722 clock cycles.
LED on time average: 315.253 clock cycles.
smooth-curves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 9124.
Sum of all LED on time: 3297806 clock cycles.
LED on time minimum: 311 clock cycles.
LED on time maximum: 712 clock cycles.
LED on time average: 361.443 clock cycles.
triangle-odd.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 1636.
Sum of all LED on time: 546946 clock cycles.
LED on time minimum: 306 clock cycles.
LED on time maximum: 712 clock cycles.
LED on time average: 334.319 clock cycles.
Clean up code to reduce duplication by consolidating code into
loops for per-axis actions.
Part 6c removes do_step(), but still tries to keep a loop. This
about the maximum of performance I (Traumflug) can think of.
Binary size is as good as with the former attempt, but performance
is actually pretty bad, 45% worse than without looping:
SIZES ATmega... '168 '328(P) '644(P) '1280
FLASH : 19876 bytes 139% 65% 32% 16%
RAM : 2302 bytes 225% 113% 57% 29%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 888.
Sum of all LED on time: 406041 clock cycles.
LED on time minimum: 448 clock cycles.
LED on time maximum: 864 clock cycles.
LED on time average: 457.253 clock cycles.
smooth-curves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 9124.
Sum of all LED on time: 4791132 clock cycles.
LED on time minimum: 453 clock cycles.
LED on time maximum: 867 clock cycles.
LED on time average: 525.113 clock cycles.
triangle-odd.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 1636.
Sum of all LED on time: 800586 clock cycles.
LED on time minimum: 448 clock cycles.
LED on time maximum: 867 clock cycles.
LED on time average: 489.356 clock cycles.
Clean up code to reduce duplication by consolidating code into
loops for per-axis actions.
Part 6b moves do_step() from the "tidiest" place into where it's
currently used, dda.c. Binary size goes down another 34 bytes, to
a total savings of 408 bytes and performance is much better, but
still 16% lower than without using loops:
SIZES ATmega... '168 '328(P) '644(P) '1280
FLASH : 19874 bytes 139% 65% 32% 16%
RAM : 2302 bytes 225% 113% 57% 29%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 888.
Sum of all LED on time: 320000 clock cycles.
LED on time minimum: 351 clock cycles.
LED on time maximum: 772 clock cycles.
LED on time average: 360.36 clock cycles.
smooth-curves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 9124.
Sum of all LED on time: 3875874 clock cycles.
LED on time minimum: 356 clock cycles.
LED on time maximum: 773 clock cycles.
LED on time average: 424.8 clock cycles.
triangle-odd.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 1636.
Sum of all LED on time: 640357 clock cycles.
LED on time minimum: 351 clock cycles.
LED on time maximum: 773 clock cycles.
LED on time average: 391.416 clock cycles.
Clean up code to reduce duplication by consolidating code into
loops for per-axis actions.
Part 6a is putting stuff inside the step interrupt into a loop,
too. do_step() is put into the "tidiest" place. Binary size goes
down a remarkable 374 bytes, but stepping performance suffers by
almost 30%.
Traumflug's performance measurements:
SIZES ATmega... '168 '328(P) '644(P) '1280
FLASH : 19908 bytes 139% 65% 32% 16%
RAM : 2302 bytes 225% 113% 57% 29%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 888.
Sum of all LED on time: 354537 clock cycles.
LED on time minimum: 390 clock cycles.
LED on time maximum: 806 clock cycles.
LED on time average: 399.253 clock cycles.
smooth-curves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 9124.
Sum of all LED on time: 4268896 clock cycles.
LED on time minimum: 395 clock cycles.
LED on time maximum: 807 clock cycles.
LED on time average: 467.875 clock cycles.
triangle-odd.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 1636.
Sum of all LED on time: 706846 clock cycles.
LED on time minimum: 390 clock cycles.
LED on time maximum: 807 clock cycles.
LED on time average: 432.057 clock cycles.
There's nothing special about this config.h, it's just the one I
happened to use for first profiling investigations. To allow
everybody else to do the very same profiling runs, I add it here.
Doing profiling isn't too complicated:
mv config.h config.h.backup
ln -s testcases/config.h.Profiling config.h
git checkout -b work
git cherry-pick simulavr # add tweaks convenient for simulation runs
make
cd testcases
./run-in-simulavr.sh short-moves.gcode smooth-curves.gcode triangle-odd.gcode
After being done you can restore your config.h and delete this work branch.
Currently, performance is as following (with convenience commit applied):
SIZES ATmega... '168 '328(P) '644(P) '1280
FLASH : 20270 bytes 142% 66% 32% 16%
RAM : 2302 bytes 225% 113% 57% 29%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 888.
Sum of all LED on time: 279945 clock cycles.
LED on time minimum: 306 clock cycles.
LED on time maximum: 722 clock cycles.
LED on time average: 315.253 clock cycles.
smooth-curves.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 9124.
Sum of all LED on time: 3297806 clock cycles.
LED on time minimum: 311 clock cycles.
LED on time maximum: 712 clock cycles.
LED on time average: 361.443 clock cycles.
triangle-odd.gcode
Statistics (assuming a 20 MHz clock):
LED on occurences: 1636.
Sum of all LED on time: 546946 clock cycles.
LED on time minimum: 306 clock cycles.
LED on time maximum: 712 clock cycles.
LED on time average: 334.319 clock cycles.
Fix:
dda_lookahead.c:327:17: warning: 'crossF' may be used
uninitialized in this function [-Wmaybe-uninitialized]
sersendf_P(PSTR("Initial crossing speed: %lu\n"), crossF);
^
As it's still a bit cumbersome to go through the whole .vcd file
to find the highest delay between On and Off, do this search
automatically and output an statistics. Can look like this:
Statistics (assuming a 20 MHz clock):
LED on occurences: 838.
Sum of all LED on time: 262055 clock cycles.
LED on time minimum: 306 clock cycles.
LED on time maximum: 717 clock cycles.
LED on time average: 312.715 clock cycles.
This should give an reasonable overview of wether and roughly
how much a particular code change makes your code slower or
faster. It should also show up showblockers, like occasionally
huge delays.
BTW., the above data was collected timing the step interrupt when
running short-moves.gcode with the current firmware.
The idea is simple: if you want to time a portion of code
precisely, turn on the Debug LED (see config.h for
DEBUG_LED_PIN) at the start of sequence and turn it off when
done. Running this in SimulAVR, you have two flanges precise
to the clock cycle which exactly reflect the time taken to
run this code sequence. Ideally, you run this code n a loop
to get a number of samples, if it doesn't run in a loop anyways.
Time taken can then be measured in GTKWave. For convenience and
for a better overview, run-in-simulavr.sh also extracts all the
delays into it's own signal, so it can be viewed as an ongoing
number.
Eventual debugging LEDs aren't part of the CPU, but part of the
electronics. Accordingly, define it in config.*.h, not in
arduino_*.h (which would be better named something like
"atmega_*.h).
Should be done for temptable in ThermistorTable.h, too, but this
would mess up an existing users' configuration.
This tries to put emphasis on the fact that you have to read
these values with pgm_read_*() instead of just using the variable.
Unfortunately, gcc compiler neither inserts PROGMEM reading
instructions automatically when reading data stored in flash,
nor does it complain or warn about the missing read instructions.
As such it's very easy to accidently handle data stored in flash
just like normal data. It'll compile and work ... you just read
arbitrary data (often, but not always zeros) instead of what you
intend.