In dda.c we have some limitation factors. e.g. the calculation
for the move_duration. 'distance * 2400' should be everytime below
UINT32_MAX.
Maybe we find later an other limitation factor. So you can
modify it now in dda.h.
We don't need to save the step_no. We can easily calculate it when needed.
Also some whitespace-work. In dda.h is only a delete of 'uint32_t step_no;'.
Saves up to 16 clock cycles in dda_step():
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 209 clock cycles.
LED on time maximum: 504 clock cycles.
LED on time average: 241.441 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 22589.
LED on time minimum: 209 clock cycles.
LED on time maximum: 521 clock cycles.
LED on time average: 276.729 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 209 clock cycles.
LED on time maximum: 504 clock cycles.
LED on time average: 262.923 clock cycles.
In `ACCELERATION_RAMPING` code we use the dda->id field even when we do
not enable `LOOKAHEAD`. Expose the variable and its related `idcnt`
when `ACCELERATION_RAMPING` is used.
Add a regression-test to catch this in the future.
These values were queued up just for finding out individual axis
speeds in dda_find_crossing_speed(). Let's do this calculation
with other available movement properties and save 16 bytes of RAM
per movement queue entry.
First version of this commit forgot to take care of the feedrate
sign (prevF, currF). Lack of that found by @Wurstnase. Idea of
tweaking calculation of 'dv' to achieve this also by @Wurstnase.
It was tried to set the sign immediately after calculation of the
absolute values, but that resulted in larger ( = slower) code.
Binary size down 132 bytes, among that two loops. RAM usage down
256 bytes for the standard test case:
ATmega sizes '168 '328(P) '644(P) '1280
Program: 17944 bytes 126% 59% 29% 14%
Data: 1920 bytes 188% 94% 47% 24%
EEPROM: 32 bytes 4% 2% 2% 1%
Neither of them brought a performance improvement, so we revert
both. Commits as well as revert kept to preserve the knowledge
gained.
This reverts commits
"DDA, dda_start(): use mb_tail_dda directly." and
"DDA, dda_start(): don't pass mb_tail_dda as parameter."
Performance and binary size is back to what we had before:
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19270 bytes 135% 63% 31% 15%
Data: 2179 bytes 213% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 218 clock cycles.
LED on time maximum: 395 clock cycles.
LED on time average: 249.051 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 237 clock cycles.
LED on time maximum: 438 clock cycles.
LED on time average: 272.216 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 237 clock cycles.
LED on time maximum: 395 clock cycles.
LED on time average: 262.572 clock cycles.
Instead, read the global variable directly.
The idea is that reading the global variable directly removes
the effort to build up a parameter stack, making things faster.
Actually, binary size increases by 4 bytes and the slowest step
takes 3 clock cycles longer. D'oh.
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19274 bytes 135% 63% 31% 15%
Data: 2179 bytes 213% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 218 clock cycles.
LED on time maximum: 398 clock cycles.
LED on time average: 249.111 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 237 clock cycles.
LED on time maximum: 441 clock cycles.
LED on time average: 272.222 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 237 clock cycles.
LED on time maximum: 398 clock cycles.
LED on time average: 262.576 clock cycles.
Not queuing up waits for the heaters in the movement queue removes
some code in performance critical paths. What a luck we just
implemented an alternative M116 functionality with the previous
commit :-)
Performance of the slowest step is decreased a nice 29 clock
cycles and binary size decreased by a whoppy 472 bytes. That's
still 210 bytes less than before implementing the alternative
heater wait.
Best of all, average step time is down some 21 clock cycles, too,
so we increased general stepping performance by no less than 5%.
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19436 bytes 136% 64% 31% 16%
Data: 2177 bytes 213% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 259 clock cycles.
LED on time maximum: 429 clock cycles.
LED on time average: 263.491 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 251 clock cycles.
LED on time maximum: 472 clock cycles.
LED on time average: 286.259 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 251 clock cycles.
LED on time maximum: 429 clock cycles.
LED on time average: 276.616 clock cycles.
While this was an improvement of 9 clocks on AVRs, it had more
than the opposite effect on ARMs: 25 clocks slower on the slowest
step. Apparently ARMs aren't as efficient in reading and writing
single bits.
https://github.com/Traumflug/Teacup_Firmware/issues/189#issuecomment-262837660
Performance on AVR is back to what we had before:
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19610 bytes 137% 64% 31% 16%
Data: 2175 bytes 213% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 280 clock cycles.
LED on time maximum: 549 clock cycles.
LED on time average: 286.273 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 272 clock cycles.
LED on time maximum: 580 clock cycles.
LED on time average: 307.439 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 272 clock cycles.
LED on time maximum: 539 clock cycles.
LED on time average: 297.732 clock cycles.
In dda_step instead of checking our 32-bit-wide delta[n] value,
just check a single bit in an 8-bit field. Should be a tad faster.
It does make the code larger, but also about 10% faster, I think.
Performance:
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19696 bytes 138% 65% 32% 16%
Data: 2191 bytes 214% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 263 clock cycles.
LED on time maximum: 532 clock cycles.
LED on time average: 269.273 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 255 clock cycles.
LED on time maximum: 571 clock cycles.
LED on time average: 297.792 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 255 clock cycles.
LED on time maximum: 522 clock cycles.
LED on time average: 283.861 clock cycles.
Using the Bresenham algorithm it's safe to assume that if the axis
with the most steps is done, all other axes are done, too.
This way we save a lot of variable loading in dda_step(). We also
save this very expensive comparison of all axis counters against
zero. Minor drawback: update_current_position() is now even slower.
About performance. The slowest step decreased from 719 to 604
clocks, which is quite an improvement. Average step time increased
for single axis movements by 16 clocks and decreased for multi-
axis movements. At the bottom line this should improve real-world
performance quite a bit, because a printer movement speed isn't
limited by average timings, but by the time needed for the slowest
step.
Along the way, binary size dropped by nice 244 bytes, RAM usage by
also nice 16 bytes.
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19564 bytes 137% 64% 31% 16%
Data: 2175 bytes 213% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 326 clock cycles.
LED on time maximum: 595 clock cycles.
LED on time average: 333.62 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 318 clock cycles.
LED on time maximum: 604 clock cycles.
LED on time average: 333.311 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 318 clock cycles.
LED on time maximum: 585 clock cycles.
LED on time average: 335.233 clock cycles.
We need the fastest axis instead of its steps.
Eleminates also an overflow when ACCELERATION > 596.
We save 118 bytes program and 2 bytes data.
Reviewer Traumflug's note: I see 100 bytes program and 32 bytes
RAM saving on ATmegas here. 16 and 32 on the LPC 1114. Either way:
great stuff!
Similar to M221 which sets a variable flow rate percentage, add
support for M220 which sets a percentage modifier for the
feedrate, F.
It seems a little disturbing that the flow rate modifies the next
G1 command and does not touch the buffered commands, but this
seems like the only reasonable thing to do since the M221 setting
could be embedded in the source gcode for some use cases. Perhaps
an "immediate" setting using P1 could be considered later if
needed.
`target` is an input to dda_create, but we don't modify it. We
copy it into dda->endpoint and modify that instead, if needed.
Make `target` const so this treatment is explicit.
Rely on dda->endpoint to hold our "target" data so any decisions
we make leading up to using it will be correctly reflected in our
math.
The flow rate is given as a percentage which is kept as
100 = 100% internally. But this means we must divide by 100 for
every movement which can be expensive. Convert the value to
256 = 100% so the compiler can optimize the division to a
byte-shift.
Also, avoid the math altogether in the normal case where the
flow rate is already 100% and no change is required.
Note: This also requires an increase in the size of e_multiplier
to 16 bits so values >= 100% can be stored. Previously flow
rates only up to 255% (2.5x) were supported which may have
surprised some users. Now the flow rate can be as high as
10000% (100x), at least internally.
Now it is possible to control the extruders flow.
M221 S100 = 100% of the extruders steps
M221 S90 = 90% of the extruders steps
M221 is also used in other firmwares for this. Also a lot of
hosts, like Octoprint and Pronterface using this M-Code for
this behaviour.
REPRAP style acceleration broke quite a while ago, but no one noticed.
Maybe it's not being used, and therefore also not tested. But it should
at least compile while it remains an option.
The compiler complains that dda->n is not defined and that current_id is
never used. The first bug goes back to f0b9daeea0 in late 2013.
In the interest of supporting exploratory accelerations, fix this to
build when ACCELERATION_REPRAP is chosen.
We previously put replacements for the von Neuman architecture
into arduino.h already, now let's complete this by having only
one #include <avr/pgmspace.h> in arduino.h. Almost all sources
include arduino.h anyways, so this is mostly a code reduction.
As we can always only move towards one end of an axis, one common
variable to count debouncing is sufficient.
Binary size 12 bytes smaller (and faster).
'all_time' sounds like forever to me, but this variable really
tracks the last time we hit one of "all the axes". It sticks
out more now in looping, so rename it to make sense.
A generic implementation here will allow callers to pass the
target axis in as a parameter so the callers can also be made more
generic.
Traumflug notes:
Split out application of the new implementation in dda.c into its
own commit.
This actually costs 128 bytes, but as we can access axes from within
a loop now, I expect to get more savings elsewhere.
Interestingly, binary size is raised by another 18 bytes if
um_to_steps(int32_t, enum axis_e)
is changed to
um_to_steps(enum axis_e, int32_t)
even on the 8-bit ATmega. While putting the axis number to the
front might be a bit more logical (think of additional parameters,
the axis number position would move), NXP application note
AN10963 states on page 10ff, 16-bit data should be 16-bit aligned
and 32-bit data should be 32-bit aligned for best performance.
Well, so let's do it this way.
Many places in the code use individual variables for int/uint values
for X, Y, Z, and E. A tip from a comment suggests making these into
arrays for scalability in the future. Replace the discrete variables
with arrays so the code can be simplified in the future.
In preparation for more efficient and scalable code using axis-loops
for common operations, add two new array-types for signed and unsigned
32-bit values per axis. Make the TARGET type use this array instead of
its current X, Y, Z, and E variables.
Traumflug notes:
- Did the usual conversion to spaces for changed lines.
- Added X = 0 to the enum. Just for peace of mind.
- Excellent patch!
Initially I wanted to make the new array an anonymous union with the
old variables to allow accessing values both ways. This way it would
have been possible to do the transition in smaller pieces. But as
the patch worked so flawlessly and binary size is precisely the
same, I abandoned this idea. Maybe it's a good idea in other areas.
Test code which wants to customize config.h can do so without
touching config.h itself by wrapping config.h in a macro variable
which is passed in to the compiler. It defaults to "config.h" if
no override is provided.
This change would break makefile dependency checking since the selection
of a different header file on the command line is not noticed by make
as a build-trigger. To solve this, we add a layer to the BUILDDIR path
so build products are now specific to the USER_CONFIG choice if it is
not "config.h".
There's no apparent documentation for this on the AVR variant of GCC.
Likely it means to optimize "more aggressively". Uhm, is gcc
intentionally wasting cycles otherwise? Likely not.
Also, the compilation result is exactly the same size with or
without this attribute.
For now this is for the initial rampup calculation, only, notably
for moving the Z axis (which else gets far to few rampup steps on
a typical mendel-like printer).
The used macro was verified with this test code (in mendel.c):
[...]
int main (void) {
init();
uint32_t speed, spm;
char string[128];
for (spm = 2000; spm < 4099000; spm <<= 1) {
for (speed = 11; speed < 65536; speed *= 8) {
sersendf_P(PSTR("spm = %lu speed %lu ==> macro %lu "),
spm, speed, ACCELERATE_RAMP_LEN_SPM(speed, spm));
delay_ms(10);
sprintf(string, "double %f\n",
(double)speed * (double)speed / ((double)7200000 * (double)ACCELERATION / (double)spm));
serial_writestr((uint8_t *)string);
delay_ms(10);
}
}
[...]
Note: to link the test code, this linker flag is required to add
the full printf library (which does print doubles):
LDFLAGS += -Wl,-u,vfprintf -lprintf_flt -lm
Previously, ramps were calculated with the combined speed,
which can differ from the speed of the fast axis by factor 2.
This solves part 2 of issue #68.
This obviously requires less place on the stack and accordingly a
few CPU cycles less, but more importantly, it lets decide
dda_start() whether a previous movement is to be taken into account
or not.
To make this decision more reliable, add a flag for movements done.
Else it could happen we'd try to join with a movement done long
before.
This is a preparation for starting a move from non-zero speeds,
which is needed for look-ahead. Keeping both variables in
move_state and doing the calculations in dda_start() is possible
in principle, but might not fit the tight time budget we have when
going from one movement to the next at high step rates.
To deal with this, we have to pre-calculate n and c, so we have
to move it back into the DDA structure. It was there a year ago
already, but moved into move_state to save RAM (move_state exists
only once, dda as often as there are movement queue entries).
Before, endstops were checked on every step, wasting precious time.
Checking them 500 times a second should be more than sufficient.
Additionally, an endstop stop now properly decelerates the movement.
This is one important step towards handling accidental endstop hits
gracefully, as it avoids step losses in such situations.
This means, modify existing code to let the lookahead algorithms
do their work. It also means to remove some unused code in
dda_lookahead.c and reordering some code to make it work with
LOOKAHEAD undefined.
This gets rid of overflows at micrometer to step conversion as
much as possible within 31 bits. It also opens the door to get
STEPS_PER_M configurable at runtime.
This also costs 290 bytes, unfortunately.
These were commits 9dbfa7217e0de8b140846ab480d6b4a7fc9b6791 and
2b596cb05e621ed822071486f812eb334328267a.
There are several reasons why this new approach didn't work out well:
- The machine coordinate system is lost on relative movements.
OK, we could keep tracking it, but this would mean even more
code, so even more chances for bugs.
- With the lost coordinate system, no software endstops are possible.
- Neither of X, Y, Z will ever overflow.
- If a movement planner would appear one day, he'd have to handle
relative movements as well. Even more code duplication.
Instead of converting them to absolute first, then back to
relative and having all the fuzz with working on the queue's
start vs. working at the queue's end, mark a movement as relative
and use this directly.
The implementation is slightly different this time, as it's not
using these famous bresenham algorithms. The intention is to
allow axis-independent movements, as it's required for
EMC-quality look-ahead.
This is a intrusive patch and for now, it's done for the X axis only.
To make comparison with the former approach easier ...
The advantages of this change:
- Converting from mm to steps in gcode_parse.c and back in dda.c
wastes cycles and accuracy.
- In dda.c, UM_PER_STEP simply goes away, so distance calculations
work now with STEPS_PER_MM > 500 just fine. 1/16 microstepping
on threaded rods (Z axis) becomes possible.
- Distance calculations (feedrate, acceleration, ...) become much
simpler.
- A wide range of STEPS_PER_M can now be handled at reasonable
(4 decimal digit) accuracy with a simple macro. Formerly,
we were limited to 500 steps/mm, now we can do 4'096 steps/mm
and could easily raise this another digit.
Disadvantages:
- STEPS_PER_MM is gone in config.h, using STEPS_PER_M is required,
because the preprocessor refuses to compare numbers with decimal
points in them.
- The DDA has to store the position in steps anyways to avoid
rounding errors.