While this was an improvement of 9 clocks on AVRs, it had more
than the opposite effect on ARMs: 25 clocks slower on the slowest
step. Apparently ARMs aren't as efficient in reading and writing
single bits.
https://github.com/Traumflug/Teacup_Firmware/issues/189#issuecomment-262837660
Performance on AVR is back to what we had before:
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19610 bytes 137% 64% 31% 16%
Data: 2175 bytes 213% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 280 clock cycles.
LED on time maximum: 549 clock cycles.
LED on time average: 286.273 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 272 clock cycles.
LED on time maximum: 580 clock cycles.
LED on time average: 307.439 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 272 clock cycles.
LED on time maximum: 539 clock cycles.
LED on time average: 297.732 clock cycles.
In dda_step instead of checking our 32-bit-wide delta[n] value,
just check a single bit in an 8-bit field. Should be a tad faster.
It does make the code larger, but also about 10% faster, I think.
Performance:
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19696 bytes 138% 65% 32% 16%
Data: 2191 bytes 214% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 263 clock cycles.
LED on time maximum: 532 clock cycles.
LED on time average: 269.273 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 255 clock cycles.
LED on time maximum: 571 clock cycles.
LED on time average: 297.792 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 255 clock cycles.
LED on time maximum: 522 clock cycles.
LED on time average: 283.861 clock cycles.
This time we don't test for remaining steps, but wether the axis
moves at all. A much cheaper test, because this variable has to
be loaded into registers anyways.
Performance is now even better than without this test. Slowest
step down from 604 to 580 clock cycles.
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19610 bytes 137% 64% 31% 16%
Data: 2175 bytes 213% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 280 clock cycles.
LED on time maximum: 549 clock cycles.
LED on time average: 286.273 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 272 clock cycles.
LED on time maximum: 580 clock cycles.
LED on time average: 307.439 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 272 clock cycles.
LED on time maximum: 539 clock cycles.
LED on time average: 297.732 clock cycles.
Apparently gcc doesn't manage to sort nested calculations. Putting
all the muldiv()s into one line gives this error:
dda.c: In function ‘update_current_position’:
dda.c:969:1: error: unable to find a register to spill in class ‘POINTER_REGS’
}
^
dda.c:969:1: error: this is the insn:
(insn 81 80 259 4 (set (reg:SI 82 [ D.3267 ])
(mem:SI (post_inc:HI (reg:HI 2 r2 [orig:121 ivtmp.106 ] [121])) [4 MEM[base: _97, offset: 0B]+0 S4 A8])) dda.c:952 95 {*movsi}
(expr_list:REG_INC (reg:HI 2 r2 [orig:121 ivtmp.106 ] [121])
(nil)))
dda.c:969: confused by earlier errors, bailing out
This problem was solved by doing the calculation step by step,
using intermediate variables. Glad I could help you, gcc :-)
Moving performance unchanged, M114 accuracy should have improved,
binary size 18 bytes bigger:
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19582 bytes 137% 64% 31% 16%
Data: 2175 bytes 213% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
Using the Bresenham algorithm it's safe to assume that if the axis
with the most steps is done, all other axes are done, too.
This way we save a lot of variable loading in dda_step(). We also
save this very expensive comparison of all axis counters against
zero. Minor drawback: update_current_position() is now even slower.
About performance. The slowest step decreased from 719 to 604
clocks, which is quite an improvement. Average step time increased
for single axis movements by 16 clocks and decreased for multi-
axis movements. At the bottom line this should improve real-world
performance quite a bit, because a printer movement speed isn't
limited by average timings, but by the time needed for the slowest
step.
Along the way, binary size dropped by nice 244 bytes, RAM usage by
also nice 16 bytes.
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19564 bytes 137% 64% 31% 16%
Data: 2175 bytes 213% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 326 clock cycles.
LED on time maximum: 595 clock cycles.
LED on time average: 333.62 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 318 clock cycles.
LED on time maximum: 604 clock cycles.
LED on time average: 333.311 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 318 clock cycles.
LED on time maximum: 585 clock cycles.
LED on time average: 335.233 clock cycles.
Our standard performance test is to run these three G-code files
in SimulAVR and recording step pulse timings. While this certainly
doesn't cover everything related to possible performance
measurements, it's a good basic standard to compare code changes.
Current performance:
ATmega sizes '168 '328(P) '644(P) '1280
Program: 19808 bytes 139% 65% 32% 16%
Data: 2191 bytes 214% 107% 54% 27%
EEPROM: 32 bytes 4% 2% 2% 1%
short-moves.gcode statistics:
LED on occurences: 888.
LED on time minimum: 308 clock cycles.
LED on time maximum: 729 clock cycles.
LED on time average: 317.393 clock cycles.
smooth-curves.gcode statistics:
LED on occurences: 23648.
LED on time minimum: 308 clock cycles.
LED on time maximum: 726 clock cycles.
LED on time average: 354.825 clock cycles.
triangle-odd.gcode statistics:
LED on occurences: 1636.
LED on time minimum: 308 clock cycles.
LED on time maximum: 719 clock cycles.
LED on time average: 336.327 clock cycles.
Traumflug's note: if one uses #define LOOKAHEAD_DEBUG at line 177,
one should use the same symbol in line 321. Edited the commit to
do so.
This reduces binary size by 38 bytes and RAM usage by 4 bytes.
PCBScriber is a printer for the scratch 'n etch method, see
http://reprap.org/wiki/PCBScriber
Commit reviewer Traumflug's note:
- Rebased to current branch 'experimental', which adds
USE_INTERNAL_PULLDOWNS.
- Removed DEFINE_HOMING for now, this part isn't cooked, yet.
For example, it doesn't pass regression tests.
- Thank you very much for the contribution!
This was an attempt to make Teacup sources compatible with
Arduino IDE 1.6.0 - 1.6.9 and became obsolete as of 1.6.10. The
problem was fixed on the Arduino IDE side.
We calculate all steps from the fastest axis now. So X and Y
steps_per_m don't have to be the same anymore.
Traumflug's: another 16 bytes program size off on AVR, same size
on LPC1114.
We need the fastest axis instead of its steps.
Eleminates also an overflow when ACCELERATION > 596.
We save 118 bytes program and 2 bytes data.
Reviewer Traumflug's note: I see 100 bytes program and 32 bytes
RAM saving on ATmegas here. 16 and 32 on the LPC 1114. Either way:
great stuff!
This should fix issue #235.
Recently ConfigTool has been very slow for me on Ubuntu Linux.
When I run the app there is a 15 second wait before the window is
first displayed. I bisected the problem and found it was tied to
the number of pins in `pinNames`, and ultimately that it was
caused by a slow initializer in wx.Choice() when the choices are
loaded when the widget is created. For some reason, moving the
load after the widget is created is significantly faster. This
change reduces my startup time to just under 4 seconds.
Further speedup could be had by using lazy initialization of the
controls. But the controls are too bound up in the loaded data
to make this simple. Maybe I will attack it later.
There is still a significant delay when closing the window, but I
haven't tracked what causes it. Maybe it is caused just by
destroying all these pin controls.
In the process of making this change, I wanted to simplify the
number of locations that bothered to copy the pinNames list and,
to support lazy loading, to try to keep the same list in all
pinChoice controls. I noticed that all the pinChoice controls
already have the same parameters passed to the addPinChoice
function which makes them redundant and confusing. I removed the
extra initializers and just rely on pinNames as the only list
option in addPinChoice for now. Maybe this flexibility is needed
for some reason later, but I can't see a purpose for it now.
Notes by reviewer Traumflug:
First of all, which "trick"? That's an excellent code
simplification and if this happens to make startup faster (it
does), all the better.
Measured startup & shutdown time here (click window close as soon
as it appears):
Before: With this commit:
real 0m4.222s real 0m3.780s
user 0m3.864s user 0m3.452s
sys 0m0.084s sys 0m0.100s
As the speedup was far more significant on the commit author's
machine, it might be a memory consumption issue (leading to
swapping on a small RAM machine). Linux allows to view this in
/proc/<pid>/status.
Before: Now:
VmPeak: 708360 kB 708372 kB
VmSize: 658916 kB 658756 kB
VmHWM: 73792 kB 73492 kB
VmRSS: 73792 kB 73492 kB
VmData: 402492 kB 402332 kB
Still no obvious indicator, but a 300 kB smaller memory footprint
is certainly nice.
If you attempt a Steinhart-Hart table in the configtool with
parameters (4700, 25, 100000, 209, 475, 256, 201) it fails with a:
...
File "/Users/drf/2014/RepRap/GIT/Teacup_Firmware/configtool/
thermistortablefile.py", line 169, in SteinhartHartTable
(i, int(t * 4), int(delta * 4 * 256), c, int(t), int(round(r))),
TypeError: not enough arguments for format string
Catched and fix provided by dr5fn, this should fix issue #246.
Heck, that's simply forbidden. A C compiler had catched this in a
split second at compile time, Python didn't until the faulty code
section was actually executed (a section of code for rare cases).
The simple fix is to replace the old tuple with a changed, new
tuple.
This resolved issue #242.
Similar to M221 which sets a variable flow rate percentage, add
support for M220 which sets a percentage modifier for the
feedrate, F.
It seems a little disturbing that the flow rate modifies the next
G1 command and does not touch the buffered commands, but this
seems like the only reasonable thing to do since the M221 setting
could be embedded in the source gcode for some use cases. Perhaps
an "immediate" setting using P1 could be considered later if
needed.
`target` is an input to dda_create, but we don't modify it. We
copy it into dda->endpoint and modify that instead, if needed.
Make `target` const so this treatment is explicit.
Rely on dda->endpoint to hold our "target" data so any decisions
we make leading up to using it will be correctly reflected in our
math.
In a test, the system worked fine even for a change in config.h,
which is #included by a variable (config_wrapper.h, line 20).
This should speed up repeated regression test, e.g. when doing a
'git regtest', substantially.
Disable it only when appropriate, of course.
The move of this code makes Teacup compiling with both,
ACCELERATION_REPRAP and LOOKAHEAD enabled. Such a configuration
makes no sense, but can happen anyways.
The flow rate is given as a percentage which is kept as
100 = 100% internally. But this means we must divide by 100 for
every movement which can be expensive. Convert the value to
256 = 100% so the compiler can optimize the division to a
byte-shift.
Also, avoid the math altogether in the normal case where the
flow rate is already 100% and no change is required.
Note: This also requires an increase in the size of e_multiplier
to 16 bits so values >= 100% can be stored. Previously flow
rates only up to 255% (2.5x) were supported which may have
surprised some users. Now the flow rate can be as high as
10000% (100x), at least internally.
Now it is possible to control the extruders flow.
M221 S100 = 100% of the extruders steps
M221 S90 = 90% of the extruders steps
M221 is also used in other firmwares for this. Also a lot of
hosts, like Octoprint and Pronterface using this M-Code for
this behaviour.
Note a performance improvement opportunity.
Review note by Traumflug: the original commit didn't add a
comment, but replaced the existing code with what's in the
comment now.
According to the comment in issue #223:
Pre-unroll:
LED on time minimum: 3138.44 clock cycles.
LED on time maximum: 5108.8 clock cycles.
LED on time average: 4590.58 clock cycles.
Unrolled:
LED on time minimum: 3016.92 clock cycles.
LED on time maximum: 4987.28 clock cycles.
LED on time average: 4469.06 clock cycles.
Thermistors and AD595 can be faster in that mode.
The new stategy is:
1. read the value
2. start the adc
3. return the result
- next cycle
instead of:
1. start the adc
- wait 10ms
2. read the value
3. return the result
- next cycle
Review changes by Traumflug: fixed the warnings appearing in some
configurations (case NEEDS_START_ADC undefined and case
NEEDS_START_ADC defined, but TEMP_READ_CONTINUOUS == 0)
This allows to use EWMA_ALPHA in an #if clause, which is needed
for the next commit.
Review changes by Traumflug: made changes to comments more
complete, added rounding ("+ 500") and also adjusted Configtool
for the change.
After firmware startup it's always in a valid range, even in the
unlikely case analog_init() is called twice.
This saves 4 bytes binary size without drawback.
If we have EMWA mode turned on, then the user wants to average
several samples from the temp sensors over time. But now we read
temp sensors only 4 times per second making this averaging take
much longer.
Read the temperatures continuously -- as fast as supported by the
probe type -- if we are using weight averaging (TEMP_EMWA < 1.0).
Heater PID loops must be called every 250ms, and temperature
probes do not need to be called any more often than that. Some
probes require some asynchronous operations to complete before
they're ready. Handle these in a state machine that first begins
the conversion and finally completes it on some future tick.
Signal it is complete by setting the new state variable to IDLE.
Kick off the heater PID loop by simply beginning the temperature
conversion on all the temperature probes. When each completes,
it will finish the process by calling its PID routine.
Remove the "next_read_time" concept altogether and just run each
temp conversion at fixed 250ms intervals.