This series of posts investigates a number of problems identified with the motor control system in leJOS 0.8.1. I hope it provides an insight into how some of these systems work and also into the process of identifying a problem and producing a possible solution.
A few weeks ago Roger one of the leJOS developers mentioned that he was seeing some strange results when running tests with the leJOS DifferentialPilot class. This class provides pretty much all you need to control a simple robot that uses two motors configured to use tank, or differential steering. Something like the one shown below.
The problem that Roger was seeing was that the robot didn’t seem to be able to drive a simple series of straight lines and rotations very well. The robot finished out of place. The error wasn’t huge but it was significantly worse then a similar robot built using the NXT. Roger investigated further and discovered that during a move operation the two motors did not seem to be staying in sync very well. After some time investigating this Roger provided several key observations:
There is a delay in bringing the second (right) motor up to speed. When travel is called, robot initially swerves to right , travels straight, then turns after left motor stops and the right completes its rotation.
Call the pilot.travel(15) and output the tacho count of each motor at 100 ms intervals. The maximum difference (left count – right count) usually occurs after about 300 ms, and remains usually within 1 degree of the max till deceleration begins at the end of the travel.
In the early experiments, the recorded max difference is much larger (in the range of 20 to 40 deg) in the first call to pilot.travel(15) then in subsequent calls. If travel(0) or rotate(90) is called at the start of the experiment, this effect largely disappears.
To better understand what was going on I built the above robot. This robot has three significant features:
- It has a relatively narrow track width, this means that any small differences in motor position will be amplified and errors will be more obvious.
- It is designed so that the same base of motors+caster can be used with either the EV3 or the NXT enabling direct comparisons to be made.
- The robot is designed to use a combination of a jig and a laser to ensure that the initial start position can be reproduced easily.
The test was very simple. Carefully align the robot start position. Perform a series of 16 travel(25); travel(-25); operations (so 32 moves in total). Observe the final position of the robot. The following images show the results for the NXT and EV3:
The first image shows the NXT version in the starting position (it also shows the alignment jig). The second image shows the position after the 32 move operations, as can be seen there is some loss of position over the move. The third image shows the results when using the EV3 and version 0.8.1 of leJOS, the robot actually fell off the test track (not good). The final image shows the positions combined to make it easier to see how they compare, in this case the red rectangle is the start position, blue is the NXT after 32 moves and yellow is the EV3 after 16 moves (because it was not possible to record the final position accurately). As we can see there is a considerable error in the case of the EV3 confirming Roger’s observations. The following image is a bigger version of the final one of the set above, I have also included the results for the EV3 running a similar test program using the standard Lego software and firmware (in pink).
Looking at these results in more detail we can see that NXT is rotated and shifted slightly backwards, the EV3 has smaller rotation but a large right shift and the Lego software has a similar rotation to the NXT but a larger backwards shift.
During this investigation there was some discussion among the leJOS developers and further testing, then a theory began to develop. What seems to be happening is that there is sometimes a delay between the two motors starting. This delay then results in an offset of the motor positions for the remainder of the move and results in the swerves described by Roger and the large right shift. This raises a number of questions:
- Why is this problem not seen on the NXT?
- Why was the error much larger on the first call to travel?
- What can we do to fix it?
To explain the first problem we need to understand what leJOS does to synchronize motor movements and the different Java implementations on the two platforms. The simple answer is that leJOS performs no explicit motor synchronization. Instead it relies on the motor regulator providing accurate positional control of the motor rotation during a move. This means that so long as the motors start at the same time they will remain in sync for the duration of the move. On the NXT starting the motors at the same time is easy. The Virtual Machine used to execute the byte code is a reasonably good real time system and uses a strict priority based scheduling mechanism and the only code executing on the NXT is the VM. I won’t go into all of the gory details here but what this means is that it is simple to ensure that both motors start to rotate at pretty much the same time. On the EV3 things are very different. To begin with the EV3 is running a full up Linux system, this means much, much, more is going on (the leJOS menu system, networking code and a bunch of other system threads). Secondly the thread scheduling is different which means that it is much more likely that the thread running the Pilot code will be interrupted. Finally the motor control code on the EV3 is very different to that on the NXT. To get the level of control we need we make use of a kernel module for the hard real time operations and hardware access, our Java code has to make system calls to talk to this module. This means that again there is much greater chance that a thread switch can happen introducing a delay between the two motors starting.
OK so it looks like delays between the two calls to start the motors rotating may be due to differences between the two systems. But why is it much worse for the first call? Again we need to understand the differences between the two systems. On the NXT the leJOS VM executes Java byte code directly using an interpreter. This is slower than executing native code but allows the use of very compact code which on the memory limited NXT is important. On the EV3 we use a standard Oracle JVM one of the many features of this VM is that it contains a JIT (Just In Time) compiler. What this means is that although the system starts off interpreting byte code (just like on the NXT), the VM can choose to convert that byte code into native code and run that instead. In general this is a good thing, as it means our code runs much faster. The downside is that when the system decides to convert a method from byte code to native code this can cause a delay. It is this delay that creates that large initial error.
So what can we do to fix this. There are probably a number of things we could do to “work around” the problem, adjust the threading priorities, give the system chance to switch to other threads just before we run the critical code, force the JIT to compile our code before we need to use it in a critical way (this is what adding the call to travel(0) did that Roger discovered helped reduce the size of the initial error). But really none of these is the right solution. The fundamental problem we have here is that we have two operations (starting the motors) that we need to happen at the same time, but we have no way of telling the system that is what we require. We need to modify the motor control system such that we can make it aware of this synchronization requirement and then work out a way to ensure that happens.
Part two of this series will look at an experimental solution to this problem and how well it works.