V 2.2.1962

Mar 16, 2015 at 1:46 PM
Edited Apr 24, 2015 at 9:58 PM
I've applied a small patch to the release, fixing a bug in the DHT initialization code, and a problem seen when the system became very overloaded during startup. Unless you have run into this issue, which was hard to miss (startup failed), don't worry about switching to this patch release. This causes the release version number to jump to 1954.
Apr 24, 2015 at 10:02 PM
Precise description of the startup issue I fixed:
  • First, the system wasn't handing one of the "new" formats of Query calls properly. If you called g.Query(ALL, some-list-of-members, request-code, EOL); then instead of delivering the (null) messages and collecting replies, the Query would hang and eventually timeout via an abort-reply exception.
  • Second, there was a case in which Isis2 itself used this feature, namely when doing OOB transfers of the initial view in a group where more than 5 members joined simultaneously. This would fail because of the first bug.
  • Third, the OOB code itself seems to have a race condition that arises when under heavy load of the kind you see if 20 members are all launched on the same computer and OOB is used internally for sharing the initial group view. I temporarily disabled the "initial views via OOB" feature so that this won't kick in. I'll fix the race condition itself in coming days or weeks and then will post another patch.
Apr 26, 2015 at 11:30 PM
Edited Apr 26, 2015 at 11:31 PM
Today (April 26) I've uploaded V2.2.1956, fixing the remaining issue for startup -- this was a form of race condition in the OOB transfer of the initial view when processes were joining the system.

The bug fix isn't ideal -- it may slow down batch joins if you have very large groups with large numbers of members joining all at the same time. Let me know if you have such an issue (I doubt anyone would see it with less than about 150 members all joining "en masse", and perhaps quite a bit more). I can do a more elaborate fix that would allow the system to use the OOB transfer feature for these initial views, but the code changes (just a few lines) wouldn't be trivial to test (I don't actually have really easy ways to launch 5000 copies at a time...) and right now, my (scarce) coding time has been going into developing DMC V1.0, so it is kind of a distraction to fix this thing properly. That said, it wouldn't take me more than an hour, and I'll invest the time if anyone needs the performance!
Apr 27, 2015 at 2:43 PM
April 27: I figured out how to make a small change that restores reasonable performance for batch joins with large numbers of joining members. The change involved a few lines but should be extremely selective, and my testing didn't reveal any problems and does exercise the new logic, so I feel good about this.

At the same time, I've begun to shift the error messages printed during extreme overload to be a little less specific. Instead of complaining that after 30 seconds such and such a lock couldn't be acquired, now the system prints a kind of generic complaint ("Isis is shutting down due to extremely long scheduling delays. Is your computer unusually overloaded?") and then shuts down. Right now I get some complaints from people who don't seem to understand that a real-time system can't really run on a computer that is experiencing 30 or 60 second scheduling delays; this should help them understand where the issue is coming from...
Apr 29, 2015 at 8:31 PM
Edited Apr 29, 2015 at 8:50 PM
April 29. When testing, one of my students was able to trigger a hang; V2.2.1962 fixes the cause and seems to be completely stable.

The core problem turned out to be a race condition between the "whenDone" upcall on OOB ReReplication and the OOBFetch that needed to occur on the remote client system, and this is something you might run into too. When you use OOB to change a replication pattern, perhaps to create a copy on machine B of something that was on machine A, the upcall on machine A will occur as soon as the machine B replica has been created. It can be tempting to just call OOBDelete at this point. But keep in mind that if the code on B hasn't yet done the OOBFetch, you'll end up with a race condition, with the OOBFetch and the OOBDelete racing to machine B. The OOBFetch would then fail if the OOBDelete arrives first.

This was my bug, and suggests that you need to be kind of careful with OOBDelete. In fact the better solution might be to not supply a "WhenDone" method at all on the sender side, and instead issue OOBDelete from the receiver side.