Flow control delays traced to overloaded router


While working on the DMC version of the system, I noticed an issue that seems to be common to the most current release as well: some form of flow control stutter. To trigger it, I set up a group with 6-12 members all on one machine, and then send multicasts at high rates. This is enough to overload the UDP IPMC layer of the kernel and trigger loss. In that situation, where multicasts get dropped now and then, the system seems to drop a message and freezes up very briefly, then retransmits and recovers.

So this is actually what you might expect, but the freeze-up lasts longer than it should. [Ken: deleted a speculation about what caused this that proved to be incorrect!]

So you get a stutter longer than it should be, but not super long. Maybe 150 or 250ms.

Since I can see what is going wrong, I should have this fixed fairly soon; I'll post a patch release when I have it. Meanwhile the system is definitely working perfectly well, which is probably why nobody reported this. But if you have an application that generates steady loads of multicasts this way, you'll probably see delays too.
Closed Jun 20, 2015 at 11:56 AM by birman


birman wrote Jun 1, 2015 at 1:04 AM

Seems hard to reproduce. Same computer, same experiment but I don't see the problem with a different wireless router... (Fortunately, on Wednsday I'll be back where I first triggered it, and can debug it there!)

birman wrote Jun 4, 2015 at 4:18 PM

I was able to pin this down after a few days of experiments. The issue was actually not with Isis2, but in fact originated in a problem with the device driver for my Dell laptop in combination with the particular kind of wireless router I use at home. Under heavy load the router was having problems (kind of bizarre because I wasn't actually using it to do any routing, since all my applications ran on the laptop). Anyhow, the router would sort of freeze up for 500ms periodically, and somehow this causes Windows on the Dell laptop to freeze up, hence everything would pause for 500ms. Then the router recovered, Windows recovered and the experimental application recovered.

I was able to prevent this problem by setting the peak messages/second limit in Isis2 (a not widely used or advertised parameter) to 250. With this the system won't send more than 250msgs/s, and that change is enough to eliminate the entire issue.

So it wasn't the fault of Isis2.

Along the way I noticed and fixed some very minor bugs triggered by extreme conditions. I'll post a patch release later this summer but I can't imagine anyone running into them, so I'm not feeling that this is urgent (I would need to redo my regression tests, which takes me a whole day by now). But I will upload a patch release sometime in August. Let me know of any issues you've seen that I should fix!