Group membership across the WAN

Jul 28, 2015 at 6:59 AM
I am looking into using ISIS2 group membership features to detect server join/part events. These events update an algorithm used to partition a workload among servers.

My scenario is a very old SNMP monitoring infrastructure for ~60,000 devices. In about 12 different locations across the globe I have groups of 4-10 servers responsible for devices in their respective regions. I do not have access to multicast, so my initial plan is to run a small group in each region using unicast UDP mode. I know the list of servers in advance in each region, and am using that list to populate the ISIS_HOSTS environment variable so that oracles can be selected.

The tricky part is that I need to be able to communicate between these regions across the WAN. There are several reasons for this, but the main one is that if I lose enough servers in the region closest to a particular device, I need to start pulling in servers from the next closest region. I am trying to figure out the best way to do this, which I suspect would involve building some TCP level service on top of the Isis portion. There are dedicated links between these sites which does allow for pretty good UDP performance, but if I understand correctly running Isis across the WAN would require a set of oracles to be shared among the various regions, which I suspect would not perform well.

A simple solution would be to simply spin down the group in a particular region after some number of servers has become unavailable, and move the entire workload to the group of servers in another region. I would prefer however to make use of all available servers, and this requires a way to allow group membership across the WAN. Is this possible, or is there a better way to design this?

Thanks in advance for suggestions.

Jul 28, 2015 at 1:19 PM
Edited Jul 28, 2015 at 1:22 PM
This is an interesting question. In the old days we used to caution that WAN partitioning events were a big worry, but I suspect that the modern Internet basically never experiences them (slowdowns, perhaps, but not outright partitioning). So the main reason I don't particularly encourage WAN use of Isis2 is simply that because I yanked out the code that tunnels over TCP (it wasn't very stable), the system is doing all its communication over UDP. In WAN settings you tend to have firewalls that only allow unidirectional connections (e.g. you can make a TCP connection from inside your home to a server in the data center, but it can't make a TCP connection back to you). The only want to support Isis2 in such settings is to punch holes in the firewall for the UDP ports Isis2 will use, and most system admins won't agree to do that. So I find it easier to tell people to just not try.

In fact even UDP multicast works these days in the WAN, or at least in many parts of it. In the past, it was routinely disabled. But many ISPs allow multicast now, because some video streaming services use it to reduce load on them and have convinced them that by enabling IPMC, they actually carry less traffic, hence spend less money to carry the traffic.

What all this adds up to is that there isn't any obvious reason Isis2 wouldn't work in your setting. I would probably use ISIS_UNICAST_ONLY mode, to not have to experiment with IPMC, and then use ISIS_HOSTS to list the places where the system will be launched. If you do try to use IPMC you would need to fuss with TTL values and so forth and I honestly don't know how that would work. In an enterprise VLAN spanning your WAN network, it could probably be made to work well, but if you are just running out there in the wild, I bet it would be nearly hopeless...

So as you outline, there are then two big options. One is to run N instances of your fault-tolerant server in each "region" (e.g. maybe N=2 or N=3), have it do the SNMP monitoring for that region, and then build some other WAN overlay to connect them via TCP. I guess this is the design you've tentatively ruled out, but perhaps had initially been planning to use. The other is to just run Isis2 flat, so that if there are R regions and you wanted N server instances each, you would just end up with an Isis2 group containing N*R members at best, but with failures and so forth sometimes knocking it down to a smaller number temporarily. This actually should work too.

There are a few performance-tuning parameters you may want to experiment with. One is the failure detection threshold parameter, ISIS_DEFAULT_TIMEOUT. In fact it is quite large right now (Amazon has absurdly long scheduling delays and so forth, so we have a big timer to overcome that). But conceivably you will find that this value is currently too small even so. I think we have it at 90 seconds right now. The system isn't really designed for ISIS_DEFAULT_TIMEOUT to be smaller than 15 seconds or so, maybe even 30, so be aware that you just can't use this mechanism to get lightening fast failure reconfigurations. On the upper end, you certainly could push it up towards 2 minutes if you like. Failure sensing will be correspondingly slower.

One option to consider is to add some other form of failure sensing of your own; perhaps there is some obvious way you could do this. That might allow you to use a value like 90 seconds for the default timeout, but perhaps your own logic would generally call IsisSystem.NodeHasFailed() sometime earlier in most cases. Similarly, if you have any situation where a server deliberately would shut down, have it call IsisSystem.Shutdown() before doing so. This substitutes super-fast and accurate failure notification for slow ping-based detections.

In situations where some of your systems might be more stable than others, I would consider not including those servers in the ISIS_HOSTS list. Suppose that the servers in Delhi are prone to power outages and crash 10x more often than the ones in New York. Well, you could make ISIS_HOSTS list all the stable servers, but none of the unstable ones. The unstable guys can join the system, in fact, but can't run the ORACLE. More generally, the ORACLE plays some performance-critical roles, and perhaps it would be unwise to have its members separated by 100ms or 150ms. System might really slow down a lot if that happened. So for this, list in ISIS_HOSTS just a few (maybe 3 or 5) machines that have pretty low latency between them. For example maybe you would only have New York and Chicago in the list. Now you'll know that the ORACLE (typically 3 members from that set, but you can change ISIS_ORACLE_SIZE if you wish) is on these very stable machines. Have all the remainder of your system running with that same list of hosts, and hence not playing the ORACLE role at all. I think this should give better performance than if you ended up with one ORACLE member in NY, one in Dehli, and one in Moscow or Paris or something. For my protocols, very distant (==high network latency) ORACLE members would be a killer.

So I guess this is what I would try first -- just running Isis2 on the WAN, but thinking through all of those questions, picking ISIS_HOSTS appropriately, and then running some stability tests for a week or two before assuming this is going to work well. Feel free to post here with more issues. Do you know how to read the output from the IsisSystem.GetState() reports? You may want to generate one of those per node, per week, just to have a look at retransmission and packet loss rates. Also do make sure to have a logs directory for each Isis2 server and when things go down, check those logs to see why the system died. No comment => the node itself must have crashed. But keep one eye out for "I am in the minority partition" or "I lost contact with the ORACLE" or "I was poisoned". Those are examples of exceptions that might be thrown legitimately (e.g. if the network links were down for a while) but that could also arise if we get the parameters wrong...
Jul 28, 2015 at 1:35 PM
Edited Jul 28, 2015 at 11:28 PM
PS: One more thing I should mention. For your SNMP client systems, there is a question of how they should find the best Isis2 server group member to relay their data through. For example, above I was saying maybe R regions (perhaps 7, say, if you were a bank hosted in the big markets like London, Paris, NYC, Chicago, Chicago, Beijing, Tokyo). Then maybe N=1 or N=2 servers per region, so something like 7-14 members in the Isis2 group.

But I'm visualizing hundreds of client systems per region, like stock trading systems in all the various places (e.g. maybe you have 75 trading desks in Chicago, 280 in NYC, etc). These probably wouldn't be using Isis2 and would instead use a web API like RESTful or WCF to talk from client to what needs to look like a web server. In fact you can literally point a web browser at these servers and see a very basic web page with the stuff SOAP puts on those -- basically, a form-fill for calling into the server with whatever arguments you pass when you capture an SNMP reading. This lets you test the logic by hand, which is nice.

I often have my students work with this style of coding, and it can work well. Those 14 Isis2 servers run an extra thread that launches the RESTful or WCF subsystem. In effect each of them is now a kind of web page on a kind of web server. The client systems are, in effect, browsers -- it looks like remote method calls, but is implemented using web page exchanges.

So then the trick is to get the clients to actually find your servers. The way to do this is to do what does: you publish DNS records with backup options. So rather than having a single IP address, we want to think of a kind of structure where a client node in New York sees DNS resolution that lists the NYC server addresses (or perhaps one address, for a load-balancer, if you use one), and then lists a suitable backup, like Chicago. Use a TTL value that is reasonable but not huge, like 30s.

Then what will happen is this: if a client in NYC wants to report SNMP data, it calls this web-services reporting framework of yours. That framework builds a report record, then via RESTful or WCF, tries to push it to the website. This resolves to the NYC and Chicago IP address. If NYC is working, the report goes there, else to Chicago. It gets collected by your Isis2 server on that extra thread (note that Isis2 itself creates a lot of threads, but leaves them mostly idle. So these guys will have 20 threads in them, but even so you don't want to run with more than 1 or 2 cores each max because things actually slow down if Isis2 has too many cores -- we rarely have CONCURRENT active threads, even though we do have many waiting). Anyhow, that thread collects the report, says thank-you (the SOAP RPC reply), and then would probably relay the report into the group, maybe using g.OrderedSend, and perhaps issuing that OrderedSend from yet another thread to ensure liveness of the reporting framework. You might consider using an Isis2 bounded buffer between these threads (supports a bb.put API, and a bb.get, and you can specify the capacity before it blocks -- super simple).

So now we have this relaying structure. Every 30 seconds the DNS record is refreshed, which means that if NYC is down but recovers, reporting will be via NYC again within 30 seconds for clients in the NYC area.

Of course in Tokyo the primary would be Tokyo and the backup might be Beijing -- use latency to decide this.

Then your SNMP group would have this totally ordered event notification that would reach every member, one for every SNMP event passed in, and could log it or graph it on a monitoring display or whatever you plan to do with the data.

Done this way each Isis2 server could probably monitor thousands or tens of thousands of clients.
Jul 29, 2015 at 6:45 PM
Thanks for all of the information and suggestions. I will prototype this a few different ways and see what works best. I have some concerns with running a flat system. Some of the APAC regions tend to have network issues (not locally, but with other regions) and are more susceptible to fiber cuts. If an APAC region spends an extended period of time unable to communicate with oracles hosted in Chicago or the bay area, my concern would be that I wouldn't be able to properly detect a group membership change in APAC. I need to dig into the details of your paper to make sure I am understanding oracle behavior correctly. That said, a flat setup would make sharing updates applicable to all regions much simpler. This makes we wonder if it is possible to run two sets of oracles - one local to a region and another globally. Is it possible to run multiple instances of Isis on the same host but with different configurations?


Jul 29, 2015 at 7:30 PM
Edited Jul 29, 2015 at 7:31 PM
Oh, you could definitely have multiple instances on the same machines. Just use disjoint port numbers (you need to change a few parameter settings, but in unicast-only mode the number in use is small, just two. And you can certainly run one group per APAC. You would just need to decide whether (and if so, how) to merge their data during periods when connectivity is solid.

The easy way to do one group per APAC would be to host a few servers on nodes in the APAC, and then just let each group track the local data and also serve responses to remote status queries, probably using the same idea of a RESTFul or WCF request handler thread. Probably the same thread as for th client reporting logic, in fact.

So in fact any of these designs can work. I agree, though: Isis2 won't do well if some WAN connections are erratic. The system would need logic to route around such issues, along the lines of what MIT does in the Resilient Overlay Network (RON). I considered doing that, and if we get serious about Isis2 in the WAN I might someday add a RON layer. But right now, the system needs pretty good connectivity to run properly.
Jul 30, 2015 at 1:27 PM

I'm running multiple different application instances on the same machine as well.

I'm setting these values differently in each of the applications.

Environment.SetEnvironmentVariable("ISIS_MCRANGE_LOW", "50000");

Environment.SetEnvironmentVariable("ISIS_MCRANGE_HIGH", "54999");

Environment.SetEnvironmentVariable("ISIS_PORTNOp", "9853");

J.D. Hicks