Should the out of band file transfer code support update actions?

Jul 15, 2014 at 9:15 PM
Edited Jul 15, 2014 at 9:19 PM
I'm unsure whether or not to add something to the system and wanted to ask for opinions.

Right now the OOB code does "whole file" transfers. If you have a node A that creates a file, say /tmp/xyz, then the normal behavior is that some program on A maps the file, scribbles in it until happy, then calls OOBRegister and then OOBReReplicate. The entire file is then replicated, perhaps on B, C, D, E, etc. The job of Isis2, naturally, is to do that as fast as it can using remote DMA (RDMA) transfers, memory-to-memory at line speed. If lacking RDMA, we use the fastest mechanism Isis2 has available -- IPMC multicast, TCP connections, etc.

There is a way to modify a file. A can ReReplicate down to just itself, e.g. it was on {A,B,C,D,E}, then drops to just A. Now A is the sole owner and can modify it as desired. Then A can ReReplicate back to {A,B,...} as desired. But Isis2 will transfer the whole thing again.

So the natural question arises: if xyz is a gigabyte in size, how does A change just a few bytes here and there? Right now there are a few options.

1) Use the scheme described above. But the ReReplicate would take 500ms or more at 20Gb/s. Plus Isis2 incurs some delays while coordinating the replicas (setting the transfer up), so perhaps 50ms extra.
2) Send a normal Isis2 multicast and the recipients can each update their local replicas. This would work fine and if the update is small, probably wouldn't need more than 10ms or so.
3) Or, I could add an overload of ReReplicate to recopy just some portion of the file. I would add an offset and length argument, and by default make the offset 0 and the length be the file length. But if you used some other values, I would use them too.

Argument in favor: Option 3 seems like a natural thing to support

Argument against: Without locking, wouldn't option 3 be dangerous? You get a kind of distributed race condition because that update could get sent when someone is looking at their replica, which after all is a memory-mapped object. Moreover, if locking does require a multicast, why not just send the update and not bother with the locking request, which is done using a multicast? [If the update is small, send it in the multicast itself and if the update is large, do an OOB of the update (in a different file, xyz-updt) and then send a multicast telling the owners to please lock their replica, apply that update, and unlock it.]

In fact what troubles me is exactly this issue of locking. To me OOB is currently sort of separate from locking. Adding a region-update feature seems to force me to integrate OOB with locking, at least to some degree.

I'm leaning against doing this, but I am genuinely curious to hear your views.