5.20.2009

More Netcode

It is a good thing I threaded the netcode earlier. We probably need to use it soon.

I debugged a problem in blobd today. The apparent problem was that the server would crash when two clients connected, the server would crash upon one of them disconnecting. First, I will discuss networking and then I will address the nature of the crash.

On networking:
I was surprised that two clients could connect at all. If this were a TCP connection, that would not work. I suppose that in this case, the semantics of Unix sockets allowed two clients to share the same socket since they both did not have named client sockets. I am guessing it is even more accidental since the clients do not talk to the server at all right now.

The reason I did not expect it to accept multiple people at first is because for a connection, there are 2 sockets: the server socket and the client socket. The server binds an address to a socket and then waits for people to connect on his socket. When someone connects, the server gets a second socket file descriptor which is associated with the client: if he reads or writes to it, the operating system makes that interact with the client who connected.

So the model is that the server waits, a client connects, and the server uses that newly forged socket to talk to the other program. Normally, the wrinkle here is that at this point the server is no longer waiting -- so a new connection would be denied (or queued by the OS for the next time the server gets around to waiting.)

It appears that when using the Unix sockets, and maybe this is because the way it is written the clients all have the same "address" right now, if the second guy tries to connect after the server has taken the first guy, the OS just hooks it up anyway. The socket file is only an address in the name space of the filesystem -- really just an inode. It identifies a communications channel between programs which the operating system maintains, and it isn't completely unintuitive that this behavior may happen. UDP behaves in a similar way. I think both of these go away if the server chooses to inspect his peers's addresses. Anyway, that behavior is a pleasant surprise, I guess -- although I don't want to keep that behavior.

The better way to handle it, which my test code does, is spawn a new thread for servicing a client when someone connects. The main thread then returns to waiting for new clients. It isn't too bad in the way of restructuring code, and has two big payoffs: you can spread processing load better and you can more easily send different data to different clients. We will want to be able to send different data if we ever implement a window manager plugin which can inform the blob server about the window geometries of clients and thus which blobs they ought to receive.

The downside is that we have to have more consideration of data manipulation. There are now N threads reading the blobLists to send over the network and 1 thread writing to that list. Mutexes or semaphores would need to be implemented to synchronize those accesses so sane results come out. I stopped short of implementing semaphores in my test code because I would also need to implement some dynamic data generation too (it only sends the same blobList over and over).

On the crash:
With some pointers from Zach, I figured out the problem of the crashing. My test server happens not to crash, as Zach and Davide noticed, because only the service threads terminate and not the main program. Also because maybe they used my test client, which is friendlier and would not quite tickle this issue.

What happens is that a client goes away. The server doesn't know when a client is going to go away. So when a client goes away, the server is almost always in the middle of sending blob data over the socket. The server is probably inside, or will enter, blobList::send() which will write() to the socket. Because the client quit, his socket was implicitly closed (I doubt there was an explicit close() call!) Writing to a closed socket/pipe generates a SIGPIPE signal, which has the default action of causing the application to terminate. Suppressing the signal (right now I wrote a handler that tells us the signal happened and then returns) allows us to instead pick up the return code of the write() calls and inspect errno.

I think that is a better behavior for now, because in any case we want to recover from the error in the application logic and not in a signal handler. So now, if someone disconnects, it says something nice about it and then exits. A slight improvement (maybe I will write this now) will be to make it go back up to the accept() call -- essentially wiping the slate and waiting for a while new client. That would still mess up the current multi-client situation though.

tldr;
I think the thing to do now is to multithread the client handling code and to also consider adding some synchronization messages like "I am ready for blobs", "I am leaving now" to the network protocol.

No comments:

Post a Comment