🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to Networking and Multiplayer

Berkeley Sockets and IOCP

Randy Gaul · 2017-05-30T16:52:14

Recently I asked a very seasoned developer on his personal recommendation for receiving packets, queuing, and processing them. To provide a little context, I asked: Hi, I've been studying Uru code. I had a question.I'm looking for how packets are queued and eventually exposed to rest of game. Would you recommend running low level packet recv and send on a thread, while communicating to the rest of the game via event queue? In particular I wanted to see how synchronization primitives were used to make a higher level abstraction for game logic code to utilize.I'm assuming there's actor model style encapsulation and messaging going on. But like we talked about its hard to see big picture by reading code due to distributed nature.Thanks!Randy And got this response: Assuming that you're going to be running many game contexts in one process, I have a complicated answer to your question, but the short answer is to receive on worker threads and queue messages onto a per-game queue, which is then scheduled to be run when the worker threads run out of input to receive. GetQueuedCompletionStatus with zero timeout works perfectly for this. There is probably a similar but harder way to do this with epoll, which I've read about but have avoided so far because it seems not as well designed as IOCP and kqueue. Packets should be sent directly from the game worker thread, not queued to be sent. This whole design in general avoids context switching. The games themselves are essentially fat actors that run single threaded. So it sounds like he's recommending to use a thread pool to receive packets as top priority. If no packets are ready, the worker threads do packet processing. Here packet processing is opening them up, decryption, or whatever else needed. Then they place these processed packets into a queue for the game thread to pick them up. Sounds like a totally solid design. The worker threads are reused as much as possible. A more naive design would have jobs submit to a threadpool for packet receives, and separate jobs for packet processing. This involves potential context switches between jobs when the OS does scheduling. Another naive solution is to do packet receive + processing on one job, which does not prioritize picking up new packets, and will have likely context switches once each packet + process job is completed. That's the context. My question is, are Berkeley sockets implemented as a slightly higher level layer compared to IOCP on Windows? Sort of like how fscanf does internal buffering of syscalls, is there a similar relation to raw Windows APIs for IOCP and the POSIX API? I'm trying to gauge how high or low level my current POSIX implementation is. I plan on writing some dedicated game server code soon, but don't really want to pay for Windows servers, and was hoping to just keep my POSIX stuff as-is. However, also wanted to learn a little more about IOCP stuff. To understand perf tradeoffs, as I'll be paying for server costs myself.

Networking and Multiplayer Programming

Started by Randy Gaul May 28, 2017 11:14 PM

13 comments, last by hplus0603 7 years, 1 month ago

ApochPiQ

23,138

May 29, 2017 12:49 PM

Of course a shared-nothing implementation will scale better due to less locking. This is pretty obvious if you think on it.

But not all games can do a shared-nothing implementation, c.f. an MMO. This is where your scalability becomes highly contingent on efficient dispatching of input communications to the working thread pool. It's hard to beat IOCP and the non-Windows equivalents when you're in that situation.

But to reiterate what I've already said, that isn't most games. Most games are fine taking a more simplistic approach, and IOCP is exceedingly tough to get right. It's not worth the effort unless you are really pushing the OS.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

hplus0603

11,919

May 29, 2017 04:35 PM

A more naive design would have jobs submit to a threadpool for packet receives,

The game doesn't need to "ask" for data. Data will be coming in. The game should just process what's there.

If you're using UDP, you only need one thread, because there is only one socket, and there is only one network card, and presumably your CPU can shuffle data faster than your network card can send/receive it over the internet.

Thus, the simplest possible implementation, which is portable across all kinds of systems, and is also fast, is something that uses a simple loop with non-blocking socket calls:

int udp = socket(AF_INET, SOCK_DGRAM, IPADDR_UDP);
if ((udp < 0)
  || (fcntl(udp, FIO_SETFL, fcntl(udp, FIO_GETFL, 0) | O_NONBLOCK) < 0) 
  || (bind(udp, &listenaddr, listenaddrsize) < 0)) {
  you lose ();
}

while (true) {
  int worked = 0;
  int r;
  packet *p;
  while ((r = recvfrom(udp, buffer, bufsize, 0, &address, &addrlen)) > 0) {
    dispatch_incoming_packet(buffer, r, address, addrlen);
    ++worked;
  }
  while ((p = dequeue_outgoing_packet()) != NULL) {
    if (sendto(udp, p->buffer, p->size, 0, &p->address, p->addrlen) < 0) {
      requeue_outgoing_packet(p);
      break;
    } else {
      ++worked;
      free_packet(p);
    }
  }
  if (!worked) {
    usleep(1000);    //  one millisecond is a good balance on modern systems
  }
}

Obviously, there are some bits that you have to implement here; this just shows the structure of an active network send/receive thread for UDP.

Because the kernel queues incoming and outgoing packets per socket, it's efficient (enough) to first drain the incoming queue, then generate the outgoing queue (from the game,) and then repeat. If there's no work to do (no packet came in, nothing to send) then the system is running at low load, and sleeps a millisecond to save some CPU.
"Polling" I/O like this is often frowned upon for learning system programmers, because it can be inefficient (burning CPU polling when there's nothing to do,) and it can add additional latency (the sleep means the CPU won't wake up the instant a packet comes in.)
That's a valid concern, but in this case, the construct as shown above is often the best solution for real-time networked games (which is kind of a special case compared to traditional server programming.)

There's still interesting code inside the "dispatch incoming message" function, where you have to figure out whether the address in question is an existing connection that belongs to a known client, or whether it's a new client trying to connect, and then route it appropriately.
A hash table of existing clients, pointing at their game instances, is usually used; when the client is not in the hash table, you can assume it's a new client trying to connect, and route it to some actor that deals with that.

enum Bool { True, False, FileNotFound };

Kylotan

10,512

May 30, 2017 09:17 AM

7 years ago, we were handling between 100 and 500 concurrent clients on one process with nothing more complex than select() calls, on Windows and Linux. It helped to have this on a secondary thread which performed the basic read and deserialisation before handing the data to the logic thread, but that was a luxury rather than a requirement as there was plenty of CPU time to spare. This does require a strong separation between deserialisation and game logic, of course.

It's interesting however to see that the initial post mentions "many game contexts in one process" - if the idea is to host a lot of individual games (e.g. MOBA or RTS sessions) rather than one big session (e.g. an MMO) then you'll quickly reach the point where background threads no longer give you 'free' CPU time, and I'd speculate that context-switching could be costly; especially if you have OS-level interrupts trying to wake up each game's receiving thread every few milliseconds.

samoth

9,833

May 30, 2017 11:53 AM

7 years ago, we were handling between 100 and 500 concurrent clients on one process with nothing more complex than select() calls, on Windows and Linux

While this will admittedly "work", even under Windows if you define FD_SETSIZE, it is nevertheless an awful approach. Not only are you transferring two kilobytes of data to the kernel every time (the descriptor set is not a bitfield!), but more importantly the reason why FD_SETSIZE is only 64 with Winsock is that 64 happens to be the limit of what WaitForMultipleObjects can tackle. Which, in the case of calling select on 500 sockets, means that Winsock will spawn 8 threads, each doing a wait-all on a subset of 64 sockets (well, one of them only has 52), and your main thread doing a wait-all on these threads. Compared to IOCP, that's just... a desastrous approach.

It's not quite that desastrous under Linux, but still epoll_wait will be roughly 30-40 times (30-40 times, not 30-40%) faster for a set of several hundred descriptors.

Don't get me wrong, I'm not saying select is inherently bad. For watching 2 or 3 descriptors, or for watching a moderately small set of descriptors that changes very often, it's mighty fine (possibly even better). Only just, for mostly-static descriptor sets with a count in the hundreds, one might really consider something better which is readily available, and doesn't cost anything extra.