🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Virtual machine live migration under a game server

Started by
9 comments, last by hplus0603 2 years, 10 months ago

Anyone ever try migrating a game server from one machine to another, to get more or less resources depending on load? AWS offers live migration for both Xen and KVM VMs. It's sometimes used when a multithreaded web server needs more threads.

Live migration uses a clever trick. First, all memory pages are marked as dirty (un-copied) by the hypervisor. Dirty pages then start to be copied, over the local network, from source to target machine. It's a lot like writing pages out to virtual memory on disk. Meanwhile, the program is still running, creating more dirty pages. There's thus a race between the copier, making copies of clean pages, and the program, making pages dirty. The endgame is freezing the program, copying the remaining pages, and starting up the version on the target machine. All networking and file connections are of course shifted over. The freeze period may be as little as a few hundred milliseconds.

This is a standard AWS offering, not some experimental thing. It's been in use for years.

The question is, has anyone tried doing this underneath a running game server? And, if they have, how long was the “freeze” period? Did it work out?

I'm thinking about this as a strategy for big-world systems which need to apply more resources to areas with more load.

Advertisement

VMWare has offered this for VMWare ESXI for a long time, too! It's great for “critical web applications” and “shopping servers” and such, where the 24/7 availability is really crucial.

I would recommend against basing a game design around this feature. It's better to be able to spin up more servers, than just trying to go to a bigger server – some bottleneck is going to hit you as you scale up, be it CPU, or memory throughput, or network throughput, or disk I/O, or even kernel synchronization cost. Scaling out is always the better option.

That being said, if you try it for some game of yours, I'd be interested in hearing a write-up, just like you :-)

enum Bool { True, False, FileNotFound };

What I'm looking at here is the Second Life/Open Simulator architecture. Open Simulator is open source, compatible at the message level. Big, seamless world, spread across processes on multiple servers. Each process runs one region, 256m x 256m. Currently, each process is mostly single-thread, but that could be fixed for at least two of the major internal bottlenecks (user scripts and physics).

Now you want to allocate more resources to regions with more players. Restarting the region server and starting it on a different machine allows this, but the region blanks out for about two minutes and all users are kicked out of that region. If that could be reduced to a second or two, without user kick-out, adjusting resources more often becomes possible.

So the idea is that when traffic builds up at a club, or there's a big meeting, or a city gets busy, restart the region on hardware with more CPUs. Conversely, idle regions get migrated to smaller machines and have multiple region processes running on one CPU.

This is sort of a low-rent way to do what Spatial OS does, without a rewrite.

If you can manage to multi-thread physics and user scripts, which is a bit of an if …

Anyway, why would it take two minutes to kick players off? Start up the second instance ahead of time, let it load up all necessary resources, then do a hand-off similar to what you'd do at zone boundaries. Given that all of that hand-off code is already in the simulator, that should be possible without too much of a rewrite.

Also, it's perhaps applicable that the business model of Second Life was “let's rent servers to people,” they didn't intend for it to be auto-scaled – that would go against their desire to make a profit margin on top of base server hosting costs, so the overall architecture might be getting in the way.

Anyway, given that you pay by the minute for Amazon resources, it wouldn't be terribly expensive to play around with this and see if it could be made to work.

Actually, reading up on Amazon Live Migration, I don't think it works the way you suggest. They seem to only support this for importing servers from VMWare clusters – they don't support it for EC2<>EC2 hot migration, as far as I can tell. So you're kind-of stuck with VMWare NonStop/vMotion here. Which is decidedly more expensive to play with than a couple of EC2 instances :-)

And, evenso, their FAQ states this:

AWS MGN utilizes continuous, block-level replication and enables short cut over windows measured in minutes. AWS SMS utilizes incremental, snapshot-based replication and enables-cutover windows measured in hours.

enum Bool { True, False, FileNotFound };

Not AWS MGN, but AWS KVM or XEN migration. MGN is copying disks from non-AWS servers into AWS, trying to bring them into sync.

KVM or XEN migration requires that both source and target machines share networked disk storage. Since that's the normal case in AWS, that's not a big problem.

It's surprising how little is written about this on the Web, which suggests it's not used much.

hplus0603 said:
If you can manage to multi-thread physics and user scripts, which is a bit of an if …

Depends on the physics engine. SL uses Havok, which claims to be multi-threaded. The approach is to divide the world into sections which are not currently interacting physically, and solve each separately. That's a good fit to how SL works. Open Simulator seems to be using single-thread Bullet, and I'm not sure how well that works. On the Second Life side, physics tends to use about 5%-10% of frame time. Scrips use most of the time.

Script parallelism is not too hard, because of the way Linden Scripting Language works. There's not much state. Each script has 64KB of memory. There's no synchronization between scripts; they all free-run independently. It's quite possible, and normal, for some part of the system state, like an object location, to change while a script is running. So scripts can be run concurrently.

I'm not working on this directly; I'm tied up trying to write a new client in Rust. What I'm trying to do is to find potential solutions to big-world scaling problems. Second Life, old and clunky though the implementation is, comes closer to the “metaverse” than anything out there. There's no real limit to the size of the world except your server budget.

But don't bunch up. 20 to 40 avatars max per region before it gets sluggish. That's the biggest scaling problem now. If a solution to that problem is found, the other dimension of scaling has been cracked.

That's the hard problem technically. The migration stuff is to make it cost-effective. At any given time, more than half the regions in Second Life are empty of users. Those could be stacked up 4 to a CPU core, if you could give them more resources gracefully when someone showed up. That frees up AWS resources for those regions where someone has a busy club with 200 people, and you'd like to move that region to a 16 or 32 or 64 core AWS instance.

What you describe doesn't sound like its worth the effort.

There is already a bunch of well-established tech to let people co-exist on servers easily. It works in either direction, with letting players share data or not depending on whatever your game wants. It can work with players sharing a server collaboratively, or people coexisting in the same virtual world data but independent of each other with tech often called “phasing”. Passing players between servers is a solved problem.

In a game world where you do rolling updates (not common, but still happens) you migrate everyone from one server to the next and it all just works. Then when the player count is zero on that server, bring it down and do the work.

This tech already works well at controlling the costs of managed servers. When a server gets below a threshold migrate them off the box and shut off the instance. When a server gets above a threshold migrate them to other servers and potentially start another server if they're all getting busy.

So I don't see a benefit to migrating virtual machines in this case. There is no need to keep the virtual machine itself running and migrated between hardware. Just start up a server and shift players over to it when you want, or shift the players off the server and kill the server when you're done.

frob said:
letting players share data

The world itself, with its own objects and activity, needs to live someplace. I'm looking at this from the Second Life/Open Simulator perspective, where the land has a life of its own. Even when nobody is around, some things continue. Too much, by most game standards. There are animals grazing, NPCs working, vehicles running around, and even plants growing. This is all user-scripted; the game engine just sees this as non-player objects running. There's a slowdown mode when nobody is around, but all the state is maintained.

These systems are land-centric, not avatar-centric. That's common in building-type games, from Minecraft to Factorio. In a way, avatars are just visitors.

I think not much is written about Xen/KVM migration in Amazon, because it's not a feature they support.

When asked about it on their support forum, Amazon employees claim this is not something they support or intend to support.

There are approaches that try to use QEMU/KVM as a second layer hypervisor inside Amazon KVM/Nitro, and in the sense that “linux is linux,” you might be able to make that work, but that's really no different from trying to do the same thing yourself on your own hardware (except you don't get to poke at the BIOS of the Amazon hosts)

If scripting really is separable, and doesn't need synchronous RPC to other scripts (instead communicate using asynchronous messaging) then you could presumably patch OpenSim to run different scripts on different hosts and share the message queues between them, and either run the physics simulation on both hosts with a sync stream, or broadcast state from a master to each replica each frame. An architecture like that would be much more robust than trying to “scale up” or using in-flight migration.

But, again – if you try something, and have interesting results (positive or negative,) then by all means, I'd love to hear about it.

enum Bool { True, False, FileNotFound };

hplus0603 said:
I think not much is written about Xen/KVM migration in Amazon, because it's not a feature they support.

I suspect you're right. It's a listed feature, but all I see about it are how-to guides copy-pasted from the documentation.

Trying to pull scripting out of the simulator process is probably going to result in more overhead. Typical scripts don't do much work per event. RPC overhead would dwarf the actual work.

The Roblox people are going all-out on massive numbers of avatars in scene. There's a “Metaverse” discussion on the Roblox site where they talk about having 50,000 avatars in a stadium, you wave a flag to your friend on the other side of the stadium, and they can see it and wave back.

you wave a flag to your friend on the other side of the stadium, and they can see it and wave back.

It's a pretty good team, I'm very impressed by the many things they've done with a fairly small team.

One of the projects we did while I was there was to significantly improve late-join performance for large levels with tons of objects – I imagine this work has continued, to the point where they can at some point support that. (Also, internets keep improving. What was “too much” 10 years ago is “achievable” now!)

enum Bool { True, False, FileNotFound };

This topic is closed to new replies.

Advertisement