Dedicated Servers in Gears of War 3
Download
Report
Transcript Dedicated Servers in Gears of War 3
Dedicated Servers in Gears of War 3
Scaling to Millions of Players
Michael Weilbacher
Development Manager, Microsoft Studios
Introductions
Michael Weilbacher
●
Technical Development Manager at Microsoft
●
●
●
1.5 years at Microsoft
16.5 years in the game industry
Shipped games:
●Gears
of War 3, Magic the Gathering: Tactical,
●Mortal Kombat: Deception to Mortal Kombat vs. DC Universe,
●John Woo presents Stranglehold, NBA Ballers, Blitz, MLB Slugfest, Psi-Ops: The Mindgate
Conspiracy,
●NASCAR 02-03, Madden NFL 97-03, NCAA Football 97-02 and some more….
Topics – From the beginning to the end
What are/why dedicated servers
The consumer experience
The associated cost
Game code decisions
Administering the servers
Implementation rollout
Out in the wild
Gears of war 3 dedicated
Trends and the future
servers
What are dedicated servers?
●
32-bit headless client instance without renderer/user input
●
Multiple clients hosted on a single server
●
Servers hosted in a datacenter
●
Multiple datacenters worldwide support the community
●
Software infrastructure that ties it all together
What are / why dedicated servers
Why dedicated servers?
Best game experience
Addresses Gears 2 problems
Datacenters provide high bandwidth, low
latency
Increased host performance
Consistency between games
Prevent host latency advantages
Reduce host quitting and game
interruption
Cheaters and lag switches
Community perception/expectations
Decided against distributing the server to the public
Reduces problem scope
Security concerns
Control the experience with consistent performance/bandwidth
The downside of hosting is the increase to the game cost
What are / why dedicated servers
Overview of our datacenters
Four large datacenters
Four small datacenters
Over 900 servers worldwide
Average ~70 users per core
What are / why dedicated servers
What is our latency tolerance?
< 150ms is playable, 50-90 is best
Average case after launch at datacenters
were 75 ms
Able to tweak by region
Oceania / Asia requirements relaxed
slightly after launch
During development
Playtest labs tested worst case
Artificial latency >200ms
Packet loss @ 5-10%
The consumer experience
Finding the server hosted game
Each server hosted game is assigned an ID based on datacenter
Each client is assigned one of these IDs based on an IP to location lookup
In matchmaking query
Client looks for a server hosted game with the ID
Hosted game balance experience and TrueSkill rating based on players that join
The consumer experience
Consumers finding the best match
Matchmaking returns servers in TrueSkill/XP range
4 types of queries
●
●
●
●
Best – looking for exactly my party size
Any – looking for any match that fits the party
Empty – configure a new host from the shared pool
Default to peer to peer
Lots of knobs to tweak allows much control over matchmaking
experience
The consumer experience
Games always available
Always fallback to a player hosted match
Necessary if we ever phase out servers over the life of the project
Underestimating server need should never affect players
Some people are just not close to datacenters
Favor server experience
Die roll to balance between "host vs. client rich“
Servers can host migrate if needed
Servers to peer to peer migration, but not server to server
Tracking this metric shows host migration is rare
~0.17% of matches
The consumer experience
How much hardware do we need for day one?
Historical data from previous game
Gears 2 multiplayer data
Gears supports 10 players max
Sales forecast per region
Formula driven
Assumed 15% attach rate online
and 30% concurrent rate
Can be costly if you are wrong
If too little, then community is unhappy
If too much, the accountants are unhappy
Easier to ask for the accountants forgiveness
The associated cost
How much hardware: should you buy versus rent?
Purchased enough for long term
needs (not peak)
Rented over 45% in US
Rented in regions which were
hard to setup big deployments
●
GameServers.Com
The associated cost
Monthly cost
Hardware is not most expensive part
About the graph:
At our highest cost bandwidth facility
Hardware amortized over 36 months
The associated cost
How much bandwidth do you need?
Our average hosted game sends out
~7kb/sec
Our average consumers sends in
~4kb/sec
VOIP traffic is peer to peer to reduce
host bandwidth requirement
Cost savings:
Pay for burst (more costly) versus
committed (long term)
●
More upfront, but cheaper over lifecycle
The associated cost
Match making: LSP or XLIVE/G4WL?
Punch through LSP?
Extra level of indirection
Extra latency
Roll your own matchmaking (no advertising on LIVE)
Non-starter
Games for Window Live?
Acts as a headless client
Codebase built around LIVE already (UE3 / Gears1 / Gears PC / Gears 2)
Only minor and focused additions/changes required
Game code decisions
G4WL challenges
Still beholden to client rules
CD Key / Local admin account necessary per
instance
Need one local account for each game
process on servers
One live account for each hosted game
1 Gamertag for every 10 users
Microsoft Platform created a custom tool to
generate all the accounts
Manually creating initial 50 Gamertags was no fun
Over 100k Gamertags created!
Platform did not maintain the accounts for us
Manually accepting Terms of Service for every
Gamertag
Used a web testing solution to help
upgrade accounts when account terms
changed
Very painful for all parties involved
Talk to your Developer Account Manager
before you go down this route
Game code decisions
Modifications to the existing UE3 dedicated server platform
Sitting idle
Needed to restart every 10 minutes to pull down possibly
new information
Dynamically need to configure themselves with new
updates
Transition period where clients and hosts are sync'd up
Detecting "empty" and resetting
People start to go into the game and do not make it
People stop playing and server needs to become available
again
Server shutdown whitelist
Need to be able to shutdown gracefully for
upgrades/maintenance
Auto configure when the first party joins and re-advertise
Players make a request for what game mode they want to
play and the game needs to setup
Empty server pool shared across all playlists and
configurations
General robustness
Needs a solid uptime, error conditions, shutdown
Fortunately not a single crash during the beta
However precision issues creep in after 48 hours,
so we reboot as players roll off servers close to
that mark
Lots of memory leak testing
Lots of logging, events, perf counters (more on that later)
Most of these have been integrated back to UE3
Game code decisions
Memory and Performance
Memory was not as a big deal as
performance
Servers run under 150MB/instance
Memory was cheap on the server
Set a goal for a solid 30 fps network tick rate
Simulated load with automated bot matches
Charted fps via performance counters
2.5 hosted games per core (2009 Gears 2)
7.2 hosted games per core (2011 Gears 3)
Memory optimizations
Major performance wins
Stripped out the renderer
Lots of time spent removing "visual effects"
code paths
Get the whole team thinking about dedicated servers
Moving from Server 2008 -> 2008R2 was 2x win
(Vista -> Win7 kernel)
The associated cost
Lessons learned
Servers load much faster than clients
Server told clients to load things before they had unloaded previous maps -> higher watermark and
occasional OOM errors
Introduced configurable latency before loading next map
No intrinsic first player assumption
Slow to connect players were missing the game based on checks that assumed player host existed
More code to check that at least one player existed before running existing checks
Mixing client and server side optimizations
Lots of animation optimizations "last render time" code had to be double checked
Invisible collision in a few instances where the animation never played leaving collision in a bad state
Make sure the "server" Gamertag was never exposed to the clients
Made sure arbitrated sessions did not include server in the TrueSkill calculation
Never registered a session for the “server”
Game code decisions
Reporting systems
Created by Games IT at MS
SCMM – Monitoring system
Tells Tier1 staff an issue is occurring
Email reporting and graphing of
health
Monitoring DB for heartbeats in game
process and launcher
Most common issue is XLive not logging in.
Administering the servers
Control center
Aimed at Tier 1 support
Silverlight app that interfaces with Master
services
Lock down to datacenter and not accessible
to the team
Silverlight app that shows high level metrics
Available through login
Webservice only has three read only service
calls
Can fetch log files of game
Administering the servers
Major components of infrastructure
•
•
•
•
Master DB
Master Service
Launcher Servers
Game Process
Administering the servers
Master DB
All components handshakes with the DB
to accomplish work
Size fixed after all machines and
accounts are added
Parameterized stored procedures only
Separate DB for metrics
No performance issues with proper
indices in place
Administering the servers
Master service
Writes to the master DB
Configuration setting of the machines
Datacenter setup with ID association
Assigns accounts to each machine and each process
(Account and 5x5 input)
Installs and health monitor of launcher service on each
machine
Tracks and moves builds to the datacenter local cache
Removed from DB and move to file caching
Can inject into the ini for custom settings
Can fetch log files from any game process or launcher
service
Administering the servers
“Gears of War 3” process
Runs many per server
All communication is asynchronous with database
DB Status messages
Game status
(datacenter/game
mode/playlist version/map name)
Server status
(launching/map
cycling/restarting/shutting down/etc)
DB Configuration options
Query every time server restarts or idle threshold is reached
Query returned various key/value pairs
Very flexible
Many performance counters exposed
Frame rate, thread timings, number of players connected, client
connection data (Ping, Incoming/Outgoing traffic, Packet loss)
Administering the servers
Launcher service
Runs one per server
Owns game processes on server
DB Commands to interact with game
Start (Install if needed from cache),
Stop (Bleeds off clients), Kill, Kill All
Restart server, and clean machine
Health monitor of the process
Reasons to restart,
Every 48 hours
In case game crashes
Datacenter ID or playlist version does not match
Server status hangs in any state for too long
(datacenter/game
name)
Hot swappable
Gather and records state of the game
processes
Game status
Server status
(launching/map
Allowed us to change health rules dynamically
without stopping server hosted games
Administering the servers
mode/playlist version/map
down/etc)
cycling/restarting/shutting
Health monitoring – good day
Administering the servers
Health monitoring – bad day
Administering the servers
Lessons learned
Restarting the process automatically is mandatory
Many small things outside your control, allows you to come back online quickly
Live connectivity
Server hiccups
Configuration issues
G4WL cannot handle loading all processes at once
We found the need of 10-15 seconds between the load of each game process to prevent XLIVE DLL
issues
All administrative applications need the ability to be updated without taking down the server
hosted games
From the game to the monitoring services, you never know when you need to make adjustment
and this allows you to do a simple form of A/B testing
Administering the servers
Developer environment
Client/Server Environment
Could run multiple clients and servers on same machine
Multiple Gamertags / local accounts required (runas.exe)
Maintained GFWL PC client for rapid iteration
Could run without admin tool from commandline
UnrealConsole could talk to server through socket
All the debugging functionality of UE3
Admin Environment
One datacenter simulation for testing
5 servers with 1 SQL/webservice server
Could run locally using
Visual 2008 (for the game),
Visual 2010 (admin tools),
SQL, and Internet Information Services (ISS)
Implementation rollout
Phase 1 - Gears 2 title update
(April 2010)
Retrofit game to support planned Gears 3 features
Good way to introduce features with no expectations
First test of new matchmaking flow
First test of dedicated servers
Limited run of dedicated servers
Profiling servers in a real environment
Controlled environment, closely monitored
Tested CPU/Bandwidth usage in the wild on various hardware
Found 2 otherwise irreproducible crashes in the wild
Able to get minidumps and figure out the problems
Implementation rollout
Phase 1.5 – Large test in the labs
(January 2011)
More than 100 people (mostly testers)
Lock machines available and cores to create simulated overload
Monitor CPU and bandwidth
Will stress servers, but not infrastructure.
Work with enterprise staff to look for flaws
(outside of games devs)
DB analysis
Network sniffers
Locking down cores on a PC
Number of network cards, etc
Implementation rollout
Phase 2 - Gears 3 Beta
(April 2011)
Real rollout of servers to datacenters
First consumer trials of our server
administration tools
Phased rollout
Huge success
Solid uptime
Gamers happy
Implementation rollout
Lessons LEARNED (BETA)
Lack of communication issues (zombie games)
Misconfigured servers
Small number of game and balance issues
Added more matchmaking tweaks to ease contention
Good sampling of ping data from around the world
Discover data points which we should capture during release
A HTTPS Webservice is better than direct DB access
Better caching of static data in the DB to offset the DB load
Implementation rollout
Submission process
MS Cert
needs to be able to run the server in their environment
needs to be able to see the client attached to a server
liked to see that the server is attached to the client
Challenges of MS Cert Environment
Closed environment
Not accessible to our admin framework or network
Reverse IP lookup cannot find their server
Solutions
Always keep the ability to run the server by itself without any DB connections
Set cert environment to only one use datacenter
therefore all IPs return one datacenter ID
Implementation rollout
Security reviews of datacenters
(Before you go out…)
Always kill the process on security concerns,
better to be alerted than be exposed
The game is signed, but we have exposed connections that must
be protected!!
Use SDL to examine how trustworthy those communications are
and what happens if someone crashes your game process
File and networking fuzzing can be difficult, but worthwhile
Look for exposure of personal information especially in log files
Get an enterprise developer to look at your SQL stored procedures
Know your pattern for your game to help look for regularities
Think of that pattern as credit cards look for fraud
Implementation rollout
DLC / Title updates
Ability to rev the dedicated servers faster / independent of client
updates
Servers have to have all the content
Matchmaking can impose certain requirements on clients before searching
Balance between value to those who purchase content vs. fragmenting
our client base
Plan for an update path for your servers
Out in the wild
Releases of new G4WL client dlls
If mandatory upgrade, servers will not work until upgraded
Update requires a server to shut down all games
Can be done in a rolling manner
No matter how much communication, be prepared to be surprised when
these happen
Automated solution to deploy to reduce impact
Out in the wild
To the cloud…
Trends suggest that we have a lot of
unused time on our servers everyday
W it h a c lo u d s o lu t io n , yo u c o u ld
p o s s ib ly g et t h e f o llo w in g :
Pay for what you use (but
more likely at higher hour
rate)
More volume upfront for day
one demands
Tier1 built in to the purchase
(hardware issues, network
issues)
Could freeze VMs on machines
to debug later
The Future
Hopefully your launch looks
like ours….
Questions?
Email: [email protected]
Special thanks to:
• Epic Games
• MS Core Pub Team
• MS Games IT
Individual call outs:
Josh Markiewicz
Sam Zamani
Wes Hunt
Ian Thomas
Joe Graf
Vijay Krishnan
Nur Sheikhassan
Chris Kimmell
Chris Wynn
Microsoft Studios Core Publishing is recruiting