Simplifying WebRTC Connections (AKA Hacking the crap out of WebRTC)

Making online multiplayer games is hard, and making them for the browser doesn’t make anything easier. We don’t have the standard socket functions so we have to use what the browser provides for us. WebSockets are the most well-supported way of doing non-HTTP communication between server and client, and they work well for certain types of games, but for fast paced synchronous multiplayer they have a lot of problems. Since they use TCP, any stall or packet loss in the data stream causes a complete shut down of the updates until the data can be resent. There’s also no way to control certain low-level TCP features like Nagle and TCP-ack delay from JavaScript, so we are at the mercy of browser vendors to set this setting properly for us, but there’s no guarantee. Without setting these the right way, we might get hundreds of milliseconds of extra lag.

We also have WebRTC data channels, which allow us to send data over UDP. This solves a lot of problems that WebSockets have but introduces a whole lot of new problems. WebRTC was mainly designed for peer to peer internet telephony in the browser, so doing any kind of client-server type architecture is really messy. There also aren’t a lot of libraries out there for using WebRTC outside of a browser. The most mature one is the one provided by the Chrome team, however, its threading model does not allow it to scale well to any significant amount of connections and using it requires you to download a massive amount of Chrome code and setup tools that don’t easily slot into other projects.

There are some other options that work much better on the API side, but force the user to install a browser extension or plugin and this creates a lot of friction for users to try out your game. Multiplayer games live and die by the strength of their player base, so adding this kind of barrier to entry for new players is unacceptable. So we’re forced to deal with the problems of WebRTC until something better comes along.

How WebRTC Data Channels Work

The WebRTC connection handshake is complicated.  Really complicated.  It starts out by having each side of the connection generate a session description using Session Description Protocol.  This protocol is about two decades old, neigh unreadable and was designed to support everything from VOIP phones to fax machines.  It contains a lot of boilerplate for describing the session as well as information about what codecs are supported, bit rates, etc, which is all entirely useless for us since all we care about are data channels.  The session description has to be sent to the other end of the connection using some external method.  It’s not specified how so you could use anything – HTTP or WebSockets.  One guy even hacked up a way to use twitter.

One each side has each other’s session description, they start sending STUN packets to try and find a valid way to connect.  This is mainly to handle the case of both ends being behind NAT.  If there isn’t a valid way to connect directly, both sides can connect to a publicly hosted TURN server to route packets through. The STUN packets themselves are mostly devoid of meaningful data, it’s the fact that they can reach the other side which is important, however some part of the session description is used to validate the STUN packets using an HMAC. The endpoints try as many methods as they want to, and each successful method generates and ICE candidate, which is just part of the session description that indicates how a valid connection can be made.

Then, each side decides which ICE candidate to use, and begins the connection process.  At this point, we need to establish the next layer of the WebRTC connection using DTLS.  One side is designated as the server and sends the initial client-hello packet over the same socket that the STUN packets use.  Each side’s DTLS certificate is validated again using the session description from the other end.  Once DTLS is established, we have to create an SCTP-over-DTLS connection using the DTLS link.  SCTP is another protocol on the same level as TCP or UDP that gives some of the features of both.  It never caught on and is supported by very few routers or devices connected to the internet.  SCTP-over-DTLS is a way to simulate SCTP support by encoding the entire SCTP packet in a UDP packet. It is beyond me why they chose to do it this way, but this is the world we live in.

So we connect the simulated SCTP socket inside of our DTLS connection, and finally we have to set up our WebRTC data channels.  Each channel can be created in either reliable or unreliable mode and is built on top of an SCTP channel.  The first packet send on the channel must be a data channel header describing how the channel will be used.

After all of that is done and all of your channels are set up, you can send data.

Well that sounds like a lot of work, but where’s the problem?

The main problem is this whole business of transferring session descriptions around ads of a ton of complexity that I don’t need.  What I really want is a system where users can just connect to a server without having to go through an intermediary.  I don’t want to have to generate the session description on the server for every client or validate that that the client’s session matches.  Since my server is going to be hosted on a public IP, I don’t want to have to handle generating ICE candidates or have to do STUN checks from the server itself.

Well that sounds like madness

Yes it is, but it’s possible to the browser to connect through WebRTC to a server without ever having to transfer the session description.  Here’s how I managed to do it on the client end –

First, hardcode the base boilerplate session description data on the client. Most of the session description is invariant to anything

var webrtc_sdp=
o=- 0 2 IN IP4
t=0 0
a=group:BUNDLE data
a=msid-semantic: WMS
m=application 9 DTLS/SCTP 5000
c=IN IP4

Next, hardcode the ice-ufrag and ice-pwd portions of the session description.

webrtc_sdp += `

The ice-ufrag field is required, but what you put there is arbitrary.  The ice-pwd contains 24 bits of randomly generated base 64 encoded data that is used to validate STUN packets. Each stun packet response is required to have a MESSAGE-INTEGRITY chunk which is calculated by doing a SHA1-HMAC of the entire message using the ice-pwd data as the seed.  Since the client does validate this data, it’s required for the server to be able to produce it.  We hardcode it here so that the client knows what to expect.

Next we hardcode the server’s DTLS certificate fingerprint.

webrtc_sdp += "a=fingerprint:sha-256 " + fingerprint + "\n";
The certificate itself can be self signed and you can use a tool like this to calculate it yourself after you’ve created one.  The certificate fingerprint is also validated by the client.
Now we create a WebRTC offer and set our generated SDP as the response.  We also create an ice candidate for our server’s ip and port.  Setting the remote description as an ‘answer’ force the client to behave as the DTLS connection initiator.
var desc= {'sdp':webrtc_sdp, 'type':'answer'};

peer.setRemoteDescription(desc).then(function() {
  var ice_candidate= {
    'candidate':"a=candidate:0 1 UDP 1 "+ipaddr+" "+port+" typ host",

  peer.addIceCandidate(new RTCIceCandidate(ice_candidate));


At this point, the client should be able to connect to the server as long as the server is able to respond to STUN packets properly, is able to establish the DTLS and SCTP connection afterwards, and can handle the data channel protocol.


You can take a look at the beginnings of my C++ WebRTC server that follows all these steps at this GitHub:

This code isn’t ready for production use, but it does actually receive connections from the browser and is able to send data back and forth.  I’m also working on a C++ client API that can work with both emscripten and other platforms using Berkeley sockets.