If you are one of the few developers who is looking to use WebRTC in Android native code and are stuck without a tutorial resource, this guide is for you.
If you landed up here accidentally, go back to basics in the Part 1 of the series to get an idea of what is discussed here.
This is Part 2 of the series “Getting started with WebRTC for Android” and if you are new to this article, please make sure that you have read the other parts of this series too.
Here is a glance of what we saw in part 1 of the series.
- WebRTC is a peer to peer connection framework which can be used to provide high-quality audio/video/data transfer between peers.
- WebRTC for the Web is straightforward. In Android, We have to write a lot of verbose code to make it work right.
- WebRTC is available by default in almost all of the latest browsers. For the mobile platform, we have to include the peer connection native library to make it work.
- In part 1 of the series, we saw how to get the video from the user’s camera using WebRTC framework classes
- We used VideoCapturer class to capture the video from the user’s camera
- VideoSource was created using the capturer instance which is then used to create the VideoTrack instance.
- This VideoTrack was rendered in a local SurfaceViewRenderer
Now, this was a long time back, and I had to take a break after the part 1 of the series. Let us look more into how PeerConnection works and how to implement it in Android.
First, let us establish a fact. WebRTC is based on Peer-to-Peer connection (Or simply P2P). So what is P2P? Here is what Wikipedia has to say about P2P
Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the application. They are said to form a peer-to-peer network of nodes.
If you are someone who is still in the dark, the more humane explanation would be,
A peer-to-peer (P2P) network is one where two or more devices are connected and share their resources without having to go through a separate server.
So in simpler terms, by using WebRTC, we could share the audio/video streams directly with the other peers without the need for an external server computer. This is a huge benefit for companies as their infrastructure are not overloaded with the bulk of audio/video data.
- A Peer is a device connected to the network.
- Sharing of data between peers is possible only when the network to which they are connected allows it. We will see more about it when we see about STUN/TURN.
- In WebRTC, we use a concept called “Rooms” to identify the peers who are connected to each other.
- Many peers can be connected to a single room thereby sharing their resources with other peers in the same room.
- Signaling (or Signalling if you are British) is needed to intimate the peers about the other peers in the room to establish a P2P connection between them.
Even though WebRTC takes care of the P2P communication, we need to specify some signaling mechanism so that the peers can communicate. Signaling is not included in WebRTC so that the developers could have the freedom of choosing their own mechanism for their needs.
Isolating signaling from the core WebRTC gives freedom to plug’n play whatever the infrastructure the developer/organization has already set up.
WebRTC works only with secure networks. It gives the guarantee that the data transfer is done in a secure channel. It is up to the developer to use a secure channel for signaling messages.
WebRTC has two major components — Offer and Answer. They both are nothing but metadata. They provide the audio/video encoding formats available in the device, their network details and much more.
The offer and answer are encapsulated in a format which is called Session Description Protocol, or simply, SDP. Read this IETF document if you want to learn more about the format.
This SDP is transferred among the peers in the network so that they would know about the other peers. This data would be used by WebRTC to come to an agreeable audio/video format that would be supported by most of the peers along with the route through which the communication could take place.
Let’s assume now Bruce Wayne calls Clark Kent to ask him to surrender. Wayne, the gadget freak that he is, wants to call Kent in his private P2P network because showing bat signal is too mainstream.
Here is what happens in the background
- Wayne creates an Offer and sends it to the “SuperHeroes” room.
- He stores the Offer safely in his cave so that he would remember it.
- Kent, being a superhero, receives the offer, stores it safely in his fortress as Wayne’s personal information.
- Being a generous person and not knowing what the call is about, Kent answers the offer sent by Wayne after storing his answer safely.
- Wayne receives the answer from Kent and stores it safely again.
- Wayne and Kent also share some interesting additional information like how both have a mother named Martha (They share IceCandidates which specify their audio/video and other important details — More on it below)
- Now Wayne and Kent has the meta information about them and the information about each other. Wayne could now establish a direct link to Kent, and we all can have some action time!
Here, When Wayne creates the offer, he is called the “initiator” of the call. The offer that he created and stored is his “local description.”
When Kent receives the offer, he stores it as the “remote description” of his peer. His answer is his “local description”.
This transfer of offer and answer as SDP can happen over any medium. It could be WebSockets, Google cloud messaging system, XMPP, SIP, RTCDataChannel, etc. This choice available to the developer to select the signaling medium is the major advantage for WebRTC.
Even though the P2P connection does not require a server connection, the signaling part of WebRTC does require a server to manage the sessions, rooms and their participants. The signaling messages are relatively small in comparison with the bulk audio/video data. This would take a few KB of data transfer over an entire call session.
Even though signaling messages are small, if you want to support a huge customer base, you might have to think about scaling the signaling system to support a huge number of concurrent calls made.
Now that we know how to signal the participants of a caller, we can establish a call, right? Yes, in an ideal world where the villains like NAT gateways and firewalls are not there. Unfortunately, we are not living there.
WebRTC is equipped with handling those villains. But we might have to guide it just like how we helped it with signaling. There is a more detailed version that explains what STUN, TURN are and what problems they solve.
WebRTC in the real world: STUN, TURN and signaling - HTML5 Rocks
Signaling is the process of coordinating communication. In order for a WebRTC application to set up a ‘call’, its…
In Simpler terms,
- The P2P connection can happen only when the peer knows the public ip:port combination of the other peer. This is not possible in NAT-protected networks.
- STUN is a public server whose only work is to find out the public ip:port of the incoming request and send that address as the response. Peers can use STUN to know their public ip:port (NATs doesn’t give away that information directly to the peer. Hence the STUN — Session Traversal Utilities for NAT)
- STUN is not bandwidth intensive, and there are many public STUN servers that we could use.
- Again in an ideal world, where NATs don’t block the UDP/TCP audio/video streams, we could make the WebRTC communicate through NATs with the STUN response we got.
- But, in cases where NAT is so restrictive, TURN (Traversal Using Relays around NAT) is used.
- TURN servers are relays that act as a proxy for peers. TURN servers have their public ip:port and hence the peers can communicate directly with them even if they are behind the firewall.
- Each peer sends their media data to the TURN server which relays it to another peer.
- Unlike STUN which handles a low volume of data, TURN handles large media streams and hence needs to be scalable for production apps.
- TURN servers are costly and are not available for free.
TURN is used to relay audio/video/data streaming between peers, not signaling data!
- TURN is like a messenger who carries the messages from one person to another. It does nothing else to the media streams.
WebRTC works perfectly for two peers to communicate with each other. Imagine what will happen when a group of peers wants to communicate? (Group Calls, duh?)
Consider a full mesh network where every peer is connected to every other peer.
- This approach works well for a smaller number of peers in a connection. But, when the number grows beyond control, due to excessive redundancy and higher bandwidth requirements, it fails miserably.
- This approach is not suitable for low-bandwidth devices like our smartphones.
- In comes the Multipoint Control Unit (MCU). These are specific hardware/software technology which helps in creating a star topology instead of the full-mesh topology.
- The Media server acts as the peer for all the peers. It will keep track of all the peers in the network and transfer the data from other peers.
- Each peer in star topology will be connected only to the MCU and will not have any network connection to the other peers in the network.
- See below for more information regarding MCU and Media Servers.
After all these theories, if you are like this,
Wait for the next part in this series to find out more how all these fits in together to create an excellent Video calling android application.
Until then, See ya!
- Of course, Google 😉
The next part of the series is out. Do check it out for more info on how to get started with WebRTC for Android.