Editor’s Note: This post was compiled and created by Scott Andreas, who can be reached on twitter: @cscotta.
Exploring the Durability of IP Connections from Android Devices
We see a lot of phones each day. Our Helium messaging platform serves hundreds of different models of Android devices from over 5100 global network providers in 208 countries via links running the gamut from GPRS to 3G to satellite. In order to maximize reliability and deliverability across our network, we’re continuously analyzing the behavior of our systems and the data available to us about devices in the field. Recently, we’ve taken steps to automate a more thorough analysis of these logs to understand how network interruptions impact individual devices and our system as a whole. This gives us some insight into what’s happening on these devices and the networks to which they connect.
Connection Durability as a Metric
Carriers and industry analysts have conducted many studies on the speed of mobile networks, dropped call stats, and coverage. In this post, however, we explore a different dimension called connection durability. Connection durability refers to the average duration of a mobile IP connection, or more precisely, the average number of times a device’s data connection reconnects throughout the day. While such irregular blips are unlikely to interrupt web browsing or Twitter-checking unless they occur during it, these blips do affect the reliability of background services such as sync and messaging. As these services are the lifeblood of a mobile device, it’s worth looking into what happens over the course of a tumultuous day on a mobile network.
What factors affect connection durability?
Several factors combine to result in low-quality or short-lived network connections. You’re probably familiar with many of them: walking into an elevator, taking the subway, or moving away from the windows inside a large building or descending to the basement. CDMA devices suspend their data connection each time a phone call is made. All mobile networks have dead spots. For devices using WiFi or WiMax connections, many devices will aggressively manage the link, shutting it down as often as possible. Switching between WiFi and 3G/EDGE connections also triggers a temporary drop. Task Killer apps trigger a similar effect, interrupting a connection until the service restarts. While the quality of a network may appear very good while you’re using it, most mobile phones take a silent beating over the course of the day as connectivity fluctuates.
To better understand this fluctuation, we’ve analyzed the server-side logs generated by the connection activity of a slice of one million devices on our messaging cluster. After plotting a global baseline, we’ve analyzed this data to see what we can learn about connection quality by country, carrier, and device type. Due to a variety of factors, this data does not permit us to offer firm statistical conclusions about the quality of a given network, device, or connection from a country. It’s important to bear in mind that this data speaks primarily to the ability of a device to maintain a persistent connection in the presence of all factors that diminish connection duration, including the OS itself. By analyzing this data in different dimensions, we seek to understand if any interesting correlations are present.
Diving In: Connection Events Across All Devices
Let’s start by looking at the global statistics of connection events per day across this slice of devices:
This chart shows that most devices in this sample lose and regain their data connection fewer than 10 times per day (55%), with the vast majority losing their connection fewer than 100 times per day (96.2%). Two reconnects/day is the most frequently-occurring value, followed closely by three and then one. Following this, we find a long tail of a handful of devices with much higher reconnect rates, most likely indicating either a malfunctioning phone or one with a consistently poor connection. As a high rate of disconnect and reconnect events will typically occur when the device is in an area with marginal coverage (passing through a subway, dead spot, elevator, or building), these events are likely concentrated to small portions of the day during which adverse conditions are present.
With this data we can establish that over the course of a day, most mobile devices will experience a relatively low rate of reconnections. 55% of devices in this sample reconnected 10 or fewer times per day, averaging less than one connection event per 2.4 hours.
Geography: Breaking it Down by Country
We’ve also examined the breakdown by country. Might connection durability vary unevenly across national boundaries? Via MaxMind’s Geo-IP Country database, we’re able to map mobile device IPs to the country from which they’re connecting. While it’s not possible to reliably pinpoint city or regional data by mobile IP, we can determine the country with a high level of confidence.
The y-axis in this chart repesents the number of times a device located in a given country reconnected throughout the day. Here, we see that devices in this sample connecting from China experience the fewest reconnections per day (12), slowly climbing upward toward Canada and the US with 21. However, after these, we see a spike indicating that connections from Indonesia, France, and Japan are significantly more volatile. While many devices in the sample from these three countries demonstrated low reconnect rates, others varied widely with samples in the hundreds of reconnects in each. Note that this plot excludes countries from which fewer than 1000 devices in this slice of data have connected.
Variations Across Mobile Networks
Surprised by the numbers in France and Japan, we broke the results down by network to see if uneven connection rates appeared at the carrier level as well. Via MaxMind’s ISP/Organization database, we can map device IPs to network providers. Parsing this data takes work, as many mobile networks function as independent systems under one brand following mergers with other carriers (e.g., AT&T and Cingular, or Verizon – Bell Atlantic – GTE). Rather than attempting to group these, we’ve provided the raw data of device-to-network mappings below. Note that this chart represents networks from which we see greater than 1000 devices in this sample connecting.
In this chart, AT&T Global Internet Services and “Service Provider Corporation” (formerly Cingular) represent AT&T. Cellco Partnership is the corporate name of Verizon Wireless. Orange Communication SA and Orange PCS Ltd. are networks operated by Orange in multiple countries. We also see landline and fiber providers due to devices connecting via WiFi. Once again, the y-axis in this chart repesents the number of times a device located in a given country reconnected throughout the day.
Consistent with our breakdown by country, we see that connections from NTT Docomo (Japan) and Bouygues (France) experience the highest level of interruptions. Devices on these networks experience significantly more dropped and re-established data connections to our messaging cluster than on other networks in other countries. This data also shows that connections originating from landline and fiber providers are interrupted more often. With the exception of NTT and Bouygues, the upper bounds of this dataset are weighted toward landline providers such as Cox, BellSouth, Charter, and Comcast.
Device Models and Manufacturers
What variations present when we slice this data by device type? This chart represents the mean reconnect rates from devices (frequency > 1000) in this sample.
The Motorola Xoom leads here, which may be attributed in part to the fact that tablets are less likely to be carried through volatile network conditions throughout the day. On the opposite end, LG’s Optimus V, T, and M phones showed significantly greater reconnect rates, topped out by Samsung’s Nexus S and the T-Mobile G2. The middle of the pack is dominated by an alternating flurry of Samsung, HTC, and Motorola phones. This chart does not demonstrate a direct correlation between device manufacturers and reconnect rates, suggesting that the variations between individual models (radios, chipsets, software, etc.), the networks on which they are deployed, and user behavior (such as leaving a tablet on a coffee table) may be more significant than the device’s manufacturer.
This sample is not pure enough to support statistically sound statements regarding the reliability of a particular device, carrier, or connection within a country. The number of confounding factors prevents us from making such statements with confidence. This would require a cleaner dataset, and a more thorough analysis that cuts across each of these categories to account for the variations introduced.
However, it provides a fascinating picture into the life of a mobile device on data networks deployed throughout the world. We see that these devices must be capable of gracefully and transparently handling network failures throughout the day, retrying connections and backing off as appropriate. Network and geographic factors may correlate with the ability of a device to maintain a reliable IP connection to a remote server. We can also see that devices registered on mobile data networks tend to maintain more stable connections than those connecting over WiFi via traditional network providers. Finally, the data demonstrates that connection durability rates can vary widely across different Android device models as well.
More importantly, this slice of data provides insight into the behavior of devices connected to our messaging cluster. These results enable us to tune our software and systems on both the client and server to maximize connection durability and the reliability of our messaging services, while minimizing the impact on the device. Regardless of the factors contributing to poor connections, this type of analysis provides us with a better understanding of the best, average, and worst cases that devices are likely to experience, and feeds directly back into our development process. This rigorous analysis of our data is important, and constantly helps us to improve the reliability and performance of our systems.
We initially performed this analysis back in January on a much smaller dataset. After re-running the same jobs across a dataset about 8x the size, we found surprisingly little variation. Previously, 63% of devices connected 10 or fewer times per day (compared to the current 55%). Consequently, reconnect rates increased about 10% across a few of the dimensions we analyzed (country, network, and by device type). While a few elements changed in the new analysis, it was refreshing to see that a revisit of this data six months later validated our first analysis, paving the way for more confident, data-driven changes to our messaging systems.