T For Tsunami : Dealing with traffic spikes

Akash Saxena
Disney+ Hotstar
Published in
5 min readJun 21, 2017

--

TL;DR: In India, Cricket is religion. At the recently concluded Champions Trophy, Hotstar broke it’s own previous best records of 3M concurrent users. We did 4.02 M for Ind-Pak #1, broke the record with 4.2 M for Ind-Ban and rounded the tournament off with 4.69M platform peak concurrency for the finals! Such concurrency places high demands on our infrastructure and architecture. Read on for some in-the-trenches stories from the last couple of weeks.

The Tsunami

I’m writing this just as my team is going through the final checks for what is now an India-Pakistan dream final. It’s as good as a time as any to retrospect on how the tournament has gone for my team from a technical perspective. There was a fair bit of pre-planning for the Champions Trophy (CT) , as we do for any premier event. We were just fresh off an IPL final that saw 3M concurrent and change and were ramping up for the juicy India-Pak opener. While our planning was solid, and we planned for a peak concurrency of 7M, there was this one moment…

Virat Kohli walks in to bat — a notification is sent to a large segment and boom!

Our login API that was cruising at 10K TPS, shot up to 125 K TPS in a matter of seconds. Kohli walked in and the platform had a hiccup. We had a flaw in our planning, we had not budgeted for that large a spike, something that an Ind-Pak game exposed given the nature of the contest. We take customer experience very seriously and a blip on the platform is a serious event.

Upon Further Review

Debugging problems in a distributed system of such scale is never an easy task. So many moving parts and so much data to sift through. We don’t do easy, we like hard problems at Hotstar. Long hours of reviewing the configuration of our application, database servers and reviewing logs followed. When the large spike came in, the database started to melt down and swapped at one point. We kept digging and what emerged was that while we had scaled up our fleet for 7M, our calculation had failed to factor in the Tsunami pattern. While we had aggressive caching at multiple layers, given the nature of the type of request (login), which has limitations in cache design, we had not scaled the database up proportionately for the additional traffic all those front-facing servers brought on.

However, that wasn’t all, we also analysed traffic patterns from our client applications during the Tsunami and realised that our clients could be smarter in their caching strategy so as to smoothen the wave. A slew of changes and updates followed and loads of performance testing later, here we are today for the big game hoping to beat our own personal best.

Progressive Degradation

On the web, we talk about progressive enhancement, but when scaling a platform like Hotstar we’ve got to think of multiple levers and actually think of progressive degradation. The key levers are the bit-rates at which we offer the content, which must degrade as concurrency grows because the we’re limited by hard bandwidth in the infrastructure. We need to build in “positive panics” and “negative panics” to be able to take away traffic from the origin when traffic goes beyond platform ratings.

Tier 1 (Login / Subscriptions / Watch) actions must be able to keep going visibly (positive panics), but could be deferred in the back-end, while Tier 2 (Account / Logout) actions could just temporarily show failure messages (negative panics). Apart from simply building in the levers, reviewing data patterns is crucial to understand and tweak client applications to be gentle on the servers. Ofcourse none of this really stands up without caching at multiple levels, which is where cache design with API’s and their usage in the application becomes all the more important.

To prep for the big event we had to orchestrate across our infrastructure

  • Cloud based load balancers need to be specially “warmed up” to receive such heavy traffic
  • Cloud infrastructure needs to ensure that appropriate types of instances are available if the fleet starts to scale up
  • Ensured that Disaster Recovery (DR) infrastructure was in place and fail-overs were tested
  • Tested every “panic” situation to ensure it did what it advertised
  • Established clear “thresholds” post which we had to tweak system parameters to support peak loads
  • Regular scrum-of-scrums to maintain high communication and coordination across multiple teams

Protocols

During key events when so many people are helping to orchestrate the system, I trust my Chatops basics. We laid down some basic rules

  • The whole on-call team was in a single chat room
  • Every action / decision to be emitted in the chat room, thereby creating a chronology of what transpired
  • Key decision-makers were all in the same room
  • Threshold scenarios were identified and plans were created with ready-made scripts to enact the changes
  • We used our “TV wall” to great effect and had all key dashboards up on rotation. We used a nifty Chrome plug-in called Revolver

This is just how we orchestrated this big event, so many more things that we could do better and, we’ll get there.

Looking Ahead

At Hotstar, we’re working on ensuring that our platform can scale out to tens of millions of customers and this is just not possible beyond a point with monolithic architectures. Although, no knock on monoliths, I’ve built a few in my time, this piece makes for interesting reading. The challenges of scaling anything is how you partition stuff, which allows you to spread the “pain” as it were. Monoliths, by their very nature offer fewer options to tweak.

We’re in the middle of revamping our platform and building for spikes and building for scale and resiliency is core to the design. Topic for an upcoming post.

We’re adding folks to our team who want to get in on the ground floor of building this highly scalable and reliable platform. If you like calm waters, stay away. However, if you want to ride a Tsunami, go check out our careers page!

--

--