Disney+ Hotstar - Medium

Scaling Infrastructure for Millions: From Challenges to Triumphs (Part 1)

Ajaychoudhary — Wed, 09 Oct 2024 12:09:51 GMT

Disney+ Hotstar is one of the largest OTT providers in India. The platform also provides live-streaming services to millions of customers nationwide. This is the first article in a series of blogs that will describe how we scaled the Disney+ Hotstar infrastructure to serve record-breaking 59 Million Concurrent streams. 🚀

Before the ODI 🏏 World Cup 2023, we had hit a peak concurrency of around 25 million — managed on two Kubernetes clusters.

For the 2023 Asia Cup and World Cup events, we introduced the ‘Free on Mobile’ 🆓 offering, to provide a delightful experience to as many cricket lovers as possible.

This change prompted a re-evaluation and redesign of our infrastructure landscape as free viewership was expected to attract more users to the platform (Remember, We are a nation crazy for cricket 🏏 😀). Based on our modeling, we ought to be prepared to cater to anywhere near 50 million concurrent streams, an unprecedented number for anyone in the industry.

This challenge became even more exciting as we needed to support this on our brand new X architecture, which had yet to be tested at such a high scale. ⚔️ [Refer to this interesting blog from our Chief Architect to know why we rearchitected our app].

Background

We serve our users through multiple client platforms like mobile 📱(Android and iOS), Web 💻, and Connected TVs 📺to name a few. These client apps make calls to the external API Gateway (we use the CDNs as our external API Gateway), which runs security checks and executes a bunch of routing rules to forward the request to our internal API gateway. Our internal API gateway is fronted by a fleet of Application Load Balancers.

The internal API Gateway forwards the request to the relevant backend service, which processes it and sends back the response 🔁. These backend services use a bunch of managed and self-hosted database solutions. Every layer of this architecture needs carefully calibrated scaling to serve our users.

Limits are not walls, but stepping stones to growth 📈

Our external API gateway (CDNs) scale themselves by deploying a cluster of edge and mid-gress nodes. These nodes serve well in most cases and are specially optimized for cache-hit scenarios (aka typical CDN use cases). However, for the cases where you have millions of requests being processed per second and the nodes are also acting as API gateway proxies, they must undertake responsibilities of running security checks 🛡, rate controls, request unpacking, etc. This puts stress on the compute capacity of these nodes and therefore caps the overall throughput that one can push through the external gateway.

Alright, so how do we figure out the spillover and what do we do about it?

Well, we ought to know our target throughput first, right? This is where the first challenge arose ⚠️. As we migrated from our legacy architecture to the new one, which is largely server-driven, the traffic patterns shifted and we had no baseline to understand the amount of traffic that our external API gateway should expect at peak. ⛰

We collected data from the past couple of months (including a mid-scale tournament from January 2023) to plot user journey-level traffic distributions during a Live stream. We mapped this data to the underlying API calls.

With the data, we identified our top 10 APIs in descending order of throughput. The initial numbers we arrived at were a clear show-stopper, and scaling our CDNs for that volume was neither cost nor time-effective.

We asked ourselves if we must treat all these requests the same way❓.
And that’s where we got our first breakthrough. 🥂

Segregate to Accelerate 🚀

We segregated our APIs into two major buckets: cacheable and non-cacheable. Some of the features that are used during live-streaming events are highly cacheable (like scorecard, concurrency, key moments, etc). Serving highly cacheable features through leaner but optimal security checks 🛡 and rate controls helps reduce the stress on the compute capacity of the nodes, thereby multiplying the overall throughput capacity of our external gateways.

To ensure the availability of our critical features at a high scale, we created a new CDN domain that offered optimized security and configuration rules for the cacheable paths. This ensured better isolation and prevented any side effects from misconfigurations. We worked with multiple teams across the Disney+ Hotstar tech to migrate their cacheable API calls to this new CDN domain.

Next, to handle the massive network load for live streaming, we analyzed both peak RPS/RPM and network bandwidth at different concurrency levels. The demand was far beyond what our CDN providers could handle. To solve this, we went back to basics again. Our top priority always is to ensure:

The video must play!

By adjusting the refresh rates for features like scorecards, Watch-more, live feeds, and key moments, we reduced the strain on our peak network bandwidth. Additionally, we optimized routing and security configurations to save compute resources, as complex rule sets can spike up compute utilization at the edge servers.

Spread and Scale

In a typical cloud environment ☁️, infrastructure involves multiple components like VPCs for network isolation, NAT Gateways for networking, Kubernetes clusters for container orchestration ☸️ (which is what we use at D+ Hotstar), and nodes for compute resources, all working in tandem to manage the network traffic flow and API orchestrations.

To effectively scale, it’s essential to understand the limitations of each infrastructure component.

1. NAT Gateways

We started by gathering data on the NAT Gateway network throughput, active connections, and packet transfers across all VPCs. Our analysis showed that one of the Kubernetes clusters was already using 50% of its network throughput capacity at just 1/10th of the peak traffic load. We enabled VPC flow logs for the NAT Gateway’s Elastic Network Interfaces (ENIs) and discovered that several services within the cluster were generating a significant amount of external traffic. Bummer 😿

When it is not possible to scale up, scale out.

In a standard private VPC setup, we usually configure one NAT Gateway per Availability Zone (AZ), with all subnets in that AZ routing external traffic through this single gateway. While this setup works for most applications, it can become a bottleneck if your application generates significant external traffic.

To resolve this, we scaled out by provisioning NAT Gateways at the subnet level instead of the AZ level. Additionally, we migrated some services to other clusters to rebalance the workloads. This adjustment significantly alleviated the pressure on the stressed NAT Gateways.

2. Kubernetes Worker Nodes

After resolving the NAT Gateway scaling issue, we turned our attention to network throughput at the Kubernetes worker node level. A high-scale load test revealed that several services were consuming substantial bandwidth — around 8 to 9 Gbps. The internal API Gateway which is deployed across all the clusters as the fronting ingress controller, was a major contributor to this. Nodes that ran multiple internal API Gateway pods simultaneously faced particularly high utilization, indicating an opportunity for optimization.

In shared environments, distribute resources as evenly as you can.

To address this, we deployed high-throughput nodes (minimum 10 Gbps) across all clusters and applied topology spread constraints to the API Gateway, ensuring each node hosted only a single pod. This strategy kept throughput per node at 2–3 Gbps, even during peak load. ⚖️

Peeling the layers of K8s — Disney+ Hotstar way

After resolving network bandwidth challenges at various levels within our VPCs, we turned our attention to capacity planning for our Kubernetes clusters and began benchmarking to identify their limits. Prior to the 2023 tournaments, we had two self-managed Kubernetes clusters, but these clusters were hitting infrastructure limits and couldn’t scale up to support 50 million concurrency.

Flattening the curve🌟

The Kubernetes control plane manages the cluster and the workloads running within it, comprising of key components like the API Server, Scheduler, Controller Manager, etc. We migrated to Amazon EKS, a cloud-managed service, where AWS handles the control plane, allowing us to focus solely on managing the data plane. This decision streamlined our operations and let us prioritize workload optimization over control plane maintenance.

During our migration to EKS clusters, we began benchmarking the EKS clusters to assess their ability to handle node scheduling, API server limits, and control plane scaling. This was crucial to ensure they could manage spikes in API server requests triggered by cluster scaling activities, which is typically how we scale up during live events.

We ran multiple tests to identify the limits, and had the following key takeaways:

During the load test, the API server, Kubernetes controller manager, and scheduler remained operational. ✅
When scaling the cluster to over 400 nodes simultaneously, we began experiencing API server errors. Although there was no downtime, the retries introduced a slight delay in scheduling pods and nodes during the scale-up process. ⚠️

We optimized our scaling configurations to minimize API server throttling during pre-scale-up. By setting a step size that limited provisioning to 100–300 nodes per step, we reduced throttling risks and automated the process to eliminate human error, resulting in smoother scaling and more efficient resource management.

Make every resource count 🔧

Ahead of the 2023 World Cup, we faced a critical production incident in one of our Kubernetes clusters due to an IP address shortage. Our platform required over 400 worker nodes to schedule the requisite pods, but we couldn’t scale beyond 350 nodes. To understand this issue, let’s first explore how networking works in EKS clusters.

Private Subnets: We set up private subnets within a private VPC to assign IP addresses to the nodes and the pods running on them. Our configuration included three subnets across multiple availability zones, each with a /20 CIDR block, allowing approximately 4,090 IP addresses per subnet. With around 12.3K total IP addresses available, one might assume this would be sufficient for 12.3K pods and nodes. However, the situation is more complex, primarily due to the way IPs are managed in the EKS cluster.

VPC CNI Plugin: The VPC CNI plugin handles networking for EKS clusters, and two key settings directly affect how IPs are allocated to nodes and pods:

MINIMUM_IP_TARGET: This defines the minimum number of IP addresses allocated to each node for pods. Initially set to 35, it meant each node reserved 35 IP addresses, regardless of actual usage.

WARM_IP_TARGET: This setting controls how many additional IP addresses are pre-allocated for scaling. We set this to 10, which led to further over-allocation.

With 35 IP addresses reserved for each node, even if the node didn’t have enough pods to use them, the result was that we could only scale to around 350 nodes (12.3k/35) across the three subnets.

To resolve this issue, we quickly added new subnets with a larger /19 CIDR block, increasing the available IP addresses to approximately 48K. We also updated our internal guidelines to use a /18 CIDR block for future clusters to prevent similar situations. We were fortunate that the cluster creation process was just starting, allowing us to make these changes without major disruptions.

The MINIMUM_IP_TARGET was set too high, forcing each node to reserve 35 IP addresses for pods, even when far fewer were needed. Combined with a WARM_IP_TARGET of 10, this configuration led to a 40–50% waste of IP addresses across nodes.

To optimize our usage, we fine-tuned the VPC CNI settings. After several iterations, we reduced the MINIMUM_IP_TARGET to 20 and the WARM_IP_TARGET to 5. This adjustment provided enough IP addresses for expected pod scaling while avoiding excessive pre-allocation.

The optimization worked perfectly 🎯. We reduced resource wastage, and throughout the World Cup 2023, we maintained just 40% IP utilization across our subnets, allowing us to scale services efficiently without any further resource shortages. 🚀

Vertical Scaling to the rescue

During an internal high-scale load test, we observed one of our services experiencing unexpected traffic fluctuations across its pods, ranging from zero to several thousand requests per second. All the pods were healthy, and resources were within limits. The only thing unique to this service was that it had scaled beyond 1000 pods. After debugging the issue we identified a limit on the Kubernetes Endpoint object.

The Endpoints API provides a simple way to track network endpoints in Kubernetes. However, as clusters and services grow, the limitations of this API become apparent, particularly when scaling to large numbers of network endpoints. Since all endpoints for a service are stored in a single resource, this can lead to performance issues for Kubernetes components (especially the control plane) and increased network traffic and processing when endpoints change. When a service has over 1000 endpoints, Kubernetes truncates the data in the Endpoints object and sets an annotation: endpoints.kubernetes.io/over-capacity: truncated. This annotation is removed if the number of backend pods drops below 1000.

Although traffic continues to be sent to backends, any load balancing mechanism relying on the legacy Endpoints API only sends traffic to a maximum of 1000 endpoints. The same limit prevents manual updates to an Endpoint with more than 1000 endpoints. EndpointSlices mitigate these issues and provide a platform for additional features like topological routing.

However, our API Gateway did not support service discovery using EndpointSlices, prohibiting us from adopting this alternative.

When horizontal scaling is impractical, scale vertically 🔼

We went on to then identify all the services that required more than 1000 pods at higher concurrencies and collaborated with them to vertically scale up and cap the number of pods below 1000.

Recap

To sum up, we learnt a good set of lessons in our journey to scale the Disney+ Hotstar infrastructure. The solutions stemmed from a deep dive into identifying bottlenecks, challenging conventional wisdom through first-principles thinking, and maintaining an unwavering focus on our end goals. Techniques like workload isolation, implementing graceful degradation levers, leveraging both horizontal and vertical scaling, distributing workloads across space and time, and fine-tuning system configurations played a pivotal role in our success. In our strive to ensure a seamless consumer experience, we also unlocked significant operational and efficiency wins for the platform 🚀

In the next set of articles, we will dive deep into the various technical choices we made and how we pulled together a mammoth task of setting up 10+ EKS clusters and migrating 200+ services across the organization to handle this scale. Stay tuned!

Passionate about breaking the barriers and solving problems at scale 👨‍💻? Join us: https://careers.hotstar.com/jobs

Scaling Infrastructure for Millions: From Challenges to Triumphs (Part 1) was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

MaxView: game-changing, literally

Angad Bhalla — Tue, 19 Mar 2024 11:45:50 GMT

“Are we just going to try and change almost 90 years of broadcasting history?” We were. It’s just that every once in a while you get a chance to innovate beyond people’s imaginations and potentially shake things up for years to come. In the moments that you do, you pounce.

https://medium.com/media/5bcf855de015adddfb4b539d39e83d70/href

Innovation has always lain at the core of Disney+ Hotstar’s DNA, and the sport of cricket has often served as a vehicle for it. We live-streamed cricket way back in 2015, revolutionising the sport and the market. We were also the first in India to introduce watch parties, prediction based play-along games, key moments and many more such interactive experiences, a few of which have now become the industry standard. With the ICC Cricket World Cup returning to be played on home soil after 12 long years, it was time to spark another fire, the biggest one yet.

Though the development and shipment of Hotstar X (read that story here) kept everyone occupied, the “next big thing” in cricket had to be found too. Org-wide jam sessions were held, some objectively great ideas popped up, but somehow none of them really got everyone’s hair tingling. None, except one.

What is MaxView?

MaxView is a brand new vertical cricket watching experience that lies at the cusp of absolute natural immersion. With globally changing consumption trends and people always on the move, the aim was being able to render the country’s favourite sport into the most comfortable, natural of grips, while enhancing the experience significantly.

Closer up action

A bespoke, taller video stream that shows you more details of the action from closer up than ever before. Decades of camera handling and panning experience enhanced by some AI magic keep the ball always in the best possible region of interest. Vertical-specific infographics enrich the experience further.

Split Views

The extra real estate also means that at the right moment, the feed can be split up to display 2 camera streams simultaneously. With every strike of the ball, you’d now be able to see where it went and how the players reacted to it. Runs between wickets more intense, DRS reviews supremely dramatic, Virat’s post-century kisses to Anushka now two-way.

Live feed and scorecard

With access to ball-by-ball commentary in the live feed and the detailed scorecard just a swipe away, it creates a true single-stop ecosystem for all sorts of cricket watching- casual or fanatic.

How we got here

A significantly larger video feed, contextually curated split views and easy access to rich surround content right below the player- all in your natural, vertical grip. No awkward tilting, twisting and turning of phones- just true immersion on the go. With a fire yet to be lit, one would think this one would have been an immediate no-brainer; it wasn’t always smooth sailing, though.

While it sparked serious elation all around, executing the idea seemed so out of reach that it had already been discussed and sidelined a couple of times over the months due to the complexity of delivery- the wild timelines, immense business implications and the massive, even unknown production requirements just trumped the eagerness at the time.

What catapulted the idea back to the table and right to the top was the ICC’s serendipitously similar vision for vertical cricket and their commitment to conceive the best possible version of it. With that strong a belief as the foundation, there was more seriousness, excitement and nervousness about the idea than ever before, and one recurring question:

“Are we really just going to try and change almost 90 years of broadcasting history?”

We were. It’s just that every once in a while you get a chance to innovate beyond people’s imaginations and potentially shake things up for years to come. In the moments that you do, you pounce.

Bigger, better and bolder

With everyone (eventually) aligned, an uneasy look at the calendar revealed that we had just about a month and a half to design, build and deliver the potential future of video consumption. (Still can’t help but chuckle a bit)

Code-named “VLC” (short for Vertical Live Cricket), the project started off with a modest, super-conservative delivery plan. What followed next all around the org was excitement-fuelled delivery in the most unorthodox of fashions. Relentless, iterative, massive efforts spanning multiple teams across partner organizations; while keeping our customers at the heart of it each step of the way.

Finding the right balance

The design and research team ran simultaneous cycles of iteration and user testing. Various market-sport combinations were studied where this model had been tried before to extract what worked and what didn’t. While vertical football (soccer) in the US didn’t really take off, it worked immensely well for Bundesliga in Germany. Even basketball, a much more dynamic sport, had a very successful vertical mode debut in China in recent times. We eventually came up with a recipe we were fairly confident about; it was now time to take it to the people!

One of the earlier visualizations of the idea

With the help of our researchers, we tried various layouts and tested multiple manifestations of buttons and toggles by creating a spectrum of simplicity and novelty. We initially tried naming both the vertical and the regular portrait modes, and used discrete buttons to switch between them. Testing told us that a stage of understanding this nascent, it was best to reduce the learning curve, and rather have a simple on-off toggle.

One of the other larger decisions to make was to choose between a 9:14 and a 9:16 video size. We set up a few guardrails to help us make a decision-

The toggle to switch back to regular portrait must always be accessible to all users while in this mode.
None of the central action area or the primary graphics could be cropped regardless of the device size. On the other extreme, black borders also went against the idea of edge-to-edge immersion.
The quick access to ball-by-ball commentary and the scorecard below the player was an important aspect of the portrait experience that we wanted to maintain.

Considering these constraints and the extremely vast spectrum of the Next Billion device sizes, there wasn’t an easy way to optimise the 9:16 video experience for one size without degrading the experience of the others. While 9:16 was the more immersive of the two, for a static ratio, 9:14 struck the immersion vs affordance balance the best across the spectrum.

The evolution of content

With the ideal aspect ratio identified from a product experience lens, the team shifted their focus to figuring out the specifics of the feed’s production. The challenge was elevating the current cricket watching experience while combating the downsides of a significantly narrower field of view!

A snap of a focus-group session held for VLC led by Rajath, UX researcher at the Bengaluru office.

Researchers, designers and product managers fluttered around between speaking to users for hours and collaborating with the production team, relaying any learnings to devise a strategy that would deliver the best possible experience. To be able to compensate for the narrower FOV, vertical video had to be shot afresh with its own specifications and incorporated learnings from the extensive user research; processes seasoned over decades had to be reimagined. 6 special cameras were placed in each host stadium and camera-crew with years of experience were hand-picked and trained by Star Sports for the task of leading this evolution from the front, while the ICC produced the world feed.

The power of collaboration

At the same time on the engineering side, the brilliant AdTech team, in another instance of serendipity, just happened to have a working prototype for a video player that supported dynamic aspect ratios (ready from a different project). The video experiences engineering team sprinted the second they got passed that baton, optimising the experience to deliver the best possible value to the users. The engineering team braced our systems for concurrent viewers larger than we (or anyone else in the world) had seen before- everyone else from product, marketing and business that touched the project added their share of magic to the bubbling cauldron that was VLC.

The shortest month and a half later, with dozens of relationships formed and the smallest number of stumbles humanly possible, the relay was over! VLC was finally ready to take on all that it was poised to and more; all that remained was giving it a name that communicated its magnitude. A lot of names were floated- some too technical, some too subtle considering the monumental goal, some just plain funny. A comprehensive cycle of diverging and testing later, MaxView was born.

Getting the word out

With the feature and content strategy in the chamber, the first bullet to be fired was to let people know of the innovation. A myriad of on-product communications were launched with the beginning of the World Cup to help drive this monumental shift in how cricket was consumed. Getting early feedback and driving awareness, intent and eventually conversion was paramount - push notifications, on-product nudges, visual explainers on the match page did a lot of the heavy lifting in terms of making people aware of the feature.

In parallel, the task of creating a memorable promo ad for off-platform impact was taken up and who better to star in it than India’s first world cup winning captain- Kapil Dev! Watch as Paaji walks around a set wondering if he now needs to teach people how to watch cricket after teaching them how to play the sport.

https://medium.com/media/23a7aa4f1250d075e198b918afd60fd7/href

Where do we go from here?

The first version of MaxView went live and thrived with the ICC Cricket World Cup 2023. We set multiple world records and received a lot of love that reaffirms the aim of revolution, and some great constructive feedback that we’re absolutely on top of for the next version.

People talking about MaxView

And while we gather ourselves after the last World Cup, there’s immense joy in sharing that its demand has outgrown just cricket already. VLC is now VE- Vertical Everything! MaxView was recently launched on one of India’s biggest reality shows, Dance+ Pro, marking its entertainment debut and what are hopefully the beginnings of a true paradigm shift in video consumption.

At the end of the day, this is what it’s all about. Pushing the envelope as far as you can and being able to take on the risk that comes before the reward. To be able to operate in a machine that fosters this sentiment in every single one of its cog is beyond cool, to say the least.

Passionate about breaking the barriers and building world class experiences at scale? Join us: https://careers.hotstar.com/jobs

MaxView: game-changing, literally was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

Code Less, Achieve More: Disney+ Hotstar’s Approach to Modern Access Control

Eason Xu — Tue, 02 Jan 2024 06:12:26 GMT

Photo by Helena Lopes on Unsplash

Introduction

In the complex organizational application landscape, internal tools like monitoring dashboards, IT service portals, and content management systems are common. The widespread use of service-to-service (S2S) calls is also prevalent in the microservices era. However, ensuring AAA security for these tools is challenging due to multiple teams’ involvement. Despite the option to integrate third-party Identity Providers (IDPs) like Okta or Google, developers face a steep learning curve dealing with IDP documentation and the complexities of authentication and authorization protocols. Moreover, developers often hesitate to prioritize authentication and security, perceiving them as separate from core business values.

To address this challenge, we created IAuth, a low-code platform at Hotstar. IAuth empowers applications to achieve AAA security with minimal developer effort, eliminating the need for extensive security expertise. It quickly gained traction within Disney+ Hotstar and received considerable acclaim from 50+ applications. This blog explores the journey behind the creation of IAuth.

Before IAuth

Before implementing IAuth, our internal applications had a fragmented and disorderly authentication framework. Different applications used various IDPs like Google, Okta, and Disney’s internal IDP (MyID). Some applications even relied on their own username and password mechanisms. Diagram 1 illustrates this scenario.

Diagram 1. Chaotic Authentication

Authorization added further segregation and fragmentation. Diagram 2 shows that applications independently built their authorization mechanisms. Some first-party applications used subjects within their service code (Case 2), while others stored permissions in the IDP using user groups and implemented logic in the OAuth protocol flow (Case 1). Third-party applications relied on its built-in mechanism and the IDP provided by the third party for permission controls (Case 3). For S2S calls, certain developers shared a hard-coded token between the caller and callee, embedding it in the HTTP header (Case 4).

Diagram 2. Chaotic Authorization

This fragmented and disorderly nature presents lots of issues:

Administrative Overhead: IT administrators struggle with managing dispersed employee accounts across multiple IDPs and handling user permissions on individual application dashboards, amplifying the challenge as the number of applications and users grows.
Security Monitoring Hurdles: The fragmented AAA security processes hinder effective monitoring, leaving the security team unaware of potential issues. Moreover, the absence of centralized auditing logs, currently relying on service developers for construction, leads to a lack of comprehensive audit trails across multiple services which inhibits effective investigations into security incidents.
Compromised Security Practices: Security standards may be compromised due to oversight from our developers. For instance, the reliance on a simple hard-coded token shared between service-to-service (S2S) calls, while straightforward, raises concerns about the overall security posture. Additionally, the redundancy of accounts caused by multiple IDPs can result in the persistence of employee accounts even after they have left the company, posing a risk of data leakage.
Redundant Development and Complex Integration: Developers invest resources in duplicating AAA security, requiring additional work during IDP changes. Furthermore, integrating with diverse IDPs and authorization protocols presents challenges, demanding a deep understanding, secure implementation, and a balanced approach to requirements, security, and user experience.
Reduced Resiliency and Fault Tolerance: Dependence on a single IDP raises concerns about resiliency and fault tolerance, as a disruption in the IDP affects all services and applications, highlighting the need for a more adaptive architecture.

After IAuth

IAuth, serves as a low-code internal platform, enabling applications to achieve AAA security through three simple steps:

Create your service via the IAuth Portal

Diagram 3. Create your service via the IAuth Portal

2. Integrate with IAuth SDK with a few lines of code

// 1. initiate SDK with your 
iauthClient = iauth.NewIAuth({
       ID:                 "YOUR-SERVICE-ID",
       Secret:             "YOUR-SERVICE-SECRET",
       PanicServiceToken:  "YOUR-PANIC-SERVICE-TOKEN",
       BlockPanicTokenDuringNormal: true | false,
})
        
// 2. apply SDK to your http middleware
router.Use(iauthClient.Middleware())

3. Set up permissions, roles and assignments via the IAuth Portal

Diagram 4. Set up via the IAuth Portal

Dive Deep

In the sections below, we will talk about some core features of IAuth.

IDP-agnostic Architecture

IAuth builds an IDP-agnostic Single Sign-On (SSO) architecture, enabling seamless compatibility with any IDP. This flexibility empowers service owners to choose the required IDP based on different needs, allowing organizations to do IDP migration without impacting any applications. Furthermore, IAuth can work as an account provisioning tool, aggregating and correlating accounts sourcing from all IDPs.

Diagram 5. IAuth IDP-agnostic Single Sign-On Architecture

In Diagram 5, we illustrate the process of an employee accessing an application:

Initially, the employee initiates access to the application, which triggers redirection to the IAuth Login Center for authentication.
Subsequently, the employee undergoes a series of verifications through IDPs or IAuth itself.
After the verification, the request is redirected back to the IAuth Login Center
Upon successful authentication, the browser obtains a valid iauth token.
With this iauth token, the browser can seamlessly access applications without encountering repetitive login prompts.

Centralized Authorization and Full-Functional RBAC

IAuth provides a centralized and full-functional access control model for managing access in various scenarios. As illustrated in Diagram 6, this model is built based on RBAC (Role-Based Access Control), wherein each application is treated as a separate authorization space, instead of considering the entire applications of Hotstar as a single authorization space. Within each authorization space, traditional RBAC is applied. This model offers comprehensive control to IAuth owners while allowing application owners to retain control over their specific applications. Meanwhile, application owners have the flexibility to customize roles and permissions without limitations.

Diagram 6. IAuth Centralized and Full-Functional RBAC

Based on the powerful access control model, access to all applications in various scenarios can be controlled via the IAuth Policy Center. As illustrated in Diagram 7.

First-party applications verify tokens and then check permissions based on policy data synchronized from the IAuth Policy Center (Case 1, 2).
Third-party applications retrieve role information from tokens issued by the IAuth Login Center and utilize the built-in RBAC mechanisms to verify permissions. The RBAC mechanism pre-defines appropriate permissions for specific roles (Case 3).
Additionally, for S2S calls between first-party applications, the caller acquires a service token from the integrated IAuth SDK and includes it in the requests sent to the callee (Case 4).

Diagram 7. IAuth controls access to all applications in various scenarios

Secure and Resilient Service Tokens

To address hard-coded token security concerns, IAuth develops a secure service token mechanism for S2S contexts. This mechanism combines short-term tokens with panic tokens (long-term tokens) for enhanced security and resilience. Short-term service tokens have a limited Time-to-Live (TTL) and automatic rotation to minimize token leakage impact. Panic tokens, with a permanent TTL, ensure consistent resiliency in failover scenarios without compromising security.

Diagram 8. Secure and Resilient Service Token Mechanism

In Diagram 8, after integrating the IAuth SDK, it maintains a periodically refreshed short-term service token. It also loads a panic service token from the application-dedicated Vault. During normal operation, the caller uses the short-term service token to access other applications. In the event of IAuth crashing, the caller switches to the panic service token. The IAuth SDK determines which token to provide based on IAuth’s status. Note that the panic service token is exclusively intended for failover scenarios. Once the callee receives the request, it verifies the token before responding. IAuth Policy Center manages and controls the permissions assigned to these tokens.

Transparent Auditing Architecture

IAuth utilizes a centralized authorization mechanism and implements a transparent and robust auditing architecture. This architecture requires no additional developer effort. Events generated by various components within the IAuth ecosystem are seamlessly transmitted to the IAuth Event Center through a resilient mechanism. This mechanism supports batch uploading, memory consumption control, time management, and retry capabilities to handle failures effectively. All events are stored in a standardized format and can be easily queried through IAuth dashboards, providing a convenient and centralized approach to monitor and analyze system activities.

Summary

By utilizing IDP-agnostic SSO, centralized authorization, full-functional RBAC access model, secure and resilient service tokens, and transparent auditing, IAuth simplifies AAA security, boosts productivity, and enhances overall security. Our future goal is to explore a zero-trust solution leveraging IAuth’s capabilities.

Excited about disrupting the OTT landscape at never-seen-before scale? Check out: https://careers.hotstar.com/

Code Less, Achieve More: Disney+ Hotstar’s Approach to Modern Access Control was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hotstar X: Rebuilding the Disney+ Hotstar experience

Fairooz Nazar — Wed, 20 Sep 2023 07:34:28 GMT

They say that product design is 30% design and 70% communication. We learnt how true that statement is during our efforts to bring Disney+ Hotstar’s most significant update Hotstar X to life.

https://medium.com/media/27e81f3f20ef4c42ab3d837976e50913/href

Hotstar X didn’t start a few months or a year back, it started way back in 2018 when we put the idea of a new version of Hotstar in our organisation’s mind. We were aware that Hotstar had room for improving the customer experience, but in an organisation of this size, having the intention to bring change is not enough.

Hotstar in 2018 vs 2023

So let me tell you the story and the ingredients that made this redesign happen.

Building a Vision

If you can’t envision the future, how do you build it?

We were a small team of 4 designers when we first got together to jam on what was then codenamed Hotstar 2.0 on a weekend in an office overlooking the Mumbai skyline and the Arabian Sea.

One of our brainstorming sessions during the weekend jam

The first version we explored was back in 2018. This vision was not set in stone, it evolved with new problems and opportunities we discovered along the way. As we uncovered new problems, we continued to explore and reached another version later in 2020.

Hotstar X explorations circa.2018

Ian Spalter on the show “Abstract” encouraged his team at Instagram to “fall off the cliff” to get to the most innovative idea which is brave and pushes the industry forward. To fall off the cliff, you have to find the cliff so our goal was to explore the boundaries at that point.

We had found the cliff somewhere around here probably

Whenever we explored what our future could be, Rahul Bhosale, our Motion chef at the time whipped up a beautiful video that we socialised with the wider group every chance we got but we slowly learnt that a few directions are not enough to spark change.

Hotstar X explorations in 2020

Selling a Story

Disney+ Hotstar gives people access to some of the best stories out there. This meant that people opened the product to watch their favourite series even if they were not entirely happy with the experience so there were not obvious enough incentives for the product to improve. The obvious part is key because it was up to the design team to make it obvious why the product has to improve.

The explorations helped Gaurav Joshi, our design head, show the stakeholders what a possible future for the product could be, but we realised that the what was not enough. We needed a why and a how to help the stakeholders understand. We needed a story. A story that shows how customers feel about the product and how a new direction can achieve the business goals.

The story was told through a deck that was not afraid to be brutally honest or even exaggerate a bit, which did ruffle a few feathers but that was partly our goal- To light a fire about a new direction. The deck dug deep into every aspect we rethought to present a new frontier for the app.

That deck just started the first conversation of many more to come. We made a lot more similar decks to repeatedly tell/sell versions of the story. Selling the story is a repetitive effort where you have to consistently be on the lookout for an opportunity until the narrative solidifies in the key stakeholders’ minds.

Finding Partners

The 70% where communication and collaboration happens is how real products get built. Amazing folks from Product, Engineering and Design must come together to bring a vision to life.

We were fortunate enough to find such partners in both product and engineering who were seeing our efforts and realised how Hotstar X could benefit product and engineering goals too. It doesn't have to be many, just one or two great partners can move mountains.

Just some of the folks who worked on Hotstar X, one article is not enough to capture every partner

In the previous Hotstar app nicknamed Rocky internally, the business logic sat on the client side — meaning the app in your hand had to talk to multiple APIs, multiple times to give you the desired experience. This reduced the speed of the app, created management insanity and required extra effort from engineering for each platform. You can read more about it here in this article from Harsh about why engineering became a partner on Hotstar X.

Finding common goals with product and engineering helped us find partners that made Hotstar X the key project for the year.

Grit and Perseverance

These are the two very underrated values for folks working in a tech company.

When we first started the Hotstar X exploration (Hotstar 2.0 then) in 2018, we were overlooking the Arabian Sea from Mumbai. When we started designing the Hotstar X app (final final.figma) you see in your hands, a pandemic ensured that I was now near the same sea down south in Cochin and Gaurav was near the same sea in Goa. We didn’t have a research team in 2018 but Divya Chadha, our UX Research lead built one and brought valuable insights to the table over time. We were a team of 5 in 2018 and we had grown to a team of 19 while building Hotstar X. Also, we used to be Hotstar and now we have become Disney+ Hotstar.

It took us 5 years but we constantly used every opportunity available to pitch Hotstar X, used research to understand customers to build a tighter case, went back to the drawing board to help the vision evolve, and built relationships with product and engineering to find common goals and shipped whatever feature possible from Hotstar X whenever we could.

These efforts cumulated in our now bigger design team building the final designs for Hotstar X across 9 platforms in around 6 months in early 2021. It took us another 18 months to ship the product to 23+ countries with different nuances in the Middle East, Africa and Southeast Asia, and another 9 months to launch at scale in India.

If we had taken our eyes off the goal at any point, this would not have taken off. Many of our efforts didn’t see the light of the day but the perseverance to bring a better experience for customers kept us going.

The apparent part about grit and perseverance is that you can never give up on these values. We are only getting started at Disney+ Hotstar and the new platform with a scalable design system called Soul kickstarted by Vivek Kumar will only make us faster to face the new challenges in the market and industry.

In the next part, I will try to articulate why we even put ourselves through this in the first place.

Passionate about breaking the barriers and building world class experiences at scale? Join us: https://careers.hotstar.com/jobs

Hotstar X: Rebuilding the Disney+ Hotstar experience was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

Innovations in high-quality transcoding: Hotstar’s tale of 10x scale up

madhukar bhat — Mon, 19 Jun 2023 11:23:02 GMT

Encoding at Disney+Hotstar

Transcoding is one of the fundamental pillars that define the quality of the watch experience for an OTT platform such as ours. The efficient transcoding of videos is crucial for handling the large volume of daily content ingestion, which can range from hundreds to a few hundred hours, including several tens of hours of 4K content. At the Video Center of Excellence @ Disney+ Hotstar, we are on a mission to redefine encoding and to deliver the best quality video at most reasonable costs with the best in-class video experience.

Over the years, the Disney+ Hotstar Video Center of Excellence has proposed and integrated a more innovative way of processing more content in a shorter period while being leaner on the wire. Such improvements have led to smoother launches in many countries with thousands of new content to process. This article discusses how we upped the transcoding game over the last year, solving critical problems in scaling up transcoding horizontally and vertically without incurring huge costs.

Photo by Jakob Owens on Unsplash

Optimization of scale in transcoding

The goal of the transcoding pipeline at Disney+Hotstar is to deliver more content to the platform in less time at optimal quality and a possibility to have a point of recovery in case the transcoding of content fails. Furthermore, a massive amount of resources is required to process longer duration 4K movies. The memory may reach 70–100GB, and CPUs up to 36–48 vCPUs due to high source bitrates up to 1Gbps. Despite this, the process can take 10–15 times the film’s length to produce renditions fit for OTT delivery.

We can tackle them with the following options:

Vertical scaling with more threads and memory (bad for quality, no option to resume): Increasing the number of computing threads can usually have a negative impact on the quality of transcoded files because parallelism limits references for predictions during compression. In addition, this does not solve the problem of having to restart from the beginning in case of failure.
Horizontal scaling with chunked encoding + optimal threading + stitching later: Processing content in small segments of 20–40s with many parallel compute resources can reduce TAT and also introduce a point of recovery for resumption if a single chunk fails.

Chunked encoding at Disney+ Hotstar

By introducing chunked encoding at Disney+ Hotstar, we made orders of magnitude improvements, and 4k transcoding was made possible within prescribed resource requirements and processing time. Fig 1 depicts a comparison of chunked and non-chunked encoding flow. In chunked encoding:

We divide the entire input into smaller units (also known as chunks)
Process them in parallel to shorten the transcode time and make it scalable.
We adopted chunked encoding with constant chunk length to predict SLAs better.

Fig 1: (A)Normal parallel layer encoding → (B)Chunked encoding. By dividing the processing of video into chunks, the decrease in processing time was dramatic. In addition, it allowed us to process 4k on cloud computing clusters such as AWS with reasonable resource units

While adopting chunked encoding, an extensive study on quality trade-offs and rate control limitations with chunk size was done as chunk boundaries would lose the context of scenes in the neighboring chunks while encoding.

Impact

We saw the TAT reduction by up to 2–4 times compared to full-length processing, as shown in Fig 2. Chunked encoding enabled 4k transcoding, especially for longer films like Avengers, without resource constraints or multiple retries. We limited the maximum chunk nodes per content to account for practical compute resource availability.

Fig 2: Turn around time (TAT)comparison for one hour of content between chunked and non-chunked encoding. The TAT reduction is more for 4K as chunking solves huge memory requirements compared to FHD.

Business-aware and resource-efficient scaling

Hotstar serves daily a range of content ranging from daily soap episodes to premium content, which comes sporadically in a week. With Disney+ Hotstar launching in multiple countries across the globe, needed to also do bulk content processing of thousands of contents for seeding the catalog in those countries. Daily soap episodes need quick processing for streaming within hours. Though less time-sensitive, bulk content for new markets requires more elasticity due to high volume, stretching SLAs to weeks or months. Expected daily capacity covers the soap episodes while scaling pressures are more prominent for bulk content. At Disney+Hotstar, we introduced some important components to chunked encoding to help scale our transcoding pipeline elastically. They were:

Storage and I/O optimization: We used to download the whole file first before starting the encoding process. It added significant lag before encoding could start, especially for longer duration and 4k titles. It also meant that the size of the whole source file needed to be maintained in network storage, introducing inefficiency for scaling.
Scaling Compute resources for optimizing consistent SLA: The system of chunked encoding tries to fill all the available chunk nodes (maximum chunk nodes) with a round-robin kind of system. We had an opportunity to improve here by scaling according to the content duration to be cost-efficient.

Storage and I/O optimisation:

The most significant impediment to elastic scaling in Disney+ Hotstar encoding was the amount of mountable network storage and I/O requirements at the content level. We downloaded the whole source file in chunked encoding and then started processing each chunk as represented in Fig 3A. It was simple in terms of engineering stability in the beginning. But as we wanted to scale and process over tens of thousands of contents for our new country launches, this would have posed a massive challenge in terms of storage and I/O requirements while scaling up our transcoding capacity. Furthermore, because their source files are massive, 4k was experiencing significant download delays in the existing pipeline before beginning encoding (up to 1TB).

Fig 3: We used to download the full source before starting the encoding process (A), but with resource optimisation technique (B) we downloaded source chunk by chunk and deleted it immediately after encoding.

We created a method to download the source data chunk by chunk, transcode it, and then remove the source chunk to scale up without using more resources and decrease 4K download latency.

Fig 4: Resource-saving with resource optimization. Resource optimization reduces the amount of storage required and also reduces network storage I/O.

Impact

The above technique saved us the network storage and I/O read count by up to 80%, as shown in Fig 4. We could transcode around 4 times more contents at a given time compared to before.

In Fig 5 impact on transcode TAT is shown. With resource optimization we see reduction in transcode time as well due to huge I/O reduction from network storage. In terms of computing cost reduction, we saved around 15% of computing per content cost due to lesser network storage requirements and a reduction in computing time (TAT). When we scaled up our encoding pipeline, we constantly encountered network storage I/O bottlenecks. However, the above resource optimization technique was a massive win for us in smoothly transcoding tens of thousands of contents and exceeding expectations in transcoding throughput during new launches.

Fig 5: TAT comparison with and without resource optimization

Scaling Compute resources for optimizing consistent SLA

Disney+ Hotstar releases all of the day’s episodes at once at a set time. Considering this publishing SLA of our daily episodes at Disney+ Hotstar, we aspire for collective publishing rather than individual episodes being available in advance. Also, we can give an episode of 30 minutes fewer compute nodes than a 2 hours movie for SLA optimization. So, instead of distributing chunks to a fixed number of compute nodes per content, scaling according to content duration helped us to achieve processing more contents efficiently to meet SLA. An example is shown in Fig 6. The advantages being

Fewer pods are scheduled per content according to duration → no need to schedule compute resources saving money + opportunity to process more contents depending on duration.
Less number of nodes required for shorter duration contents such as daily episodes → Efficient usage of cheaper instances (Eg.spot instances of AWS)
Adding consistent throughput: We optimized max resources to one content vs. distributed resources to many contents. With our use case of publishing time of multiple contents simultaneously, content duration aware resource allocation works best.

Fig 6: An example of 24 chunks distribution in a round-robin method in A would be distributed to maximum nodes available to content (16 here). In the case of scaling option B, we defined a minimum number of chunks per node so that only a few nodes are used in case of short-duration content.

Impact 💰

The above strategy gave us more consistent delivery in daily episodes with a considerable reduction in manual effort. Content duration-aware resource utilization led us to use a higher percentage of cheaper instances saving around 10–15% of computing costs.

Future Plans 🚀

Overall, we are constantly pushing the state-of-the-art and innovating Disney+ Hotstar’s video processing and encoding capabilities to evolve and optimize transcoding beyond encode parameter tuning, scale transcoding pipelines with efficient resource utilization, and bring user delight with various content adaptive techniques.

By implementing chunked encoding, we expedited 4K transcoding by 2–4 times and, with further optimizations, reduced network storage by 80%, thereby quadrupling our capacity. This approach streamlined daily episode delivery, reduced manual intervention, and cut computing costs by 10–15%.

We intend to extend our transcoding optimization innovation and learnings to other aspects of complete content ingestion, such as packaging, manifest generation, and final upload of URLs and packaged artifacts.

Pumped up about solving such hard technology problems that make a huge difference to customer experience and set the future tone for content streaming? Check out: https://careers.hotstar.com/

Innovations in high-quality transcoding: Hotstar’s tale of 10x scale up was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

HotstarX -Part 4: Getting hands dirty

Harsh Mittal — Thu, 25 May 2023 07:24:40 GMT

In the last instalment, we discussed the high level components of our BFF architecture. There’s lot more that went into bringing this stack together. Tonnes of interesting technical choices and frameworks — a lot of which we are already in the process of rapidly evolving in the coming months. Dig in and see for yourself how the HotstarX stack is under the covers!

Photo by Eddie Kopp on Unsplash

Quick Recap

Here’s the high level view of the HotstarX architecture:

Display Data Services (DDS): Contain the business logic of aggregating, decorating and processing the underlying domain datas and exposing displayable data properties
Binders: Mappers that take the DDS data and massage them into the widget data output. Also handle concerns of localization, translations, etc.
Page Compositor: The orchestrator that breaks-down an incoming page request into its constituent widgets (by consulting the Layout Service), gets the data for each widget, invokes the respective binders and stitches together the entire page response.

😈 There be devils!

Far from being cut and dried, we had several challenges which influenced the implementation choices we made.

We wanted our binders to declaratively describe their data dependencies. This is crucial because:

Developer Experience: From a dev-exp standpoint we want widget owners to care about what data they need, rather than worrying about where and how to get the data from
Performance and cost : We want widgets to request only for the optimum amount of data that is required by the widget and not over-fetch
Centralised orchestration control: In order to plan how the data is fetched, duplicated, cached, etc. across various widgets, we need the data dependencies to be separated from the code that consumes them

This obviated the need for an orchestration framework that would infer the widget’s declarative data dependencies, flat-map them at a page level, build the most efficient execution plan (service-call graph) and then execute it with the right SLAs.

GraphQL, for data orchestration

While building something like this in-house would have been an inspiring engineering problem to solve, we felt that GraphQL had already solved parts of this problem. While there’s lot more desired, we decided to latch on to the GraphQL offering as a starting step.

This provided us with:

A solid declarative data-fetch spec that is tailored for use-cases like ours to avoid over-fetching.
GraphQL federation allows us to easily compose data from multiple services in a unified spec without worrying about how to join schemas and manage them. A lot of the tooling comes OOTB and reduced our bootstrap time.
GraphQL also has mature server side frameworks like Netflix DGS (Java), gqlgen (Golang) that allowed us to quickly build our DDS services as well as to adapt our existing domain services become GQL compatible. This was a major boost for us, since we didn’t have to invest at all in building any DDS frameworks.

We settled on using the Apollo Federated GraphQL Gateway as the entry point for our DDS interactions. This took care of all the heavy lifting associated with syntactic validations, query planning, scatter-gather, and many other OOTB primitives for circuit breaking, caching, etc.

Golang, for scale

For the PageCompositor, we decided to build it natively in Golang. At Hotstar, we’ve acquired experience building and operating high scale, concurrent services in Golang and given that PageCompositor was going to be the entry point for all our page requests, we needed it to be fine tuned and highly scaled.

DSL[Golang plugins], for Binders

Given that Binders were to run in the compositor runtime, we wanted to minimize the performance and operational risks of managing arbitrary code. Our initial idea for the Binders was to manage them using some form of a DSL. Also since Binders are supposed to be very thin by design, they shouldn’t require powerful programming constructs.

We couldn’t find any performant DSL framework that’d run at near native speeds and eventually settled on using Golang plugins. This allowed us to package and manage each binder as its own plugin and load within the compositor runtime. Long-term we aspire to support hot-reloading of plugins and independent lifecycle management of each plugin such that the core framework (compositor) and the widget code (binders) could evolve independent of each other; bound only by the orchestration primitives.

🔌 Flipping the Switch — Does it work?

The theory is great, but does it work? Let us now walkthrough an end to end example of a simple TrayWidget which was introduced in our last article. To quickly recap, here’s what the widget looks like:

A simple TrayWidget

Step 1 — Proto spec

The client team will first define the widget contract (in protobuf) in a model package repo. We chose protobuf as our IDL and our data format on-wire given its strong type-safety and lean size. This proto is then used to auto-generate client side bindings for our Android, iOS and Web clients. We also generate Golang bindings to emit out these objects in the server response.

Following is a very simple widget proto definition.

message TrayWidget {
    .base.WidgetCommons widget_commons = 1;
    reserved 3 to 10;
    Data data = 11;

    message Data {
        Header header = 1;
        repeated Item items = 2;
    }

    message Header {
        string title = 1;
    }

    message Item {
      string title = 1;
      string duration = 2;
      .feature.image.Image image = 4;
    }

Field widget_commons is used to capture standard properties like widgetName, version, analytics info, etc.
The data object contains a header and a list of items each of which has a title, duration, progressMarkers and an image associated with it.

Step 2 — Widget Registration

The team that owns this widget will then register it by making an entry into the widget_registry. This is a .yaml file that describes what data the widget wants to query and what binder will be used to map the response to a valid proto object.

It also contains information about the widget owner that can be used to emit observability data and contact the owners in case of malformed or misbehaving widgets.

name: ContentTray #Unique name for the widget instance
tags: # Can be used to power discovery of widgets 
  - AutoGenerated
template_query_binder_mapping:
  - path_to_binder: ContentTray/ContentTrayTrayWidget #Folder path where this widget's binder code can be found
    path_to_query: ContentTray/ContentTrayTrayWidget #Folder path where this widget's query can be found
    template_name: TrayWidget #The actual proto template that this instance is going to use
ownership_info:
  team: #Team Name
  team_slack: #Team slack channel 
  team_mailing_list: #Team mailing list
  pager_duty: #Team PD details
    team_service_directory_id: 
    team_escalation_policy_id:

The widget team then commits a query and binder code to the respective folders in the widget repo.

Step 3 — Query

The query asks for content collections (which has been advertised as available by an underlying DDS). For each item in the collection, it then asks for some additional data.

The query layer doesn’t expose details as to which DDS provides what data and therefore makes the binders layers truly agnostic of the data provider and orchestration details.

query ContentCollection( $country: String, $platform: String ) {
  fetchContentCollection(collectionRequest: {
    hsRequest: {
      countryCode: $country,
      platformCode: $platform},
  }) {
    collectionItem {
      watchProgress {
        contentId
        duration
      }
      coreAttributes {
        title
        horizontalImage {
           url
           height
           width
        }
      }
    } 
    collectionTitle
  }
}

Step 4 — Binder

For the sake of brevity, we’ve stripped off some non-essential code-bits. Very simply, the binder takes the incoming DDS response and maps it to the various fields of the TrayWidget proto object. Concerns like language localization (refer to the title field mappings) are handled in this layer so that the DDS’es can be truly devoid of presentation concerns.

package main

func (b *binder) Execute(ctx context.Context, input *v1.BinderInput) (interface{}, error) {
 // first we need to take the raw DDS response and convert it into known structs
 contentTrayResponse := new(model.ContentTrayResponse)
 err := json.Unmarshal(input.DdsJsonResponse, &contentTrayResponse)
 if err != nil {
    return nil, err
 }
 contentTrayResponseItems := make([]model.ContentTrayResponseItem, 0)
 if contentTrayResponse != nil {
    contentTrayResponseItems = contentTrayResponse.ContentTrayResponseItems
 }
 // after unmarshalling, we can start transformation/mapping
 // we need to return the WidgetTemplate which was registered, in this case its TrayWidget
 ret := &widget.TrayWidget{
     Template: &base.Template{
     Name:    constants.TrayWidget,
     Version: constants.Version1,
  },
  // this is a common object that encapsulates standard widget properties
  WidgetCommons: &base.WidgetCommons{
     Id:              constants.TrayWidget,
     Version:         constants.Version1,
  },
  Data: &widget.TrayWidget_Data{
     Header: &widget.TrayWidget_Header{
     // display concerns like localization are handled at binders using standard libs
     Title: localizationUtil.GetLocalisedString(contentTrayResponse.CollectionTitle),
   },
     Items:         b.getContentTrayItems(ctx, contentTrayResponseItems, widgetContext),
  },
 }
 return &v1.WidgetBinderOutput{
    TypeInstanceName: constants.TrayWidget,
    Data:             ret,
 }, nil
}

// Recursively iterate over collectionItems and build the inner widget data object
func (b *binder) getContentTrayItems(ctx context.Context, contentTrayResponseItems []model.ContentTrayResponseItem, widgetContext *component.Widget) []*widget.TrayWidget_Item {
 var ret []*widget.TrayWidget_Item
 for _, item := range contentTrayResponseItems {
    collectionItem := item.CollectionItem
    id := collectionItem.WatchProgress.ContentId
    itemTitle := collectionItem.CoreAttributes.Title
    widgetItem := &widget.TrayWidget_Item{
       Title: localizationUtil.GetLocalisedString(itemTitle),
       Image: &image.Image{
          Src: collectionItem.CoreAttributes.Images.HorizontalImage.Url,
          Alt: collectionItem.CoreAttributes.Title,
          Height: int32(collectionItem.CoreAttributes.Images.HorizontalImage.Height),
          Width:  int32(collectionItem.CoreAttributes.Images.HorizontalImage.Width),
        },
      },
      Duration:            int64(cwItem.WatchProgress.Duration) * 1000,
  },
   ret = append(ret, widgetItem)
 }
 return ret
}

Step 5 — Magic 🎩

Voila! And just like that, we have a living and breathing widget ready to create magic on your screens.

If a developer wants to re-purpose this widget template to display another set of collections (say top-trending content), they could simply write a new query and a binder, update the widget_registry and get their widget to production — all from the server side.

In the forthcoming chapters, we’ll detail out how we metamorphosed our instrumentation and clickstream architecture to become server-side driven and many more exciting details on some hairy challenges that we solved around a brand new client side architecture, performance, scale and much more. Stay tuned!

Want to build mind bending architecture and build the next gen entertainment platform? We’re hiring roles — visit https://careers.hotstar.com/

HotstarX -Part 4: Getting hands dirty was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

Fortifying your API Gateway: Defending Millions of Requests against Potential Exploitations

Ziheng Wang — Tue, 09 May 2023 08:34:43 GMT

Fortifying your API Gateway: Defending Millions of Requests per second Against Potential Exploitations

Photo by Immo Wegmann on Unsplash

Disney+Hotstar is the largest OTT provider in India and powers the Disney+ app in the MENA, SEA, and SAARC regions.

One of the key challenges faced by the platform is authenticating requests to origin APIs, while also preventing any potential exploits by hackers or malicious users. Authentication exploits can result in financial losses, availability issues, and a negative impact on the user experience.

In this blog, we’ll explore our journey of building a centralized and robust authentication mechanism using the Emissary open-source Kubernetes-native API gateway (formerly known as Ambassador). We’ll discuss how our solution has evolved and how it can effectively authenticate requests from millions of Hotstar users.

User Authentication — Old Architecture

Let’s walk through our previous-gen solution for request authentication and learn how requests flow through our systems.

At Disney+Hotstar, we utilize JWT tokens for request authentication. Previously, our services were exposed to the client via AWS Load Balancer (ALB), which resulted in all requests hitting the origin without authentication. As a result, our origin services had to integrate with our in-house token SDK to authenticate and decode the JWT token.

Limitations & Challenges

Auth is every service’s responsibility : In this setup, each client-facing service was required to possess a thorough understanding of authentication, which created a potential security risk. Additionally, distributing token secrets to numerous services violated the principle of “Least Privilege”. Any oversight in this process could potentially lead to security breaches.
Inconsistencies due to SDK versions: Inconsistencies in the version of the token library across services could create difficulties in rolling out token upgrades across teams and services, including signing key rotation.
Inconsistent Error Responses: Unauthenticated error responses could be inconsistent across services, posing a challenge in maintaining the business contract between clients and services.

Given these limitations, we decided to relook at our design and find a solution that would allow us to overcome these gaps.

Centralized Gateway Authentication

To mitigate these challenges, we opted for a single ingress authentication that could serve as a safeguard to all external-facing APIs. After careful consideration, we chose Emissary Ingress, which is based on the high-performance Envoy and offers a range of flexible plugins such as ExtAuth, RateLimit, Tracing, and more. This choice was well-suited to our use cases and provided us with a high level of extensibility.

Architecture

Centralized Gateway Authentication Workflow

To achieve granular control over APIs, we implemented authentication checks as plugins in the Emissary API Gateway. This allowed us to invoke the plugin only when specific criteria were met in the incoming request path, ensuring that each API had the appropriate level of authentication. As a result, we not only improved our security measures but also gained greater flexibility for customized authentication.

Token Authentication: Basic authentication of JWT token by checking the token signature, expiration time, and other relevant information.
3rd Party Auth Integration: Pluggable authentication for requests from third-party platforms, allowing for flexible customization of the authentication process.
Silent token refresh: Token’s life cycle completely managed by a single Authentication service, transparent to origin services and client.
User session identity: To avoid the need to pass user tokens and perform validation across multiple services, we introduced a new identity structure known as “Envelope”. This structure is generated once at the Gateway and can be consumed by all origin services on request chain for common data access.

These improvements allow us to safely manage user identity tokens and protect origin services from invalid requests.

Next, we will dive into how we solved 4 major challenges with the Gateway Authentication solution.

User Session Identity (Envelope)

To securely propagate user identity information to our services without relying on the potentially fragile JWT token-based propagation, we introduced a new identity structure called “Envelope”. This structure is modeled as a Protocol Buffer and provides a uniform and secure way to propagate personal identity information to origin services.

Envelope proto structure

The Envelope can serve all the information contained in the token and also provides flexibility for serving enriched data based on our business requirements.
Each Envelope is a short-lived identity token that is scoped to the life of the client request, completely consumed and propagated among internal services in our system.
Downstream services can fetch the properties in the Envelope conveniently by integrating with our Envelope SDK supported in different languages.

Centralized Data Enrichment

Centralized Data Enrichment Workflow

There are multiple use-cases where downstream services may need similar customer data to serve a rich user experience. User Cohorts is one such piece which plays a critical role in Hotstar ecosystem. We use cohort data to bucket groups of customers that showcase similar patterns, and we can then design effective engagement strategies per unique cohort.

Let’s take a practical example to understand this better. We tag users who have a preference for watching certain sports into one cohort group, and push notification to them whenever a tournament relevant to them is being streamed on Hotstar. This ensures that our customers do not miss out on their favourite content.

Another use-case is customers whose subscription plan just expired or is due to expire shortly — they will be tagged into another cohort group, and then will be reminded to renew the subscription periodically.

We recognized that we could significantly improve the system NFRs (Non-functional Requirements) by enriching these properties once at the edge while generating the Envelope.

Force Session Block and Token Refresh

Context

At times, it’s necessary to block user sessions by invalidating their tokens after they log out or if they’re flagged as malicious by our RiskEngine (read more in our RiskEngineBlog). To accomplish this, we need a contract between the Authentication Service and other parts of our system for token force blocks.

There are also cases where certain events, such as a user purchasing a new subscription plan or upgrading their existing plan, require asynchronous updates to their token properties. This is where token force refresh becomes necessary. By implementing token force block and refresh, we ensure that our system remains secure and our users’ access remains up-to-date.

Solution

To solve this, naturally we would think of two approaches, either storing the invalidation and refresh list in a data storage like Redis or caching locally. However there are drawbacks with both solutions when it goes to production, it is costly to check Redis for every request with the first approach as traffic volume grows, and the space usage is definitely a big concern with the second approach since the set stored locally could be very large.

To address these concerns, we introduced Bloom Filter that would listen to token lifecycle events keep itself up-to-date. Bloom filter is a space-efficient probabilistic data structure used to check whether an element is a member in the set. Checks at Bloom Filter can only return either “highly possible in set” or “definitely not in set”.

“Highly possible in set” means there’s a possibility that a blocked user session is in BloomFilter, but actually not. If we block a wrong user, there will be a negative impact to our user experience. Therefore, we would still do a deep check in Redis to rule out false positives. Since the blocked and refreshed sessions take a tiny percentage of the total traffic, the majority of the cases will be filtered out by Bloom Filter without querying Redis.

By using Bloom Filter, we were able to reduce our space consumption by 40x.

Summary

In this blog, we talked about why we re-architected the user authentication flow and how we built a new age authentication system from scratch that takes care of user token validation, refresh, user-logouts, changes in subscription, user data enrichment in envelope and apply security features at gateway. We also talked about interesting sub-problems around simplifying space constraints to perform high scale authentication checks.

Want to build stuff like this? We’re hiring & we are always looking for smart engineers who love solving hard problems. Check out open roles at https://tech.hotstar.com.

Fortifying your API Gateway: Defending Millions of Requests against Potential Exploitations was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

De-bottlenecking Aurora MySQL for 19 Million concurrent users

Ishank Gulati — Mon, 30 Jan 2023 10:13:14 GMT

Photo by Roshni Sidapara on Unsplash

Introduction

Payment service is a Tier-0 service that orchestrates payment transactions at Hotstar — critical for the purchase and renewal of subscriptions.

The throughput and availability aspects are paramount, considering its importance in accepting payments across multiple payment methods and payment gateways.

As part of our routine high-velocity event readiness, we ran benchmarks to certify the throughput of the system. We discovered, to our surprise, that the system cannot scale to our target transactions per second (TPS).

We will detail below how we went about investigating and scaling the system to handle the target TPS.

Payments architecture

Fig 1. High-level Payments Architecture

Let’s briefly look at payment service architecture before proceeding further:

Payment service is a Java Spring Boot application sitting behind a load balancer that accepts the incoming transaction request, routes the request to the payment gateway based on the user’s preferred payment mode, creates and tracks the state of payment transactions.
Payment service persists transaction state in an Aurora MySQL cluster — consisting of a single writer and multiple readers.
Workers are responsible for running scheduled reconciliation with Payment Gateways and relaying payment confirmation notifications to dependent services.
Data persisted in MySQL is replicated in the data lake via the Change Data Capture (CDC) process using Debezium connectors and Kafka.
Kafka is also used to queue power payment notifications.

Eliminating the usual suspects

We started the investigation by trying and eliminating some of the common suspects that could impact application performance.

Scale-out Application

Experimented with increasing the number of pods but there was no improvement

Pod CPU & Memory

Analyzed individual pod CPU and Memory usage but didn’t notice any anomalies. At the peak load too, both CPU &Memory were well below 50% of maximum capacity.

Scale-up Aurora MySQL

Experimented with different instance types provided by AWS but didn’t see any improvement. From the cloud watch metrics, there was no indication of a bottleneck in CPU, Memory or I/O, all of them were well within bounds.

Though it was not apparent, we had a suspicion that the database is the point of contention since it is the only piece that is not horizontally scalable and we suspected having hit some bottleneck concerning IO.

MySQL deep dive

At this point, we started digging deeper if anything was causing the contention on the database, as the single-node MySQL instance was not showing any signs of a bottleneck.

TPS on the instance was not high enough for the database to give up. We wanted to do a few sanity steps and eliminate unknowns for the root cause.

Inspecting MySQL Process list

Took a dump of the MySQL process list to detect any lingering queries but didn’t notice any such case.

Re-evaluating Indexes

Listed down every query that our application could run on DB and verified indexes for each one against the production database.

Tuning DB connection pool

Changed tomcat managed database connection pool properties, but no noticeable change in performance

None of the above sanity steps, eliminated or provided hints for further performance issue debugging. We went ahead with query instrumentation.

DB Query instrumentation

Instrumented the DB queries, and noticed that some of the select queries were taking much more time than expected as can be seen in Fig 2 from 19:30 to 19:40.

Fig 2. Database Query Latency

This was unusual because all of these queries were using indexes and were very selective so these queries being latent didn’t make much sense.

Hence we decided to confirm the same on the database side.

MySQL Slow Logs

Logs for slow queries (latency > 50ms) were enabled in the database, during the load test run we found that only insert/update queries were present. The latent select queries which we saw in the previous section were absent from the slow query logs.

We re-analysed the application side telemetry and noticed that the select queries that were latent were also the top 3 highest throughput queries and by a good enough margin than the rest.

We formed a hypothesis that update/insert queries are the actual culprit. The reason why the application side instrumentation was reporting the select queries as slow because they were also the top three queries by throughput. Hence, they were competing much more for a database connection than the update/insert queries.

So the latency that is visible in Fig 2 is majorly due to the time spent to acquire a connection rather than running the query on DB. This was validated by increasing the connection pool size.

Though we were sure that the updates and inserts are the sources of the bottleneck, the reason was not clear yet. Is insert throughput so high that MySQL is not able to handle it?

Dead End

We tried all the possible ways to debug the issue from scaling the database to the largest instance type, tuning the DB connection pool, and application thread pool, verifying database indices, tuning certain aurora parameters and visualising any anomalies in RDS/app metrics. But nothing got us closer to the root cause.

We had indeed reached a dead end but decided to further deep dive into Aurora’s architecture and explore any instrumentation tools provided by AWS.

Breakthrough

We landed on Aurora Performance Insights provided by AWS to analyze database load.

Performance Insights (PI)

Performance Insights (PI) is a feature provided by Aurora built on top of MySQL performance schema which helps to visualise database load for various dimensions like wait conditions, SQL statements etc. We enabled PI for Aurora MySQL and saw that most of the active sessions were waiting on the below two conditions:

wait/io/aurora_redo_log_flush

Redo log is used to ensure durability in MySQL by recording the changes in physical data pages which enable DB to correct data written by incomplete transactions on restart due to crash etc.

Changes to data pages are first written to a log buffer which is flushed to disk in the following scenarios:

Periodically by MySQL master thread once per second
When free space in the log buffer is less than half
On transaction commit

The above event is emitted when this log is flushed to disk. A large number of concurrent sessions waiting for redo log flush could be then due to a small log buffer size or a large number of commits.

Log buffer and Redo log file size although tunable in vanilla MySQL (and advised to do so) can’t be tuned in Aurora so the only thing we could investigate further was the number of commits.

2. wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done

Active binlog in MySQL has a mutex which synchronizes reader and writer sessions. The above event showing up in performance insights indicates that a large number of concurrent sessions are competing to acquire the mutex which could be a result of high commit throughput or a large number of consumers reading binlog.

Fig 3. MySQL Performance Insights

To confirm the hypothesis we turned off binlog and ran another load test and this wait condition vanished from the Performance insights graph as visible in the second half of Fig 3 and we were able to scale above our planned TPS.

Before diving into the root cause, let’s understand Aurora’s architecture in a bit more detail.

Aurora vs Vanilla MySQL

Fig 4. Aurora Architecture

Aurora decouples storage and compute where it self-manages the storage autoscaling and developers can scale the compute by increasing the instance size or by adding more reader nodes. Unlike MySQL where storage for each replica is independent, it is shared amongst the writer and reader nodes in Aurora.

Replication

The compute layer consists of one writer and developers can add up to 15 reader nodes for scaling read operations and high availability. Since the storage layer is shared amongst nodes, replication can be performed in milliseconds as updates made by the writer are instantly available to all readers. Also, it eliminates the need for binlog replication.

But then how do all the compute nodes synchronize in-memory data (log buffers, cached pages etc) to ensure consistency? Hitting the disk for every query will deteriorate the performance.

This is achieved by transferring redo log records from the writer to other reader nodes as shown in Fig 4. Only the writer is responsible for persisting log records to disk.

Persistence

The writer instance sends redo log records to disk which consists of multiple storage nodes. The storage engine replicates this data to multiple storage nodes in Multi-AZs to withstand the loss of an entire AZ (Availability Zone).

Root Cause

Commit Commit Commit

From query patterns, we noticed that on average there’s a fan out of 1:10 for a payment transaction API to a DML operation on Aurora.

Since we tend to avoid transactions; DML operations result in commits which create IO and synchronization contention down the line.

High commit throughput results in a high number of redo log records getting flushed to disk. Similarly, this also results in binlog getting flushed frequently and using the ROW bin_log format only amplifies the impact.

In the two wait events that we detailed above; the first event wait/io/aurora_redo_log_flush is a direct result of this and the second event wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done is a side effect of the same.

Mitigation

Reducing the number of commits

There are a couple of ways to reduce the commit throughput by

Turning off auto-commit around multiple DML operations
Using Transactions to pack in multiple DML queries in a single commit operation
Batching multiple DML statements in a single query

Though all of the above are viable but require significant changes in the codebase and testing effort since introducing transactions could result in a lot of behaviour change. So we started exploring other simpler mitigation strategies.

Enabling Binlog I/O cache for Aurora

In Aurora RDS the binlog isn’t required for replication, but we knew that it is used by Debezium connectors for replicating this data in Data Lake as shown in Fig 1. Also, the ROW binlog format is required by debezium.

After some deep dive, we landed on this AWS article where a feature called binlog I/O cache was available in Aurora MySQL 2.10 onwards.

With this feature enabled Aurora still writes the events to the log file as before but consumers read these events from a cache; thereby solving the synchronisation contention. AWS claims that DB performance with this feature is nearly identical to performance without any active binlog replication.

We decided to experiment with this feature, but we were on the older Aurora MySQL 2.07 version. Aurora team also mentioned in the above article that they had introduced some parameters like aurora_binlog_replication_max_yield_seconds in the older version which could be used to enhance binlog performance. We however didn’t notice any significant improvements with this parameter tuning on our workload.

We then ran a load test on the upgraded Aurora version to verify the binlog I/O cache and we were able to scale without any major issues. From the performance insights, there were almost no sessions stuck on the wait condition wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done.

After regressing the changes through multiple test-beds, we performed a minor version upgrade on the production DB to reap the performance benefit.

Impact

Database cost was reduced to half by migrating to a smaller instance throughout the tournament.
~30% uptick in TPS (Fig 5. 19.31–19.40 vs 19.55–20.10 ).

Fig 5. Load Test — Throughput

7x drop in P99 Latency (Fig 6. 19.31–19.40 vs 19.45–20.10 ) at 30% higher load.

Key takeaways

Aurora performance insights is a great monitoring tool provided by AWS which should be used to analyse database load and is safe to be kept ON on production for most large instances. It takes around 1–3 GBs of memory without any significant performance impact. Also, it’s free for the last 7 days of data retention.
An in-depth architectural understanding of data store is extremely helpful in performance optimisation. Although Aurora is a closed-source DB, AWS does have nice documentation providing some high-level implementation details.
Diving deep and being persistent towards solving performance bottlenecks can drive a significant impact on customer experience and cost.

If you are stoked about solving such hard performance and scale problems, join us — https://careers.hotstar.com/

Thanks to Tanuj Johal, Mohit Garg, Aravind S

References

De-bottlenecking Aurora MySQL for 19 Million concurrent users was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

HotstarX -Part 3: The BFF baseplate

Harsh Mittal — Wed, 30 Nov 2022 10:25:03 GMT

In the previous instalment, we wrote about our tenets for building widgets and how the server vended widget response allows the client apps to paint a delightful UX.

In this post, we’ll unravel our server side architecture that provides us:

🏃 Mad Agility - An agile platform that allows 200+ engineers to concurrently build and manage hundreds of widgets within the set guard-rails of performance and quality.
⚙️ Mad Flexibility - Dynamically change page layouts, control the widgets in a page, their order and other widget properties like orientation, styling, etc.
🚀 Performance — A performant approach to fetch data and hydrate widgets per form-factor.

Photo by Chris Hardy on Unsplash

Struggles with Vanilla BFF

BFF (Backends For Frontends) is a proven pattern which works well, however, there were some distinct challenges from prior work on this pattern, which we wanted to address.

Logic duplication: Given discretion on how to compose the data from underlying domain apis, causes divergence. E.g. images can be pulled from source of truth, or personalization layer. Governance is needed!
Sub-optimal data aggregation : Multiple customer widgets + all asking for similar / shared data+ multiple teams building = Chatty, Duplicate calls. Aggregating these calls based on a data dependency graph is needed to keep this lean and clean.
Operationally expensive: A common orchestration framework, to harmonize, will add operational complexity. Every BFF team would need to spend effort in maintaining it and keeping it performant.
Build For Evolution : We’d like to re-use our BFF’s without much fuss for future iterations, this requires thoughtful segregation of business logic. Again, given multiple teams building simultaneously, how does one tackle it? 🤔

BFF, with a twist 🔀

A customized BFF was the need of the hour. Step 1 was to make our data API’s authoritative (by domain), with clear lines of ownership — and being unaware of the UX that consumes them.

The widget orchestration piece was moved centrally under a single team. This would provide a consistent page architecture and manage cross-cutting concerns of performance and operational excellence uniformly.

Here’s how our high level architecture looks like.

BFF architecture at Hotstar

Lets dive into the details of each logical component:

📺 Display Data Services (DDS)

When we peeled the layers of the logic in our legacy client apps, we recognized that it was often producing new “presentation entities”.

Two flavors emerged:

Presentation-Aggregate-Entity : Joining existing domain entities (e.g. join user, subscription and playback data to decide what playback urls to vend out)
Enriched-Entity : Override the domain entities with richer business logic like personalization (e.g. artwork personalization) or marketplace/feature specific rules (e.g. content age-rating filtering when kids-mode turned on).

We aligned on the fact that these are first-class entities that must be owned and mastered on the server side rather than being scattered in the BFF layer under a shared ownership model. This is not a new idea, and simply piggy-backs on the notion of aggregates and entities in the Domain-Driven-Design. We branded these set of services as Display-Data-Services (DDS).

✨ DDS — Abstract the magic

Simply put, if someone were to look for contentTitle, they should be able to fetch it from the Content-DDS (CDDS). CDDS internally would own the logic to fetch and build the most relevant title object on the intersection of contentType, personaRecommendations, cmsTitle, etc.

This also meant that a lot of complex orchestration would move into the DDS layers and could be managed by a handful teams much like other micro-services in the ecosystem.

It was re-assuring that about the time we made this decision, SoundCloud (the original pioneers of BFF pattern), had similar observations after years of operating the BFF stack and had landed on a solve similar to ours. You can read up more about it here (their VAS layer is analogous to our DDS layer).

Binders — Fetch & Map

All the presentation needed binding! The missing layer now was a set of modules that’d fetch display data from these DDS services and map them to the widget data object. For eg. the TrayWidget would ask for Listand for each item it would then recursively ask for contentTitle, contentImage, trayTitle, etc.

Once the data objects were fetched, this mapper layer would take those response objects, parse them and set them in the widget proto response object. We refer to this layer as theBinders.

Binders also become the layer where UX concerns of language localization, feature-flags (whether to show a certain feature in a given request context or not), A/B experimentation get handled.

Binders+DDS also resemble what a conventional View-Model layer is in the MVVM architecture. This layer is responsible for extracting the domain data (Model) , applying business logic and UX centric transforms to it and then returning an object model that the client (View) can consume.

🎵 Binders Runtime and Orchestrator

Time to make music! We had to decide on the strategy to host and run these binders. Given that binders are lean data-mappers, it didn’t make sense for them to be managed as independent micro-services.

Plus the efficient data scatter-gather could be done only if the binders ran in a shared runtime where some kind of central execution framework would own their orchestration.

👊 Enter PageCompositor

We introduced a new component PageCompositor that’d be responsible for parsing the incoming request, deciding what widgets to render in the given request context by consulting a layout manager (discussed later) and then firing off each widget construction in parallel. The widget binders would be hosted as plugins inside the compositor.

Each widget would declaratively describe what data it needed (example below),
PageCompositor would then fetch those data-sets in the most optimal fashion.
Once the data-sets were fetched, we’d pass on the data to the respective widget binders who’d perform the data mappings and UX transforms, returning the widget data objects.
PageCompositor would then iterate over these widget responses and compose the final page response for the client.

Sequence Diagram for GetPage flow

🪄API Gateway — The Conductor

The last missing piece of the puzzle was some form of a centralized api lifecycle manager — this includes concerns like authentication, authorization, enrichment of request context, rate-limiting, etc.

Instead of these concerns being replicated across various layers in the stack, we decided to pull them forward into our API gateway. We use Ambassador as our API Gateway to our K8 clusters and by writing custom envoy plugins, we were able to handle these cross-cutting concerns in one single layer.

🧱Layout Manager

Our server side architecture was now ready to return a page response with a set of configured widgets while each widget was being independently built and managed.

However,

what should inform the PageCompositor about which widgets to return in a given request?
How do we get the ability to change the order of widgets, or the look and feel of a widget or even drop some widgets from the page — all from the server?
We’d also like to be able to A/B test with new widget templates and roll out newer versions to the updated clients that can support those templates?

Enter - Layout Service

Layout Service is our control plane for managing page and widget configurations. It exposes an api endpoint via which all the widgets are registered (as mastered in the widget_registry).

An operator can then manage the page configurations in the Layout Service Dashboard. This provides a single pane of glass to view, edit and modify the choice of widgets on each page.

It also allows for different page configurations for varying request properties. For e.g. in some regions, the homepage for Kids cohort looks very different than the default homepage for adults — both in terms of the page layout and the choice of widgets on the page.

Layout Service in conjunction with the widget_registry manages the deprecation and promotion of newer widget templates. Given the incoming client version in the request, LayoutService is able to influence PageCompositor on whether to return the version v.X or v.Y.

This becomes the layer where the final decisions of which widgets and widget versions to return are made. Eventually, we’ll even move parts of this decision making to machine-learnt models that will find the best performing ranked order widgets on a page.

Summary

We covered a lot of ground in this edition and dug into the server side architecture of Hotstar X platform.

We discussed the motivations for evolving the conventional BFF architecture
We shared our reasoning for various system components and how we shifted orchestration concerns to the teams that were best suited to manage them
We also touched upon different components of the X server side platform and how they all fit together

In the next chapter, we’ll dive into the implementation details of these components and showcase the power of our platform by building a real life widget.

Want to build mind bending architecture and build the next gen entertainment platform? We’re hiring roles — visit https://careers.hotstar.com/

HotstarX -Part 3: The BFF baseplate was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

ChatOps — Cloud Resource Explorer

Harsha Koushik — Tue, 01 Nov 2022 11:44:06 GMT

ChatOps : Cloud Resource Explorer

Cloud Resource Explorer using simple ChatOps

Need to know your cloud dependencies in a pinch? Yes, we’ve been there. Here’s how we leveraged ChatOps to make our lives easier.

Photo by Amy Elting on Unsplash

The Problem

We’ve got cloud, and we’ve got 99 problems about what’s residing in our clouds. On most days you might have the luxury of time to unravel this dependency graph, however, if you’re chasing down an incident, you need to know in a hurry! Here’s how the Sentinels, which is our security team @ Hotstar, solved this using ChatOps.

A cloud Resource Explorer is one of the most important items in the the toolkit for anyone who is building in a modern engineering team. While the reasons can vary, the need to know what resides where and the metadata around it is needed without much drama.

☁️ Cloud Visibility — The Obvious Toolkit

Here’s what most people leverage today to discover items in the cloud and their challenges.

Console : Does not scale in Multi Account setup, complex correlation not possible.
CLI or SDK (e.g. Boto) : CLI needs setup like setting up keys, role assume settings etc.. SDK requires some programming comfort — does not scale for team members who are not current with coding. A default problem which always exists with this method is managing the keys at scale & their rotation.
Cloud Inventory or a Cloud Security Posture Management (CSPM) solution: Focus of this tool is security, not so much, inventory. Therefore the data is stale and can only work as a coarse method, which might not serve all use-cases.

While as a combination these things might work, this is not something that can be used in a pinch and will require stitching together of a solution.

🔍 Cloud Introspection — Keep It Simple

What we wanted to solve is something seen at scale only on a day-to-day basis. For example, someone has a simple question, this someone could be a customer care executive, or a backend developer. Their question might go something like :

“I want to know where is x.x.x.x IP in our Infra”

IP Details Slash command in Execution

Our goal was to make it as easy as querying from an excel sheet or a simple database for people. Using the traditional methods would fail for the simple fact that it would require stitching and additional work each time this question was asked, unless you pooled together some tooling. Add the complexity of multi-cloud, or even multiple access levels and so on, which is very common. In general, the head-wind to even answer a simple question like this is intimidating.

We began to introspect the questions that our teams were asking. Here is a sampling :

Which account does this S3 bucket belongs to & what type of encryption is enabled on it?

For an access key, which account & user this belongs to?

I want to know what xyz.hs.com points to. Which account’s R53 to check?

Each of these takes a different quantum of complexity to answer! Imagine spinning up bespoke scripts to handle each question, this is just not scalable.

🚀 Enter ChatOps

We extensively use Slack for communicating. ChatOps can be on any chat app for that matter. Anyone who keeps questioning about various things on Infrastructure comes to slack first and asks someone, most of the times — it is DevOps, Infrastructure & Security Teams who gets these questions.

Our goal was simple — nothing should limit someone to ask a question and ensure minimal dependency.

🕵️‍♀️ How did we solve it?

Querying cloud still remains the same — it is either CLI, SDK or using existing data from a source like CSPM which already pulls most of the data for you.

When to use real time queries vs using CSPM Data depends on the use case and how live you expect the data to be. For example I expect IP data to be almost live(1–2hr window) as a lot of IPs keep changing for various reasons — Spot nodes, Auto Scaling etc.. my IAM Data can be 6-12 hour old since user & access key creation is not that frequent. Similarly pulling S3 or R53 data can also be around 6–12 hours.

A simple architecture diagram to explain how it is built and used is here —

Simple Architecture Diagram for Cloud ChatOps

Components in the architecture:

Slack — This is where someone fires a Slash Command depending upon the info they wish to get. This command can be fired from their DM or a dedicated channel, the response comes to a pre-defined channel.

API Router — This is where most of the logic sits. It authenticates the Slack User, Payload coming in & then routes it to corresponding API. Decision of whether to use CSPM API, ES or Real Time CLI Query is taken here. Response to Slack is also given by this component. This is a simple Flask App.

CSPM API — This can be your CSPM, or an alternative cloud inventory service which pulls your posture data every 24 hrs. It will have some API exposed to query data out of it, which can be used.

Custom Full Text Search — You can use any full text here, we used Elastic Search here. We have few cron jobs running to pull data and keep it live as much as possible. The frequency of Cron depends on what kind of data is being pulled from the Cloud. Like mentioned before — IAM data can be pulled every 6-12 hrs, IP data every 1 hour, so on and so forth. This frequency depends on your environment & priority given to certain resources.

Real Time Queries — You can fire custom queries either using CLI, SDK such as Boto or use a tool like Steampipe.

Note: Access Control — Ofcourse everyone is not allowed to see everything, we would like to have some restrictions on what kind of data can be queried by what category of people. Simple access controls can be written based on the Slack User ID who fires the command. Group of Slack User IDs can be allowed/denied to fire certain APIs.

Few screenshots of this tool being used live —

IP Details

DNS Details

IAM Access Key Details

👩‍🌾 Reaping the Benefits

Level up Incidents 🔥 — No need to query multiple places to find out details about something in my cloud, simple command would do. Just as an example — if a user access key got compromised, a simple command here would tell us which user and which account so we can disable the user for immediate mitigation.
Minimal Dependency / Self Serve — Anyone can just type in a slash Command to get their frequent questions answered. People can focus on “what” instead of “how”.
Simplicity — Since the backend data structure remains the same, all we need is to ingest whatever is needed at the frequency we feel is the best. Over time, we add capability as needed to power the team.

💯 Conclusion

This is just an idea to innovate and make people’s lives easier, this is not a de facto method of doing things. The way we built it is completely a personal preference which suits our environment. While there can be some many other ways of achieving this, doing it this way worked well for us.

There is constant innovation in this space by adding more use cases to ChatOps including aspects like Auto Remediation as part of Incident Response which we will be talking in the future articles.

Want to work on problems like these? We’re hiring! Head over to https://tech.hotstar.com and apply for open roles. We’re a 100% remote and 100% flexible!

ChatOps — Cloud Resource Explorer was originally published in Disney+ Hotstar on Medium, where people are continuing the conversation by highlighting and responding to this story.

Disney+ Hotstar - Medium

Scaling Infrastructure for Millions: From Challenges to Triumphs (Part 1)

Background

Limits are not walls, but stepping stones to growth 📈

Segregate to Accelerate 🚀

Spread and Scale

1. NAT Gateways

2. Kubernetes Worker Nodes

Peeling the layers of K8s — Disney+ Hotstar way

Flattening the curve🌟

Make every resource count 🔧

Vertical Scaling to the rescue

Recap

MaxView: game-changing, literally

What is MaxView?

Closer up action

Split Views

Live feed and scorecard

How we got here

Bigger, better and bolder

Finding the right balance

The evolution of content

The power of collaboration

Getting the word out

Where do we go from here?

Code Less, Achieve More: Disney+ Hotstar’s Approach to Modern Access Control

Introduction

Before IAuth

After IAuth

Dive Deep

IDP-agnostic Architecture

Centralized Authorization and Full-Functional RBAC

Secure and Resilient Service Tokens

Transparent Auditing Architecture

Summary

Hotstar X: Rebuilding the Disney+ Hotstar experience

Building a Vision

Selling a Story

Finding Partners

Grit and Perseverance

Innovations in high-quality transcoding: Hotstar’s tale of 10x scale up

Encoding at Disney+Hotstar

Optimization of scale in transcoding

Chunked encoding at Disney+ Hotstar

Impact

Business-aware and resource-efficient scaling

Storage and I/O optimisation:

Impact

Scaling Compute resources for optimizing consistent SLA​

Impact 💰

Future Plans 🚀

HotstarX -Part 4: Getting hands dirty

Quick Recap

😈 There be devils!

GraphQL, for data orchestration

Golang, for scale

DSL[Golang plugins], for Binders

🔌 Flipping the Switch — Does it work?

Step 1 — Proto spec

Step 2 — Widget Registration

Step 3 — Query

Step 4 — Binder

Step 5 — Magic 🎩

Fortifying your API Gateway: Defending Millions of Requests against Potential Exploitations

Fortifying your API Gateway: Defending Millions of Requests per second Against Potential Exploitations

User Authentication — Old Architecture

Limitations & Challenges

Centralized Gateway Authentication

Architecture

User Session Identity (Envelope)

Centralized Data Enrichment

Force Session Block and Token Refresh

Context

Solution

Summary

De-bottlenecking Aurora MySQL for 19 Million concurrent users

Introduction

Payments architecture

Eliminating the usual suspects

Scale-out Application

Scaling Compute resources for optimizing consistent SLA