<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[JioHotstar - Medium]]></title>
        <description><![CDATA[Product and Engineering notes from JioHotstar, India&#39;s leading OTT service. Want to work with us? Head on over to https://www.jiostar.com - Medium]]></description>
        <link>https://blog.hotstar.com?source=rss----dbc3fcbc7f07---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>JioHotstar - Medium</title>
            <link>https://blog.hotstar.com?source=rss----dbc3fcbc7f07---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Wed, 11 Mar 2026 16:47:45 GMT</lastBuildDate>
        <atom:link href="https://blog.hotstar.com/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Orchestrating JioHotstar Traffic: The Difference Between a Loading Spinner and a Winning Six]]></title>
            <link>https://blog.hotstar.com/orchestrating-jiohotstar-traffic-the-difference-between-a-loading-spinner-and-a-winning-six-8a385b01380e?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/8a385b01380e</guid>
            <category><![CDATA[cdn]]></category>
            <category><![CDATA[technology]]></category>
            <category><![CDATA[live-streaming]]></category>
            <category><![CDATA[qos]]></category>
            <category><![CDATA[last-mile-delivery]]></category>
            <dc:creator><![CDATA[Karan Kaul]]></dc:creator>
            <pubDate>Mon, 09 Mar 2026 09:41:53 GMT</pubDate>
            <atom:updated>2026-03-09T09:41:52.459Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/767/1*IxovPCvikcIuEEHqKwz8FQ.jpeg" /></figure><p>If you’ve ever streamed a high-stakes match on JioHotstar, the experience probably felt like a simple “tap and play”. It’s a seamless transition from the app icon to the stadium; a flick of a finger and you’re right there, cheering for your players to hit another six. Beneath that simple play button lies one of the most aggressive engineering challenges in the world.</p><p>At Hotstar, we navigate a “<a href="https://blog.hotstar.com/t-for-tsunami-dealing-with-traffic-spikes-c22443bcdd3e"><strong>Tsunami</strong></a>” of traffic across one of the most complex network landscapes in the world. During events like the IPL or a high-stakes ODI match, we don’t just manage millions of users - we manage millions of unique network realities. If you thought just placing a Content Delivery Network (CDN) makes the magic happen — read on.</p><h3>The Illusion of the Monolithic Network</h3><p>Traditionally, CDN traffic management happens at the <strong>network provider </strong>or <strong>state</strong> level. For years, this was enough. But at our scale, “good enough” is the enemy of the audacious. A live event requires upwards of 60–80 Tbps of network bandwidth for streaming. To put things into perspective, it’s like instantly grabbing the <strong>entire 4K movie collection</strong> from JioHotstar (hundreds of blockbusters). Now imagine finishing all those downloads in just one second ….and then doing it again the<strong> next second and the next</strong>. That is the relentless pace of the tidal wave we face during a major event.</p><p>However, <strong>India’s network isn’t a monolith</strong>. It is a <strong>mosaic</strong> of fiber, 4G, 5G and fluctuating bandwidth that changes from one street to the next. A user on a 5G connection in South Delhi faces a vastly different network topology than a user on a local ISP in rural Rajasthan.</p><p>When you are only as good as the video you deliver, you realize that macro-level routing hits a ceiling. To provide the best <a href="https://en.wikipedia.org/wiki/Quality_of_service">Quality of Service</a> (QoS), we needed to treat every state, every city, and eventually every cohort, as a unique routing decision.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*NWpNsHiYvtc4oH3vqFuIBw.png" /></figure><h3>The Solution: The QoS Routing Manager</h3><p>To solve the “Last Mile” problem, we built a real-time observability and orchestration engine: the <strong>QoS Routing Manager</strong>.</p><p>The mission of this service is simple but massive: Observe crucial video metrics at a granular level and adjust traffic weights dynamically to ensure every user is mapped to the best possible CDN for their specific location.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ia3iObTclEtPF3kL5rwbWQ.png" /></figure><h4>The Granular Cohort</h4><p>Instead of routing by “Maharashtra” or “Jio,” we segment users into <strong>Cohorts</strong>. A cohort is a group of people who share a common characteristic over a given period. Currently for us cohort is a specific combination of geographical, network and business categorisation i.e<strong> </strong><a href="https://www.cloudflare.com/learning/network-layer/what-is-an-autonomous-system/"><strong>ASN</strong></a><strong>-Country-State-City-UserType</strong>.</p><p>This allows us to detect if a specific provider is facing issues in a specific city, even if their national health looks perfect. By slicing the data this way, we can bypass localized congestion before it affects the broader user base</p><h4>The Power of The Scoring Logic</h4><p>The QoS Routing Manager pulls real-time performance data from our sophisticated in-house <a href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a> analytical processing beast - called <strong>ARGUS, </strong>which serves as our eyes and ears when it comes to video performance across the platform. It collects heartbeat data from the devices, processes them and provides the telemetry to our service.</p><p>We map every metric - <strong>Playback Failure Rate (PFR)</strong>, <strong>Rebuffering</strong>, and <a href="https://www.cloudflare.com/learning/cdn/glossary/round-trip-time-rtt/"><strong>RTT</strong></a><strong> Latency </strong>- into normalized scores using a <a href="https://en.wikipedia.org/wiki/Piecewise_linear_function"><strong>Linear Piecewise Scoring</strong> function</a>. This allows us to define “Severity Buckets” (Ideal, Baseline, Sev3, Sev2, Sev1) based on direct business impact.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*h0MVZwlj9tjN5kSZelKx8w.png" /></figure><p>To determine which CDN “wins” for a specific cohort, we calculate a <strong>Cumulative Health Score</strong> based on a specific precedence order:</p><blockquote>Cumulative Score= ( X * PFR{score} ) + ( Y * Rebuffer{score} ) + ( Z * RTT{score} )</blockquote><p>We weigh PFR most heavily to ensure that “<strong>reachability</strong>” is the absolute priority, followed closely by the “<strong>fluidity</strong>” of the stream (Rebuffering) and the “<strong>snappiness</strong>” of the connection (RTT).</p><h4>Filtering the Noise: The Power of EWMA</h4><p>Raw network telemetry is inherently noisy. A momentary 4G tower hand-off in Jammu or transient packet loss in Bangalore can look like a critical failure in a 10-second window.</p><p>If our routing engine reacted to every micro-spike, we would introduce dangerous volatility, creating a “jittery” experience where users are constantly bounced between CDNs. We needed a way to smooth out the true performance trend from the momentary noise.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/699/1*XBPEBQmQXYJ-2bwLdKfiDg.png" /><figcaption>CDN Scores for different cohorts — observe the intermittent drops and spikes</figcaption></figure><p>To achieve this, we don’t pass raw metrics directly into our scoring engine. Instead, we pass all incoming telemetry through an <a href="https://corporatefinanceinstitute.com/resources/career-map/sell-side/capital-markets/exponentially-weighted-moving-average-ewma/"><strong>Exponentially Weighted Moving Average</strong></a><strong> (EWMA)</strong> filter.</p><p>The formula we use is:</p><blockquote>EWMA{t} = α * x_t + ( 1 — α ) * EWMA{t-1}</blockquote><p><em>where </em>EWMA<em>{t} is the new smoothed value, x_t is the raw input score, and </em>EWMA<em>{t-1} is the previous smoothed history.</em></p><p>We tune our smoothing factor <strong>α</strong> to approximately <strong>0.6</strong>. In practical terms, this means our engine ensures recent metrics have a stronger influence on the final score while keeping the <strong>last 5 scores to have significance</strong></p><p>Only once the metrics are smoothed by EWMA do we move to the routing.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Rpe7WzuMZdVqI4rA8XoPbA.png" /><figcaption>EWMA CDN Scores for same cohorts on network — smoother spikes</figcaption></figure><h4>Two-Phase Capacity Steering</h4><p>A high score isn’t the only requirement for routing. We must balance “Customer Joy” with “Infrastructure Dynamics”. For a media streaming entity, the biggest tradeoff is quality and available network bandwidth.</p><blockquote>It’s like a bridge where every viewer wants to drive a wide luxury bus (<strong>High Quality</strong>), but the physical lanes (<strong>Bandwidth</strong>) are finite; during a peak surge, there simply isn’t enough pavement to let everyone drive a bus at once without the bridge failing, so you have to balance the vehicle size just to keep everyone moving.</blockquote><p>Our engine keeps track of the used bandwidth on the CDNs and employs two distinct strategies:</p><ul><li><strong>Pre-Threshold Guardrail (&lt; X% Utilization):</strong> We preemptively throttle CDNs that are on track to hit their capacity limits too early, even if they are performing well.</li><li><strong>Uniform Exhaustion (&gt;= X% Utilization):</strong> During a surge, we shift logic to ensure all CDNs exhaust their capacity at the same time, squeezing every possible megabit out of our infrastructure.</li></ul><h4>Safety Gates: Resilience Over Risk</h4><p>In a system of this scale, “no update” is better than a “bad update.” We built a defense-in-depth approach covering both data integrity and routing logic.</p><ol><li><strong>ASN Constraints (Hard Binding):</strong> Physics and contracts still matter. Some CDNs only have presence on specific networks. Before any scoring happens, the system applies hard constraints to ensure we never route a cohort to a CDN that physically cannot serve it.</li><li><strong>Volatility Dampening (Max Deviation):</strong> To prevent wild swings in traffic that could de-stabilize the network, we cap the maximum percentage change allowed in a single iteration (e.g., a CDN cannot gain or lose more than y% share in one minute).</li><li><strong>The “Warm-Up” Floor (Minimum Weight):</strong> We never let a functional CDN drop to 0% traffic. We enforce a configurable floor (typically 5%). This keeps CDN caches warm and DNS paths active, ensuring that if we need to fail-back to them instantly during a crisis, they are ready to take the load immediately.</li></ol><h3>The Audacious Impact</h3><p>The results of moving to granular, cohort-based management has been significant. During a recent <strong>T20 Match</strong>, we ran a A/B rollout of the QoS Routing Manager with only ASN-State cohorts. For the treatment group:</p><ul><li><strong>Playback Failure Rate (PFR)</strong> improved by a staggering <strong>11%</strong>.</li><li><strong>Rebuffering</strong> and <strong>RTT Latency</strong> both saw a <strong>2%</strong> improvement.</li></ul><p>In our world of 50M+ concurrent users, an 11% improvement in PFR represents millions of users who stayed connected to the game instead of seeing a loading spinner. This is how we ensure that whether you are in a high-rise in Chennai or a village in Sikkim, you enjoy each and every six with best in class quality.</p><p>The work continues to build on the millions of datapoints that stream in that allow us to steer all our customer sessions to a stable viewing experience!</p><p><em>Are you interested in solving high-concurrency challenges at the edge? Do check out </em><a href="https://jobs.lever.co/jiostar?department=Digital+%7C+Engineering"><em>open roles</em></a><em> if you want to build for millions of customers, who will use features that you build!</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8a385b01380e" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/orchestrating-jiohotstar-traffic-the-difference-between-a-loading-spinner-and-a-winning-six-8a385b01380e">Orchestrating JioHotstar Traffic: The Difference Between a Loading Spinner and a Winning Six</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Monetisation : Multi-period DASH x ExoPlayer]]></title>
            <link>https://blog.hotstar.com/monetisation-multi-period-dash-x-exoplayer-db8e8c00e521?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/db8e8c00e521</guid>
            <category><![CDATA[dash]]></category>
            <category><![CDATA[exoplayer]]></category>
            <category><![CDATA[streaming]]></category>
            <category><![CDATA[ads]]></category>
            <dc:creator><![CDATA[Abhishek Bansal]]></dc:creator>
            <pubDate>Wed, 04 Mar 2026 04:18:06 GMT</pubDate>
            <atom:updated>2026-03-04T04:18:04.447Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6OlEGvI87XY7LSE_hoNFiQ.png" /></figure><blockquote>Client Side Ad Insertion(CSAI) levels up using multi-period DASH in the Android ecosystem. This is our journey to upgrade ExoPlayer to leverage multi-period DASH.</blockquote><h3>Introduction</h3><p><a href="https://ottverse.com/single-period-vs-multi-period-dash/">Multi-period DASH</a> is a variant of <a href="https://en.wikipedia.org/wiki/Dynamic_Adaptive_Streaming_over_HTTP">DASH</a> format that offers significant advantages, such as the ability to insert dynamic content like disclaimers, dub cards, and ad breaks without re-encoding the entire video. This flexibility is crucial for delivering personalized and localized content to millions of users across diverse regions.</p><p>While the DASH standard natively supports multi-period manifests, the Android <strong>ExoPlayer</strong> ecosystem (specifically the AdsMediaSource component) was designed with a single-period assumption. This missing piece in the puzzle meant we couldn&#39;t support Client-Side Ad Insertion (CSAI) on these modern streams out of the box.</p><p>This post details our engineering journey: identifying the constraints, evaluating architectural alternatives, and ultimately redesigning ExoPlayer’s Ad handling to support multi-period content seamlessly.</p><h3>Context</h3><h4>The Evolution of DASH Content</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MxXNu-p6Y4_x3VQqXe9Nlg.png" /></figure><p>JioHotstar relied on single-period DASH for entertainment content, where the entire stream including ads was encapsulated within a single period. This approach limited flexibility: inserting disclaimers, dub cards, and to some extent, dynamic ad breaks required re-encoding the entire video, which was both time-consuming and resource-intensive.</p><p>Consider a common localization scenario: the same title is launched across multiple regions with different primary languages. Audio dubs may be recorded by different artists per market, so dub-card credits vary by region; likewise, legal and regulatory requirements differ, driving region-specific disclaimers.</p><p>Under a single-period workflow, each of these variations would necessitate a distinct transcode of the full asset. If we target ~20 regions, that implies ~20 full transcodes, leading to long time-to-market, high cost, and significant operational complexity — super unscalable.</p><p>Multi-period DASH addresses these challenges by dividing content into multiple periods, each representing a distinct segment (e.g., disclaimer, main content, ad break). This modularity enables seamless insertion of additional content without re-encoding, significantly reducing operational overhead. In practice, we can attach region-specific disclaimers as a lightweight period and swap dub-card credits per locale without touching the main essence, keeping a single <a href="https://medium.com/freelance-filmmaker/intermediate-codec-using-mezzanine-video-formats-a28a53d3256e">mezzanine</a> and avoiding redundant full transcodes.</p><h4>The Challenge of Ad Monetization</h4><p>While multi-period DASH brought flexibility, it also introduced a critical challenge to our Ad insertion mechanism for Video on Demand(VOD) content. ExoPlayer, our Android media player, lacked native support for client-side ad insertion in multi-period DASH streams.</p><h4>Root Cause</h4><p>ExoPlayer’s code restricted ad playback to single-period content, for Ads supported users, Exoplayer uses AdsMediaSource for Ads playback and tracking. The AdsMediaSource does not allow Multi period contents with Ads. It throws IllegalArgumentException if the content has more than one period.</p><p>The crash happens as soon as AdsMediaSource object is created, irrespective of whether there are actual ads inserted in the stream or not.</p><p>Digging into the library code, we found explicit assertions guarding against complexity:</p><pre>// Logic inside AdsMediaSource / SinglePeriodAdTimeline <br>Assertions.checkState(periodCount == 1);</pre><p>An immediate thought here would be to upgrade to latest <a href="https://github.com/androidx/media">AndroidX Media3</a> package, unfortunately, it had <a href="https://github.com/androidx/media/issues/1642">the same issue</a>.</p><p>Beyond just this assertion, the architecture of AdsLoader and AdPlaybackState was built around a &quot;Shared State&quot; model. In a multi-period timeline (e.g., <em>Period 0: Logo</em> -&gt; <em>Period 1: Movie</em>), Exoplayer applied the <strong>same</strong> AdPlaybackState to every period.</p><p>Because of this shared state we couldn’t just remove the assertion and move on with life, there were problems after that like a Preroll scheduled at 0s would try to play at the start of <em>every</em> period (Logo start, Movie start etc.), or Preroll will not play at ingress points like Continue watching from app.</p><h3><strong>Key Findings</strong></h3><p>ExoPlayer’s native multi-period DASH support has no client-side ad insertion. assert(periodCount == 1) assertions crash playback outright on multi-period streams (<a href="#">related discussion</a>).</p><p>Four problems followed from that root constraint:</p><p><strong>AdPlaybackState</strong> is designed for single-period timelines — applied to multi-period content, ads repeat across periods.</p><p><strong>Cue-point alignment</strong> breaks at period boundaries — ad breaks trigger early, late, or not at all.</p><p><strong>Preroll handling</strong> requires explicit edge-case logic: play once on cold start, suppress on re-entry from Continue Watching and equivalent ingress points.</p><p><strong>Backward compatibility</strong> forced a split serving strategy — single-period DASH for older clients, multi-period for newer ones.</p><p>Extending AdsMediaSource to handle this required architectural changes to ExoPlayer&#39;s ad handling layer. Rollout was phased, with playback failure rate, buffer times, and ad impressions as the primary watch metrics.</p><h4>Insights</h4><p>These findings underscored the need for a flexible, scalable approach to ad insertion in multi-period DASH content. Addressing ExoPlayer’s limitations and rethinking AdPlaybackState and cue-point handling laid the groundwork for a solution balancing technical feasibility and user experience.</p><h3>Methodology</h3><p>To address the challenges of enabling ads on multi-period DASH content in ExoPlayer, the team adopted a systematic and iterative approach:</p><ul><li>Audited ExoPlayer’s DASH manifest handling and ad timeline management to locate all single-period assumptions.</li><li>Prototyped assertion removal on multi-period manifests; used observed failures to surface edge cases early.</li><li>Built custom AdPlaybackState logic to split the main state into period-specific instances, scoping each ad to its designated period.</li><li>Implemented cue-point realignment relative to each period’s start time.</li><li>Replaced SinglePeriodAdTimeline with a custom MultiPeriodAdTimeline.</li><li>Gated multi-period DASH behind a version check; legacy clients continue on single-period.</li><li>Tested preroll/midroll, cross-period seeking, and platform variants (Android, Android TV, Fire TV).</li><li>Phased rollout starting with select content; monitored failure rates, buffer times, and ad impressions before expanding.</li></ul><h3>The Solution: A Custom MultiPeriodAdTimeline</h3><p>Two quick options were ruled out early — removing the assertion and upgrading to Media3 (covered in root cause).</p><p>Three turnkey approaches were evaluated:</p><ul><li>Separate players for ads and content</li><li>Client-side playlist stitching of single-period content and ad clips</li><li>ClippingMediaSource + ConcatenatingMediaSource to simulate a single timeline</li></ul><p>All three introduce buffering at player or content switches, lose AdsMediaSource capabilities (timeline management, seek handling), and scatter implementation complexity across multiple teams.</p><h4>Solution : MultiPeriodAdTimeline</h4><p>The chosen path: replace SinglePeriodAdTimeline with a custom MultiPeriodAdTimeline.</p><p><strong>First attempt</strong> — apply AdPlaybackState only to the content period; treat disclaimer and credits as ad-free.</p><p>Simple to implement. Two problems:</p><ul><li>Pre-roll plays after the disclaimer, not at stream start</li><li>No ads in the credits period — breaks standard behavior where the last cue-point fires on a direct seek to end</li></ul><p><strong>Second attempt</strong> — create dedicated AdPlaybackState per period (logo, content, dub card), each including a pre-roll to handle Continue Watching entry points.</p><p>Works, but hardcodes period sequence and count on the client. Any structural change to the stream breaks compatibility. Streaming team loses the freedom to modify period structure independently.</p><h4>Breaking the Monolith</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Oh2XgcxQ1xAMxcmIRNkyCg.png" /><figcaption>Propagate Playback State to Each Period</figcaption></figure><p>We took a step back and tried to generalize the solution. The biggest challenge was the “Shared State.” We handled all periods equally. We still have single main AdPlaybackState with cue-points as if we had single period content.</p><p>Internally its duplicated and <strong><em>transformed</em></strong> <strong><em>specially for each period. </em></strong>Any modifications happen on the main AdPlaybackState and are mirrored internally when it is updated.</p><p>We decided to keep indexing of ad breaks in AdPlaybackStates for each period the same. In each period will be all the cue-points in the same indices, but some will be ignored (skipped). Each period will have this transformations for each cue-point.</p><ul><li>Subtract start position of each period — times are relative to period start — some cue-points may end up being negative, but that is fine for us.</li><li>Mark cue-points after the period end as skipped. These will be played in the following periods</li></ul><p>That is it. A lot of investigation and iterations crystalized in few simple rules. After the transformation, the cue-point positions were as intended and all ad breaks are triggered even across period boundaries. From user experience, there is no difference between multi period and single period content. In the end <strong><em>single period is only a special case of multi period content now.</em></strong></p><h4>The Indexing Problem</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Oxn9qfqQbVsqukhztY4yOw.png" /><figcaption>Constant Break indices in Each Period</figcaption></figure><p>ExoPlayer identifies ad groups by <strong>index</strong>. If we simply removed “non-relevant” ad groups from a period’s state, the indices would shift, causing the player to play the wrong ad or crash.</p><p><strong>The Fix:</strong> We kept the Ad Group count constant across all periods.</p><ul><li>If Ad Group #1 belongs to Period 1, but we are currently configuring Period 0, we mark Ad Group #1 as SKIPPED in Period 0&#39;s state using AdPlaybackState.withSkippedAdGroup().</li><li>This ensures adGroupIndex 1 always refers to the same logical ad break, regardless of which period is currently active.</li></ul><h4>Handling Continuity</h4><p>We had to ensure that seeking across period boundaries didn’t re-trigger ads. Using the global bookmarking map, if a user watches an ad in Period 1 and seeks back to Period 0, the player knows that logic “Ad Break #X” is already played.</p><h4>Backward Compatibility</h4><p>We served multi-period DASH only to newer app versions; legacy versions continued with single-period DASH.</p><h3>Summary</h3><p>The implementation of ads on multi-period DASH content in ExoPlayer yielded the following outcomes:</p><h4>Technical Achievements</h4><ul><li><strong>Seamless Ad Playback</strong>: Ads are now played at the correct times, without repetition, even across period boundaries.</li><li><strong>Accurate Cue-point Handling</strong>: Dynamic adjustment of cue points ensures ads trigger precisely as intended.</li><li><strong>Robust Backward Compatibility</strong>: Users on older app versions experience no disruption, as they continue to receive single-period DASH.</li><li><strong>Performance Metrics</strong>: Key metrics such as start lag, buffering, ad impressions, and playback failure rates remained within acceptable thresholds throughout the rollout.</li></ul><h4>User Experience</h4><ul><li><strong>Dynamic Content Delivery</strong>: The platform can now insert disclaimers, dub cards, and localized content dynamically, enhancing personalization.</li><li><strong>Uninterrupted Viewing</strong>: Users experience smooth transitions between content and ads, with no playback disruptions.</li></ul><h4>Business Impact</h4><ul><li><strong>Uninterrupted Monetization</strong>: No impact on monetization as we modernized our media stack.</li><li><strong>Operational Efficiency</strong>: Reduced need for re-encoding and streamlined content pipelines have lowered operational overhead.</li><li><strong>Cost Savings: </strong>Re-encoding a large media library like JioHotstar’s would have been cost-prohibitive. With this solution, no separate encoding is needed for existing content.</li></ul><h3>Conclusion</h3><p>“Simple” features like adding a 5-second logo or disclaimer often hide iceberg-sized engineering challenges. By diving deep into the internals of ExoPlayer and rethinking how AdPlaybackStates are managed, we turned a hard constraint into a flexible capability.</p><p>If you are facing a similar issue, we have proposed these changes to be merged in upstream on <a href="https://github.com/androidx/media/pull/2501">Media3 Github repo here</a>.</p><p>This project reinforced a key lesson for us: sometimes the best way to move forward isn’t to work <em>around</em> the platform (multi-player), but to improve the platform itself!</p><p><em>Want to dig into player internals and contribute back to projects like ExoPlayer? Do check out </em><a href="https://jobs.lever.co/jiostar?department=Digital+%7C+Engineering"><em>open roles</em></a><em> if you want to build for millions of customers, who will use features that you build!</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=db8e8c00e521" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/monetisation-multi-period-dash-x-exoplayer-db8e8c00e521">Monetisation : Multi-period DASH x ExoPlayer</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Gen AI Video — Building Scalable Validation Framework]]></title>
            <link>https://blog.hotstar.com/building-scalable-validation-framework-for-video-generation-6c67d1177ce2?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/6c67d1177ce2</guid>
            <category><![CDATA[ai-video-generation]]></category>
            <category><![CDATA[generative-ai-use-cases]]></category>
            <category><![CDATA[ai-validation]]></category>
            <dc:creator><![CDATA[Sagar Tekwani]]></dc:creator>
            <pubDate>Wed, 18 Feb 2026 03:42:23 GMT</pubDate>
            <atom:updated>2026-02-18T03:42:22.487Z</atom:updated>
            <content:encoded><![CDATA[<h3><strong>Gen AI Video — Building A Scalable Validation Framework</strong></h3><blockquote>We discuss building strong eval frameworks as part of our Generative AI studio to ensure that generated video maintains a high quality bar.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hOF71rV-cfi6sJwXo9tV8A.png" /></figure><h3>Introduction</h3><p>As JioHotstar scales to serve one of the world’s largest streaming audiences, generative AI(GenAI) offers a powerful opportunity to accelerate content creation, adapt stories across languages and regions, and unlock creative workflows that traditional pipelines cannot match.</p><p>Customers evaluate AI-generated video with the same standards they apply to premium productions; any drift in character identity, structural deformity, abrupt scene shift, or unsafe visual undermines realism instantly.</p><p>Generative models, being probabilistic, introduce such inconsistencies naturally, and at our scale even rare defects accumulate into meaningful quality gaps. Ensuring stable characters, coherent locations, and safe content therefore becomes a scientific challenge central to customer acceptance.</p><p>To address this, we built a <strong>Validation Framework</strong> that operates as a first-class component of the generation pipeline. This closed-loop system enforces quality, safety, and consistency at the same cadence as creation, enabling generative video to meet production-grade expectations on the JioHotstar platform.</p><h3>System Overview: The Validation Layer</h3><p>The video generation process starts with editorial scripts, from which the system extracts characters, locations, accessories, and context. These entities drive keyframe generation, which expands into video clips and assembles into the final video.</p><p>The <strong>Validation Framework</strong> (Fig. 1) spans all stages of this process and operates synchronously within the workflow. It evaluates intermediate outputs, flags issues early, and triggers targeted regeneration or parameter adjustments to maintain quality. Brand detection and safety checks act as hard gates, while other modules guide regeneration to preserve consistency and visual integrity.</p><p>Each module governs a specific quality dimension, character consistency, deformity, scene continuity, brand safety, content appropriateness, or story coherence and emits go/no-go signals that control progression. The system logs all validation outcomes and samples them for periodic human review to support ongoing calibration.</p><p>Together, these modules form the framework’s <strong>control surface</strong>, defining the quality of generative video. Most components operate at production readiness, while long-range story and concept continuity remains an active area of experimentation as we continue refining metrics and validation strategies.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PiPJW32Mpdhx1ntQK7Ho3Q.png" /><figcaption><em>Fig.1: Overview of Developed Validation Framework</em></figcaption></figure><h3>Character Consistency</h3><p>Character consistency ensures that a character’s <strong>visual identity</strong> remains stable across all frames and scenes. Generative models, being stochastic, can drift subtly in facial features, proportions. These deviations break temporal realism and make the sequence unusable for production.</p><p>To quantify consistency, we represent each character through frame-level embeddings and compare them against a reference “hero character” image approved by the creative team. This hero image serves as the canonical visual anchor for that character.</p><p>We use an ensemble of independent facial similarity models (e.g., <em>Buffalo-L</em>, <em>Antelope-v2</em>, <em>FaceNet, etc.</em>), each fine-tuned for <strong>intra-character identity matching</strong>. Rather than representing a face with a single global embedding, these models extract <strong>multiple localised facial feature vectors</strong> corresponding to stable semantic regions of the face (such as eyes, nose bridge, jawline, and facial contours).</p><p>Sampling multiple localized descriptors improves robustness to pose changes, partial occlusion, lighting variation, and expression drift failure modes that are common in video generation but underrepresented in still-image similarity tasks. Each character, <em>k,</em> is represented in form of embeddings as f_hero^(k)</p><p>For a given frame <em>i</em>, each model m​ produces an embedding f_i^(k,m)​. We compute similarity using <strong>cosine similarity</strong>, which measures angular alignment in embedding space and remains invariant to feature magnitude.</p><p>This property is critical because generative models can alter contrast, illumination, and texture intensity without changing identity. <strong>Distance-based metrics</strong> (Euclidean or Manhattan) distance are sensitive to these magnitude shifts and empirically produce unstable thresholds across frames.</p><p>We also experimented with <strong>Jaccard similarity</strong> on facial embeddings but observed weaker alignment with <strong>human-in-loop evaluations</strong>. The similarity between the generated frame for a character <em>k</em> and its corresponding hero image for a model <em>m</em>, <em>S_i^(k,m)</em> is computed as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/554/1*1rmvsE3Qb1CTcZs-OCD6pg.png" /></figure><p>Each model has a threshold m calibrated on <strong>human-labeled data.</strong> The binary decision per model m for each character k is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/626/1*tQvFqPVlDG9_mbzpTqL_vA.png" /></figure><p>The final consistency decision <em>C</em> uses an <strong>ensemble aggregation </strong>across all models (N) and characters, tuned for <strong>recall maximization</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/554/1*nPG9dNM59dlSQ1JJY5Z90g.png" /></figure><p>where, m​ represents each model’s reliability weight and defines ensemble sensitivity.</p><p>This design ensures that every potential inconsistency is captured, even if one model fails, prioritizing recall over precision. Minor false positives are filtered through, ensuring consistency over short frame windows rather than isolated detections. Thresholds m and m are periodically re-calibrated using <strong>human-in-loop feedback</strong>. Annotators review flagged segments, refine labels, and feed corrections back into the validation loop. Thus, by fusing diverse fine-tuned models and anchoring every comparison to the hero reference, the system maintains stable character identity across the generation pipeline, regardless of lighting or scene transition.</p><p>For, side-profile consistency, which often reveals identity drift missed in frontal views. We generate strict left and right 90-degree profile references from the approved hero image, excluding frontal or angled views. We store the <strong>front, left, and right profiles</strong> as canonical anchors and compare generated frames against them to validate identity across viewing angles. This check surfaces profile-specific inconsistencies early and triggers regeneration when needed.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*T8jH7gdKIhEtF8YkVHimYA.png" /><figcaption><em>Fig.2: Working of Character Consistency Framework</em></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2PeHKnBwtismAH-63JfJXA.png" /><figcaption><em>Table-1: Comparison of Hero Image with different generations</em></figcaption></figure><h3>Character Deformity Detection</h3><p>Character deformities break visual realism immediately. Generative models can produce warped limbs, misaligned joints, or anatomically implausible body structures when spatial constraints fail during sampling. To detect these failures reliably, we built a deformity-classification pipeline grounded in curated abnormality data and a trained YOLO-family detector.</p><p>We first evaluated general-purpose multimodal models such as <strong>Gemini</strong> and <strong>Qwen-VL</strong> on deformity detection. These models achieved only <strong>≈40% recall</strong> on our internal human-annotated deformity dataset, and they consistently failed to detect subtle or multi-region structural distortions. <strong>This baseline confirmed that deformity detection requires a dedicated model trained on explicit abnormality signals. </strong>To detect anatomical distortions reliably, we built a deformity-detection module optimized for high recall and early intervention during video generation.</p><ul><li>Used Tencent’s Distortion dataset (<em>Predicting Distortion in Real-World Human Images</em>) as the base corpus and curated it using human-in-loop review to improve label reliability.</li><li>Applied a segmentation-guided cleaning pipeline to remove annotations outside the human region, discard low-confidence samples pseg​, and filter out deformity regions below a minimum area threshold Amin​; segmentation masks also provided body-part bounding boxes.</li><li>Trained a YOLO-family detector on the curated dataset to localize and classify deformities across full-body and body-part crops, explicitly optimizing for high recall., Achieving ~<strong>35% recall lift</strong> on the same human-annotated evaluation set in comparison to Gemini and Qwen.</li><li>Integrated the detector across all generation stages to flag anatomical failures early and trigger regeneration or parameter adjustments before outputs propagate downstream.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ixDPPIZNDvQCuWGOQxPkpQ.png" /><figcaption><em>Fig.3: Deformity Detection Framework for limb abnormalities</em></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FCJkhR6QAUNjqrXaAKgXYQ.png" /><figcaption><em>Fig.4: Deformity Detection Framework for facial abnormalities</em></figcaption></figure><h3>Location Consistency</h3><p>Location consistency ensures that generated scenes remain faithful to the <strong>scripted description</strong> and stable across <strong>multiple scenes within the same location</strong>. This is extremely important since drift in layout, lighting, spatial structure, or persistent objects breaks continuity and degrades perceived quality.</p><p><strong>Consistency with scripted location and object constraints: </strong>During script parsing, LLMs extract structured location descriptions that capture spatial layout, environmental cues, lighting intent, and object-level constraints. These descriptions define both the expected set of objects and a subset of <strong>mandatory elements</strong> whose presence must be preserved.</p><p>We evaluate <strong>object presence</strong> using vision language models (VLMs) and flag in case if required objects are missing. In parallel, we assess overall scene fidelity by aligning text embeddings derived from the location description with visual scene embeddings extracted from generated keyframes using our VLM stack (e.g., Gemini, Qwen).</p><p>We calibrate <strong>similarity thresholds</strong> through human-in-loop (HIL) evaluation, selecting cutoffs that best correlate with human judgments of scene correctness. Frames that fall below the calibrated threshold indicate semantic or structural violations and trigger regeneration or parameter adjustment. Figure 5 illustrates how alignment scores reflect adherence to scripted location and object constraints.</p><blockquote>Scene Visualization:<strong> </strong>Interior. Police interrogation room. Dim overhead lighting. Metal table bolted to the floor. Two chairs facing each other. One-way mirror. No windows.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xiUH_ZdnlC7nIAqN-q3dVw.png" /><figcaption><em>Fig.5: Location Consistency with Script&lt;&gt;Object alignment</em></figcaption></figure><p><strong>Consistency across scenes within the same location: </strong>For locations that recur across multiple scenes, we treat the first validated keyframe as the one achieving the highest text–image alignment score as the <strong>location anchor</strong>. Using our vision–language model (VLM) stack, we extract scene-level embeddings from subsequent frames and compare them against this anchor via cosine similarity to detect structural and stylistic drift. This formulation allows us to enforce consistency in spatial layout, lighting characteristics, and persistent background elements when the same room, street, or set reappears at different points in the video. <strong>Figure 6</strong> illustrates this anchor-based consistency check.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*x0U1MUP-25NjmyVZsko32Q.png" /><figcaption><em>Fig.6: Location Consistency across scenes</em></figcaption></figure><p>LLMs provide strong priors when generating detailed location descriptions, but they do not yet reliably detect subtle spatial or geometric inconsistencies in generated frames. We are therefore exploring additional approaches, including embedding-based scene classifiers, layout-consistency models, and structure-aware validators, to strengthen this module. These efforts aim to convert location consistency into a fully measurable and enforceable dimension within the validation framework.</p><h3>Brand Logo Detection/Safety Checks</h3><p>Brand-logo/Safety violations act as <strong>hard safety gates</strong> in the generation pipeline. The system blocks any output containing such elements and triggers regeneration until the output passes all safety checks. The validation loop operates as follows: the detector scans each generated unit, flags any violation, the system regenerates the content with adjusted constraints, and the detector re-evaluates the regenerated output. If repeated attempts fail, the system escalates the case for manual review. This loop ensures no flagged safety issue propagates downstream.</p><p>We enforce these safety dimensions through <strong>prompt-level controls</strong> and <strong>automated detection</strong>. During script parsing, LLMs generate structured content descriptions, including required or disallowed visual categories. These descriptions guide the image-generation model to avoid branded items and unsafe content.</p><p>We curated datasets for both tasks: brand/logo samples across categories such as laptops, consumer electronics, apparel, etc. and safety violations across violence, child-abuse, racism, hate symbols, and other violation types.</p><p>Using these datasets, we optimized prompts and detection thresholds for <strong>high recall</strong>, achieving <strong>&gt;95% recall</strong> on internal evaluations. When the detector identifies a brand or NSFW instance, the system regenerates the output with stricter constraints to remove the violation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XZZnbMrU2UUCdExv6t7bWw.png" /><figcaption><em>Fig.7: Brand Logo Detection across categories</em></figcaption></figure><h4>Engineering Controls to Prevent Recurrence of Violations</h4><p><strong>Context-Aware Decoding:<br></strong> We add structured negative constraints that suppress brand names, logos, or unsafe categories during generation. These constraints adjust the decoding trajectory of the image-generation model and reduce the probability of producing forbidden visual elements.</p><p><strong>Adaptive Prompt Rewriting:<br></strong> When a violation is detected, the system rewrites the prompt by tightening constraints, clarifying allowed content, and removing ambiguous phrasing. These rewritten prompts condition the regeneration step and help eliminate repeated violations across attempts.</p><h3>Summary and Future Directions</h3><p>This work presents a <strong>validation framework for generative video</strong> that operates as a first-class component of the generation pipeline. By integrating validation directly into the workflow, the system detects and corrects character inconsistency, anatomical deformities, location drift, and safety violations during generation.</p><p>The framework applies recall-first validation, multi-model similarity checks, vision–language alignment, and thresholds calibrated through human feedback to convert qualitative notions of visual quality into enforceable signals. This design prevents narrative-breaking errors from propagating while allowing controlled creative variation.</p><p>Next, we will formalize a unified <strong>evaluation and metrics layer</strong> that measures video, audio, lip-sync, and temporal consistency and supports systematic optimization.</p><p>We will also extend the framework to address <strong>long-range scene and concept continuity</strong>, enforcing coherence across scenes, episodes, and story arcs. These extensions will complete the quality stack required to deploy generative video systems at production scale.</p><p><em>Want to be at the forefront of generative video in India? Do check out </em><a href="https://jobs.lever.co/jiostar?department=Digital+%7C+Engineering"><em>open roles</em></a><em> if you want to build for millions of customers, who will use features that you build!</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6c67d1177ce2" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/building-scalable-validation-framework-for-video-generation-6c67d1177ce2">Gen AI Video — Building Scalable Validation Framework</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Modernizing Dependency Management: Beyond CocoaPods (Part 2 — The Execution)]]></title>
            <link>https://blog.hotstar.com/modernizing-dependency-management-beyond-cocoapods-part-2-the-execution-bf3b9d739efb?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/bf3b9d739efb</guid>
            <category><![CDATA[swift]]></category>
            <category><![CDATA[mobile-development]]></category>
            <category><![CDATA[ios-development]]></category>
            <category><![CDATA[swift-package-manager]]></category>
            <category><![CDATA[dependency-management]]></category>
            <dc:creator><![CDATA[Saurabh Kapoor]]></dc:creator>
            <pubDate>Fri, 13 Feb 2026 04:26:22 GMT</pubDate>
            <atom:updated>2026-02-13T04:26:20.830Z</atom:updated>
            <content:encoded><![CDATA[<h3>Modernizing Dependency Management: Beyond CocoaPods (Part 2 — The Execution)</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1022/1*uo1p1ZKkrCqxXMBsbm6uxw.jpeg" /></figure><h3>Recap</h3><p>In <a href="https://blog.hotstar.com/dependency-management-our-journey-beyond-cocoapods-part-1-the-strategy-c3c7874566a9">Part 1</a>, we shared the strategy — why we had to move away from CocoaPods, how we evaluated our options, and the phased migration plan we designed to modernize 60 pods over 2 quarters without disrupting feature development.</p><p>We bring it all home and share our learnings and scars.</p><h3>Phase 1: Proving That SPM Could Coexist</h3><p>Big bang approaches for a product that supports over a billion Indians is not on the menu. We could not stop the train and we had to upgrade while we ran as fast as we were.</p><blockquote><em>Swift Package Manager (SPM) had to live alongside CocoaPods without disrupting day-to-day development.</em></blockquote><p>This was not negotiable. It was also about building confidence gradually.</p><p>Our dependency graph was already complex, and CocoaPods was deeply embedded in how the app was built, tested, and shipped. Introducing a second dependency manager into that ecosystem was risky. If coexistence failed, everything downstream would be compromised.</p><p>So we treated Phase 1 as a controlled experiment.</p><h4>Defining Success</h4><p>We were explicit about what success looked like:</p><ul><li>Developers should not need to change their workflows</li><li>CI pipelines must remain stable</li><li>No runtime regressions</li><li>CocoaPods must remain fully functional</li></ul><h4>Choosing the Right First Dependencies</h4><p>We intentionally avoided internal SDKs in this phase. Instead, we chose third-party libraries that already had mature SPM support and met three criteria: widely used in the app, minimal customization, and no deep runtime coupling with internal frameworks or development pods.</p><p>This allowed us to isolate SPM behavior without risking business-critical flows. By migrating only a handful of carefully selected dependencies, we could observe real-world behavior without destabilizing the system.</p><h4>Invisible to Developers</h4><p>One of the most important constraints we enforced was developer invisibility.</p><p>Developers continued to open the same workspace, build using the same schemes, run tests the same way, and rely on the same CI signals. There were no new scripts to run, no new commands to remember, no changes to onboarding docs.</p><p>SPM dependencies resolved automatically by Xcode in the background. If someone hadn’t been told we were testing SPM, they wouldn’t have noticed.</p><h4>Phase 2: Internal Pods and Binary Distribution</h4><p>Phase 1 inspired confidence and we ramped up and doubled down. Internal pods were where things got interesting.</p><p>These weren’t isolated third-party libraries. They were shared across multiple apps, some across Android, and were actively developed. Moving them required more than a format change — it required rethinking how we distribute internal code.</p><h4>The Promise of Binary Targets</h4><p>Moving our internal pods to SPM binary targets felt like a natural evolution. Using .xcframework with .binaryTarget allowed us to distribute prebuilt artifacts instead of rebuilding large internal modules every time.</p><pre>.binaryTarget(<br>    name: &quot;CoreSDK&quot;,<br>    url: &quot;https://example.com/CoreSDK.xcframework.zip&quot;,<br>    checksum: &quot;...&quot;<br>)</pre><p>This allowed us to preserve encapsulation, reduce build times, and decouple SDK evolution from app builds.</p><p>But this phase also exposed one of our biggest challenges.</p><h4>The Problem: Binary Targets and Private Repositories</h4><p>Very quickly, we ran into a problem that wasn’t obvious from the documentation.</p><p>Swift Package Manager assumes that binary artifacts are publicly accessible. When SPM encounters a binaryTarget(url:), it attempts to download the zip file, verifies the checksum, and caches the artifact locally. What it does <em>not</em> do is authenticate.</p><p>That assumption works fine for open-source packages hosted publicly — but completely breaks down for private internal SDKs.</p><p>We hosted our .xcframework.zip files as GitHub release assets inside private repos. Everything looked correct: URL was valid, checksum matched, artifact was present. Yet builds consistently failed:</p><pre>Failed to download binary artifact<br>The requested URL returned error: 404</pre><p>The file existed. The problem was subtle but critical: SPM was making unauthenticated HTTP requests to private URLs. GitHub correctly responded with 404, not 401, masking the real issue.</p><h4>Why This Was a Big Deal</h4><p>This wasn’t just an inconvenience — it had architectural implications:</p><ul><li>Developers couldn’t resolve packages locally</li><li>CI pipelines failed deterministically</li><li>Binary targets became unusable for internal SDKs</li><li>The entire binary-distribution strategy was at risk</li></ul><p>At this point, we had to pause and ask: <strong><em>Can SPM actually work for private, enterprise-scale binary distribution?</em></strong></p><h4>Exploring Workarounds</h4><p>We explored multiple approaches, each with trade-offs:</p><p><strong>Making repositories public</strong> — Immediately ruled out. Internal SDKs contain proprietary logic.</p><p><strong>Embedding tokens in URLs</strong> — Technically possible, but unacceptable. Security risk, tokens leak via logs, impossible to rotate safely.</p><p><strong>Git LFS</strong> — Workable, but introduced large repository sizes, slower clones, and additional tooling overhead. Didn’t scale well for frequent SDK releases.</p><p><strong>Artifact repositories (S3/Nexus)</strong> — Viable, but required additional infrastructure, credential management, and URL signing logic.</p><p>We wanted something simpler for GitHub-hosted binaries.</p><h4>The Solve: .netrc Authentication</h4><p>The solution came from understanding how SPM downloads binaries.</p><p>SPM relies on standard system networking under the hood. That means it respects .netrc credentials, just like curl or git. By configuring authentication at the system level, we could allow SPM to fetch private binaries without changing a single line of Package.swift.</p><pre># ~/.netrc<br>machine github.com<br>login GITHUB_USERNAME<br>password GITHUB_PERSONAL_ACCESS_TOKEN</pre><p>Once this file was present, SPM successfully authenticated, binary artifacts downloaded correctly, checksums validated as expected, and builds became deterministic again.</p><p>Most importantly, this worked identically on developer machines and CI runners.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*p9_PIC6RcQuH075qlgLOhA.png" /></figure><h4>Hardening : Making It CI-Friendly</h4><p>On CI, we injected the .netrc file securely at runtime using secrets:</p><pre>echo &quot;machine github.com login $GITHUB_USER password $GITHUB_TOKEN&quot; &gt;&gt; ~/.netrc<br>chmod 600 ~/.netrc</pre><p>This gave us no credentials in source control, easy token rotation, and clear audit boundaries. It also aligned well with GitHub Actions and self-hosted runners.</p><h4>What This Unlocked</h4><p>Solving this problem unlocked the full potential of SPM binary targets:</p><ul><li>Internal SDKs could be versioned independently</li><li>App builds became significantly faster</li><li>SDK releases became predictable artifacts</li><li>Teams consumed binaries without worrying about source-level coupling</li></ul><p>Onwards!</p><h3>Phase 3: Development Pods → Swift Packages</h3><p>Development pods were not just dependencies — they were living parts of the app, evolving alongside features, touched daily by multiple teams. They were also deeply intertwined with how our codebase was structured.</p><h4>The Challenge: Living Code, Not Just Dependencies</h4><p>Our development pods served a very specific purpose: they allowed teams to iterate on shared modules without releasing binaries, supported rapid local changes and debugging, and encoded architectural boundaries inside the Podfile.</p><p>Over time, they became like an extension of the app — not just dependencies.</p><p>Replacing them with Swift packages meant answering a hard question: <em>Can we retain the same developer experience without CocoaPods doing the heavy lifting for us?</em></p><h4>Preserving the Existing Structure</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9eV0SZYqS6nqBfdMlUxT-w.png" /></figure><p>One of our biggest concerns was accidentally reshaping the codebase. We did not want to flatten modules, merge responsibilities, or introduce artificial package boundaries.</p><p>Instead, we followed a strict rule: <strong>every development pod becomes a Swift package with the same conceptual boundaries.</strong></p><p>That meant one pod became one package, with the same folder structure, same ownership, and same responsibility. This discipline paid off later when debugging regressions and onboarding developers.</p><h4>Language Boundaries: Objective-C and Swift</h4><p>Swift Package Manager supports Objective-C, but it does not allow multiple languages within the same target. A single target cannot contain both Swift and Objective-C sources.</p><p>Several of our development pods relied on exactly that — Swift and Objective-C files coexisting within the same logical module, with bridging handled implicitly by the build system. Under SPM, this was no longer possible.</p><p>To move forward, we restructured these modules intentionally. Objective-C code was moved into dedicated package targets with explicitly defined public headers. Swift targets then depended on these Objective-C targets through clear, declared dependencies.</p><p>This refactoring made language boundaries explicit and removed implicit bridging behavior. While it required effort, it resulted in a cleaner dependency graph and clearer ownership across modules.</p><h4>Platform Boundaries: XIBs and Multi-Platform Packages</h4><p>Legacy XML Interface Builder (XIB)s introduced a different kind of challenge — one tied closely to how Swift Package Manager treats multi-platform packages.</p><p>Under CocoaPods, platform-specific behavior could be handled through Podfile logic or build configurations. This allowed development pods to bundle UI resources that behaved differently across iOS and tvOS without that distinction being explicit in the code.</p><p>Swift Package Manager takes a stricter approach. Packages are inherently multi-platform, and resources declared in a package are shared across all supported platforms. There is no native way to conditionally include or exclude resources based on platform within the same target.</p><p>Several development pods contained XIBs that were platform-specific and loaded conditionally at runtime. Once these pods became Swift packages, those assumptions no longer held.</p><p>Rather than fragmenting packages or introducing complex runtime branching, we made a deliberate architectural decision: <strong>we moved most XIB-based UI into code.</strong></p><p>This shift reduced reliance on resource bundling, eliminated fragile platform assumptions, and aligned better with the multi-platform model encouraged by Swift Package Manager — even though it required more upfront refactoring.</p><h4>Build Behavior: Replacing Post-Install Scripts</h4><p>One aspect we underestimated initially was how much logic lived outside the code itself.</p><p>Over the years, our CocoaPods setup had accumulated substantial scripting inside the post_install block. These scripts handled modifying build settings across targets, injecting compiler and linker flags, adjusting deployment targets, patching generated project settings, code generation for modules, and handling branding assets.</p><p>CocoaPods made this convenient because it centralized these changes in one place. But once we moved away from CocoaPods, that implicit behavior disappeared immediately.</p><p>Swift Package Manager does not offer an equivalent of a post_install hook. That forced us to confront an important reality: <strong>a lot of critical build behavior was hidden in scripts that developers rarely looked at.</strong></p><p>To preserve correctness without reintroducing global magic, we deliberately moved this logic closer to where it actually mattered. Most essential scripting was redistributed into explicit Xcode build phases, scoped to the relevant app or framework targets. In some cases, we replaced scripts entirely by fixing the underlying configuration rather than patching it at build time.</p><p>This shift had two important effects: build behavior became more visible and discoverable, and changes were scoped to specific targets instead of being applied globally.</p><h4>Resources, Flags, and Conditional Logic</h4><p><strong>Resources</strong> — Assets that were automatically bundled by CocoaPods now had to be declared explicitly:</p><pre>.target(<br>    name: &quot;UserProfileKit&quot;,<br>    resources: [<br>        .process(&quot;Resources&quot;)<br>    ]<br>)</pre><p>This forced us to audit every resource and validate runtime access paths — something CocoaPods had silently handled for years.</p><p><strong>Build Settings</strong> — CocoaPods’ pod_target_xcconfig allowed us to inject compiler and linker settings easily. SPM requires these to be expressed explicitly:</p><pre>swiftSettings: [<br>    .define(&quot;ENABLE_LOGGING&quot;, .when(configuration: .debug))<br>]</pre><p>For edge cases, we used .unsafeFlags — but sparingly and deliberately. This made us more intentional about what each module actually required.</p><p><strong>Conditional Inclusion</strong> — In CocoaPods, it was common to conditionally include pods based on build configurations. SPM does not support conditional dependencies for custom build configurations.</p><p>This forced a shift in thinking. Instead of conditionally including dependencies, we moved toward conditionally <em>using</em>them via compile-time flags and feature gates:</p><pre>#if ENABLE_EXPERIMENTAL_FEATURE<br>// Feature-specific code<br>#endif</pre><p>This change improved clarity, even though it required refactoring.</p><h4>Keeping Development Fast</h4><p>A common fear with moving development pods to SPM is slower iteration. We paid close attention to this.</p><p>Local packages were referenced via relative paths, changes reflected immediately in the app, and the debugging experience remained intact. From a developer’s perspective, very little changed — which was exactly what we wanted.</p><h4>CI and Testing Implications</h4><p>Moving development pods affected CI in subtle ways. Test targets needed explicit dependency declarations, schemes had to be updated, and build order changed slightly.</p><p>This surfaced hidden assumptions in our pipelines — but fixing them made CI more robust and predictable.</p><h4>The Emotional Reality</h4><p>This phase took time. It required patience. And it touched a lot of code.</p><p>But it also marked a turning point. By the end of Phase 3, CocoaPods was no longer central to our architecture, Swift packages were no longer “new,” and the system felt cleaner and more explicit.</p><p><strong>We stopped thinking in terms of pods and started thinking in terms of modules with explicit contracts.</strong></p><h4>The Surprises We Didn’t See Coming</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1020/1*vBOg5RI_WkYuOI86y48vtg.jpeg" /></figure><p>We expected friction. We anticipated refactoring. What we didn’t fully anticipate was how many assumptions CocoaPods had been quietly absorbing for us over the years — assumptions that surfaced only once Swift Package Manager forced everything into the open.</p><p>These challenges didn’t appear all at once. They emerged gradually, often at inconvenient moments, and almost always in places we thought were already “done.”</p><h4>Duplicate Symbols at Runtime</h4><p>One of the more subtle issues we encountered was related to duplicate symbols — but not in the way they typically present themselves. These were not build-time or linker failures. Builds succeeded, targets launched normally, and at a glance everything appeared to be working.</p><p>The problems surfaced only at runtime.</p><p>Because Swift Package Manager builds packages as static libraries by default, the same dependency could end up being embedded multiple times through different dependency paths. This resulted in multiple instances of what was expected to be a single module being present in the process.</p><p>At runtime, this caused the wrong instance of certain symbols to be referenced:</p><ul><li>Global state would appear to reset or diverge</li><li>Dependency injection would resolve to unexpected instances</li><li>Mocks would not behave as expected</li></ul><p>In some cases, the only signal we had was a vague runtime warning:</p><pre>objc[12345]: Class MySharedService is implemented in both<br>/path/to/App.app/App and /path/to/AppTests.xctest/AppTests.<br>One of the two will be used. Which one is undefined.</pre><p>Nothing crashed. Nothing failed to launch. But from that point onward, behavior was undefined.</p><p>The challenge was not detecting the issue, but diagnosing it. Since nothing failed during build or launch, the failures initially looked like flaky tests or logical bugs rather than a dependency problem.</p><p>Resolving this required a careful audit of how dependencies were introduced across app, framework, and test targets. We had to ensure that shared modules were linked exactly once and that dependency graphs were consistent across targets.</p><p><strong>This was a strong reminder that test targets are not passive consumers of the app binary. They are independent bundles with their own runtime environment.</strong></p><h4>Builds That Succeeded but Crashed</h4><p>Some of the most difficult issues gave us the least amount of signal.</p><p>The app compiled successfully. There were no compiler errors. There were no linker warnings. And yet, the application crashed immediately at runtime.</p><p>These failures didn’t surface during build because, from the compiler’s perspective, everything was valid. All symbols were present, all dependencies resolved, and the binary was produced without complaint. The problem only emerged once the app launched and the runtime attempted to load and resolve those symbols.</p><pre>dyld: Symbol not found: _$s15MySharedModule16CriticalServiceC11sharedInstanceACvgZ<br>  Referenced from: /Applications/App.app/App<br>  Expected in: /Applications/App.app/Frameworks/MySharedModule.framework/MySharedModule</pre><pre>dyld: Library not loaded: @rpath/MySharedModule.framework/MySharedModule<br>  Reason: image not found</pre><p>Diagnosing these issues required stepping outside the usual compile–link–run mental model. We had to inspect the final app binary, verify which frameworks and libraries were actually embedded, and confirm that runtime search paths and linkage settings were aligned with how the dependencies were built.</p><h4>Saying Goodbye to Slather</h4><p>One of the bigger surprises had nothing to do with compilation, linking, or dependency resolution. It was code coverage.</p><p>For years, we had relied on Slather as our coverage tool. It was stable, familiar, and deeply integrated into our CI pipelines. We assumed it would continue to work as we moved dependencies to Swift Package Manager.</p><p>That assumption turned out to be wrong.</p><p>Slather does not support pure Swift packages as first-class citizens. It expects coverage to be generated from Xcode projects or workspaces, not standalone packages. As more of our codebase moved into Swift packages, coverage for package-based modules simply disappeared.</p><p>The only way to keep Slather working would have been to create artificial “host” projects whose sole purpose was to run package tests and collect coverage. That approach went directly against our goal of making systems simpler.</p><p>At that point, it became clear that the problem wasn’t Swift Package Manager — it was the tooling around it.</p><p>We switched to an xcresult-based coverage pipeline, parsing coverage directly from Xcode&#39;s native test result bundles. This aligned us with the direction Apple was already taking. Coverage became more accurate, easier to reason about, and independent of how the code was packaged.</p><h3>Phase 4: Removing CocoaPods</h3><p>By the time we reached this phase, CocoaPods was no longer doing much work.</p><p>All third-party dependencies had already moved to Swift Package Manager. Internal SDKs were consumed as binary targets. Development pods had been fully replaced with package-based modules.</p><p>And yet, CocoaPods was still there — quietly present, still wired into the system.</p><p>Removing it was less about technical effort and more about confidence.</p><h4>Knowing When We Were Ready</h4><p>For several weeks, CocoaPods existed in the repository almost as a safety net. During this period, we monitored build stability across all configurations and regions, CI performance and reliability, developer onboarding, and QA cycles without CocoaPods involvement.</p><p>Only after we had multiple successful releases — without touching CocoaPods at any stage — did we decide it was time to move on.</p><h4>The Final Delete</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oHsBUFkMcLx6lZbkJqyAJw.jpeg" /></figure><p>The final step was straightforward: we removed the Podfile, the CocoaPods-generated .xcworkspace, and all remaining CocoaPods-related files and scripts.</p><p>There was no disruption, no rollback, and no follow-up fixes.</p><p>The system simply continued to work — without CocoaPods.</p><p><strong>The day we merged that PR felt like crossing a finish line we’d been racing toward for six months.</strong></p><h3>What This Journey Taught Us</h3><p>By the time CocoaPods was fully removed, the migration itself had stopped feeling like the most important outcome.</p><p>What mattered more was how the journey reshaped the way we think about our codebase, our tooling, and the systems that support day-to-day development.</p><p>This wasn’t just a dependency migration. It was a gradual recalibration of engineering discipline.</p><h4>Tooling Should Fade Into the Background</h4><p>The best tooling is the kind you don’t think about.</p><p>CocoaPods had accumulated years of scripts, configuration overrides, and implicit behavior. Swift Package Manager, in contrast, forced us to be explicit — but once configured, it largely disappeared into the background.</p><p>When developers no longer need to remember setup steps or debug dependency resolution, cognitive load drops. Productivity rises not because things move faster, but because there’s less to manage.</p><h4>Migration Is About Trust, Not Speed</h4><p>One of the most important decisions we made was not rushing.</p><p>By migrating in phases and allowing CocoaPods and SPM to coexist, we preserved trust: trust from developers that their workflows wouldn’t break, trust from QA that releases wouldn’t destabilize, trust from leadership that the migration wouldn’t impact delivery.</p><p>The time spent validating coexistence and waiting through release cycles was not overhead — it was risk mitigation.</p><h4>CI/CD Is Part of the Architecture</h4><p>Several challenges — especially around binary targets, coverage, and caching — forced us to acknowledge something we had underweighted before:</p><p>CI/CD is not infrastructure glue. It’s part of the system design.</p><p>Solving problems like authenticated binary downloads or deterministic package resolution required thinking beyond Xcode and into the pipeline itself. Once we did, CI became more reliable, more predictable, and easier to maintain.</p><h3>Closing Thoughts</h3><p>CocoaPods played a critical role in helping us scale our iOS and tvOS ecosystem at a time when the platform needed it most. It gave us structure, enabled modularization, and supported years of rapid development. For that, it deserves recognition.</p><p>Swift Package Manager represents where the Apple ecosystem is heading. It aligns more closely with Swift itself, integrates natively with Xcode, and encourages explicit, predictable dependency management. Adopting it wasn’t just a response to CocoaPods’ sunset — it was an investment in long-term maintainability and clarity.</p><p>The migration was not quick, and it was never meant to be. We approached it deliberately, prioritizing stability over speed and confidence over convenience. By moving in phases, preserving existing workflows, and validating each step through real release cycles, we ensured that modernization didn’t come at the cost of ongoing development.</p><p><strong>Six months. Two quarters. Sixty pods. Zero broken releases. One transformed dependency management system.</strong></p><p>If you’re standing at a similar crossroads, our advice is simple but hard-earned:</p><p><strong>Move with intent. Move carefully. And never break development in the process.</strong></p><p>We’re hiring for our client teams! Do check out <a href="https://jobs.lever.co/jiostar?department=Digital+%7C+Engineering">open roles</a> if you want to build it right, while you build for millions of customers, who will use features that you build!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bf3b9d739efb" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/modernizing-dependency-management-beyond-cocoapods-part-2-the-execution-bf3b9d739efb">Modernizing Dependency Management: Beyond CocoaPods (Part 2 — The Execution)</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Dependency Management: Our Journey Beyond CocoaPods (Part 1 — The Strategy)]]></title>
            <link>https://blog.hotstar.com/dependency-management-our-journey-beyond-cocoapods-part-1-the-strategy-c3c7874566a9?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/c3c7874566a9</guid>
            <category><![CDATA[ios-development]]></category>
            <category><![CDATA[swift-package-manager]]></category>
            <category><![CDATA[swift]]></category>
            <category><![CDATA[dependency-management]]></category>
            <category><![CDATA[mobile-app-development]]></category>
            <dc:creator><![CDATA[Saurabh Kapoor]]></dc:creator>
            <pubDate>Tue, 03 Feb 2026 08:58:51 GMT</pubDate>
            <atom:updated>2026-02-03T08:58:50.101Z</atom:updated>
            <content:encoded><![CDATA[<h3>Modernizing Dependency Management: Beyond CocoaPods (Part 1 — The Strategy)</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LKzyVxVFyIXupSUYoQaH5g.jpeg" /></figure><blockquote>CocoaPods formed the spine of our iOS build workflow. In this two part blog, we’re sharing our journey to transparently switch away from CocoaPods to SPM, minus the drama, but plus all the scars!</blockquote><h3>Where We Started</h3><p>CocoaPods was at the heart of our development workflow — for everything. Dependency resolution, configuration overrides, CI integration — all became tightly coupled to CocoaPods. It stopped being a convenience layer and became part of our infrastructure.</p><p>Then came the <a href="https://blog.cocoapods.org/CocoaPods-Specs-Repo/">announcement</a> that changed everything.</p><blockquote><strong>CocoaPods’ trunk would become read-only in December 2026.</strong></blockquote><p>This wasn’t just an ecosystem update. It was a tectonic shift. A read-only trunk meant no new pod versions, no straightforward path to adopt upstream fixes, and increasing exposure to unpatched issues over time. Any disruption here wouldn’t just affect builds — it would impact active feature development, CI stability, and release confidence across teams.</p><p>So when the announcement landed, the question wasn’t <strong><em>“Should we move?”</em></strong> That decision had effectively been made for us.</p><p>What followed was one of the most ambitious infrastructure initiatives we’ve undertaken — a six-month effort, touching every corner of our codebase, and ultimately transforming how we build and ship our apps.</p><p>This was open-heart surgery on a system that powered millions of app sessions every day — performed while the patient was still running around</p><p>This is that story.</p><h3>Choices, choices..</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7GFT2BmFXcc4bDk_evUDmw.jpeg" /></figure><p>Before committing to any solution, we took a deliberate step back and evaluated the dependency management landscape as a whole. Since we had to move, we wanted to make as much of a future proof choice as possible.</p><p>Several factors mattered deeply to us:</p><ul><li><strong>Scalability</strong> — Handle the size and complexity of our codebase, while remaining adaptable.</li><li><strong>Learning curve</strong> — It couldn’t slow teams down or force widespread workflow changes.</li><li><strong>Future-proofing</strong> — We wanted to align with the direction the Apple ecosystem was moving.</li><li><strong>Tooling integration</strong> — It had to work well with project generation and tooling we already relied on.</li></ul><p>With those constraints in mind, we evaluated the available options.</p><h3>Carthage</h3><p>Carthage is lightweight, decentralized, and intentionally avoids modifying Xcode projects — qualities that align well with engineering simplicity. For smaller projects or teams with straightforward dependency needs, it can be an excellent choice.</p><p>However, managing binary distribution at scale, supporting internal SDKs, handling complex build configurations, and maintaining consistent tvOS support required more manual orchestration than we were comfortable with. Over time, this would have shifted operational complexity from tooling into team workflows.</p><p>Carthage wasn’t a bad fit universally; it just wasn’t the right fit for our ecosystem.</p><h3>Bazel and Buck</h3><p>We also explored Bazel and Buck — not dependency managers in the traditional sense, but complete build systems. Both are powerful and proven at massive scale, offering deterministic builds, strong caching, and sophisticated dependency graphs.</p><p>However, adopting either would have meant replacing our entire build system, moving away from Xcode’s native build model, and introducing a steep learning curve for developers whose daily workflows are deeply tied to Xcode.</p><p>For our teams, the cost of that transition far outweighed the benefits.</p><h3>Swift Package Manager (SPM)</h3><p>Swift Package Manager wasn’t perfect. It lacked some of the flexibility we were used to, and certain features required workarounds. But it had two qualities that ultimately mattered most.</p><p>First, it was <strong>native</strong> — integrated directly with Xcode, aligned with Swift’s evolution, and benefiting from ongoing investment by Apple. Second, it allowed us to <strong>preserve our existing project structure</strong> while gradually modernizing it. We could migrate incrementally, validate changes in production, and avoid large-scale rewrites.</p><p>That balance — modernization without disruption — was the turning point. Swift Package Manager positioned us to evolve with the platform, rather than constantly working around it.</p><h3>Replacing a beating heart…</h3><p>This was open-heart surgery on a system that powered millions of app sessions every day — performed while the patient was still running around!</p><p>At the time of the migration, our ecosystem included:</p><ul><li><strong>~60 iOS/tvOS engineers </strong>actively shipping features</li><li><strong>~30 SDETs</strong> maintaining extensive automation and test infrastructure</li><li><strong>A multi-language codebase</strong> (Swift, Objective-C, and supporting tooling) spanning millions of lines of code</li><li><strong>60+ pods in total </strong>— external dependencies, internal SDKs, and local development pods under constant iteration</li><li><strong>Multiple markets</strong>(India, International) with distinct configurations</li></ul><p>Every one of those engineers relied on CocoaPods behaving in very specific, sometimes undocumented ways. Every automation script, every CI pipeline, every local development workflow had assumptions baked in about how dependencies resolved, how builds were structured, and how artefacts were produced.</p><p>We made extensive use of pre_install and post_install hooks. We maintained multiple build configurations for India and international markets — involving conditional linking, selective dependency inclusion, and market-specific code paths. Our CI pipelines were tightly coupled to CocoaPods-generated projects in ways that weren&#39;t always visible until something broke. This was organisational Jenga, only, we couldn’t let that tower fall, or even shake!</p><p>Easy, right?</p><h3>Core Constraint: Do Not Break Development</h3><p>From day one, we aligned on one non-negotiable rule:</p><blockquote><strong>This migration must not block ongoing development.</strong></blockquote><p>This constraint shaped everything.</p><p>Teams were actively shipping features. Local pods and internal SDKs were under constant iteration. CI pipelines had to remain stable across multiple environments. We couldn’t afford a <strong>“big bang”</strong> migration that froze development, forced widespread workflow changes, or introduced uncertainty into release cycles. Migration had to occur in phases with both systems co-existing.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/250/1*lfC9cdpEstTKcYtt6p9xMQ.gif" /></figure><p>This wasn’t the cleanest approach — maintaining two dependency systems in parallel added complexity. It was the safest path forward. It meant teams could keep shipping while we rebuilt the foundation underneath them.</p><h3>Our Migration Strategy</h3><p>Once we committed to a phased migration, we needed a structure that balanced safety with forward momentum. Each phase had to deliver real progress, while still preserving the ability to ship features without disruption.</p><p>We broke the journey into four deliberate stages:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qxtpM2tK4lFe-VnXDi05OA.png" /></figure><h4>Phase 1: Proving That SPM Could Coexist</h4><p>Before touching any critical paths, we focused on coexistence. Swift Package Manager was introduced alongside CocoaPods — not as a replacement, but as a parallel system. This phase was about validation.</p><h4>Phase 2: Internal Pods and Binary Distribution</h4><p>With coexistence proven, we moved to internal SDKs and binary dependencies. These were more controlled, lower-churn modules, making them ideal candidates for early migration. This phase helped us establish patterns for versioning, distribution, and consumption at scale.</p><p><em>This is where we built the playbook that would carry us through the harder phases ahead.</em></p><h4>Phase 3: Development Pods → Swift Packages</h4><p>This was the crucible.</p><p>Development pods were deeply intertwined with active feature work, often containing mixed-language code and UI resources. Migrating these required structural changes — not just mechanical conversions — and forced us to confront some of Swift Package Manager’s core constraints head-on.</p><p>This phase stretched across quarters, required constant coordination with feature teams, and tested every assumption we’d made about the migration strategy.</p><p><em>This is where patience mattered more than speed.</em></p><h4>Phase 4: Removing CocoaPods</h4><p>Only after the system had fully stabilized under SPM did we execute the final step: removing CocoaPods entirely. By this point, CocoaPods was no longer a dependency — it was technical debt waiting to be deleted.</p><p>The day we removed the Podfile from the repository felt like crossing a finish line we’d been racing toward for six months!</p><h3>The Payoff: Measurable Wins</h3><p>This wasn’t change for change’s sake. After two quarters of sustained effort, the results were tangible and significant.</p><h4>Immediate Impact</h4><ul><li><strong>Automation stability improved</strong> — test targets became less sensitive to implicit CocoaPods behavior</li><li><strong>Dependency graphs became explicit</strong> — making ownership, impact analysis, and refactoring significantly easier</li><li><strong>CI pipelines became more predictable</strong> — fewer pod-related cache invalidations and mysterious failures</li><li><strong>Build times improved</strong> <strong>by 2x </strong>— especially in incremental builds, due to better dependency isolation and binary targets wherever possible</li><li><strong>App startup time reduced by 200–300ms</strong> — SPM’s cleaner dependency loading eliminated redundant framework initialization at launch. The migration also allowed us to adopt the latest linker (previously blocked due to crashes on older OS versions), which improved dynamic library load times and static linking performance.</li></ul><h4>Long-Term Gains</h4><p>Just as importantly, we fundamentally reduced our risk profile. Dependency management stopped being a fragile layer propped up by scripts, conventions, and tribal knowledge. It became something the platform itself understood — native, supported, and evolving with the ecosystem.</p><p>We went from dreading the CocoaPods deprecation deadline to being ahead of it by a year.</p><p><strong>Six months. Two quarters. Sixty engineers. Sixty-plus pods. Zero feature freezes. One satisfying Podfile deletion.</strong></p><h3>The Scars — Part 2</h3><p>In <strong>Part 2</strong>, we’ll go deep into how these phases were executed in practice.</p><p>We’ll cover the real issues we encountered along the way — mixed Objective-C and Swift targets, XIBs in development pods, CI assumptions that quietly broke, and the architectural decisions we had to make to move forward safely.</p><p>This is where theory met reality — and where the migration truly earned its scars!</p><p>We’re hiring for our client teams! Do check out <a href="https://jobs.lever.co/jiostar?department=Digital+%7C+Engineering">open roles</a> if you want to build it right, while you build for millions of customers, who will use features that you build!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c3c7874566a9" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/dependency-management-our-journey-beyond-cocoapods-part-1-the-strategy-c3c7874566a9">Dependency Management: Our Journey Beyond CocoaPods (Part 1 — The Strategy)</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Tesseract: JioHotstar’s Central Design Token System]]></title>
            <link>https://blog.hotstar.com/tesseract-jiohotstars-central-design-token-system-922e4616bc05?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/922e4616bc05</guid>
            <category><![CDATA[figma]]></category>
            <category><![CDATA[design-to-code]]></category>
            <category><![CDATA[kotlin-multiplatform]]></category>
            <category><![CDATA[design-systems]]></category>
            <dc:creator><![CDATA[Ritika Pahwa]]></dc:creator>
            <pubDate>Fri, 30 Jan 2026 08:01:05 GMT</pubDate>
            <atom:updated>2026-01-30T08:01:04.020Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/727/1*ZReKswl11juPXSC3Rwt7NA.png" /></figure><p>How do you keep a large customer platform like JioHotstar looking and working uniformly, and consistenly from a visual and experiential perspective?</p><blockquote>This blog explores our central design token system that’s integrated into our client platforms and made a massive dent in UI inconsistencies across platforms.</blockquote><p>JioHotstar is supported on multiple platforms, like Android and iOS devices, tablets, Android TV, tvOS and web. While the code-bases are different, it’s the design team’s job to ensure that the product UI/UX is harmonised across platforms.</p><p>We leverage our design system to unite these clients with a consistent brand identity, and the standards are defined in the “<em>Foundation Styles” </em>figma file managed by our design team.</p><p>Our <a href="https://en.wikipedia.org/wiki/Design_system">design system</a> (SOUL), uses some basic ingredients — colors, typography, effect styles, spacings, sizes and other constants, referred to as design tokens. The tokens act as building blocks for all the myriad experiences at JioHotstar and any change in these tokens needs to be reflected in all the codebases as well.</p><h4>Speed Thrills, But Spills</h4><p>With the pace of change and the manual overhead that existed, we observed several inconsistencies creeping into the translation process.</p><p>Here is what we found:</p><h4>1. Variable Pace of Adoption</h4><p>Whenever the design team change any tokens, the changes need to be propagated to the respective engineering teams. While some teams might develop, test and release a feature quickly, others might take a bit longer to develop and test the feature before the release takes place. This leads to an inconsistent design across the platforms, thereby affecting the branding of the organisation.</p><h4>2. Mis-Interpretation of Missing Values</h4><p>Sometimes the design suggestions get lost in translation. For instance, if developers receive design pages with missing hex codes at some places, then something appearing red could be interpreted as #D21D1D, #BF0F0F, or similar.</p><h4>3. Token Sprawl is real</h4><p>Tokens can be defined anywhere within the codebase. While this seems harmless at first, it turns large refactors into a nightmare when the overall theme needs to change. Often, visual bugs hide so deep in the user flows that they slip past sanity testing. Strict guardrails are,<strong> </strong>therefore, the need of the hour.</p><p>These inconsistencies accumulate over time, to the extent that designs across the platforms started to deviate significantly. To mitigate the challenges we faced with the design to production flow, we decided to automate the process.</p><blockquote><strong>The goal was simple: to make Figma the source of truth for all the design tokens.</strong></blockquote><p>Enter — <strong>Tesseract</strong>!</p><h3><strong>Tesseract : Central Design Token Framework</strong></h3><p>We created a central design token repository, named <em>Tesseract, </em>and integrated it with client platforms.</p><p>There were two parts to it, for our two personas (Designers &amp; Developers), that were required:</p><ul><li>Designers : Solution to export tokens to the central repository</li><li>Developers : Access this repository within the client codebases and use the foundational elements</li></ul><p>Designers and developers are our customers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-x3s_aibOWlvYL-e3pMTGw.png" /><figcaption>The revamped flow</figcaption></figure><h3>Digging into the workflow - Design to code</h3><p>The workflow was adapted to leverage the “Foundation Styles” file as the source of truth. Designers were now required to publish their tokens to <em>Tesseract</em> using a standard review+approval workflow.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8ubcM9X5GHhLrZ17xh9owg.png" /><figcaption>Designer Workflow : Export design tokens to the central repository</figcaption></figure><h4>Designer Workflow — Enter Tesseract Plugin</h4><p>One of the most crucial functional requirements was a Figma plugin that the designers could use to export the tokens to the central repository. In this process, the <a href="https://www.figma.com/plugin-docs/plugin-quickstart-guide/">official documentation for Figma plugin development</a> came in handy.</p><p>We decided to export all the tokens in JSON format. The following snippet shows a glimpse of the JSON structure we used in the tokens.json file:</p><pre>{<br>  &quot;Colors&quot;: {<br>    &quot;Background&quot;: {<br>      &quot;UI&quot;: {<br>        &quot;Default&quot;: {<br>          &quot;id&quot;: &quot;VariableID:11109:3198&quot;,<br>          &quot;name&quot;: &quot;Background/UI/Default&quot;,<br>          &quot;description&quot;: &quot;Panther Grey 10&quot;,<br>          &quot;type&quot;: &quot;SOLID&quot;,<br>          &quot;hex&quot;: &quot;#0F1014&quot;,<br>          &quot;opacity&quot;: 1<br>        },<br>        ...<br>      },<br>    },<br>  },<br> &quot;Typography&quot;: {<br>    ...<br>  },<br> &quot;Effects&quot;: {<br>    ...<br>  },<br> &quot;Spacings&quot;: {<br>    ...<br>  },<br> &quot;Radius&quot;: {<br>    ...<br>  }<br>...<br>}</pre><p>Our Foundation Styles Figma file initially utilised styles for colours, text, and effects, and separate pages were used for the size constants. We used Figma’s <a href="https://www.figma.com/plugin-docs/">Plugin API</a> to fetch all the styles from the files, while the plugin was being run on it.</p><p>The designers need to enter their personal access token to use the plugin. This is a one-time step that is not triggered again until the token expires. Instead of exporting all changes directly to GitHub, we added an extra review page to display differences, allowing for proofreading before committing any unintended changes:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*V7pUtp8W1pXOA-sd" /><figcaption>Custom Review Page</figcaption></figure><h4>Challenge with Constants — Adopting Figma variables</h4><p>Unlike styles, we had no direct API calls to retrieve the constants from the pages. We started with a simple solution — traversing all the pages to find the required frame using its name and populating our JSON object.</p><p>This approach, however, was quite error-prone as it searched for frames by their names. So we decided to advocate for the design team to exclusively utilise variables for constants, which required some evangelisation with the design team.</p><p>With these challenges solved, we were ready to pipe everything into a shared repository where design tokens were maintainted in our Github.</p><h4>Developer Workflow — The code to deployment phase 🚀</h4><p>We built a library that the developers could use to retrieve these design tokens.</p><h4>Build Tenets</h4><p>We wanted to give the seamless developer experience while consuming these tokens at scale.</p><p>We decided the following tenets:</p><ul><li><strong>Plug and play adoption: </strong>The developers should be able to use the library as a dependency.</li><li><strong>Autocomplete:</strong> Autocomplete should work in almost all the IDEs that the developers use (Android Studio, VS Code, Xcode, etc.).</li><li><strong>Intuitive syntax</strong>: The syntax should be developer-friendly, that is, the developers should be able to use something like Colors.primary to use the primary color, just like the native UIColor.systemBlue for iOS or Color.Red for Android.</li></ul><p>One way to approach this phase was to allow the client devices to fetch the tokens from the repository as and when required. However, this server-driven UI had some trade-offs. While this architecture allowed us to have the latest tokens on every run, we required the resources to be bundled within the app itself.</p><p>Another way to approach this problem was to have a map with all the key-value pairs, with keys being the token names and values being the corresponding token instances. We discarded this idea as it was an in-memory approach and did not satisfy our auto-completion requirement.</p><h4>Kotlin Multi-platform : Ship platform-specific artefacts</h4><p>Different platforms use different templates. For instance, we have an XML file or an object containing variables and their corresponding values on Android as opposed to a class on iOS.</p><p>So, this was typically a scripting problem that we could have solved by having separate scripts and libraries for Android, web and iOS. However, Kotlin Multiplatform(KMP) fit our requirements.</p><p>Here’ why 👇</p><ul><li><strong>Flexibility: </strong>With KMP, we had the flexibility of using our own customised template, instead of going with the ones used on the platforms. KMP would take care of generating the platform-specific artefacts.</li><li><strong>Maintainability:</strong> For any design system, one of the most challenging parts is its maintenance. One script and one library meant that we would require much less maintenance later on!</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*dWEXWBnqWcifwe_A" /><figcaption>Why KMP?</figcaption></figure><p>The template creation itself was involved and required support and feedback from developers so that we could ensure maximum ease of use. DevX was key. We have to emphasize what a critical aspect this step was, which impacted ultimate uptake and adoption.</p><h4>Managing Updates</h4><p>We have dedicated GitHub workflows to update the KMP library and release the artefacts. The KMP library gets updated whenever the pull request with an updated tokens.json file gets merged:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*gNVRHbuht_M1zfVU" /><figcaption>Releasing updates</figcaption></figure><p>Since the tokens need to be shared across all the platforms, we opted to keep them in the commonMain section of our KMP library.</p><p>We use a script called common-main-files-generator.js to traverse the tokens in tokens.json file and update the tokens in commonMain:</p><pre>- name: Populate KMP library<br>  run: node scripts/common-main-files-generator.js</pre><p>The actual implementation for Android/iOS/Web uses native dependencies and APIs to handle platform-specific nuances.</p><p>Once the files in commonMain get updated, we use GitHub workflows to update the library version and generate platform-specific artefacts and release them. The following snippets elucidate the process.</p><ul><li>Generating node module for web and publishing it on Nexus:</li></ul><pre>- name: Generate Node Module<br>  run: ./gradlew jsNodeProductionLibraryDistribution --no-configuration-cache<br>      <br>- name: Publish the node module to nexus <br>  run: |<br>       ...<br>       npm publish</pre><ul><li>Creating a .aar file for Android and publishing it on Nexus:</li></ul><pre>- name: Build Android artifact<br>  run: ./gradlew assembleRelease<br><br>- name: Release Android artifact on Nexus<br>  run: ./gradlew publishTesseractPublicationToNexusRepository<br>  env:<br>    MAVEN_REPOSITORY_ANDROID_PUBLISH_PASSWORD: ${{ secrets.MAVEN_REPOSITORY_ANDROID_PUBLISH_PASSWORD }}</pre><ul><li>Building an XCFramework for iOS and releasing it on GitHub:</li></ul><pre>- name: Build and publish iOS Framework<br>  run: | <br>       ./gradlew podPublishReleaseXCFramework <br>    <br>- name: Release iOS Framework on GitHub<br>  working-directory: .<br>  run: |<br>        # Add only XCFramework and podspec file to the release branch<br>        git push -u -f origin iOS/Release<br><br>  ... <br><br> - name: Tag the iOS build<br>   run: |<br>         ...<br>         git push origin ${{ steps.get_version.outputs.version }}<br></pre><h4>Handling Breaking Changes</h4><p>Things change, and things break. The following are the use-cases where design tokens change.</p><ul><li>Token gets updated <em>(attributes like hex, opacity, etc. get changed)</em></li><li>New token is added</li><li>Token is deleted</li><li>Name of an existing token was modified.</li></ul><p>Out of these four cases, we were aware that our codebases would break if a token were deleted or its name altered. We wanted the library’s adoption to be as frictionless as possible. Thus, we opted to version tokens, and deprecate older tokens.</p><p>This gave us more control over the design system in general. We could even automate the version bump process since the changes wouldn’t disrupt the client codebases.</p><p>Once all the deprecated tokens get removed from the entire ecosystem, we delete them completely from the central repository. The Figma plugin allows the designers to delete deprecated tokens and the flow is the same as for exporting updates to the central repository as discussed before.</p><h4>Versioning our library</h4><p>We follow semantic versioning for Tesseract. For this, we use the following rules:</p><ul><li>All the exports happening via the plugin account for the <em>patch updates</em>.</li><li><em>Major updates</em> are released only when the deprecated tokens are completely removed from the entire ecosystem.</li></ul><p>The snippet below outlines the process we follow for versioning the KMP library:</p><pre>- name: Update Library Version <br>  id: get_version<br>  working-directory: ./design-tokens-lib<br>  run: |<br>        VERSION=$(grep &quot;libVersion&quot; ./Tesseract/build.gradle.kts | awk &#39;{print $4}&#39; | tr -d &#39;&quot;&#39; | tr -d &#39;\n&#39;)<br>        IFS=&#39;.&#39; read -r major minor patch &lt;&lt;&lt; &quot;$VERSION&quot;<br>        if [ &quot;$GITHUB_EVENT_PULL_REQUEST_TITLE&quot; = &quot;Delete Deprecated Tokens&quot; ]; then<br>           new_major=$((major + 1))<br>           NEW_VERSION=&quot;$new_major.0.0&quot;<br>        elif [ &quot;$GITHUB_EVENT_PULL_REQUEST_TITLE&quot; = &quot;Update Design Tokens&quot; ]; then<br>           new_patch=$((patch + 1))<br>           NEW_VERSION=&quot;$major.$minor.$new_patch&quot;<br>        else<br>           NEW_VERSION=&quot;$VERSION&quot;<br>        fi<br>        sed -i &quot;&quot; &quot;s/libVersion = \&quot;$VERSION\&quot;/libVersion = \&quot;$NEW_VERSION\&quot;/&quot; ./Tesseract/build.gradle.kts<br>        ...</pre><h3>Rolling it out</h3><p>One final step remained to ensure no regression in app sizes, or app start time — no regressions were observed and were good to go! Finally, we needed to improve communication around changes. We leveraged a slack channel to announce these changes for maximum visibility, since our team <em>lives </em>on Slack!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/976/1*k24TJgMmxzZgx8vGIHZR4w.png" /><figcaption>Slack notification on our design system channel</figcaption></figure><h3>Impact 📈</h3><blockquote>4 codebases.</blockquote><blockquote>400+ hard-coded values.</blockquote><blockquote>And that wasn’t all!</blockquote><p>We identified numerous inconsistencies, and we integrated our library in several iterations. Across the codebases, we discovered</p><ul><li><strong>Over 20 tokens </strong>whose names weren’t consistent</li><li><strong>180+ tokens</strong> which were not available in the Foundation Styles at all</li><li><strong>More than a dozen tokens</strong> that were mismatched or improperly utilised</li><li>Primitive tokens being used within the code, while they shouldn’t be (primitive tokens are the basic tokens over which other tokens carrying a contextual meaning, often called semantic tokens, are built)</li><li><strong>Different font families</strong> being used across the platforms</li></ul><p>As a part of Tesseract, we now ensure that the font families are controlled centrally.</p><p>While we eradicated inconsistencies to a large extent, coming to a consensus on whether to retain a token was a challenging endeavour, and the process required an audit. Apart from this, we also made the developer experience smooth and seamless by making the usage of gradients and text styles hassle-free!</p><h3>What’s next? 🆙</h3><p>The code hitting production was the ultimate reality check and we learnt several lessons:</p><ul><li><strong>Multi-theme &amp; token variant support: </strong>The static nature of tokens ensured blazingly fast access with zero startup delays or race conditions, it came with its own limitation: changing themes dynamically at runtime was not possible with this approach.</li><li><strong>Manual update bottleneck:</strong> The process of bumping up the library versions within the client codebases was still manual and often resulted in different parts of the ecosystem running on different versions of the design system.</li><li><strong>Delayed rollouts and rollbacks: </strong>Since the tokens were statically included in the KMP library, any change or revert meant going through the full release cycle on the App/Play Store.</li><li><strong>The older build trap:</strong> Once an app version was released, its look was frozen. Users on older builds saw the old theme, meaning we could never achieve 100% visual uniformity across our user base.</li></ul><p>These were changes that we have now incorporated into newer versions of the library.</p><h3>Wrapping Up</h3><p>Design consistency is crucial for a consumer application of the reach of JioHotstar. Developer and designer experience is also super crucial. Our teams are able to spend their times on more meaningful experiences for our customers due to initiatives like Tesseract!</p><p>We’re hiring for our client teams! Do check out <a href="https://jobs.lever.co/jiostar?department=Digital+%7C+Engineering">open roles</a> if you want to contribute to initiatives like these and focus your time on quality engineering problems versus “token hunting”!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=922e4616bc05" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/tesseract-jiohotstars-central-design-token-system-922e4616bc05">Tesseract: JioHotstar’s Central Design Token System</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Scaling Tales: Discovering Onboarding Rate]]></title>
            <link>https://blog.hotstar.com/scaling-tales-discovering-onboarding-rate-a862afe85345?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/a862afe85345</guid>
            <category><![CDATA[scaling]]></category>
            <category><![CDATA[onboarding-rate]]></category>
            <category><![CDATA[api]]></category>
            <dc:creator><![CDATA[Ajaychoudhary]]></dc:creator>
            <pubDate>Tue, 30 Dec 2025 08:38:48 GMT</pubDate>
            <atom:updated>2025-12-30T09:17:41.907Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mVfmOfoI0JIFOMb0cfLp0g.png" /></figure><blockquote>If there’s one thing streaming live sports teaches you, it’s humility.</blockquote><p>You can prepare for months. Tune every knob. Deploy every optimization.<br>And then, at <strong>7:29:59 PM</strong>, a nation decides to show up.</p><p>For the longest time, we believed we understood this rhythm.<br>We watched <strong>platform</strong> <strong>concurrency</strong> like hawks.<br>We built scaling ladders.<br>We separated Non-High-Scale-Days (<strong>BAU)</strong> and Scale-Days (<strong>Live)</strong> modes.</p><p>And we thought that was enough. It wasn’t.</p><p>The real turning point in our scaling journey came when we realized that <strong>platform concurrency</strong> — the metric we had achored on for years — was only telling half the story<strong>.</strong></p><p>The other half was hiding in plain sight, quietly shaping every surge, every incident, every 3 AM war room.</p><p>That missing piece was <strong>Onboarding Rate (OR)</strong> — the <strong>velocity at which users arrive</strong> 📈.</p><p>Building on the foundation of our <a href="https://medium.com/p/42b04ef5ed6a">multi-datacenter architecture</a>, this is the story of how we uncovered <strong>Onboarding Rate</strong> — the evolution of our scaling strategy — and how it fundamentally changed the way the JioHotstar platform scales.</p><h3>What’s beyond Concurrency?</h3><p>Platform concurrency was always our anchor, the bed-rock of our scaling ladders, how many customers are watching <em>right now? </em>For years, this held us in good stead, we learnt over the years how customer <em>shape </em>onto the platform, but we never named the “ramp-up”, until relatively recently.</p><p>Through a slew of incidents, at lower concurrency rates, and their following RCA’s, we started to realise that we needed to surface and name what we had already planned for instinctively over the years …</p><blockquote>It wasn’t just the number of customers watching at the same time, It was also the pace at which these customers <em>joined </em>the stream.</blockquote><p>During high scale, this phenomenon had remained masked to some degree, since our scaling ladders bake in the “tsunami” — however, as we’ve moved to more autonomous scaling and optimising our infrastructure, at lower concurrencies, cracks had started to appear.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*T9K3Tct-__HiYKp-YCDiAA.png" /><figcaption><em>Concurrency tells you how many people are here. Onboarding Rate tells you how fast they’re arriving.</em></figcaption></figure><ul><li><strong>Concurrency (Green):</strong> Gradual incline</li><li><strong>Onboarding Rate (Yellow):</strong> Sharp vertical spike at first ball</li></ul><h3>The First Ball Problem: The Anatomy of a Surge</h3><p>A cricket match doesn’t start gently. It detonates. At the moment the first ball is bowled, we often see:</p><ul><li><strong>6–8 million</strong> users joining in the first <strong>2 minutes</strong></li><li>Homepage API traffic jumping 300–400%</li><li>Watch Page [page with the player] API traffic jumping 400–500%</li></ul><p>Everything in the onboarding path lights up red.</p><p>Then, strangely, 10 minutes later the same services hum along happily as concurrency climbs toward 50 or 60 million. While it sounds obvious now — the stress was caused by the arrival pattern.</p><h3>Naming the Invisible Force: Onboarding Rate (OR)</h3><p>We started to ask</p><p>“How many people are watching now?”<br><strong>AND….</strong><br><strong>“How fast are people joining right now?”</strong></p><p>By plotting <em>new active sessions per minute</em>, the heartbeat of the match finally revealed itself:</p><ul><li>The mini-spike at toss</li><li>The explosion at first ball</li><li>The middle overs lull</li><li>Seconds innings spike</li><li>The pre-death overs ramp</li><li>The sudden drop at match end</li></ul><p>It was beautiful in a way — like seeing the pulse of a stadium appear on your Grafana dashboard.</p><p>We gave it a name: <strong>Onboarding Rate (OR)</strong>.</p><p>It also pushed us to recognise that every service experienced the wave differently and we had to recognise that as a first class problem statement rather than <em>buffer </em>an instinctive number over the concurrency number.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Lh8oPKzITClm3Qk9koGp5A.png" /><figcaption><em>Every match has a pulse. OR makes that pulse visible.</em></figcaption></figure><h3>OR Didn’t Just Improve Scaling — It Rewrote It</h3><p>Recognising OR as a first-class signal forced us to rethink our entire scaling philosophy.</p><p>What we realised was uncomfortable but liberating:</p><ul><li>Our BAU (Only metrics based scaling) vs Live mode (Concurrency ladder + metrics based scaling) dichotomy was artificial</li><li>Our reliance on concurrency created blind spots</li><li>Our scaling ladders could be better aligned with the customer journey for an event</li><li>Our systems needed different signals at different stages</li></ul><p>So we did what we do best as a team — adapt and mature.</p><h3>The End of “Modes”: One Unified Scaling Model</h3><p>With OR now visible, we no longer needed to remember to switch from BAU to Live mode. The system could adapt to user intent directly.</p><ul><li><strong>High OR:</strong> Scale onboarding path immediately</li><li><strong>High Concurrent Users (CCU):</strong> Scale playback and streaming paths</li><li><strong>Both high:</strong> Full platform lift</li><li><strong>Both low:</strong> Efficient, cost-effective base ladder which can auto scale based on real time traffic.</li></ul><p>Scaling became fluid — not binary.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LGmlWrNVrdLY_Ei8vH1pzg.png" /><figcaption>OR Ladder Scale up/down with real time metrics</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*I6lXOrNLc1DzpyWUqbOdLQ.png" /><figcaption>Concurrency Ladder Scale up/down with real time metrics</figcaption></figure><h3>Mapping Scaling Signals to the Customer Journey</h3><p>Armed with new eyes, we now scale our system based on the stress it experiences.</p><p><strong>Auth</strong> → Onboarding rate + Throughput</p><p><strong>Customer Profile Selection</strong> → Onboarding rate + Throughput</p><p><strong>Home Page</strong> → Onboarding rate + Throughout + Concurrency * Post-match Coefficient</p><p><strong>Watch Page</strong> → Onboarding rate + Throughput</p><p><strong>Ads</strong> → Concurrency + Throughput</p><p>This alignment transformed our reliability dramatically, as is evident from the scaling patterns below.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*u-dK65m0uMMRlJ3kR1LkKQ.png" /><figcaption>OR based System</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wTayl41kxnEouReI3TRyXA.png" /><figcaption>Concurrency based System</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*gey1-lX5_ixusr5WjWW4Xw.png" /><figcaption>OR + Concurrency*coefficient based system</figcaption></figure><h3>The Surprise We Didn’t Expect: The Post-Match Storm (Patent filed)</h3><p>We thought OR solved everything. Then we discovered match endings — or the “Movie Interval” conundrum.</p><p>At the end of a match/innings:</p><ul><li>Concurrency drops</li><li>OR is flat</li><li>But homepage traffic spikes <strong>200–300% within seconds</strong></li></ul><p>Once the event was over, customers flocked to the home-page, which was great, we <em>want </em>this to happen! Just like a movie <em>interval, </em>everyone streams out to the concession stands causing a surge.</p><p>We call this the “off-boarding rate” (OBR).</p><p>This “off-boarding rate” (OBR) was invisible to both concurrency and OR.</p><p>So we built a hybrid model for homepage scaling:</p><p><strong>Homepage Scaling = OR + (Concurrency × Post-Match Coefficient)</strong></p><p>This resolved the reverse surge — though, we continue to refine how we predict the OBR better, to arrive at a optimal coefficient.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*d_rlNCkVvTSmcdUR1gWhjQ.png" /><figcaption><em>The moment after the match ends is its own event.</em></figcaption></figure><h3>Rounding It All Up</h3><p>Here’s what we’ve learnt so far, and we’re still learning!</p><h3>1. Ride the “Wave”</h3><p>The curve of user arrival tells you far more than the peak ever will. Right from when customers come [OR], to their “staying” [concurrency] and finally their off-boarding [OBR].</p><h3>2. Every Service is Unique Like A Snowflake</h3><p>Every service moves to a different rhythm, depending on where it is in the customer journey. While we always knew this, with OR, we are able to now articulate this difference more sharply, mathematically.</p><h3>3. Adapting Is Compulsary</h3><p>World record concurrencies are a given when you stream Indian men’s cricket. No other streaming service in the world has to deal with the <em>tsunami’s </em>that JioHotstar has to plan for.</p><p>Even though we’ve learnt and built over the years, we had to keep adapting and naming parts of our scaling system which we had only scaled instinctively earlier.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oYEXnXUV-oQ2LVWJZfacJw.png" /><figcaption><em>True scaling is preparing for all three.</em></figcaption></figure><p>Live sports streaming builds a lot of humility — it requires a unique blend of multi-disciplinary headspace to get right, from clients, to CDN’s, to cloud capacity to gateways, to services and then back. At JioHotstar we constantly have to learn to build for the next big wave, as our digital foot-print continues to grow.</p><p>Adapting our scaling systems to recognize OR as a first class citizen has been a major step in our scaling architecture.</p><p>As we tread calm waters and wait for the next Tsunami, we got our metrics right for now!</p><p>Don’t miss out the earlier editions of this scaling series</p><ul><li><a href="https://medium.com/hotstar/scaling-infrastructure-for-millions-datacenter-abstraction-part-2-42b04ef5ed6a">Scaling for Millions — Datacenter Abstraction</a></li><li><a href="https://medium.com/hotstar/scaling-infrastructure-for-millions-from-challenges-to-triumphs-part-1-6099141a99ef">Scaling for Millions — Challenges &amp; Triumphs</a></li></ul><p>Want to work on problems like these? Come join our team! We’re actively hiring across all teams. Please apply <a href="https://jobs.lever.co/jiostar?department=Digital%20%7C%20Engineering">here</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a862afe85345" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/scaling-tales-discovering-onboarding-rate-a862afe85345">Scaling Tales: Discovering Onboarding Rate</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Demuxed 2025 Talk — Server Guided Ad Insertion (SGAI)]]></title>
            <link>https://blog.hotstar.com/demuxed-2025-talk-server-guided-ad-insertion-sgai-193d7326d270?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/193d7326d270</guid>
            <category><![CDATA[demuxed]]></category>
            <category><![CDATA[scale]]></category>
            <category><![CDATA[dynamic-ad-insertion]]></category>
            <category><![CDATA[aisg]]></category>
            <dc:creator><![CDATA[Akash Saxena]]></dc:creator>
            <pubDate>Wed, 24 Dec 2025 05:29:55 GMT</pubDate>
            <atom:updated>2025-12-24T05:29:55.092Z</atom:updated>
            <content:encoded><![CDATA[<h3>Demuxed 2025 Talk — Server Guided Ad Insertion (SGAI)</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*I0LLn1RRzscTzDha" /><figcaption>Photo by <a href="https://unsplash.com/@varpap?utm_source=medium&amp;utm_medium=referral">Vardan Papikyan</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><blockquote>JioHotstar(JHS) has a problem that no other streamer in the world has — monetising mid-rolls at world record levels of concurrency, not global, but in a single geography. Live sports rights fees are significant, which means that monetisation strategies must be uber resilient to match up.</blockquote><p>We solved this for large cohorts using our home spun Server Side Ad Insertion (SSAI) tech, that we pioneered from 2019 onwards, when no other commercial provider could step up, and none can, at this scale. We took this further with our work on SGAI — for granular 1:1 targeting at scale.</p><p><a href="https://medium.com/u/af4f3dac1e63">Prachi Sharma</a> recently gave a talk about our pioneering work on Server Guided Ad Insertion (SGAI) at <a href="https://2025.demuxed.com/">Demuxed UK</a>. Here is her talk and the transcript follows.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FSlLD-gvX7nM%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DSlLD-gvX7nM&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FSlLD-gvX7nM%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/9b6584b5bfdb1405f0cd64f84a9e9cc2/href">https://medium.com/media/9b6584b5bfdb1405f0cd64f84a9e9cc2/href</a></iframe><h3>SGAI @ Scale</h3><p><a href="https://en.wikipedia.org/wiki/Cricket">Cricket</a> matches typically last around three hours with two innings, and streams are available in multiple languages and camera angles. Unlike sports with fixed commercial timeouts, cricket’s ad break structure is unpredictable.</p><p>While there are planned slots between overs, an over’s duration can vary significantly due to fast bowlers, umpire reviews, tense game moments, or strategic timeouts. Additionally, viewership is not steady, with key moments driving massive spikes in concurrency. Therefore, inserting ads in live cricket involves managing irregular, dynamic, and often short ad breaks where even milliseconds are crucial for revenue.</p><h3>Ad Insertion Strategies</h3><h4>Client-Side Ad Insertion (CSAI)</h4><p>In this model, the client app uses a two-player strategy: one for content and a separate one for ads, hoping they stay in sync. A <a href="https://en.wikipedia.org/wiki/SCTE-35">SCTE-35</a> marker in the live feed triggers an ad break, pausing the content player and spinning up the ad player. The client fetches the ad creative from the ad server, plays it, and then hands control back to the content player. While CSAI offers direct measurement and control, it has challenges such as susceptibility to ad blockers, buffering risks on low-powered devices or weak networks, and a clunky user experience due to the constant switching between players.</p><h4>Enter SSAI</h4><p>To address these issues, Server-Side Ad Insertion (SSAI) emerged as a more seamless and resilient solution. Instead of a two-player model, SSAI shifts the work to the server, which rewrites the original content manifest to include ad segments. The client then sees a single stream with ads and video delivered together.</p><p>When the SSAI system detects a SCTE-35 marker, it calls the ad server to fetch relevant ad creatives and rewrites the manifest to seamlessly integrate them. SSAI can insert ads in three ways: spot ads (burnt directly into the playout, same ad for everyone), cohort-level ads (users grouped, each cohort gets personalized ads), and one-to-one stitching (unique manifest for every user).</p><p>SSAI solves many CSAI problems: ads are harder to skip or block, there’s no sync drift, low-powered devices can handle playback as they decode a single stream, and the overall user experience is seamless, similar to broadcast TV.</p><p>JioHotstar built its SSAI infrastructure in-house, giving us control over the entire pipeline. The live stream goes from production to their playout systems, where operators manage the stream and insert ad markers based on a production feed and a director’s feed (which runs a few seconds ahead for reaction time).</p><p>From playout, the stream passes to origin servers, where cohorts are layered. Instead of every user getting a unique manifest, users are grouped into large cohorts based on targeting parameters like age, gender, location, and device type.</p><p>When the CDN requests a manifest for a cohort, the in-house stitching service interacts with the ad server to fetch ads for that cohort and rewrites the manifest to include the ad segments. These stitched segments are then pushed to the CDN, providing each cohort with a personalized stream.</p><h3>SSAI Limitations For Targeting</h3><p>SSAI also presented challenges for JioHotstar, both in streaming infrastructure and monetization. The number of manifests produced by the stitching service grew exponentially as a factor of streams, qualities, and cohorts, leading to decreased CDN cache offload and increased compute time at the origin shield.</p><p>While theoretically possible, backend stitching for tens of millions of users in real-time is a difficult problem(As of writing — JHS peak concurrency record stands at 62.5Mn viewers concurrent). On the monetization side, SSAI’s cohort-level operation limited granularity, making it difficult to target specific user segments (e.g., women in Mumbai if the cohort is only women in metro cities).</p><p>This also hindered performance campaigns requiring precise per-user targeting and conversion tracking. Reach and frequency campaigns were also difficult to execute due to a lack of impression-level control, making programmatic pipeline integration challenging as they rely on one-to-one matching. Over-delivery of ads and inability to enforce frequency capping were also issues. While SSAI provided reliability and control, it imposed limits on utilizing remnant inventory and scaling the solution.</p><h3>Et <em>Voilà</em>! SGAI</h3><p>This led to the development of Server-Guided Ad Insertion (SGAI). The core idea of SGAI is that the stream remains the same for everyone, but ad delivery changes. Instead of the server pre-stitching ads for many users, it sends a common manifest with ad opportunities marked via SCTE-35 tags, HLS interstitials, or MPEG xlink cues.</p><p>When a break occurs, the client follows the manifest and requests the ad system, at which moment the ad server decides which ad to show to that specific user. This results in a cache-friendly, common manifest across users, while ad segments delivered can vary. Importantly, even though the call is client-side, SGAI operates in a single-player world where ads and content are just segments in the same pipeline, without player juggling like in CSAI.</p><p>SGAI offers significant benefits for scalability, as the single manifest version can be easily cached on the CDN, and the CDN only needs to scale for network, not compute. The separation of content and ad delivery also allows both infrastructures to scale independently. On the monetization side, SGAI unlocks one-to-one serving, enabling frequency caps, programmatic pipelines, and richer region and frequency campaigns.</p><h3>SGAI Challenges</h3><p>However, SGAI also has challenges. It remains network-dependent, requiring the client to fetch ads quickly to prevent playback stalls. Full SGAI adoption requires consistent support across all platforms for a truly universal solution.</p><p>When JioHotstar began its SGAI journey, player technology was still evolving in the industry, with limited native support in open-source players for features like HLS interstitials and MPEG xlink cues. Given our vast Android user base relying on Exoplayer for HLS streaming, we leveraged our existing custom Exoplayer fork that already parsed HLS manifests. We added lightweight logic to this hook to intercept ad markers and trigger their ad logic without additional overhead.</p><p>We also exported their in-house server-side stitching logic into a client-side library, acting as a manifest interceptor decoupled from the player. This architecture allows for independent evolution of the player and the library, enables multi-CDN controls (routing ad segments and content to different CDNs based on bandwidth), and allows the same signaling mechanism to trigger companion rendering, L-Bands, or on-screen overlays, making the framework extensible.</p><p>Crucially, because the core logic is player-independent, the solution can be plugged into multiple platforms with minimal adoption effort, setting them up for broader device coverage.</p><h3>Workflow</h3><p>The SGAI process involves the player receiving a pristine manifest with SCTE-35 markers. The client-side library intercepts the manifest, calls a middleware service, which in turn talks to the ad server to get ads for the user. It fetches master and child manifests and returns relevant ad segments to the client. The client library then assembles these ad segments into a temporary manifest, which is handed back to the player. The client-side stitching is deliberately lightweight, solely focused on inserting ad segments.</p><p>A significant challenge faced after moving to SGAI was at the ad server. During an IPL break with millions of concurrent viewers, every client would simultaneously try to contact the ad server for ads when a SCTE-35 marker hit, causing a “thundering herd” problem and multi-million RPS spikes.</p><p>To mitigate this, their team explored adding a jitter on the client side. As the player typically buffers a few seconds of content after detecting a SCTE-35 marker, this window allows for a small, dynamic jitter before contacting the ad server when it’s not under heavy load.</p><p>The challenge lies in carefully tuning this jitter: too little and the spike remains; too much and there might not be enough time to stitch the segments. Another approach is pre-fetching ads for the following break, but the critical part is deciding when to pre-fetch to give the ad server enough time to process inventory and make ad decisions.</p><h3>Looking Ahead</h3><p>Looking ahead, Jio Hotstar has three main directions. First, extending SGAI beyond Android, integrating the manifest interceptor to other platforms due to its player independence. Second, enabling longer DVR windows, which introduces additional storage, latency, and measurement challenges requiring smarter coordination between the CDN, player, and ad system.</p><p>Finally, using the manifest interceptor beyond just ads for personalized replays, alternate commentary feeds, and potentially a future where every user streams their own personalized version of the game.</p><p>Want to work on improving SGAI for millions of concurrent customers and problems that no other company has to solve? Come join our team! We’re actively hiring in our Ads team. Please apply <a href="https://jobs.lever.co/jiostar?department=Software+Engineering">here</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=193d7326d270" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/demuxed-2025-talk-server-guided-ad-insertion-sgai-193d7326d270">Demuxed 2025 Talk — Server Guided Ad Insertion (SGAI)</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[JioHotstar Android App — Road to 99.9% CFUR]]></title>
            <link>https://blog.hotstar.com/jiohotstar-android-app-road-to-99-9-cfur-e6bdd1299558?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/e6bdd1299558</guid>
            <category><![CDATA[memory-profiling]]></category>
            <category><![CDATA[android-development]]></category>
            <category><![CDATA[android]]></category>
            <category><![CDATA[crashlytics]]></category>
            <dc:creator><![CDATA[Vrihas Pathak]]></dc:creator>
            <pubDate>Tue, 23 Dec 2025 05:44:54 GMT</pubDate>
            <atom:updated>2025-12-23T05:44:53.112Z</atom:updated>
            <content:encoded><![CDATA[<h3>JioHotstar Android App — Road to 99.9% CFUR</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*nXXJDHWWB2I4Odtj" /><figcaption>Photo by <a href="https://unsplash.com/@edge2edgemedia?utm_source=medium&amp;utm_medium=referral">Edge2Edge Media</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><blockquote>First rule of building mobile apps — “Don’t Crash”</blockquote><p>At JioHotstar, we keep an eye on Crash Free User Rate (<strong>CFUR</strong>) as key metric of success. Over the years, we’ve achieved a 99.8% CFUR which we are very proud of. We’re sharing our strategies to resolve persistent Out Of Memory (OOM) crashes, a major source of instability.</p><h3>Goal — 99.9%</h3><p>Our Android app consistently maintained a <strong>99.5%</strong> Crash Free User Rate (CFUR), which we’re very proud of — given our customer base of millions. However, our ultimate goal, our North Star metric, is to achieve a <strong>99.9%</strong> CFUR.</p><h3>Strategy to improve CFUR</h3><p>To reach our goal and improve our CFUR, we rigorously monitored crashes through the Firebase Crashlytics dashboard.</p><ul><li><strong>Quick off the block</strong>: We started by targeting the most critical crashes affecting users. To achieve this, we prioritized fixing the <strong>top 10</strong> crashes based on their impact on user experience. This focused approach proved effective, and we successfully raised the app’s Crash-Free User Rate (CFUR) to an impressive<strong> 99.5%</strong>. However, as we approached this milestone, we encountered a challenge — further improvements became increasingly difficult.</li><li><strong>Pushing Further: </strong>At 99.5% CFUR, the crashes we were addressing in the top 10 list were no longer having a significant enough impact to push the needle further. It became evident that a different strategy was needed to break through this wall. We decided to expand our analysis, shifting our focus from the top 10 crashes to the <strong>top 25</strong>. Through this broader investigation, we uncovered a pattern: a large portion of these crashes were related to Out Of Memory (OOM) issues.</li></ul><h4>Out of Memory And Into Our Radar</h4><p>While each individual OOM crash had a relatively minor impact on its own, collectively, these crashes were affecting approximately <strong>0.3%</strong> of our users. This discovery was crucial, as it pointed to a category of crashes that had previously gone under the radar due to their individual severity but were collectively significant in keeping us from achieving an even higher CFUR.</p><p>Further analysis revealed that these OOM issues were particularly problematic during <strong>large-scale live events (&gt; 40Mn Concurrent), </strong>within our app. During these events, many users with lower RAM devices were onboarded, which made them more susceptible to OOM crashes. As a result, these devices disproportionately contributed to the surge in OOM-related crashes, significantly impacting our overall CFUR.</p><p>Memory-intensive operations such as playing multiple videos or switching between different content types led to increased memory consumption, which the app was unable to handle effectively. As a result, the app frequently crashed for approx<strong> 0.3% </strong>users<strong> </strong>when it exceeded its allocated memory limits.</p><p>These crashes were particularly challenging because they were often triggered by background operations that retained excessive memory even when the app was not actively being used.</p><p>Recognizing the need for a solution, we embarked on a comprehensive analysis and resolution of these persistent OOM issues. Through meticulous memory profiling and strategic fixes, we aimed to elevate our app’s performance to achieve our North Star metric of a 99.9% CFUR.</p><h3>Deep Seek : OOM Analysis!</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/480/0*qxO2M1zBk4QxfDCZ" /></figure><h4><strong>Crashlytics dashboard analysis</strong></h4><p>We first analyzed the top OOM issues in the Crashlytics dashboard. These were the top variations of the OOM issues:\</p><pre>* com.google.common.collect.ImmutableSet.construct<br>* kotlinx.coroutines.CancellableContinuationImpl.resumedState<br>* com.hotstar.DaggerHsApplication_HiltComponents_SingletonC$ActivityCImpl.getViewModelKeys<br>* android.os.perfdebug.MessageMonitorImpl$MessageMonitorInfoImpl.markDispatch</pre><p>For all the above issues, there was one similar error message in the stack trace</p><blockquote>Fatal Exception: java.lang.OutOfMemoryError<br>Failed to allocate a 2064 byte allocation with 3015400 free bytes and 2944KB until OOM, target footprint 536870912, growth limit 536870912; giving up on allocation because &lt;1% of heap free after GC.</blockquote><h4><strong>Unpacking the fault</strong></h4><ul><li><em>Memory Allocation Failure:</em> The application failed to allocate a relatively small amount of memory (2064 bytes), even though there was seemingly enough free memory. This can happen due to memory fragmentation or because the free memory isn’t contiguous.</li><li><em>Garbage Collection Ineffectiveness</em>: The garbage collector couldn’t free up enough memory to satisfy the allocation request. This indicates that the application holds on to a lot of memory, potentially due to memory leaks or inefficient memory usage.</li><li><em>Approaching OOM Threshold</em>: The application was close to its maximum allowed heap size (512MB), suggesting that it was consuming a lot of memory overall.</li></ul><p>We analyzed more stack trace threads and found that the apps are crashing majorly on the watch page (the page where our video player sits).</p><h4>User flow analysis using custom logs</h4><p>Unlike other crashes, identifying the root cause of OOM issues from the stack trace is challenging. Crashlytics reports the point of failure at the moment of the crash, but it does not pinpoint the actual underlying cause. OOM crashes can stem from various objects being retained in memory, eventually leading to a crash when the app exceeds its allocated memory limit.</p><h4><strong>Custom Tracing</strong></h4><p>Until this point, we had identified that there was a recurring issue related to the watch page when users were consuming content. To better understand the problem, we leveraged the custom logs implemented within the app. These logs provide detailed insights into the app’s and player’s states, allowing us to track the app’s behavior over time.</p><p>From the application side, we have been tracing detailed information about the app’s state, including whether it is in the foreground or background, as well as the player’s state, such as when the user starts and stops the player. By analyzing this timeline data, we can correlate the app’s state with user actions, providing a clearer picture of when and why OOM crashes occur.</p><h4><strong>Key Findings</strong></h4><p>While we were checking logs, we found that most of the user content playing sessions were in the downloads flow due to an “offline mode” flag. Additionally, we noticed a pattern where users frequently switched between different downloaded contents within a single session. This pattern was prevalent among most users experiencing OOM crashes.</p><h4>Memory Profiling</h4><p>We initiated our investigation by profiling the offline user journey. Using the Android Profiler, we performed memory profiling on our application. Our approach involved replicating the exact user journey patterns observed in Crashlytics to ensure accurate and relevant profiling.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fUpNb_369LpIihtM" /></figure><h4><strong>Analysis and Observations</strong></h4><p>We analyzed the heap dump and memory usage patterns of the app during various user journey steps. The key observations were as follows:</p><ul><li><strong>High Retained Sizes</strong><br>We observed that objects of third party libraries such as LottieCompositionCache and EmojiCompat exhibited high retained sizes. Ideally, these objects should be garbage collected by the system’s Garbage Collector (GC) once the user journey is completed. However, they were not being released as expected, contributing to memory retention issues.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Jbl4Q_ESUgNWNCSjwgiGlQ.png" /><figcaption>High retained sizes of third party libraries</figcaption></figure><ul><li><strong>Network Call Retention<br></strong>We use separate API calls to track client side events whenever the user is interacting with the app. Using the network debugger tool, we identified that these calls were being retried continuously after playing content from the Downloads. These were failing due to offline network conditions but were being retried repeatedly. This behavior led to substantial memory allocation for network constructs such as SegmentPool (okio) and CipherSuite (okhttp3), exacerbating memory usage.</li><li><strong>Incremental Memory Usage<br></strong>During the same session, each time the player was opened for new content, the app’s memory usage increased by approximately 10 MB. This cumulative increase in memory usage eventually led to app crashes, particularly during sessions with frequent content switching.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/930/1*1e8AUv0HuE2Dkfb0_0T6Qg.png" /><figcaption>Memory graph of the app in the profiler</figcaption></figure><h4><strong>Additional Profiling</strong></h4><p>In addition to profiling the offline user journey, we extended our analysis to cover the different parts of the app as well. The observations in this section included:</p><ul><li><strong>ViewModel Leaks<br></strong>A ViewModel instance was found to be leaking after the user navigated back from a specific tab and then exited the app. This suggested that resources were not being properly released during navigation, leading to memory leaks.</li><li><strong>Retained Views<br></strong>Some of the views were retained in memory even after the user exited a screen with multiple players using the back button. These views were holding references to a singleton class, highlighting a need for better management of view lifecycles and dependency handling to prevent resource retention and ensure proper cleanup.</li></ul><h3>Fixing It All — Tune Up and Go 🚀</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/497/0*S_3VEjTFWRVWpqFY.gif" /></figure><p>Through memory profiling, we identified several critical issues contributing to OOM crashes, as follow:</p><h4>Lottie Composition Cache Retention</h4><ul><li>Lottie, a third-party library used for rendering images and animations, was retaining its Lottie Composition Cache object, which was not being collected by the Garbage Collector (GC). This object had a substantial retained size of around<strong> 8 MB</strong>. For our use case, we didn’t want Lottie to cache anything but in another case, if we want Lottie to cache the image resources so that it doesn’t have to load the image resource every time, this retention is valid.</li><li>To mitigate this, we manually cleared the Lottie library cache during the destruction of the app’s Main Activity and the disposal of Page UI. The Lottie library provides an API for clearing its cache, which we utilized to ensure the cache was properly released. Following this change, the Lottie Composition Cache was no longer retained in memory.</li></ul><h4>Emoji Compat Library Optimization</h4><ul><li>The Emoji Compat support library, designed to keep Android devices updated with the latest emojis and prevent the display of missing emoji characters, was retaining memory. The retained size of this library was approximately <strong>352 KB</strong>.</li><li>We placed the initialization of this library under a feature flag and subsequently disabled it. By disabling the Emoji Compat library, we ensured that it was not retained in memory, thereby reducing unnecessary memory usage.</li></ul><h4>High Memory Allocation of OkHttp resources</h4><ul><li>When a user watches content, our app’s internal library triggers client-side events via API calls. If a call fails, events are stored in a queue, and the library continuously retries until successful transmission.</li><li>In offline mode, as users watch downloaded content, these API calls fail due to lack of network connectivity, causing the events to remain queued and retried, leading to high memory allocation for OkHttp objects like Segment Pool and Cipher Suite. This problem increased when new player instances are initiated.</li><li>To address this, we modified the library to halt API calls when no network is detected, queuing events for transmission only when connectivity is restored. This change significantly reduced unnecessary memory allocation and overall app memory usage.</li></ul><h4>View Model Memory Leak Fix</h4><ul><li>A view model was leaking memory when users navigated to a specific tab. The issue was traced to a handler retained by the view model to trigger certain actions, preventing proper garbage collection of the view model instance.</li><li>To address this, we refactored the view model to avoid direct usage of the handler. Instead, an event system was implemented where the view model fires an event, which is then captured within the UI containing the handler. Upon receiving the event, the handler triggers the required actions. This approach ensured that the view model did not hold unnecessary references, effectively eliminating the memory leak.</li></ul><h3>Results</h3><p>We incorporated all the aforementioned fixes into our build. Following the deployment, we closely monitored user adoption and performance metrics via the Crashlytics dashboard. The results were notable: all OOM issues previously listed among the top 10 crashes were resolved. Consequently, the Crash Free User Rate for that build exceeded <strong>99.8%</strong></p><h3>Do’s and Don’ts for Addressing OOM Issues</h3><h4>Do’s</h4><ul><li><strong>Conduct Comprehensive Memory Profiling</strong><br>Regularly perform memory profiling using tools such as Android Profiler to understand memory usage patterns and identify potential memory leaks.</li><li><strong>Implement Efficient Memory Management</strong><br>Use appropriate data structures and algorithms that minimize memory usage. Release unused resources promptly, such as database connections, file handles, and bitmap objects, to prevent memory bloat.</li><li><strong>Optimize Third-Party Libraries</strong><br>Clear caches and release retained objects from third-party libraries like Lottie and EmojiCompat to avoid unnecessary memory retention.</li><li><strong>Monitor and Log App States</strong><br>Implement custom logging to track app states. Use these logs to analyze user behavior and pinpoint memory usage spikes.</li><li><strong>Profile Specific User Journeys</strong><br>Replicate and profile specific user journeys, particularly those that are prone to OOM issues, to gain insights into memory consumption during those activities.</li><li><strong>Manage Network Calls Effectively</strong><br>Integrate network state checks to prevent unnecessary retries of network calls in offline mode, reducing memory allocation for network-related objects.</li><li><strong>Refactor Code to Avoid Memory Leaks</strong><br>Refactor ViewModels and other components to eliminate direct usage of resources that prevent garbage collection.</li><li><strong>Utilize Feature Flags</strong><br>Use feature flags to control the initialization and usage of memory-intensive libraries and features, enabling you to disable them when not needed.</li></ul><h4>Don’ts</h4><ul><li><strong>Avoid Retaining Unnecessary References</strong><br>Do not retain references to objects that are no longer needed, as this prevents garbage collection and leads to memory leaks.</li><li><strong>Do Not Overlook Background Services</strong><br>Background services that continuously sync data, download files, or process information can consume significant memory. Ensure they are managed and released properly.</li><li><strong>Avoid Large Data Operations Without Optimization</strong><br>Performing operations on large datasets or loading large files (e.g., images, videos) without efficient memory handling can lead to excessive memory consumption.</li><li><strong>Do Not Rely Solely on Garbage Collection</strong><br>Relying only on the system’s garbage collection to manage memory can be ineffective. Actively manage and release resources to maintain optimal memory usage.</li><li><strong>Avoid Singletons for Memory-Intensive Objects</strong><br>Using singleton patterns for objects that hold significant memory can lead to retention issues. Ensure that such objects are released when no longer needed.</li></ul><p>Want to focus on building customer experiences that are not only delightful but also demand high stability? Come join our team! We’re actively hiring in our Android team. Please apply <a href="https://jobs.lever.co/jiostar?department=Software+Engineering">here</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e6bdd1299558" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/jiohotstar-android-app-road-to-99-9-cfur-e6bdd1299558">JioHotstar Android App — Road to 99.9% CFUR</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Preventing Performance Regressions on iOS Apps]]></title>
            <link>https://blog.hotstar.com/preventing-performance-regressions-on-ios-apps-be3bd033f1dc?source=rss----dbc3fcbc7f07---4</link>
            <guid isPermaLink="false">https://medium.com/p/be3bd033f1dc</guid>
            <category><![CDATA[regression-analysis]]></category>
            <category><![CDATA[ios]]></category>
            <category><![CDATA[xcuitest]]></category>
            <category><![CDATA[performance]]></category>
            <dc:creator><![CDATA[Saurabh Kapoor]]></dc:creator>
            <pubDate>Fri, 12 Dec 2025 10:49:44 GMT</pubDate>
            <atom:updated>2025-12-12T10:49:44.051Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sLYaRXNYMv7IAGShtOWR4g.png" /></figure><blockquote>Dive into the nitty-gritties of how we ensure that our iOS performance remains top-notch, even with so many changes going in on a weekly basis, so that our customers always get the best experience.</blockquote><h3>Introduction</h3><p>Have you ever faced the frustration of using an app that just couldn’t keep up? You know, the kind that lags, stutters, and sometimes feels like it’s working against you rather than for you? If you’ve ever found yourself wondering why an app performs poorly on your device, you’re not alone.</p><p>In a world where competition is fierce and user expectations are higher than ever, performance isn’t just a nice-to-have — it’s a table stakes. That’s why, we at JioHotstar, are on a mission to deliver nothing short of a blazing fast experience to our millions of users, regardless of the device they’re using.</p><p>Our journey to optimal performance wasn’t without its challenges. We started by tackling the major performance issues head-on, fixing janks and glitches across our pages. Each victory felt like a triumph, a step closer to our goal of seamless user experience. But our celebrations were short-lived. Time and time again, issues resurfaced, undoing our hard-earned progress.</p><blockquote>It was like chasing our own tail, stuck in a frustrating cycle of fix and repeat.</blockquote><p>We knew there had to be a better way — a way to identify and fix problems before they ever reached our users’ screens. Performance. Responsiveness. They’re not glamorous tasks. When done properly, nobody is going to thank you. When done incorrectly, app retention is going to suffer.</p><h3>How do you even start testing for performance?</h3><p>Our primary challenge lay in the testing environment itself. Developers often rely on a single, high-performance device or simulator, which doesn’t reflect the diverse realities of user experiences. This setup overlooks scenarios like memory constraints or battery saver mode, which can significantly impact performance, much like navigating congested traffic.</p><p>Manual testing also falls short in detecting subtle performance issues, such as app launch speed or scrolling smoothness, as these can be subjective. Different people perceive smoothness differently, making it difficult to measure performance accurately.</p><p>To address these limitations, we shifted to automation, aiming to improve efficiency and objectively identify issues. We explored two approaches to automate performance testing, each presenting its own set of challenges.</p><h3>Leveraging E2E Tests along with instruments to detect performance issues</h3><p><a href="https://developer.apple.com/videos/play/wwdc2019/411/">Instruments</a>, Apple’s powerful performance analysis tool, is deeply integrated with Xcode, allowing developers to profile and debug their applications using a variety of specialized templates. Its real-time data and intuitive visualizations are invaluable for uncovering performance bottlenecks and gaining insights into app behavior.</p><p>However, as we delved deeper into detecting performance issues using automation, some crucial questions emerged:</p><blockquote><em>How do we automate Instruments? <br>Is there a way to detect performance issues without introducing another layer of testing complexity? <br>Aren’t our existing unit, integration, and E2E tests sufficient?</em><strong>”</strong></blockquote><p>Our ultimate goal was to integrate performance monitoring directly into our existing End to End (E2E) tests, ensuring a streamlined workflow without sacrificing thoroughness.</p><p>Thankfully, xctrace offered a solution. By running our E2E tests while simultaneously capturing performance metrics, we could weave performance monitoring directly into our established testing routine. Here’s how our solution unfolded:</p><ol><li><strong>Integrating xctrace with E2E Tests:</strong> We started by running a single E2E test, incorporating performance profiling using the xctrace command. This allowed us to monitor performance in real-time without disrupting our existing tests.</li><li><strong>Generating Detailed Reports:</strong> After each test, we generated a .trace report that captured a comprehensive dataset of performance metrics. This report provided a detailed view of the app’s behavior under real-world conditions, helping us spot potential issues early.</li><li><strong>Data Extraction and Analysis:</strong> Using the xctrace export command, we extracted and analyzed the performance data. This step was crucial for converting raw metrics into actionable insights, allowing us to pinpoint specific areas needing improvement.</li></ol><p>A sample E2E flow would look something like follows:</p><pre>    func testWatchNowOnHomePage() throws {<br>        let homePage = HomePage()<br>        homePage.tapOnTabBar(item: .home)<br>        let button = app.buttons.containing(NSPredicate(format: &quot;identifier CONTAINS[c] %@&quot;, &quot;watch_now&quot;))<br>            .lastMatch<br>        button.tap()<br>        app.buttons[&quot;HSTitleBar.rightBtn&quot;].tap()<br>        app.swipeDown()<br>    }</pre><p>Now in order to detect performance issues in this flow we could run the instruments in parallel using the xctrace command.</p><p>This appeared promising, but challenges emerged in managing large volumes of raw debug data, dealing with inconsistencies, and developing an effective reporting system:</p><ul><li><strong>Managing Large Data:</strong> Each test run produced massive amounts of raw debug data, posing a significant storage challenge. Even just recording for 20 seconds could create a file as big as 200 MB. And if we recorded for a few minutes for larger E2E Tests, the file size could easily reach GBs. This made it really hard to manage and store all the recordings.</li><li><strong>Dealing with Inconsistencies:</strong> The collected data exhibited inconsistencies, with traces sometimes being skipped or incomplete, making it challenging to pinpoint the root cause of regressions reliably.</li><li><strong>Developing a Reporting System: </strong>Creating a reporting system to handle raw trace files posed significant challenges. The goal was to parse through this data, filter out irrelevant details, and extract actionable insights that directly impact the application’s performance. However, not all information was easily extractable, complicating the development of a robust parser and making it challenging to deliver a comprehensive reporting solution.</li></ul><p>The above challenges made it clear that continuing with this approach would introduce more overhead and uncertainty than benefit, prompting us to explore alternative methods for automating performance testing.</p><h3>Harnessing Performance Tests for Optimal Results</h3><p>As we looked into alternatives we came across XCTMetric, we found that it offers various options to help us detect important issues across different aspects of the performance of our app while running our tests.</p><p>For example, there is:</p><ul><li><strong>XCTCPUMetric</strong>, which helps us monitor how much of the device’s CPU our app is using. This can be really useful for identifying if our app is using too much CPU power, which could slow down the device or drain its battery quickly.</li><li><strong>XCTMemoryMetric</strong>, which helps us keep track of how much memory our app is using. This is important because if our app uses too much memory, it could cause the device to slow down or even crash.</li><li><strong>XCTStorageMetric</strong> for monitoring disk storage usage</li><li><strong>XCTNetworkTransferMetric</strong> for tracking network performance</li><li><strong>XCTOSSignPostMetric</strong> for capturing specific points in our code where performance might be an issue.</li></ul><p>By using these different types of XCTMetrics, we can get a comprehensive view of the performance of our app and quickly identify any areas that need improvement.</p><h4>Getting Started with writing Performance Tests</h4><p>At this point, we had a basic understanding of XCTMetrics, and the next step was to develop performance tests to discover any regressions.</p><p>We opted to take an incremental approach, beginning with a proof of concept focused on two primary metrics of interest:</p><ol><li>Hitches (utilizing XCTOSSignPostMetric)</li><li>Memory Leaks (leveraging XCTMemoryMetric)</li></ol><p>We initiated by crafting individual tests for both Hitches and Memory Leaks. Below are the example performance tests for each:</p><p><strong>Hitches</strong></p><p>The test evaluates the scrolling performance of our home page by using the pre-defined scrollingAndDecelerationMetric. The test simulates scrolling actions—twice upwards and twice downwards—and captures performance data for five iterations. This process helps us assess the smoothness and responsiveness of the scrolling experience</p><pre>func testHomeScrollPerformance() {<br>        measure(<br>            metrics: [XCTOSSignpostMetric.scrollingAndDecelerationMetric]<br>        ) {<br>            let app = XCUIApplication()<br>            app.swipeUp()<br>            app.swipeUp()<br>            stopMeasuring()<br>            app.swipeDown()<br>            app.swipeDown()<br>        }<br>    }</pre><p><strong>Leaks</strong></p><p>The below test identifies memory leaks that may occur during navigation between the home page and the detail page. Using the XCTMemoryMetric, the test simulates clicking on an item on the home page to navigate to the detail page. This process is repeated for five iterations to assess the stability of memory usage over multiple transitions.</p><pre>func testHomeToDetailNavigation() {<br>        measure(metrics: [XCTMemoryMetric(application: app)]) {<br>            clickTrayWidget(tray)<br>        }<br>    }</pre><h4>Performance Tests in Action</h4><p>After running the above tests, we were surprised to discover that none of the expected metrics were present in the results.Further investigation revealed that the performance tests yielded the expected fields only when executed on a physical device, rather than on simulators.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/476/1*t8RJOLQJsJCdxnJl1vyCJA.gif" /></figure><p>We gave it a go on an actual device and were thrilled to finally witness all the essential information we anticipated from the performance tests. 🎉🥳</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*gtPQfQbVYgj3lUqslgwMPw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HXH64U0FtY6JMfDboQA3yA.jpeg" /></figure><p>At this point we were able to run the tests as we expected, and had the data we planned for. <strong>However, even though we would find a problem, we still weren’t sure which change in our app would have caused it.</strong></p><h4>Introducing Performance Test Diagnostics</h4><p>At this stage, we learned from <a href="https://developer.apple.com/documentation/xcode-release-notes/xcode-13-release-notes#New-Features">Apple’s documentation</a> that starting with Xcode 13, using the enablePerformanceTestsDiagnostics=YES flag with the xcodebuild command would include additional attachments in the .xcresult file for each performance test. Intrigued by this feature, we decided to run our tests via the command line with this flag enabled.</p><h4>Running Tests via the command line</h4><p>To run tests via the command line, use the xcodebuild command with the appropriate parameters to specify the project, scheme, and destination. For example:</p><pre>xcodebuild test -scheme OurSchemeName -destination &#39;platform=iOS,name=iPhone 14&#39;</pre><p>After executing the tests with this flag, we observed significant differences. A memgraph was generated for our MemoryLeakTests using XCTMemoryMetric, and a .ktrace file was produced for our ScrollPerformanceTests with XCTOSSignpostMetric. These files provided detailed insights into memory usage and performance events, offering valuable data for further analysis.</p><h4>How did we proceed further?</h4><p>At this time, we had a few sample performance tests that could output .xcresultfiles with both metrics and attachments. However, the .xcresult files and attachments had to be downloaded manually, which required a substantial amount of time and work. We wanted the development process to be as simple as possible, with little effort required.</p><p>The objective was to run the performance tests and provide the results to the developer, along with any relevant attachments in case a regression was identified.</p><p>This is when we decided to build our own parser. The parser was designed with three primary objectives in mind:</p><ol><li>Extract performance metrics from the xcresult file corresponding to various types of performance tests</li><li>Retrieve attachments from the xcresult file and store them at designated paths</li><li>Store the test results and performance metrics to create our own baselines</li></ol><h4>How we created the Parser?</h4><p>The parser uses <strong>xcresulttool</strong>, which is a command-line tool provided by Apple, to inspect the result bundles.</p><pre>xcrun xcresulttool get --format json --path &lt;xcresult-bundle-path&gt; --id &lt;value&gt;</pre><p>And the output of this is a proper JSON model</p><pre>{<br>  &quot;_type&quot;: {<br>    &quot;_name&quot;: &quot;ActionTestPlanRunSummaries&quot;<br>  },<br>  &quot;summaries&quot;: {<br>    &quot;_type&quot;: {<br>      &quot;_name&quot;: &quot;Array&quot;<br>    },<br>    &quot;_values&quot;: [<br>      {<br>        &quot;_type&quot;: {<br>          &quot;_name&quot;: &quot;ActionTestPlanRunSummary&quot;,<br>          &quot;_supertype&quot;: {<br>            &quot;_name&quot;: &quot;ActionAbstractTestSummary&quot;<br>          }<br>        },<br>        &quot;name&quot;: {<br>          &quot;_type&quot;: {<br>            &quot;_name&quot;: &quot;String&quot;<br>          },<br>          &quot;_value&quot;: &quot;Test Scheme Action&quot;<br>        },<br>        &quot;testableSummaries&quot;: {<br>          &quot;_type&quot;: {<br>            &quot;_name&quot;: &quot;Array&quot;<br>          },<br>          &quot;_values&quot;: []<br>        }<br>      }<br>    ]<br>  }<br>}</pre><p>In summary, the parser recursively employs the above mentioned commands to extract relevant information from the xcresult file. Ultimately, the result is a complete parsing of the xcresult file into a JSON object.</p><p>The model details and its type information can be obtained using the following command which provides comprehensive information about the types, including their properties and associated data types</p><h4>Export Attachments</h4><p>After successfully parsing all the data, the next step involved exporting the attachments. This can be done by using the following command:</p><pre>xcrun xcresulttool export --path &lt;xcresult-bundle-path&gt; --output-path &lt;output-path&gt; --id &lt;attachment-id&gt; --type file</pre><h4>Parser in action</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2HlGL57y33LWRTgwV-dP-A.gif" /><figcaption>Our parser taking the xcresult file(s) and returning the performance data</figcaption></figure><p>In conclusion, implementing automated performance testing with the approaches mentioned above posed significant challenges, but the journey has been transformative. As we refine and continue to evolve our methodologies, we remain committed to delivering optimal performance and reliability to our users, ensuring they receive nothing short of exceptional quality from our applications.</p><p>If you’re kicked about working on problems like these, come join our team! We’re actively hiring in our iOS team. Please apply <a href="https://jobs.lever.co/jiostar?department=Software%20Engineering">here</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=be3bd033f1dc" width="1" height="1" alt=""><hr><p><a href="https://blog.hotstar.com/preventing-performance-regressions-on-ios-apps-be3bd033f1dc">Preventing Performance Regressions on iOS Apps</a> was originally published in <a href="https://blog.hotstar.com">JioHotstar</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>