Introduction

In 2022, Netflix made a significant change to its iOS and Android applications by migrating to GraphQL from its existing internal API framework, Falcor. The migration involved a complete overhaul from the client to the API layer and was accomplished with zero downtime. This architecture blog article delves into the strategies and tools employed by Netflix to perform this complex migration safely for hundreds of millions of customers.

The Migration Plan

Before transitioning to GraphQL, Netflix’s API layer was built on a monolithic Falcor server maintained by the API Team. The migration plan was executed in two phases:

Phase 1:

Creation of a GraphQL Shim Service During the summer of 2020, a GraphQL shim was developed on top of the existing Falcor API. This allowed client engineers to quickly adopt GraphQL without being hindered by server-side migrations. The GraphQL shim facilitated experimentation with client-side concerns and cache normalization, enabling them to investigate client performance. AB Testing was used to launch Phase 1 safely.

Phase 2:

Deprecation of Legacy API and Adoption of Federated GraphQL In the second phase, the legacy Falcor API was deprecated, and Federated GraphQL was embraced. This approach allowed specific domain teams to independently manage and own sections of the API. Sticky Canaries and Replay Testing were used to ensure a smooth transition and functional correctness.

Testing Strategies: A Summary

Three key testing strategies were employed based on functional vs. non-functional requirements and idempotency:

  1. AB Testing Netflix traditionally uses AB Testing to evaluate new product features. For the GraphQL migration, AB Testing was leveraged to compare the legacy Falcor stack with the new GraphQL client. This helped identify potential issues and provided insights into overall customer impact.

  2. Replay Testing Replay Testing was utilized to verify the functional correctness of the new GraphQL APIs. By comparing responses from the GraphQL Shim and the new Video API service, engineers gained confidence in the replicated business logic.

  3. Sticky Canaries Sticky Canaries, an infrastructure experiment, allocated customers to either a canary or baseline host for the duration of the experiment. By running two instances of the GraphQL gateway with different schemas, Netflix monitored performance, error rates, logs, and resource utilization to ensure a seamless migration.

Conclusion

Migrating critical traffic and evolving APIs are inevitable tasks for engineers. Netflix’s successful migration to GraphQL serves as an excellent example of how to accomplish this safely and with zero downtime. The use of AB Testing, Replay Testing, and Sticky Canaries allowed Netflix to confidently transition to Federated GraphQL and empower domain teams to manage specific sections of the API independently.

As technology continues to evolve, such techniques and tools will become indispensable for organizations seeking to stay at the forefront of innovation while maintaining a seamless user experience. By adopting and adapting these strategies, engineering teams can navigate complex migrations and embrace the future of API evolution.

				
					//References:
Link(https://netflixtechblog.com/migrating-netflix-to-graphql-safely-8e1e4d4f1e72)
				
			

Leave a Reply

Your email address will not be published. Required fields are marked *