In these days of discussing how Facebook, Twitter, Foursquare, Tumblr, and others scale, I don't often think of eBay anymore. Yet eBay, despite its age and ugly UI, is still one of the largest sites on the internet, esp. given its global nature. So I enjoyed this QCon talk by Randy Shoup, eBay Chief Engineer, about Best Practices for Large-Scale Websites.
Here are few lessons that caught my eye:
- Partition functions as well as data: eBay has 220 clusters of servers running different functions like bidding, search, etc. This is the same model Amazon and other use
- Asynchrony everywhere: The only way to scale is to allow events to flow asynchronously throughout the app
- Consistency isn't a yes/no issue: A few datastores require immediate consistency (Bids), most can handle eventual consistency (Search), a few can have no consistency (Preferences)
- Automate everything and embed machine learning in your automation loops so the system improves on its own
- Master-detail storage is done detail first, then master. If a transaction fails in the middle, eBay prefers having unreachable details than a partial master record. Reconciliation processes clean up orphaned detail records
- Schema-wise, eBay is moving away from strict schemas towards key/value pairs and object storage
- Transitional states are the norm. Most of the time eBay is running multiple versions of its systems in parallel, it's rare that all parts of a system are in sync. This means that backwards compatibility is essential
- "It is fundamentally the consumer's responsibility to manage unavailability and quality-of-service violations." In other words: expect and prepare for failure
- You have only one authoritative source of truth for each piece of data but many secondary sources, which are often preferable to the system of record itself
- There is never enough data: Collect everything you never know you'll need. eBay processes 50TB of new data / day and analyzes 50PB of data / day. Predictions in the long tail require massive amounts of data