Fork me on GitHub

Ranked awesome lists, all in one place

This list is a copy of mmcgrana/services-engineering with ranks

Services Engineering Reading List

A reading list for services engineering, with a focus on cloud infrastructure services.

We welcome suggestions.

Papers

Fault Injection in Production (Allspaw)
Making Reliable Distributed Systems in the Presence of Software Errors (Armstrong)
Highly Available Transactions: Virtues and Limitations (Bailis et al.)
The Incident Command System (Bigley and Roberts)
The Chubby Lock Service for Loosely Coupled Distributed Systems (Burrows)
Bigtable: a Distributed Storage System for Structured Data (Chang et al.)
Spanner: Google’s Globally-Distributed Database (Corbett et al.)
Dynamo: Amazon’s Highly Available Key-Value Store (DeCandia et al.)
MapReduce: Simplified Data Processing on Large Clusters (Dean and Ghemawat)
The Google File System (Ghemawat et al.)
On Designing and Deploying Internet Scale Services (Hamilton)
Kafka: A Distributed Messaging System for Log Processing (Kreps et al.)
Weathering the Unexpected (Krishnan)
The Unified Logging Infrastructure for Data Analytics at Twitter (Lee et al.)
Automatic Management of Partitioned, Replicated Search Services (Leibert et al.)
Learning to Embrace Failure (Limoncelli et al.)
Scaling Big Data Mining Infrastructure: The Twitter Experience (Lin and Rayboy)
Dremel: Interactive Analysis of Web-Scale Datasets (Melnik et al.)
Out of the Tar Pit (Moseley and Marks)
The Log-Structured Merge-Tree (O’Neil et al.)
In Search of an Understandable Consensus Algorithm (Ongaro and Ousterhout)
Failure Trends in a Large Disk Drive Population (Pinheiro et al.)
Fallacies of Distributed Computing Explained (Rotem-Gal-Oz)
F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business (Shute et al.)
Dapper, A Large Scale Distributed Systems Tracing Infrastructure (Sigelman et al.)
Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing (Zahari et al.)
The Human Side of Postmortems (Zwieback)
Crew Resource Management: a Positive Change for the Fire Service

Posts

Resilience Engineering: Part I, Part II (Allspaw)
Systems Engineering: a Great Definition (Allspaw)
Chaos Monkey Released Into The Wild (Bennett and Tseitlin)
Some Rules for Engineering and Operations (Black)
Service Level Disagreements Part I, Part II (Black)
Incuriosity Will Kill Your Infrastructure (Crayford)
My Philosophy on Alerting (Ewaschuk)
You Can’t Sacrifice Partition Tolerance (Hale)
Customer Trust (Hamilton)
Observations on Errors, Corrections, & Trust of Dependent Systems (Hamilton)
Game Day Exercises at Stripe: Learning from kill -9 (Hedlund)
Life Beyond Distributed Transactions: An Apostate’s Opinion (Helland)
Notes on Distributed Systems for Young Bloods (Hodges)
The Network is Reliable (Kingsbury)
The Trouble with Clocks (Kingsbury)
Call Me Maybe: Final Thoughts (Kingsbury)
Getting Real About Distributed Systems Reliability (Kreps)
The Log: What every software engineer should know about real-time data’s unifying abstraction (Kreps)
Incident Response at Heroku (McGranaghan)
On HTTP Load Testing (Nottingham)
Observability at Twitter (Watson)
Stevey’s Google Platforms Rant (Yegge)

Presentations

Design, Lessons, and Advice from Building Distributed Systems at Google (Dean)
Service Design Best Practices (Hamilton)

Books

The Field Guide To Understanding Human Error (Dekker)
Agile Retrospectives: Making Good Teams Great (Derby et al.)
Better: A Surgeon’s Notes on Performance (Gawande)
The Checklist Manifesto: How to Get Things Right (Gawande)
High Performance Browser Networking (Grigorik)
Resilience Engineering in Practice (Hollnagel et al.)
Effective Monitoring and Alerting (Ligus)
Release It!: Design and Deploy Production-Ready Software (Nygard)
The Challenger Launch Decision (Vaughan)
Managing the Unexpected (Weick and Sutcliffe)

Research Groups

Conferences

This list is a copy of mmcgrana/services-engineering with ranks