The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization. This book is divided into four sections: Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems Management—Explore Google's best practices for training, communication, and meetings that your organization can use
Expensive For example, scientists lost the $125 million Mars Climate Orbiter because two separate engineering teams used different units of measurement (imperial versus metric). As before, strong types are a solution to this issue: they ...
But not completely evenly distributed, as William Gibson might say. 15 See relevant research at https://devops-research.com/research.html. 16 See http://en.wikipedia.org/wiki/Goodhart%27s_law and ...
This book covers: Service-level requirements and risk management Building and evolving an architecture for operational visibility Infrastructure engineering and infrastructure management How to facilitate the release management process Data ...
This book covers how to track and monitor application performance using Grafana, Prometheus, and Kibana along with how to extend monitoring more effectively by building full-stack observability into the system.
Inspired by that earlier work, this book explores a very different part of the SRE space. The more than two dozen chapters in Seeking SRE bring you into some of the important conversations going on in the SRE world right now.
In this chapter, we are going to focus on the various ways and means of bringing up the reliability assurance factor by ... There are machine learning algorithms besides big and fast data analytics platforms to speed up and simplify the ...
Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE.
Here is your chance to dive into the SRE role and know what it takes to be and implement best SRE practices. The DevOps, Continuous Delivery and SRE movements are here to stay and grow, its time you to ride the wave!
This book goes in detail about DevOps Culture, Microservices Architecture, How to automate deployment using Kubernetes and How Google's SRE and DevOps philosophies overlap.
What you will learnMonitor for approaching catastrophic failureAlert your team to an outage emergencyDissect your incident response strategiesTest automation tools and build your own softwarePredict bottlenecks and fight for user ...