Site Reliability Engineeringtxt,chm,pdf,epub,mobi下载 作者:Betsy Beyer/Chris Jones/Jennifer Petoff/Niall Richard Murphy 出版社: O'Reilly Media 副标题: How Google Runs Production Systems 出版年: 2016-4-16 页数: 552 定价: USD 44.99 装帧: Paperback ISBN: 9781491929124 内容简介 · · · · · ·The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to... 作者简介 · · · · · ·Betsy Beyer Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En ro... 目录 · · · · · ·Chapter 1IntroductionThe Sysadmin Approach to Service Management Google’s Approach to Service Management: Site Reliability Engineering Tenets of SRE The End of the Beginning Chapter 2The Production Environment at Google, from the Viewpoint of an SRE · · · · · ·() Chapter 1Introduction The Sysadmin Approach to Service Management Google’s Approach to Service Management: Site Reliability Engineering Tenets of SRE The End of the Beginning Chapter 2The Production Environment at Google, from the Viewpoint of an SRE Hardware System Software That “Organizes” the Hardware Other System Software Our Software Infrastructure Our Development Environment Shakespeare: A Sample Service Principles Chapter 3Embracing Risk Managing Risk Measuring Service Risk Risk Tolerance of Services Motivation for Error Budgets Chapter 4Service Level Objectives Service Level Terminology Indicators in Practice Objectives in Practice Agreements in Practice Chapter 5Eliminating Toil Toil Defined Why Less Toil Is Better What Qualifies as Engineering? Is Toil Always Bad? Conclusion Chapter 6Monitoring Distributed Systems Definitions Why Monitor? Setting Reasonable Expectations for Monitoring Symptoms Versus Causes Black-Box Versus White-Box The Four Golden Signals Worrying About Your Tail (or, Instrumentation and Performance) Choosing an Appropriate Resolution for Measurements As Simple as Possible, No Simpler Tying These Principles Together Monitoring for the Long Term Conclusion Chapter 7The Evolution of Automation at Google The Value of Automation The Value for Google SRE The Use Cases for Automation Automate Yourself Out of a Job: Automate ALL the Things! Soothing the Pain: Applying Automation to Cluster Turnups Borg: Birth of the Warehouse-Scale Computer Reliability Is the Fundamental Feature Recommendations Chapter 8Release Engineering The Role of a Release Engineer Philosophy Continuous Build and Deployment Configuration Management Conclusions Chapter 9Simplicity System Stability Versus Agility The Virtue of Boring I Won’t Give Up My Code! The “Negative Lines of Code” Metric Minimal APIs Modularity Release Simplicity A Simple Conclusion Practices Chapter 10Practical Alerting from Time-Series Data The Rise of Borgmon Instrumentation of Applications Collection of Exported Data Storage in the Time-Series Arena Rule Evaluation Alerting Sharding the Monitoring Topology Black-Box Monitoring Maintaining the Configuration Ten Years On… Chapter 11Being On-Call Introduction Life of an On-Call Engineer Balanced On-Call Feeling Safe Avoiding Inappropriate Operational Load Conclusions Chapter 12Effective Troubleshooting Theory In Practice Negative Results Are Magic Case Study Making Troubleshooting Easier Conclusion Chapter 13Emergency Response What to Do When Systems Break Test-Induced Emergency Change-Induced Emergency Process-Induced Emergency All Problems Have Solutions Learn from the Past. Don’t Repeat It. Conclusion Chapter 14Managing Incidents Unmanaged Incidents The Anatomy of an Unmanaged Incident Elements of Incident Management Process A Managed Incident When to Declare an Incident In Summary Chapter 15Postmortem Culture: Learning from Failure Google’s Postmortem Philosophy Collaborate and Share Knowledge Introducing a Postmortem Culture Conclusion and Ongoing Improvements Chapter 16Tracking Outages Escalator Outalator Chapter 17Testing for Reliability Types of Software Testing Creating a Test and Build Environment Testing at Scale Conclusion Chapter 18Software Engineering in SRE Why Is Software Engineering Within SRE Important? Auxon Case Study: Project Background and Problem Space Intent-Based Capacity Planning Fostering Software Engineering in SRE Conclusions Chapter 19Load Balancing at the Frontend Power Isn’t the Answer Load Balancing Using DNS Load Balancing at the Virtual IP Address Chapter 20Load Balancing in the Datacenter The Ideal Case Identifying Bad Tasks: Flow Control and Lame Ducks Limiting the Connections Pool with Subsetting Load Balancing Policies Chapter 21Handling Overload The Pitfalls of “Queries per Second” Per-Customer Limits Client-Side Throttling Criticality Utilization Signals Handling Overload Errors Load from Connections Conclusions Chapter 22Addressing Cascading Failures Causes of Cascading Failures and Designing to Avoid Them Preventing Server Overload Slow Startup and Cold Caching Triggering Conditions for Cascading Failures Testing for Cascading Failures Immediate Steps to Address Cascading Failures Closing Remarks Chapter 23Managing Critical State: Distributed Consensus for Reliability Motivating the Use of Consensus: Distributed Systems Coordination Failure How Distributed Consensus Works System Architecture Patterns for Distributed Consensus Distributed Consensus Performance Deploying Distributed Consensus-Based Systems Monitoring Distributed Consensus Systems Conclusion Chapter 24Distributed Periodic Scheduling with Cron Cron Cron Jobs and Idempotency Cron at Large Scale Building Cron at Google Summary Chapter 25Data Processing Pipelines Origin of the Pipeline Design Pattern Initial Effect of Big Data on the Simple Pipeline Pattern Challenges with the Periodic Pipeline Pattern Trouble Caused By Uneven Work Distribution Drawbacks of Periodic Pipelines in Distributed Environments Introduction to Google Workflow Stages of Execution in Workflow Ensuring Business Continuity Summary and Concluding Remarks Chapter 26Data Integrity: What You Read Is What You Wrote Data Integrity’s Strict Requirements Google SRE Objectives in Maintaining Data Integrity and Availability How Google SRE Faces the Challenges of Data Integrity Case Studies General Principles of SRE as Applied to Data Integrity Conclusion Chapter 27Reliable Product Launches at Scale Launch Coordination Engineering Setting Up a Launch Process Developing a Launch Checklist Selected Techniques for Reliable Launches Development of LCE Conclusion Management Chapter 28Accelerating SREs to On-Call and Beyond You’ve Hired Your Next SRE(s), Now What? Initial Learning Experiences: The Case for Structure Over Chaos Creating Stellar Reverse Engineers and Improvisational Thinkers Five Practices for Aspiring On-Callers On-Call and Beyond: Rites of Passage, and Practicing Continuing Education Closing Thoughts Chapter 29Dealing with Interrupts Managing Operational Load Factors in Determining How Interrupts Are Handled Imperfect Machines Chapter 30Embedding an SRE to Recover from Operational Overload Phase 1: Learn the Service and Get Context Phase 2: Sharing Context Phase 3: Driving Change Conclusion Chapter 31Communication and Collaboration in SRE Communications: Production Meetings Collaboration within SRE Case Study of Collaboration in SRE: Viceroy Collaboration Outside SRE Case Study: Migrating DFP to F1 Conclusion Chapter 32The Evolving SRE Engagement Model SRE Engagement: What, How, and Why The PRR Model The SRE Engagement Model Production Readiness Reviews: Simple PRR Model Evolving the Simple PRR Model: Early Engagement Evolving Services Development: Frameworks and SRE Platform Conclusion Conclusions Chapter 33Lessons Learned from Other Industries Meet Our Industry Veterans Preparedness and Disaster Testing Postmortem Culture Automating Away Repetitive Work and Operational Overhead Structured and Rational Decision Making Conclusions Chapter 34Conclusion Appendix Availability Table Appendix A Collection of Best Practices for Production Services Fail Sanely Progressive Rollouts Define SLOs Like a User Error Budgets Monitoring Postmortems Capacity Planning Overloads and Failure SRE Teams Appendix Example Incident State Document Appendix Example Postmortem Lessons Learned Timeline Supporting information: Appendix Launch Coordination Checklist Appendix Example Production Meeting Minutes · · · · · · () |
回转曲折,坎坷不平
这是需要耐心
好书.值得观看.更是值得收藏.
文字表现力极强