DevOps
What role does HR play in process improvement
1) Implemented new organizational structures 2) communicating monitoring employee responses 3) Integrating process improvement into talent system 4) Recruiting 5) People Development 6) Prevention of Ambiguity Wastes 8 types of ambiguity wastes as outlined by Karen Martin and Mike Osterling: 1) terminology and communication 2) Problem solving and decision making 3) Work systems 4) Roles and Responsibility 5) Policies 6) Business Goals and Priorities 7) Customers and Products 8) Organizational purpose and vision
How to Increase Flow
1) Make work visible 2) reduce batch sizes and intervals of work (limiting WIP) 3) Build quality in by preventing defects from being passed to downstream work centers 4) constantly optimize for global goals
DevOps
1. Collaborative Development 2. Continuous Testing 3. Continuous Release 4. Continuous Monitoring and Optimization
Transformation Best Practices (IBM)
1. Consider all elements of a delivery ecosystem 2. Implement a center of excellence 3. Plan improvements around capabilities 4. Adopt capabilities incrementally 5. Embrace principles of organizational change
Monolithic v1
All functionality in one application Pros: 1. simple at first 2. Low inter-process latencies 3. Single codebase, one deployment unit 4. Resource-efficient at small scales Cons: 1. Coordination overhead increases as teams grow 2. Poor enforcement of modularity 3. Poor scaling 4. All-or-nothing deploy (downtime, failures) 5. Long build times
Reduce the Number of Handoffs
Each work handoff = communication To mitigate problems, strive to decrease number of handoffs by: 1) automating significant portions of the work OR 2) by reorganizing teams so they can deliver value to the customers themselves
Urgent Changes
Emergency and consequently, potentially high risk, changes that must be put into production immediately (urgent security patch, restore service). Require senior management approval, but allow documentation to be performed after the fact.
Enable creation of production metrics as part of daily work
Enable Dev and Ops to create and improve telemetry as part of their daily work
CHAPTER 19
Enable and Inject Learning into Daily Work Goal: Create a dynamic system of learning that allows us to understand mistakes and translate that understanding into actions that prevent those mistakes from occurring in the future
Fund not Projects, but Services and Projects
Enable high performing outcomes by creating stable service teams with ongoing funding to execute their own strategy and road map of initiatives. Measured by achievement of organizational and customer outcomes such as revenue, customer lifetime value, or customer adoption rate Traditional "project" based model is BAD b/c unable to see long-term consequences of decisions they make and funding only pays for earliest stages of SDLC which are the least expensive Measured by budget, time, scope
Spread knowledge by using automated tests as documentation and communities of practice
Enable rapid propagation of expertise and improvements regarding shared libraries Shared libraries should have significant amounts of automated testing
Share your Experiences from DevOps Conferences
Encourage engineers to attend conferences, give talks at them, and create or organize internal/external conferences themselves
Create a single, shared source code repository for our entire organization
Engineers can leverage the diverse expertise of everyone in the organization Engineers can contribute to the single repository knowing that the code is updated and what areas of the repo will be affected
Integration Tests:
Ensure that our application correctly interacts with other production applications and services
Deployment Pipeline
Ensures that all code checked into version control is automatically built and tested in a production like environment Find any build, tests, or integration errors as soon as a change is introduced Allows us to always be assured that we are in a deployable and shippable state
2 Broad categories of release patterns we can use:
Environment-based release patterns Application-based release patterns
Experimental Model
Everyday exercise and every piece of new information is evaluated and debated in a culture that resembles a research and design lab
Dev Shares pager rotation duties with Ops
Everyone in the value stream will share the downstream responsibilities of handling operation incidents. Do this by putting developers, developer managers, and architects on pager rotation This ensures that everyone in the value stream gets visceral feedback on any upstream architectural and coding decisions they make Operations doesn't struggle alone with code related production issues
Guidelines for Code Review
Everyone must have someone to review their changes before committing to trunk Everyone should monitor the commit stream of their fellow team members Define what changes qualify as high risk and may require review from a designated Subject Mater expert If someone submits a change that is too large a reason about easily it should should be split up into multiple smaller changes that can be understood at a glance
Instrument and Alert on Undesired outcomes
Analyze the most severe incidents of the recent past and create a list of telemetry that could have enabled earlier and faster detection and diagnosis of the problem, as well as easier and faster confirmation that an effective fix had been implemented Repeat this process on ever-weaker failure signals to find problems even earlier in the life cycle
Anomaly Detection Techniques
Anomaly Detection and Smoothing
Logging Level: Debug
Anything that happens in the program, most often used during debugging. Debug logs are often disabled in production but temporarily enabled during trouble shooting
Convergence of DevOps
Applying principles from physical manufacturing and leadership to IT value stream. Blends principles from: Lean, Theory of Constraints, the Toyota Kata movement, resilience engineering, learning organizations, safety culture, Agile software journey
Fearlessly Cut bureaucratic process
Approval process can significantly increase lead times A great Metric is to publish how many meetings and work tickets are mandatory to perform a release - the goal is to relentlessly reduce the effort required for engineers to perform and deliver it to the customer
CHAPTER 12: Automate and Enable Low-Risk Releases
As deployment batch-size grows, so does the risk of unexpected outcomes associated with the change
Ensure Documentation and Proof for Auditors and Compliance Officers
As tech increasingly adopts DevOps patterns it creates more tension between IT and audit b/c it challenges traditional thinking about auditing, controls, and risk mitigation
Definition of Done
At end of each development interval, we must have integrated, tested, working, and potentially shippable code, demonstrated in production-like environment, created from trunk using a one-click process, and validated with automated tests.
Automate our deployment process
Automate as many of the manual steps as possible: Packaging code in ways suitable for deployment Creating pre-configured virtual machine images or containers Automating the deployment and configuration of middleware Coping packages or files onto production servers Restarting servers, application, or services Generating configuration files from templates Running automated smoke test Running testing procedures Scripting and automating database migrations
Canary Release Pattern:
Automates the release process of promoting to successively larger and more critical environments as we confirm that the code is operating as designed A1 group: production serves that only serve internal employees A2 group: production servers that only serve a small percentage of customers A3: the rest of the production serves
DEVOPS VS AGILE
Best if use DevOps and Agile together
To better enable fast flow, we want a code promotion process that can be performance be either dev or ops, ideally without any manual steps or handoffs. This affects the following steps:
Build Test Release
Case study - 18F automating compliance for the federal government with compliance masonry
Built a framework to automate the creation of system security plans
Make Infrastructure Easier to Rebuild than to Repair
By having repeatable environment creation systems can easily increase capacity by adding more servers and avoid disaster that inevitably results when restore service after catastrophic failure of irreproducible infrastructure, created through years of undocumented and manual production changes
Ensure Tests run quickly
Can run tests in parallel (performance testing at the same time as security testing)
Eliminate Hardship and Waste in the Value Stream
Categories of Waste and Hardship: Partially done work Extra Processes Extra Features Task Switching Waiting Motion Defects Nonstandard or manual work Heroics
Steps of Deployment pipeline:
Commit stage (builds and packages the software, runs automated tests, and performs additional validation such as static code analysis, duplication, test coverage) Acceptance Stage: (automatically deploys the packages created in the commit stage into a production-like environment, runs automated acceptance tests)
Logging Level: Info
Consists of actions that are usually user-driven or system specific
Dynamic Analysis
Consists of tests executed while a program is in operation (system memory, functional behavior, response time, performance)
Cluster immune System
Expands on canary release pattern by linking our production monitoring system with our release process and by automating the roll back of code when the user-facing performance of the production system deviates outside of a predefined expected range
Multivariate Testing
Experiments with more than one variable, allows to see how the variables interact
Ways to influence people's Beliefs and Values:
Exposure to others who have been successful (books, mentoring programs, conferences, training, etc) Hiring new people
Having Developers follow work downstream
Contextual inquiry: When the product team watches a customer use the application in their natural environment, which uncovers ways that customers struggle with the application Developers often learn a lot after participating in customer observation Can use same technique to observe how our work affects our internal customers. Developers should follow their work downstream, so that they can see how work centers must interact with their product to get it running into production. can improve deployability, manageability, operability
Four common traits that companies need to develop as part of conscious effort to build a process improvement culture
Continually identify problems Discussing problems openly Focusing on large and small problems Solving real problems
Potential Dangers of "overly controlling changes"
Controls we put in place when change control failures occur: 1) adding more questions that need to be answered to the change request form 2) Requiring more authorizations 3) Requiring more lead time for change approvals so that change requests can be properly evaluated creating approval steps from people who are located further and further away from the work may actually reduce the likelihood of success High performing organizations rely more on peer review and lesson external approval of change
3 main areas that commonly impact corporate culture:
Corporate Behaviors Corporate Values Corporate Structures
Components of a great culture?
Corporate culture begins with leadership: Leaders need to explain and share their vision Make culture store meaningful: helps employees believe in the transformation by answering their most important questions All employees should be treated equally 2-way communication is essential Hiring decisions should reflect desired corporate culture A proper vision is paramount Company values should be continuously promoted company policies should embody corporate values Company history is key Spotlight success
How a centralized Ops team can achieve outcomes typically associated with market-oriented teams:
Create Self-service capabilities to enable developers in the service to be productive Embed Ops engineers into the service teams Assign Ops liaisons to the service teams when embedding Ops is not possible
Enable Everyone to Teach and Learn
Create a culture of continual teaching and learning Code reviews can help teach skills through daily work Have dev and ops work together to solve small problems
CHAPTER 20: Convert local discoveries into global improvements
Create mechanisms that make it possible for new learnings and improvements discovered locally to be captured and shared globally throughout the entire organization
Contrasting Point - Creating vs. Deploying
Creating vs. Deploying: Developing software is inherent to Agile, but DevOps is concerned with the appropriate deployment of said software
System of Engagement
Customer-facing or employee facing systems, such as e-commerce systems and productivity applications. Typically have higher pace of change to support rapid feedback loops that enable them to conduct experimentation to discover how to best meet customer needs.
Daily Scrum Meeting
Daily, ~15 min, stand-up What did you do yesterday? What will you do today? What obstacles are in your way?
Architecture of centralized telemetry infrastructure should have following components:
Data collection at the business logic, application, and environments layer An event router responsible for storing our events and metrics
T-Shaped:(Generalist)
Deep expertise in one area Broad skills across many areas Can step up to remove bottlenecks Sensitive to downstream waste and impact Helps make planning flexible and absorbs variability
E-Shaped
Deep expertise in one area Experience across many areas, proven execution skills, always innovating Almost limitless potential
I-Shaped: (Specialist)
Deep expertise in one area Few skills or experiences in other areas Creates bottlenecks quickly Insensitive to downstream waste and impact Prevents planning flexibility or absorption of variability
Requirements for our Deployment Pipeline
Deploying the same way to every environment Smoke testing our deployments Ensure we maintain consistent environments
The Separate World of DevOps
DevOps is responsible for developing and deploying new products to the end-user DevOps walks a line between flexibility and the rigorous testing and communication that comes with deploying new software
Enable Automated Self-service deploymnets
Developer's ability to self-deploy code into production, to quickly see happy customers when their feature works, and to quickly fix any issues without having to open up a ticket with Operations has diminished over the last decade --in part as a result of a need for control and oversight perhaps driven by security and compliance requirements
A Typical Deployment Landscape
Development (developer) Build (build engineer) QA (QA team) SIT (Integration Tester) UAT (User) Production (ops engineer)
Integrate A/B Testing into our Release
Feature toggles A/B testing is made possible by doing production deployments on demand, using feature toggles and potentially delivering multiple versions of code simultaneously
Start with the Most Sympathetic and Innovative Groups
Find those teams who already believe in DevOps principles and practices, i.e. the innovators and early adopters, esp in early stages to spend less energy on converting more conservative groups. Strategy: choose to focus efforts in a few areas of org where there is the most support - make those initiatives successful and expand from there.
why we keep improvement planning horizons short
Flexibility and ability to re-prioritize and re-plan quickly Decrease the delay between work expanded and improvement realized, which strengthens our feedback loop Faster learning generated from the first iteration Reduction in activation energy to get improvements Quicker realization of improvement that make meaningful differences in our daily work Less risk that our project is killed before we can generate any demonstrable outcomes
Create a Value Stream Map to see the Work
Focus investigation and scrutiny on the following areas: Places where work must wait weeks or even months Places where significant rework is generated or received
Logging Level: Error
Focuses on Error conditions (ie API call failures)
What to do when Changes are Categorized as Normal Changes
For changes that cannot be classified as standard changes, they will be normal changes that require approval from at least a subset of CAB before deployment Goal: ensure that we can deploy quickly even if it is not fully automated Make sure submitted change requests are as complete and accurate as possible to prevent revisions that will increase the time required for us to get into production Automate the creation of complete and accurate Request for change forms to include links to machine readable data to give context of the change Goal should be to continually show an exemplary track record of successful changes, so we can eventually gain their agreement that our automated changes can be safely classified as standard changes
Integrating A/B Testing into our Feature Planning
Frame Hypotheses in feature development as: 1. We believe: 2. Will Result: 3. We will have confidence to proceed when:
Organizational Archetypes
Functional-oriented organizations Matrix-oriented organizations Market-oriented organizations
Case Study - Instrumenting the environment at Etsy
Galbreath (former VP of engineering at Etsy) defined fraud as when "the system works incorrectly, allowing invalid or un-inspected input into the system, causing financial loss, data loss/theft, system downtime, vandalism, or an attack on another system." Galbreath created security-related telemetry that were displayed alongside all other metrics, including: Abnormal production program terminations Database syntax error Indications of SQL injection attacks
CHAPTER 16: Enable Feedback so development and operations can safely deploy
Galbreath, VP of Engineering at Right Media found that dev and ops have the fear of deploying code Galbreath observed that providing faster and more frequent feedback to work, created safety and confidence It is not enough merely to automate the deployment process --we must also integrate the monitoring of production telemetry into our development work, as well as establish the cultural norms that everyone is equally responsible for the health of the entire value stream
Institute Game Days to Rehearse Failures
Game Days: popularized by Jesse Robbins, they are specific disaster recovery rehearsals from the discipline of resilience engineering to help teams simulate accidents to practice → create more resilient service and higher degree of assurance that we can resume operations when inopportune events occur
Apply Lean Principles
Get ideas into production fast Get people use to it Get feedback
CHAPTER 2: The first way principles of flow
Goal: Decrease the amount of time required for changes to be deployed into production and to increase the reliability and quality of those services
How to transform Local discoveries into global improvements
Goal: convert team/individuals experiences/expertise into explicit codified knowledge create mechanisms to create global knowledge such as making all post-mortem reports searchable by teams trying to solve similar problems, creating shared source code repositories that span the entire organization
CHAPTER 8: How to get great outcomes by integrating operations into the daily work of development
Goal: enable market-oriented outcomes where many teams can quickly and independently deliver value to the customer Challenge: when Ops is centralized and functionally-oriented -results in long lead times for needed Ops work, constant reprioritization, and poor deployment How: create more market-oriented outcomes by better integrating Ops capabilities into Dev Teams
CHAPTER 14: Create telemetry to enable seeing and solving problems
Goals: to enable problem solving-behavior, we need to design our systems so that they are continually creating telemetry, widely defined as "an automated communications process by which measurements and other data are collected at remote points and are subsequently transmitted to receiving equipment for monitoring" Ensure that we have enough telemetry to confirm our services are correctly operating in production
Ensure Security of the Application
Happy Path Testing Sad Path Testing /Bad Path Testing include as part of testing: Static analysis Dynamic Analysis Dependency Scanning Source code integrity and code signing
Continuously build, test, and integrate our code and environments
Have Developers build automated tests as part of daily work Create automated test suites that increase frequency of integration and testing of our code and our environments from periodic to continuous Build deployment pipeline that will perform integration of our code and environments Must create automated build and test processes
how Application-based patterns enable safer releases
Implement feature toggles: provide us with mechanism to selectively enable and disable features without requiring a production code deployment can control which features are visible and available to specific user segments Enable us to do the following: Roll back easily Gracefully degrade performance Increase our resilience through a service-oriented architecture Feature toggles enable the decoupling of code deployments and feature releases Perform dark launches: feature toggles allow us to deploy features into production without making them accessible to users, known as dark launching
Reduce Batch Size
Large Batch sizes were common prior to Lean and result in high levels of WIP and high levels fo variability in flow --> long lead times and poor quality Lean Lesson: smaller batches = smaller lead-times & increase quality Single Piece Flow: Each Operation is performed one unit at a time Small Batches = less WIP, faster lead-times, faster detection of errors, and less rework
Pathological organizations
Large amounts of fear and threat. Failure is hidden
How does leadership play a role in process improvement?
Leaders should connect the process and the people Ensure leadership practices are universally aligned and accepted Leadership Behaviors: Positive Negative Process focused leadership is about enabling and empowering people
Standard Changes
Lower risk changes that follow and established and approved process, but can also be pre-approved. Includes monthly updates of app tax tables or country codes, website content, styling changes and certain types of apps or operating system patches that have a well understood impact. The change proposer does not require approval before deploying the change, and change deployments can be completely automated and should be logged so there is traceability.
Ways to Influence People's Level of Awareness:
Maintaining proper work-life balance Ensuring people take their vacation time Having flexible working hours Holding regular team workshops, townhalls, etc Creating opportunities to move within the organization Coaching and mentoring opportunities Conducting regular performance reviews
Methods to make Work visible
Difference b/t tech and manufacturing value streams: work is invisible In tech value streams, we cannot easily see where flow is being impeded or when work is piling up To see where work is flowing or piling up, need to make work visible as possible Visual boards: Kanban or sprint planning boards manage our work so that it flows from left to right as quickly as possible limit-multitasking by enforcing WIP limits for each column Lead Time: When a card is placed on the board to when it is moved into the "done" column
Prevent uncontrolled configuration variances and motivate version control by:
Disable remote logins to production servers Routinely kill and replace production instances, ensuring that manually-applied production changes are removed
Consequence of Second Law of Architectural Thermodynamics
Downward spiral of deploying less frequently (change = unknowable catastrophes → so don't want to change). Reducing complexity and increasing productivity of dev teams is rarely the goal of an individual project
Creating Security Telemetry in our environments
Need to create telemetry in our environments so we can detect early indicators of unauthorized access, which may include: OS changes Security group changes Changes to configurations Cloud infrastructure changes XXS attempts SQLi attempts (SQL injection attacks) Web server errors
Agile (Definition Continued)
No "finished product" which is the goal of waterfall approach Agile methodologies encourage developers to break down software dev into small pieces "user stories" highlighting value agile places on customer, helps devs provide faster feedback loops and ensuring product alignment with market need Agile advocates for adaptive planning, evolving development, early and continuous delivery, and continuous improvement
Enable Pair Programming to improve all our changes
One engineer fills the role of the driver, the person who actually writes the code Another engineer acts as the navigator, observer, or pointer, the person who reviews the work as its being performed Another pair programming pattern reinforces TDD by having one engineer write the automated test and the other engineer implement the code Pair programming can also spread knowledge throughout the organization and increasing information flowed with the team
Create Dedicated Transformation Team
One of the inherent challenges with initiative such as DevOps transformations is that they are inevitably in conflict with ongoing business operations Dedicated transformation team is able to operate outside of the rest of the organization that is responsible for daily operations How to Execute DevOps initiative: Assign members fo the dedicated team to be solely allocated to the DevOps transformation efforts Select team members who are generalists Select team members who have long standing and mutually respectful relationships with the rest of the organization Create a separate space for the dedicated team
Positive Leadership Behaviors
Openness to learning, customer focus, accountability
Create shared services to increase developer Productivity
Operations can create set of centralized platforms and tooling services that any Dev team can use to become more productive Enable dev team to spend more time building functionality for their customer, as opposed to obtaining all the infrastructure required to deliver and support that feature in production
Integrate Ops into Dev rituals
Ops engineers discover what rituals the product teams follow, integrate into them, and add value to them
Market-oriented Organizations
Optimize for responding quickly to customer needs. Flat organizations, composed of multiple, cross-functional disciplines which lead to possible redundancies across the org. How many prominent organizations adopting DevOps operate.
Process Improvement Culture:
Organizations cannot improve unless they continually seek out and solve their problems. For many companies that means undertaking a profound cultural change
Conway's Law
Organizations which design systems...are constrained to produce designs which are copies of the communication structures of these organizations...the larger an organization the less flexibility it has and the more pronounced the phenomenon. The organization of the software and the organization of the software team will be congruent.
Why Automated build and test processes are critical?
Our build and test process can run all the time A segregate build and test process ensures that we understand all the dependencies required to build, package, run, and test our code Can package our application to enable the repeatable installation of code and configurations into an environment Instead of putting code into packages we may choose to package our applications into deployable containers Environments can be made more production-like in a way that is consistent and repeatable
Types of Code reviews;
Pair programming "Over-the-Shoulder" Email Pass-Around Tool-Assisted code review
Improvement Blitz (kaizen blitz)
Part of the Toyota Production System defined as dedicated and concentrated period of time to address a particular issue over the course of a several days Goal of improvement blitz team: new approach to solving a problem that we encounter in our daily work
Negative Leadership Behaviors
Passive, Avoidance, politics, cover-ups
Integrate performance testing into our test suite
Performance problems are often difficult to detect, so write and run automated performance tests that validate our performance across the entire application stack
4 steps to effectivly identify and solve problems:
Plan Do Check Act
Types of Corporate values
Positive values Negative values
Product Owner
Possibly a product manager or project sponsor Decides on features, release date, prioritization, $$$
Ensure Technology choices help achieve organizational goals
Prevent optimization for team productivity instead of achievement of organizational goals Identify technology that: Impede or slow down the flow of work Disproportionately create high levels of unplanned work Disproportionately create large numbers of support requests Are more inconsistent with our desired architectural outcomes
Case Study -- static security testing at Twitter
Prevent security mistakes from being repeated Integrate security objectives into existing developer tools Preserve trust of development Maintain fast flow through infosec through automation Make everything security related service, if possible Take a holistic approach to achieving infosec objectives
Limiting WIP
Prevents multitasking makes it easier to see problems that prevent the completion of work instead of starting something new, find out what is causing the delay and help fix it
Case Study--Pair programming replacing broken code review process at pivitol tables
Two accepted methods of code review at Pivotal: pair programming or a code review process managed by Gerrit (every code commit had two designated people "+1" the change) The problem with the Gerrit code review process was that it would often take an entire week for developers to receive their required reviews. Only senior engineers could "+1" changes Pivotal then eliminated the Gerrit code review process and required pair programming to implement code changes into the system.
Scrum
Scrum is an agile process that allows us to focus on delivering the highest business value in the shortest time Rapidly and repeatedly inspect actual working software every 2 weeks to one month Business sets the priorities Inspect and Adapt
Culture:
Shared attitudes, values, principles and beliefs that characterize a company and define its nature and approach to managing employees, customers, investors, and the greater community
Use tools to reinforce desired behavior
Shared tools allow for a shared backlog slack allows for fast and widespread communication
Contrasting Point - Specialization
Specialization: Agile is an equal opportunity team- every member of the scrum can do every job within the team, which prevents slowdowns and bottlenecks. DevOps on the other hand, assumes separate teams for development and operations and people stay within their teams but they all communicate frequently
Contrasting Point - Speed
Speed: Agile is all about rapid and frequent deployment, but this is rarely the goal or even part of it for DevOps
Build Reusable operations user stories into developement
Standardize, automate, and document operations work as much as possible For all recurring Ops work, we should know the following : What work is required Who needs to perform it What the steps are to complete it Create well defined "ops user stories" that represent work activities that can be reused across all our projects
Decrease Incident Tolerances to Find Ever-Weaker Failure Signals
Standardized Model vs. Experimental Model Work in technology value stream should use experimental model to continually seek to find ever-weaker failure signals so that we can better understand and manage the system we operate in
Sprint Retrospective
Take a look at what is and not working after every sprint ~15-30min
Sprint Review
Team presents what it accomplished during the sprint Typically takes the form of a demo of new features or underlying architecture Informal 2-hour prep time rule No slides Whole team participates Invite the world
Sprint Planning
Team selects item from the product backlog they can commit to completing Sprint backlog is created (tasks are identified and each is estimated) High-level design is considered
Logging Level: Fatal
Tell us when we must terminate
Unit Test
Test a single method, class, or function in isolation, providing assurance to the developer that their code operates as designed
Acceptance Tests
Test the application as a whole to provide assurance that higher level of functionality operates as designated. prove that our app does what the customer meant it to not that it works the way its programmers think it should
Changing Mindeset
The actual change itself (the most difficult phase) At conclusion of phase 2, you should have: New cultural behaviors being displayed in most employees, with leaders serving as role models for the rest of the organization Any needed performance management changes designed New Rituals, meetings, and activities that support the values and behaviors A flurry of communications using various methods to reinforce desired behaviors and update the community on status
Corporate Behaviors
The behavior of employees and leadership that are most commonly desplayed
Continuously Improving
The change is reinforced but you are also refining the processes to improve and accelerate your transformation At conclusion of phase 3 you should have: Everyone trained in the new value system and fully bought into the new corporate vision Everyone articulating the unique aspects about the culture Everyone understanding how what they and others do builds and reinforces that culture Evidence of a competitive advantage in both the customer and employment market based on the corporate culture Everyone looking to continually find ways to improve work, life, and operations
Assign an Ops Liaison to each service team
The designated Ops engineer is responsible for understanding: What the new product functionality is How it works as it pertains to operability, scalability and observability How to monitor and collect metrics Any departures from previous architectures and patterns Any extra needs for infrastructure Feature launch plans
Corporate Structures
The mechanisms by which a company manages tasks and people
Product Backlog
The requirements A list of all desired work on the project List of user stories along with "story points" Ideally expressed such that each item has value to the users or customers of the product Prioritized by the product owner Reprioritized at the start of each sprint
Corporate Values
The values employees associate meaning to and continually promote
Sad Path Testing (or Bad Path)
Things go wrong, especially in relation to security related error conditions
Building Momentum
This is about building the business case, obtaining commitment, and building the velocity required to make a significant change At the conclusion of phase 1 you should have: A clear outline of the current corporate culture and ideal culture A sponsored and endorsed plan for change A communication strategy and plan for the cultural change Commitment by executives to take on the journey
Integrate Security and Compliance into Change Approval Process
Types of changes: Standard, Normal, Urgent
Automated Tests fall into one of the following categories:
Unit Tests Acceptance Tests Integration Tests
Happy Path Testing
Validates user journeys where everything goes as expected
Integrate preventative security controls into shared source code repositories and shared services
Want a centralized shared service organization to collaborate and put our information security artifacts Put code libraries and recommended configurations, secret management, OS packages, and builds into shared source code repository Makes it easy for engineer to correctly create and use logging and encryption standards in their application and environments
Integrate non-functional requirements testing into our test suite
Want to validate every other attribute of the system we care about -- availability, scalability, capacity, security, etc Many of these requirements are fullfilled by correct configuration of our environments, so we must also validate that our environments have been built and configured properly
Build Fast and reliable validation test suite
We need fast automated tests that run within our build and test environments whenever a new change is introduced into version control
Green Build
Whatever is in version control is in buildable and deployable state
Environment-based release patterns
This is where we have 2 or more environments that we deployed into, but only one environment is receiving live customer traffic. New code is deployed into a non-live environment, and the release is performed moving traffic into this environment Types: Blue-Green deployment pattern Canary Release Pattern Cluster Immune system
Goal of Deployment Pipeline
To provide everyone in the value stream the fastest possible feedback that a change has taken us out of a deployable state
Continually identify and elevate our constraints
To reduce lead times and increase throughput we need to continually identify our system's constraints and improve its work capacity Constraint usually follows this progression: Environment creation Code Deployment Test setup and run Overly tight architecture
Redefine Failure and Encourage Calculated Risk Taking
To reinforce culture of learning and calculated risk taking we need leaders to continually reinforce that everyone should feel both comfortable with and responsible for surfacing and learning from failures High performing DevOps orgs fail and make mistakes more often
Chaos Monkey
Tool used by Amazon to simulate AWS failures by constantly and randomly killing production servers. Ex of constantly injecting failures into pre-production and production environments
Integrate security into development iteration demonstrations
Track all open security issues in the same work tracking system that dev and ops are using
Evolutionary of Delivery Practices
Traditional Iterative Agile Scaled Agile DevOps
How to Integrate Ops culture into Dev Teams?
Try to embed Ops engineers and architects into each of the Dev Teams (hard b/c sometimes not enough Ops engineers) 2 Ops Liaison Models: Business Relationship Manager -- worked with product management, line of business owners, project management, dev Management and developers. Act as advocates for product owners inside of Operations. Help product teams navigate the Operations Landscape to prioritize and streamline work requests Dedicated Release Manager: Familiar with product's development and QA issues, helps them get what they need from Ops organization to achieve their goals
Enabling Organizational Learning and Safety Culture
Dr. Westrum Defines 3 types of culture: Pathological organizations Bureaucratic Organizations Generative Organizations Remove Blame and put organizational learning in its palce
System of Record
ERP (enterprise reporting planning) systems that run businesses, where correctness of the transactions and data are paramount. Typically have slower pace of change, regulatory and compliance requirement.
Immutable infrastructure
manual changes to production environment are no longer allowed--the only way to make production changes is to put changes into version control and re-create the code and environment from scratch to prevent variance in production
Application-based release pattern
modify application so that we can selectively release and expose specific application functionality by small configuration changes (feature flags)
Microservice
modular, independent graph relationship vs tiers, isolated persistence Pros: 1. Each unit is simple 2. Independent scaling and performance 3. Independent testing and deployment 4. Can optimally tune performance (caching, replication, etc) Cons: 1. Many cooperating units 2. Many small repos 3. Requires more sophisticated tooling and dependency management 4. Network latencies
Mass Marketing/Brand Marketing
placing as many ad impressions in front of people as possible to influence buying decisions
Resilience
requires that define our failure modes then perform testing to ensure that these failure modes operate as designed
Logging Level: Warn
tells us of conditions that could potentially become an error (i.e. database call taking longer than predefined time)
Static Analysis
testing we perform in non-runtime environment (coding flaws, backdoors, malicious code)
Why fast flow from dev to ops is important?
to deliver value to customers quickly
CHAPTER 15
Analyze Telemetry to Better Anticipate Problems and Achieve Goals
Bad Apple Theory
(Dekker) notion of eliminating error by eliminating the people who caused the errors, dekker claims this is invalid because human error is not our cause of troubles; instead human error is the consequence of the design of the tools we gave them
Integrating A/B Testing into our Feature Testing
A/B Testing in UX and Multivariate Testing
Resilience Engineering
An exercise designed to increase resilience through large-scale fault injection across critical systems
Phases of Cultural Change
Analysis of current culture, along with its impact on the performance of your business and employees Identify the values and culture you strive for Set your goals, and build and implement your plan based on the values, behaviors, and structures
How is Agile Different?
Analysis, design, coding, and testing are continuous activities
Scrum Framework
-Roles: Product Owner, Scrum Master, Team -Ceremonies: Sprint Planning, Sprint Review, Sprint Retrospective, Daily Scrum meeting -Artifacts: Product Backlog, Sprint Backlog, Burndown Charts
3 main phases for basic framework necessary for culture to spread
1) Building Momentum 2) Changing mindsets 3) Continuously improving
GitHub Flow
1) Engineer creates named branch off of master 2)Engineer commits to that branch locally, regularly pushing their work to the same named branch on the server 3) when they need feedback or help or when they think the branch is ready for merging they open a pull request 4) When they get their desired reviews and approvals, the engineer can merge it into master 5) Once code changes are merged and pushed to master, the engineer deploys them into production
Source Code integrity and code signing
All commits to version control should be signed - that is straightforward to configure using the open source tools gpg and git
What happens in a post-mortem meeting?
1. Construct timeline and details from multiple perspectives on failures, ensuring we don't punish people for making mistakes 2. Empower all engineers to improve safety by allowing them to give detailed accounts of their contributions to failures 3. Enable and encourage people who do make mistakes to be the experts who educate the rest of the organization on how not to make them in the future 4. Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgement of those decisions lies in hindsight 5. Propose countermeasures to prevent a similar accident from re-occurring and ensure these countermeasures are recorded with a target date and an owner for follow up
Principles of organizational change
1. Establish a sense of urgency 2. Create the guiding coalition 3. Develop a vision and strategy 4. Communicate the change vision 5. Empower employees for broad-based action 6. Generate short-term wins 7. Consolidate gains and produce more change 8. Anchor new approaches in the culture
Iterative
1. Iterative Development 2. Risk-value lifecycle 3. Shared vision 4. Use case-driven development 5. Release planning
2 of Lean's major tenets
1. Manufacturing lead time required to convert raw materials into finished goods was the best predictor of quality, customer satisfaction, and employee happiness 2. Small batches of work best predictor of short lead times
Scaled Agile
1. Measured Performance 2. Formal Change Management 3. Concurrent Testing
Traditional
1. Multiple views 2. Quality attribute-driven development 3. Component based development 4. Asset reuse 5. Decision capture 6. Architecture proving
Stakeholders needed at a Blameless-Post Mortem
1. People involved in decisions that may have contributed to the problem 2. People who identified the problem 3. People who responded to the problem 4. People who diagnosed the problem 5. People who were affected by the problem 6. Anyone else interested in attending the meeting
Institutionalize Rituals to Pay Down Technical Debt
1. Schedule and conduct day-week long improvement blitzes where everyone on a team or in the entire org self-organizes to fix problems they care about--no feature work is allowed 2. Spring or Fall Cleanings 3. Schedule week long improvement blitzes that prioritize Dev and Ops working together towards improvement goals where at the end each team presents to their peers the problem and what they built (work across entire value stream to solve problems)
Contrasting Points
1. Speed 2. Creating vs. Deploying 3. Specialization 4. Communication and Documentation 5. Team Size 6. Scheduling 7. Automation
Agile
1. Test-driven development (TDD): Creating tests that are a specification of what the code should do first 2. Continuous integration:Encourage frequent integration and testing of program changes 3. Refactoring: Changing an existing body of code in order to improve its internal structure 4. Whole team: A focus on the value of highly-collaborative teams as exemplified by Scrum's daily standup meeting. INstills a sense of collective ownership and responsibility 5. User Story-driven Development: Capture requirements in a lightweight manner. Encourages collaboration with the relevant stakeholders throughout a project 6. Team Change Management: Supports the logging of defects or new requirements, by any member of the team, that are within the scope of the current iteration
Most common A/B testing method in UX
2 versions of a page, control (A) or treatment(B) using statistical analysis of behavior of these users demonstrate if there is a causal link between treatment and the outcome
What % of cycles should be reserved for non-functional requirements and reducing tech debt
20%
Team
5-10 members, teams are self organizing Cross functional Membership should change only between sprints
Evaluating the effectiveness of pull request processes
A bad Pull request is one that doesn't have enough context for the reader, having little or no documentation of what the change is intended to do A great PR: Sufficient detail on why a change is being made, identify risks and resulting countermeasures
Continuous Integration requires 3 capabilities
A comprehensive and reliable set of automated tests that validate we are in a deployable state A culture that "stops the entire production line" when our validation tests fail Developers working in small batches on trunk rather than long-lived feature branches
Burndown Chart
A display of what work has been completed and what is left to complete
Use telemetry to make deployments safer
Actively monitor production telemetry when anyone performs a production deployment
Generative Organizations
Actively seeking and sharing information to better enable the organization to achieve its mission
Case Study-- Standardizing a new tech stack at Etsy
After disastrous peak holiday season, Etsy decided to massively reduce the # of technologies used in production Goal was to standardize and reduce the support infrastructure and configurations Migrate Etsy's entire platform to PHP and MySQL wanted both dev and ops to understand the full technology stack
Re-Categorize the Majority of Our Lower Risk Changes as Standard Changes
After having a reliable deployment pipeline in place → fast reliable non-dramatic deployments next step is to gain agreement form Ops and relative change authorities that our changes have demonstrated to be low risk enough to be defined as standard changes, pre-approved by the CAB Show change is low risk by showing history of changes over a significant period of time and provide a complete list of production issues during the same period Assert high change success rates and low Mean Time To Repair (MTTR) Need standard changes to be visual and recorded in change management system ideally automatic deployments will have results automatically recorded and automatically link to work planning tools (i.e.JIRA ) Allows for visibility, and traceability
Contrasting Point - Automation
Agile Doesn't require automation DevOps: Automation is the heart of DevOps b/c their goal is to minimize disruptions and maximize efficiency
Contrasting Point - Team Size
Agile Small teams, to they can move faster DevOps Will have many teams working together and each team can realistically practice different theories
Contrasting Point - Scheduling
Agile Sprints: work in short predetermined amounts of time, rarely longer than a month and as short as a week DevOps: Values max reliability, so they focus on a long-term schedule that minimizes business disruptions
Culture of Agile and DevOps
Agile approaches a change in how we think about development Agile thinking promotes small manageable changes quickly that over time lead to large changes DevOps brings cultural shifts in the form of cultural shifts within an organization including enhanced communication, and balancing stability with change and flexibility
What is Agile?
Agile is a time boxed iterative approach to software and product delivery that builds software incrementally from the start of the project instead of trying to deliver it all at once near the end It works by breaking down projects into little bits of user functionality called "user stories' prioritizing them and then continuously delivering them over iterations
Contrasting Point - Communication & Documentation
Agile: Daily informal meetings required so that team members can share progress, daily goals, and indicate help when needed, scrums not meant to go over documentation or milestones or metrics Don't codify their meeting minutes or other communications - often preferring lo-fi methods of simple pen and paper DevOps: Meetings are not daily but they require a lot of documentation in order to communicate software deployment to all relevant teams Requires design documents and specs in order to fully understand a software release
Have developers initially self-manage their production service
Have dev groups self-manage their services in production before they become eligable for a centralized ops group to manage (Google started this) Define launch requirements that must be met in order for services to interact with real customers: Defect counts and severity Type//frequency of pager alerts Monitoring coverage System architecture Deployment process Production hygiene Effective monitoring should be in place, deployments should be reliable and deterministic, and architecture should support fast/frequent deployments Need a different mechanism to ensure that ops is never stuck with an unsupportable service in production Create a service hand-back mechanism- when production service becomes sufficiently fragile, ops has ability to return production support responsibility back to dev (practice at Google)
Use Chatrooms and chat bots to automate and capture organizational knowledge
Having work performed by automation in the chat room has numerous benefits: Everyone saw everything happening Engineers on the first day of work could see what daily work looked like People were more apt to ask for help when they saw others helping each other Rapid organizational learning was enabled and accumulated
Case Study: The Launch and Hand-off readiness review at Google
High importance product teams are assigned site reliability engineers (SREs) are 'what happens when a SWE is tasked with what used to be called operations" Google created 2 sets of safety checks for 2 critical stages of releasing new services called Launch readiness review (LRR) and Hand off readiness review (HRR) LRR must be performed and signed off on before any new google service is made publicly available to customers and receives live production traffic HRR is performed when the service is transitioned to an Ops managed state, usually months after the LRR Any product team going through LRR or HRR has a SRE assigned to them to help them understand/achieve the requirements Teams with the fastest HRR production approvals are the ones that worked with SREs earliest
Normal Changes
Higher-risk changes that require review or approval from the agreed upon change authority. In many orgs this responsibility is placed with a change advisory board (CAB)or emergency change advisory board which may lack the required expertise to understand the full impact of the change, → unacceptable long lead times. CAB will almost certainly have a well-defined request for change (RFC) form defines what info is required for the go/no-go decision.
Reduce Reliance on Separation of Duty
Historically separation of duty as one of our primary controls to reduce the risk of fraud or mistakes in the software development process As complexity and deployment frequency increase, separation of duty slows down and reduces the feedback engineers receive on their work Solutions: Pair Programming, code check-ins, code review give necessary reassurance about quality of work
Catch Errors as early in our Automated testing as possible
IN order to find errors as early as possible, run faster-running automated tests (unit tests) before slower-running automated tests (acceptances and integration tests) If most errors are found with acceptance and integration tests, feedback to devs is slower than with unit tests fixing errors with acceptance and integration tests is much slower because you have to validate that the tests pass by re-running then If error is found with an acceptance or integration test, we should create a unit test that could find the error faster, earlier, and cheaper IF unit or acceptance test are too difficult/expensive to write, it;'s likely that the architeture is too tightly-coupled
Find and Fill any telemetry gaps
Identify gaps in our telemetry that impede our ability to quickly detect and resolve incidents Use this data later to better anticipate problems Need metrics from the following levels: Business (# of sales transactions, revenue, etc) Application(transaction times, user response time) Infrastructure level (web server traffic, CPU load) Client software level(errors and crashes) Deployment pipeline level (build pipeline status--red or green, change deployment lead times) At application level, goal is to ensure generating telemetry not only around application health but also to what extent we are achieving our organizational goals Goals: to have every business metric be actionable--these top metrics should help inform how to change our product and be amenable to experimentation and A/B testing for production and non-production infrastructure ensure that we are getting enough telemetry so that if a problem occurs in any environment, we can quickly determine whether infrastructure is a contributing cause of the problem
Protect our deployment pipeline
If someone compromises the servers running deployment pipeline that has the credentials for our version control system, it could enable someone to steal source code. Or someone could inject malicious changes into our repository A good place for someone to hide malicious code is a unit test, because no one looks at them and they run every time someone commits code to the repo In order to protect our continuous build, integration, or deployment pipeline our mitigation strategies may include: Hardening continuous build and integration servers Reviewing all changes introduced into version control, either through pair programming or code review Instrumenting our repository to detect when test code contains suspicious API calls Ensuring every CI process runs on its own isolated container or VM Ensuring the version control credentials used by the CI system are read-only
Design for operations through codified non-functional requirements
Implementing non-functional requirements will enable our services to be easy to deploy and keep running in production. Examples of non-functional requirements: Sufficient production telemetry Ability to accurately track dependencies Services that are resilient and degrade gracefully Forward and backward compatibility versions Ability to achieve data to manage the size of production data set Ability to easily search and understand log messages Ability to trace requests from users Simple, centralized runtime configuration use feature flags
Creating security telemetry in our applications
In order to detect problematic user behavior we need to create relevant telemetry, which may include: Successful and unsuccessful user logins User password resets User email address resets User credit card changes
Pull our Andon cord when the pipeline breaks
In order to keep our deployment pipeline in a green state, we will create a virtual Andon Cord Whenever someone introduces a change that causes our build or automated test to fail, no new work is allowed to enter the system until the problem is fixed Create highly visible indicators so that the entire team can see when our automated tests are failing
Ways to Influence People's feelings:
Increase positive motivation (being valued and receiving recognition) Decrease Negative Motivation (eliminating mistrust, fear of making a mistake, fear of change, etc)
Inject Production Failures to Enable Resilience and Learning
Increase resilience by regularly injecting and rehearsing failure in the system to confirm that we have designed and architected our systems properly so that failures happen in specific and controlled ways
Sprint Backlog
Individuals sign up for work of their own choosing Estimated work remaining is updated daily Any team member can add, delete change sprint backlog Work for the sprint emerges If work is unclear, define a sprint backlog item with a larger amount of time and break it down later Update work remaining as more becomes known
How to change the behaviors of others
Influence: Beliefs and Values People's feelings People's level of awareness
DevOpsSec:
Infosec is integrated into all stages of the SDLC
Code Review:
Instead of requiring approval from an external body prior to deployment, we need to require engineers to get peer reviews of their changes goal; find errors by having fellow engineers close to the work scrutinize our changes Require reviews prior to committing code to trunk in source code
CHAPTER 17
Integrate Hypothesis Driven Development and A/B Testing into our Daily Work
Create Internal Consulting and Coaches to Spread Practices
Internal coaching or consulting org is useful to spread expertise across an organization
Dependency Scanning
Inventorying all our dependencies for binaries and executables, ensuring that dependencies are free of vulnerabilities
Deployment
Is the installation of a specific version of software to a given environment (may or may not be associated with a release of a feature to customers)
How does Agile Work
Make a list: sit down w/ customer and make list of features they would like to see in software or "user stories" Size things up: size up stories relatively to each other, guessing how long each will take Set some priorities: As the customer to prioritize their list Start Executing: Start at the top and work way to the bottom building, iterating, and getting feedback from the customer Update the plan as you go: As you start delivering you're either: going to fast too much to do and not enough time 2 choices: do less and cut scope or push out the date and ask for money Planning is adaptive Roles blur Scope Varies Requirements can change
Encourage Organizational Learning
Make engineers feel safe when giving details about mistakes Make them enthusiastic in helping the company avoid the same error in the future Continually reinforce that we value actions that expose and share more widely the problems in our daily work
An Agile Approach
Manifesto for Agile Software Development (2001) codified several values: People: teammates, customers, and interactions between these people-instead of processes and tools Immediacy: Working software -instead of comprehensive documentation Flexibility: Responding to, and even embracing change - instead of following a predetermined plan
Potential Dangers of doing more manual testing and change freezes
Manual testing is slower and more tedious that automated testing: Deploying less frequently and thus increasing deployment batch size We want to fully integrate testing into our daily work as part of smooth and continual flow into production
Integrate information security into production telemetry
Marcus Sachs (Verizon Data Breach) found that cardholder data breaches were detected months or quarters after the breach occurred, and it was not detected by internal monitoring but someone outside of the organization-either a business partner or customer This shows that internal security controls are often ineffective in successfully detecting breaches in a timely manner, either because of blind spots in our monitoring or because no one is examining the relevant telemetry in their daily work. Should integrate security telemetry into the same tools that development, QA, and operations are using to create visibility into how their application and environments are performing in a hostile threat environment This reinforces that everyone needs to be thinking about security risks and designing countermeasures in their daily work.
Agile/DevOps transformation considers
Method Tools Enablement Organization Infrastructure Adoption
Negative Values
Money, status, power, control, promoting winning over others
Create Application Logging telemetry that helps production
Must ensure that the application we build and operate are creating sufficient telemetry Do this by having Dev and Ops create production telemetry as part of their daily work Dev can use Telemetry to better diagnose problems on their workstation Ops can use telemetry to diagnose a produciton problem Different logging levels: Debug, Info, Warn, Error, Fatal Should create logging hierarchical catagories
Ensure Security of our software supply chain
Need to be aware of security vulnerabilities in open source/commercial software
Bureaucratic Organizations
Rules and processes. Failure is processed through system of judgement
How to institutionalize improvement of daily work
Processes degrade over time in the absence of improvements Reserve time to pay down tech debt, fix defects, refactor and improve problematic areas of our code an environments fix problems when it's not only easier and cheaper but when the consequences are smaller
Identify the teams supporting our value stream
Product Owner Development QA Operations InfoSec Release Manager Technology Executives Value Stream Manager
Scrum Master
Project manager or team leader Responsible for enacting scrum values and practices Remove impediments/politics
CHAPTER 23
Protecting the Deployment Pipeline
Integrate security into our deployment pipeline
Provide Dev and Ops fast feedback on their work so that they are notified whenever they commit changes that are potentially insecure
Ensure security of the environment
Put in monitoring controls to ensure that all production instances match these known good states
Automate standardized processes in software for re-use
Put knowledge into a centralized source code repository, making the tool available for everyone to search and use
Create Self-service access to telemetry and information radiators
Radiate this telemetry data to the rest of the organization, ensuring that anyone who wants info about any of the services we are running can get it without needing production system access or privileged accounts By putting info radiators into highly visible places we promote responsibility among team members, actively demonstrating the following values: Team has nothing to hide from its visitors Team has nothing to hide from itself
Waterfall vs Scrum
Rather than doing everything at one time, scrum teams do a little of everything all the time
Chapter 22: Information Security is everyone's job everyday
Ratio of engineers in dev, ops, and infosec is 100:10:1
CHAPTER 18: Create review and coordination processes to increase quality of our current work
Reduce the risk of production changes before they are made Pull requests allow for peer review in order to determine how risky a change is
Decouple deployments from releases
Releases are driven by marketing launch date, but things often don't go according to plan, failures may occur Restoring service may require a painful rollback process or an equally risky fix forward operation, where we make changes directly in production To deploy more frequently to achieve our desired outcome of smooth and fast flow: need to decouple our production deployments from our feature releases. Deployment and Release are 2 distinct actions
CHAPTER 21
Reserve Time to Create Organizational Learning and Improvement
Gated Commits
When deployment pipeline first confirms that submitted change will successfully merge, build as expected and pass all automated tests before actually being merged into trunk if not developer will be notified allowing for corrections to be made without impacting anyone else in the value stream
Interdependence of the systems
When improving brownfield systems we should strive to reduce complexity, improve reliability and stability making them faster, safer and easier to change. When new functionality is added to greenfield systems of engagement, they often cause reliability problems in the brownfield systems of records they rely on.
Enable coordination and scheduling of changes
When multiple groups are working on systems that share dependencies, our changes will likely need to be coordinated to ensure that they don't interfere with each other The more loosely coupled our architecture the less we need to communicate and coordinate with other component teams Even in loosely coupled architecture, there may be a risk of changes interfering with each other (simultaneous A/B testing) to mitigate these risks, use chat rooms to announce changes and proactively find collisions that may exist More tightly coupled architecture organizations need to deliberately schedule changes
Release
When we make a feature available to all our customers or a segment of customers. Our code and environment should be architected in a way that the release of functionality does not require changing our application code
Standardized Model
Where routine and systems govern everything, including strict compliance with timelines and budgets
User Stories
Who (user Role)-- is this a customer, employee, admin, etc? What(goal)-- what functionality must be achieved/developed? Why(reason) -- why does the user want to accomplish this goal? As a [user role], I want to [goal], so that I can [reason]
Publish our Post-Mortems as Widely as Possible
Widely publish meeting notes and artifacts ideally in a centralized location where the entire org can access it and learn from it esp. If similar incident arises on different team (Organizational learning) Prohibit production incidents from being closed until the post-mortem meeting has been completed Example: Etsy's Morgue is a tool for better access to post-mortem meeting notes
Test Driven Development:
Write automated tests before we write the code Begin every change to system by first writing an automated test that validates the expected behavior fails and then we write the code to make the tests pass Technique developed by Kent Beck has 3 steps: Ensure the tests fail ensure the tests pass refactor both new and old code to make it well structured. Ensure the tests pass
Blue-Green deployment pattern
You have 2 production environments; blue and green At any time, only one of these is serving customer traffic. To release a new version of our service, we deploy to the inactive environment where we can perform our testing without interupting the UX. We can execute our release, we direct traffic to the blue environment so blue goes live and green becomes staging. Roll back is performed by sending customer traffic back to the green environment Having 2 versions create problems when they depend upon a common database. Two approaches to solve this problem: create 2 databases (blue and green database) decouple database changes from application changes (make only additive changes to the database and make no assumptions in our application about which database version will be used in production)
Just Culture
codified by Dr. Sidney Dekker, when responses to incidents and accidents are seen as unjust, it can impede safety investigations, promoting fear rather than mindfulness in people who do safety critical work, making organizations more bureaucratic rather than more careful and cultivating professional secrecy, evasion, and self-protection
Blameless Post-Mortem
coined by John Allspaw, help us examine "mistakes in a way that focuses on the situational aspects of a failure's mechanism and the decision-making process of individuals proximate to the failure" Schedule post-mortem as soon as possible after the accident has occurred and problem has been solved so before memories and the links between cause and effect fade or circumstances change Disallow phrases "would have" or "could have" because they are counterfactual statements that result from tendency to create possible alternatives to events that have already happened
Feature toggles
control what percent of users see the treatment version of an experiment
Positive Values
customer focus, performance, safety, innovation, service oriented, reliable
Rugged DevOps:
incorporating information security objectives into DevOps
Game Day team
defines and executes drills any problems or difficulties that are identified, addressed, and tested again
Direct Response Marketing:
defining customer acquisition funnel and performing A/B Testing
Agile Infrastructure and Velocity Movement (2008-2009)
Shared goals b/t Dev and Ops and continuous integration practices make deployment part of everyone's daily work. "DevOps" term coined in Ghent Belgium by Patrick Debois.
Big Bang Approach
Starting everywhere at all and once (BAD STRATEGY) i.e. waterfall model
Using Strangler Application Pattern to Safely Evolve our Enterprise Architecture
Strangler application coined by Martin Fowler in 2004 When implement strangler applications seek to access all services through versioned APIs Creating strangler applications you avoid reproducing existing functionality in some new architecture or technology
Deployment Lead Time
Subset of value stream, begins when any engineer checks a change in to version control and ends when the change is successfully running in production, providing value to the customer and generating useful feedback and telemetry. PHASE 1 - Design and Development: highly variable and highly uncertain b/c of high degree of creativity and work may never be performed again, resulting in high variability of process times PHASE 2 - Testing and Operations: strives to have short and predictable lead times, with near 0 defects) Goal is to have testing/operations happening simultaneously with design/development, enabling fast flow and high quality.
Greenfield Project
A new software project or initiative, likely in the early stages of planning or implementation, where we build our applications and infrastructure anew, with few constraints. Typically easier to implement esp. if already funded / a team is in place, don't have to worry about existing code bases, processes, and teams. Typical examples: pilots to demonstrate feasibility of public or private clouds, piloting deployment automation, and similar tools
DevOps
A set of practices that seek to reduce the gap between software development and software operation. It focuses on automating and monitoring all steps of software construction, from integration, testing, releasing to deployment and infrastructure management. The objective is to build shorter development cycles, increased deployment frequency, more dependable releases, in close alignment with business objectives. DevOps is the result of applying Lean principles to the technology value stream and a logical continuation of the Agile software journey that began in 2001
What should happen when production changes are made?
Changes should be: 1. Put into version control 2. Automatically replicated everywhere in our production and pre-production environments as well as any newly created environments 3. Destroy old ones or taken out of rotation
Matrix-oriented organizations
Attempt to combine functional and market orientation. Complex organizational structures, such as individual contributors reporting to 2 or more managers . Sometimes achieving none of its functional or market orientation goals.
Versioned APIs/Immutable services
enable us to modify the service without impacting the callers, allowing for more loosely coupled architecture
Functional-oriented organizations
(how most IT Ops orgs are structured) optimize for expertise, division of labor, or reducing cost. (dbas in one group, infosec in another, QA in another) Centralize expertise which helps enable career growth and skill dev. Often have tall hierarchical organizational structures Prevailing method of organization from operations
CI (continuous integration)
- a practice that focuses on making a release easier - developers merge their changes back to the main branch as often as possible - the developer's changes are validated by creating a build and running automated tests against the build - CI puts a great emphasis on testing automation to check that the application is not broken whenever new commits are integrated into the main branch
Continuous Delivery
- an extension of CI to make sure that you can release new changes to your customers quickly in a sustainable way - in addition to having automated testing, you also have automated your release process and you can deploy your application at any point of time by clicking a button when all developers are working in small batches on trunk or everyone is working off trunk in short-lived feature branches that get merged to trunk regularly, trunk is always kept in a releasable state, can release on demand
Continuous Deployment
- every change that passes all stages of your production pipeline is released to your customers - there's no human intervention, only a failed test will prevent a new change to be deployed to production Deploying to production at least once per day per developer or perhaps even automatically deploying every change a developer commits Continuous delivery is the prerequisite for continuous deployment
What should be in version control?
1. All application code and dependencies 2. Any script used to create database schemas, application reference data 3. All the environment creation tools and artifacts 4. Any file used to create containers 5. All supporting automated tests and any manual test scripts 6. Any script that supports code packaging, deployment, database migration, and environment provisioning 7. All project artifacts (requirement documentation, deploy procedures, release notes) 8. All cloud configuration 9. Any other script or config info required to create infrastructure that supports multiple services
Problems Often Caused by Overly Functional Orientation ("Optimizing for Cost")
1. Long lead times and queues esp. for complex activities like large deployments where we need to open tickets w/ multiple groups and coordinate work handoffs leading to long queues at every step 2. Person performing the work has little visibility or understanding of how their work relates to any value stream goals (little motivation or creativity) 3. Poor handoffs, large amounts of re-work, quality issues, bottlenecks, delays 4. As you increase the number of dev teams and their deployment and release frequency most functionally oriented organizations will have difficulty keeping up and delivering satisfactory outcomes.
Why is swarming necessary?
1. Prevents problem from progressing downstream, where cost and effort to repair it increases exponentially and tech debt is allowed to accumulate 2. Prevents work center from starting new work, which will will introduce new errors 3. If problem is not addressed work center could have same problem in next operation, requiring more fixes and work 4. Enables learning and prevents loss of critical info due to fading memory or changing circumstances
Side effects of large batch size merges
1. Required effort to successfully merge branches back together increases exponentially as the number of branches increase 2. Increase in rate of code production → increase probability that any given change will impact someone else and increase the # of developers who will be impacted when someone breaks the dev pipeline 3. When merging is difficult → less able and motivated to improve and refactor code b/c refactoring will cause rework for everyone else → when reluctant to modify code with dependencies less payoff
Ineffective Quality Controls
1. Requiring another team to complete tedious error-prone manual tasks that could be easily automated and run as needed by team who needs the work performed 2. Requiring approvals from busy people who are distant from the work, forcing them to make decisions without adequate knowledge or the work or potential implications 3. Creating large volumes of documentation of questionable detail which become obsolete shortly after they are written 4. Pushing large batches of work to teams and special committees for approval and processing and then waiting for responses
Overly tight architecture risks
1. Risk global failure every time a commit to trunk or release to production is attempted 2. Complicated integration b/c of big batch changes in deployment increases risk of something going wrong 3. May take weeks to find and fix problems when coordinating with hundreds of developers to
requirements for a deployment pipeline
1. deploying the same way to every environment 2. smoke testing our deployments 3. ensure we maintain consistent environments
Working Safely within Complex Systems
4 conditions make it safer to work in complex systems: 1. Complex work is managed so that problems in design and operations are revealed 2. Problems are swarmed and solved, resulting in quick construction of new knowledge 3. New local knowledge is exploited globally throughout the organization 4. Leaders create other leaders who continually grow these types of capabilities
Deployment lead times of months
Common in large complex organization with tightly-coupled, monolithic applications, often with scarce integration test environments, long test and production environment lead times, high reliance on manual testing, and multiple required approval processes. With long deployment lead times: heroics are required at almost every stage of value stream, when merge all development teams changes together resulting code might not build correctly or pass any of the tests, fixing problems might require weeks resulting in poor customer outputs
Complex system
Complex system means no one person can see the system as a whole and understand how all the pieces fit together, high interconnectedness of tightly-coupled components and system-level behavior that cannot be explained merely in terms of the behavior of the system components
How to automate environment build process:
Copying a virtualization environment Building an automated environment creation process that starts from "bare metal" Using "infrastructure as code" configuration management tool Using Automated operating system configuration tools Assembling an environment from a set of virtual images or containers Spinning up a new environment in public cloud
Adopt Trunk-Based Development Practices
Countermeasure to large batch size merges: institute continuous integration and trunk-based development practices, where all developers check code into trunk once per day Frequent code commits to trunk → means can run all automated tests on our software system as a whole and receive alerts when a change breaks app or interferes with someone else's work → small problems can be corrected faster
See problems as they occur
Dr. Peter Senge describes feedback loops as critical part of learning organizations and systems of thinking. Feedforward and Feedback loops cause components within a system to reinforce or counteract each other Fast feedback and fast forward loops allow: 1. Quick detection and recovery of problems 2. Inform us how to prevent these problems from occurring again in the future
Expanding DevOps Across Our Organization
Dr. Roberto Fernandez describes ideal phases used by change agents to build and expand their coalition and base of support 1. Find Innovators and Early Adopters (Kindred spirits, and better if respected and have high credibility) 2. Build Critical Mass and Silent Majority (Work with more teams who are receptive to ideas, create bandwagon effect that further increase influence) 3. Identify Holdouts (After silent majority achieved, tackle high profile influential detractors who are most likely to resist)
The First Way
Enables fast left to right flow of work from Development to Operations to Customer Speeding up flow through TVS reduces lead time required to fulfill internal or customer requests, increasing quality of work and throughput and boost ability to out-experiment the competition Resulting practices: continuous build, integration, test, deployment process Creating environments on demand Limiting Work in Progress (WIP) Building systems and organizations that are safe to change
The Third Way
Enables the creation of generative high-trust culture that supports dynamic, disciplined, and scientific approach to experimentation and risk taking, facilitating the creation of organizational learning, both from success and failures. Design system of work so that we can multiply the effects of new knowledge, transforming local discoveries into global improvements.
Keep Pushing Quality Closer to the Source
Everyone in value stream should find and fix problems in their area of control as part of daily work, so push quality and safety responsibilities and decision-making to where the work is performed, instead of relying on approvals from distant execs.
Strangler Application pattern
Example of evolutionary design, instead of "ripping out and replacing" old services with architectures that no longer support our organizational goals, put the existing functionality behind an API and avoid avoid making further changes to it. All new functionality then implemented in the new services that use the new desired architecture, making calls to the old system when necessary Useful when migrating portions of a monolithic application or tightly-coupled service to one that is more loosely-coupled
Brownfield Project
Existing products or services that are already serving customers and have been in operation for years or even decades. Typically come with significant amounts of tech debt, i.e. no test automation or running on unsupported platforms. DevOps has been used to successfully transform brownfield projects (60% of transformation stories at devOps summit were brownfield).
The Second Way
Fast and constant flow of feedback from right to left at all stages of our value stream Requires ample feedback to prevent problems from happening again, or enable faster detection and recovery Create quality at source and generate or embed knowledge where it is needed Seeing problems as they occur and swarming them until effective countermeasures are in place we continually shorten and amplify our feedback loops (maximizes the opportunities for our organizations to learn and improve)
Swarm and Solve Problems to Build New Knowledge
Goal of Swarming: to contain problems before they have a chance to spread, and to diagnose and treat the problem so that it cannot recur (Dr. Spear)
CHAPTER 9
Goal: ensure that we can re-create the entire production environment based on what's in version control
Design Team Boundaries in Accordance with Conway's Law
Ideally software architecture should enable small teams to be independently productive, sufficiently decoupled from each other so that work can be done without excessive or unnecessary communication or coordination
Predictor of performance with DevOps
If application could be architected or rearchitected for testability and deployability.
Problems that arise when our telemetry data has non-gaussian distribution
In Operations data sets often have "chi squared" distribution so using standard deviation method can lead to over-alerting
Testing, Operations and Security as Everyone's Job Every Day
In high performing orgs. Team shares common goal of quality, availability, and security as part of everyone's everyday job. Rotation of on-call duty to firefight for services people built.
Continuous Delivery Movement (2006 and 2009)
Jez Humble and David Farley extended upon development discipline of continuous build, test, and integration. Continuous delivery: role of a "deployment pipeline" to ensure that code and infrastructure are always in a deployable state, all code checked into trunk can be safely deployed into production.
Principle of Evolutionary Architecture
Jez Humble says "architecture of any successful product or organization will necessarily evolve over its life cycle"
Keep Team Sizes Small (The "Two-Pizza Team" Rule)
Keeping team small reduces the amount of inter-team communication and encouraging us to keep the scope of each team's domain small and bounded Small team has 4 important effects: 1. Team has clear shared understanding of the system they are working on 2. Limits the growth rate of product or service being worked on 3. Decentralizes power and enables autonomy 4. Leading a 2PT is a way for employees to gain some leadership experience in an environment where failure does not have catastrophic consequences
Modify our Definition of Development "Done' to include running in production-like environments
Only accept dev work as done when it can be successfully built, deployed and confirmed that it runs as expected in production-like environment, instead of merely when developer thinks its done Ideally runs under production-like load with production-like dataset long before end of sprint
Making Functional Orientation Work
Only possible to achieve DevOps outcomes through functional organization as long as everyone in value stream views customer and organizational outcome as shared goal regardless of where they are in the organization so easy to get what they need from Ops reliably on-demand. High trust culture, work transparently prioritized and sufficient slack in system to allow high priority work to be completed.
Measures of performance in value streams
Lead Time: starts when request is made and ends when it is fulfilled (focus on this because lead time is what customer experiences) Process Time: starts when begin work on the customer request--omits time that the work is in queue-- %C/A (measure of rework): quality of output of each step in value stream, obtained by asking downstream customers what % of time they receive work that is "usable as is"
Continually optimizing for downstream work centers
Lean defines 2 types of customers that we must design for: 1. External customer(who most likely pays for the service we are delivering) 2. Internal Customer(who receives and process the work immediately after us) Most important customer, according to Lean Next step downstream How to create quality at source: Optimize for downstream work centers by designing for operations where non-functional requirements are as highly prioritized as user features
Agile Manifesto (2001)
Lightweight set of values and principles. Deliver working software frequently, from a couple of months to a couple weeks, with a preference to the shorter timescale (NO more waterfall) Small motivated teams working in a high-trust management model. Individuals and interactions over process and tools Working software over comprehensive documentation Customer collaboration over contract negotiation Responding to change over following a plan
An Architecture that Enables Productivity, Testability, and Safety
Loosely coupled architecture with well-defined interfaces that enforce how modules connect with each other promote productivity and safety
Create Loosely-Coupled Architectures to Enable Developer Productivity and Safety
Loosely-coupled architecture means services can update in production independently without having to update other services Have bounded contexts: dev should understand and update code of a service without knowing anything about the internals of its peer services Ensure that services are compartmentalized and have well-defined interfaces for easier testing
Toyota Kata (2009)
Mike Rother framed 20 year journey to understand and codify the Toyota Production System. Argued Lean community missed the "improvement kata": creating structure for the daily, habitual practice of improvement work b/c daily practice is what improves outcomes. Constant cycle of establishing desired future states, setting weekly target outcomes, and the continual improvement of daily work is what guided improvement at Toyota.
Deployment lead time of minutes
Needs for Ideal: Developers receive fast, constant feedback on work Quickly and independently implement, integrate, and validate their code and deploying into production environment Achievable by: Continually checking small code changes into version control repository Perform automated and exploratory testing against changes and deploying it into production Architecture that is modular, well encapsulated, and loosely coupled so small teams are able to work autonomously with failures being small and contained
Enable Every Team Member to be a Generalist
Prevent extreme cases of functionally-oriented orgs which is over-specialized departments Siloization which lead to multiple handoffs and queues and long lead times, don't want engineers who are "frozen in time' when tech is ever changing.
The Principles of Feedback
Principles that enable the reciprocal fast and constant feedback from right to left at all stages of the values stream. Goal is to achieve quality, reliability, and safety. 1. Seeing problems as they occur 2. Swarming and solving problems to build new knowledge 3. Pushing quality closer to the source 4. Continually optimizing for downstream work centers
Small Batch Development and what happens when we commit code to trunk infrequently
Problem with developers working in long-lived private branches("feature branches"): merges into trunk only happen sporadically → large batch size of changes Branching Strategies: 1. Optimize for individual productivity (everyone has their own private branch) 2. Optimize for team productivity (everyone works in the same common area)
CHAPTER 11
Problem: When developers fail to integrate their branches regularly and create more branches, integration becomes very difficult, requires rework, delayed feedback, delays production. When painful to integrate do less often making merges even worse Goal: use continuous integration to make merging into trunk a part of everyone's daily work
Enable On-Demand Creation of Dev, Test, and Production Environments
Problem: teams can only see how app behaves in production-like environment, but test environments are often scarce and have long lead times → teams cannot test in production-like environments → massive failure when actual released Solution: Developers run production-like environments on their own workstations , created on demand and self-service so that they can run and test code in production-like environments as part of daily work To do this create common build mechanism that creates all the environments, so anyone can get production-like environment fast without ticket Carefully define stable, reliable, secure environment requirements
Technology Value Stream
Process required to convert a business hypothesis into a technology-enabled service that delivers value to the customer. Input: formulation of a business objective, concept, idea, or hypothesis Start when accept work in development, adding it to our committed backlog of work → dev teams follow agile or iterative process to form user stories → implement in code into app or service → check code into version control repository → each change integrated and tested w/ rest of software system Value only created when services in production: Fast Flow and deployment without service outages, impairments, security or compliance failures.
Pros/Cons of Optimizing for individual productivity
Pros: everyone works independently and nobody can disrupt anyone else's work Cons: Merging is a nightmare, collaboration is very difficult b/c everyone's work has to be merged with everyone else's work to see even a small part of the complete system
Pros/Cons of optimizing for team productivity
Pros: no branches, just a long unbroken straight line of dev, nothing to understand so commits are simple Cons: each commit can break the entire project and bring all progress to a halt
Version Control
Records changes to files or set of files stored within the system, could be source code, assets, or other documents part of software project, stores changes as commits and allows for reversion back to previous versions.
Enable Market-Oriented Teams ("Optimizing for Speed")
Responsible for feature dev, testing, securing, deploying, and supporting their service in production, from idea conception → retirement Cross-functional and independent teams, run experiments build/deliver/deploy/run/fix features without manual dependencies on other teams To achieve this embed functional engineers and skills (Ops, QA, Infosec) into each service teams or provide their capabilities to teams through automated self-service platforms
Use means and standard deviations to detect potential problems
Use means and standard deviations to create filter that detects when a metric is significantly different from its norm and alert Better these alerts by focusing on variances or outliers that matter to prevent alert fatigue
Create our Single Repository of Truth for the Entire System
VERSION CONTROL Need to be able to re-create any previous state of production environment, pre-production, and build process, including tools, and environments Need code and environments to BOTH be in version control Ops needs version control b/c way MORE config settings for environment then code
Lean Movement (1980s-1990s)
Value Stream Mapping, Kanban Boards, and Total Productive Maintenance. Create value for customer by creating constancy of purpose, scientific thinking, creating flow and pull, assuring quality at source, humility, respect everyone
Outlier detection
abnormal running conditions from which significant performance degradation may well result
Andon Cord
cord that every worker and manager is trained to pull when something goes wrong, team leader is alerted and immediately works to resolve the problem if the problem cannot be resolved within specified time the production line is halted so the entire org can assist until problem is resolved
Anomaly Detection
the search for items or events which do not conform to an expected pattern
Benefits to creation of dev, test, and production environments
providing early and constant feedback to dev, reproduce/diagnose defects safely isolated from production, experiment with changes to environment to create more shared knowledge b/t dev and operations
Monolith v2
sets of monolithic tears: front end presentation, application server, database layer Pros: 1. Simple at first 2. Joint queries are easy 3. Single schema, deployment 4. Resource-efficient at small scales Cons: 1. Tendency for increased coupling over time 2. Poor scaling and redundancy(all or nothing, vertical only) 3. Difficult to tune properly 4. All or nothing schema management
Smoothing
suitable if the data is a time series, uses moving averages which transform data by avg each point with all the data within a sliding window this highlights longer-term trends or cycles
Discipline of daily code commits forces...
work broken down into smaller chunks while keeping trunk in a working, releasable state