DevOps

Ace your homework & exams now with Quizwiz!

What role does HR play in process improvement

1) Implemented new organizational structures 2) communicating monitoring employee responses 3) Integrating process improvement into talent system 4) Recruiting 5) People Development 6) Prevention of Ambiguity Wastes 8 types of ambiguity wastes as outlined by Karen Martin and Mike Osterling: 1) terminology and communication 2) Problem solving and decision making 3) Work systems 4) Roles and Responsibility 5) Policies 6) Business Goals and Priorities 7) Customers and Products 8) Organizational purpose and vision

How to Increase Flow

1) Make work visible 2) reduce batch sizes and intervals of work (limiting WIP) 3) Build quality in by preventing defects from being passed to downstream work centers 4) constantly optimize for global goals

DevOps

1. Collaborative Development 2. Continuous Testing 3. Continuous Release 4. Continuous Monitoring and Optimization

Transformation Best Practices (IBM)

1. Consider all elements of a delivery ecosystem 2. Implement a center of excellence 3. Plan improvements around capabilities 4. Adopt capabilities incrementally 5. Embrace principles of organizational change

Monolithic v1

All functionality in one application Pros: 1. simple at first 2. Low inter-process latencies 3. Single codebase, one deployment unit 4. Resource-efficient at small scales Cons: 1. Coordination overhead increases as teams grow 2. Poor enforcement of modularity 3. Poor scaling 4. All-or-nothing deploy (downtime, failures) 5. Long build times

Reduce the Number of Handoffs

Each work handoff = communication To mitigate problems, strive to decrease number of handoffs by: 1) automating significant portions of the work OR 2) by reorganizing teams so they can deliver value to the customers themselves

Urgent Changes

Emergency and consequently, potentially high risk, changes that must be put into production immediately (urgent security patch, restore service). Require senior management approval, but allow documentation to be performed after the fact.

Enable creation of production metrics as part of daily work

Enable Dev and Ops to create and improve telemetry as part of their daily work

CHAPTER 19

Enable and Inject Learning into Daily Work Goal: Create a dynamic system of learning that allows us to understand mistakes and translate that understanding into actions that prevent those mistakes from occurring in the future

Fund not Projects, but Services and Projects

Enable high performing outcomes by creating stable service teams with ongoing funding to execute their own strategy and road map of initiatives. Measured by achievement of organizational and customer outcomes such as revenue, customer lifetime value, or customer adoption rate Traditional "project" based model is BAD b/c unable to see long-term consequences of decisions they make and funding only pays for earliest stages of SDLC which are the least expensive Measured by budget, time, scope

Spread knowledge by using automated tests as documentation and communities of practice

Enable rapid propagation of expertise and improvements regarding shared libraries Shared libraries should have significant amounts of automated testing

Share your Experiences from DevOps Conferences

Encourage engineers to attend conferences, give talks at them, and create or organize internal/external conferences themselves

Create a single, shared source code repository for our entire organization

Engineers can leverage the diverse expertise of everyone in the organization Engineers can contribute to the single repository knowing that the code is updated and what areas of the repo will be affected

Integration Tests:

Ensure that our application correctly interacts with other production applications and services

Deployment Pipeline

Ensures that all code checked into version control is automatically built and tested in a production like environment Find any build, tests, or integration errors as soon as a change is introduced Allows us to always be assured that we are in a deployable and shippable state

2 Broad categories of release patterns we can use:

Environment-based release patterns Application-based release patterns

Experimental Model

Everyday exercise and every piece of new information is evaluated and debated in a culture that resembles a research and design lab

Dev Shares pager rotation duties with Ops

Everyone in the value stream will share the downstream responsibilities of handling operation incidents. Do this by putting developers, developer managers, and architects on pager rotation This ensures that everyone in the value stream gets visceral feedback on any upstream architectural and coding decisions they make Operations doesn't struggle alone with code related production issues

Guidelines for Code Review

Everyone must have someone to review their changes before committing to trunk Everyone should monitor the commit stream of their fellow team members Define what changes qualify as high risk and may require review from a designated Subject Mater expert If someone submits a change that is too large a reason about easily it should should be split up into multiple smaller changes that can be understood at a glance

Instrument and Alert on Undesired outcomes

Analyze the most severe incidents of the recent past and create a list of telemetry that could have enabled earlier and faster detection and diagnosis of the problem, as well as easier and faster confirmation that an effective fix had been implemented Repeat this process on ever-weaker failure signals to find problems even earlier in the life cycle

Anomaly Detection Techniques

Anomaly Detection and Smoothing

Logging Level: Debug

Anything that happens in the program, most often used during debugging. Debug logs are often disabled in production but temporarily enabled during trouble shooting

Convergence of DevOps

Applying principles from physical manufacturing and leadership to IT value stream. Blends principles from: Lean, Theory of Constraints, the Toyota Kata movement, resilience engineering, learning organizations, safety culture, Agile software journey

Fearlessly Cut bureaucratic process

Approval process can significantly increase lead times A great Metric is to publish how many meetings and work tickets are mandatory to perform a release - the goal is to relentlessly reduce the effort required for engineers to perform and deliver it to the customer

CHAPTER 12: Automate and Enable Low-Risk Releases

As deployment batch-size grows, so does the risk of unexpected outcomes associated with the change

Ensure Documentation and Proof for Auditors and Compliance Officers

As tech increasingly adopts DevOps patterns it creates more tension between IT and audit b/c it challenges traditional thinking about auditing, controls, and risk mitigation

Definition of Done

At end of each development interval, we must have integrated, tested, working, and potentially shippable code, demonstrated in production-like environment, created from trunk using a one-click process, and validated with automated tests.

Automate our deployment process

Automate as many of the manual steps as possible: Packaging code in ways suitable for deployment Creating pre-configured virtual machine images or containers Automating the deployment and configuration of middleware Coping packages or files onto production servers Restarting servers, application, or services Generating configuration files from templates Running automated smoke test Running testing procedures Scripting and automating database migrations

Canary Release Pattern:

Automates the release process of promoting to successively larger and more critical environments as we confirm that the code is operating as designed A1 group: production serves that only serve internal employees A2 group: production servers that only serve a small percentage of customers A3: the rest of the production serves

DEVOPS VS AGILE

Best if use DevOps and Agile together

To better enable fast flow, we want a code promotion process that can be performance be either dev or ops, ideally without any manual steps or handoffs. This affects the following steps:

Build Test Release

Case study - 18F automating compliance for the federal government with compliance masonry

Built a framework to automate the creation of system security plans

Make Infrastructure Easier to Rebuild than to Repair

By having repeatable environment creation systems can easily increase capacity by adding more servers and avoid disaster that inevitably results when restore service after catastrophic failure of irreproducible infrastructure, created through years of undocumented and manual production changes

Ensure Tests run quickly

Can run tests in parallel (performance testing at the same time as security testing)

Eliminate Hardship and Waste in the Value Stream

Categories of Waste and Hardship: Partially done work Extra Processes Extra Features Task Switching Waiting Motion Defects Nonstandard or manual work Heroics

Steps of Deployment pipeline:

Commit stage (builds and packages the software, runs automated tests, and performs additional validation such as static code analysis, duplication, test coverage) Acceptance Stage: (automatically deploys the packages created in the commit stage into a production-like environment, runs automated acceptance tests)

Logging Level: Info

Consists of actions that are usually user-driven or system specific

Dynamic Analysis

Consists of tests executed while a program is in operation (system memory, functional behavior, response time, performance)

Cluster immune System

Expands on canary release pattern by linking our production monitoring system with our release process and by automating the roll back of code when the user-facing performance of the production system deviates outside of a predefined expected range

Multivariate Testing

Experiments with more than one variable, allows to see how the variables interact

Ways to influence people's Beliefs and Values:

Exposure to others who have been successful (books, mentoring programs, conferences, training, etc) Hiring new people

Having Developers follow work downstream

Contextual inquiry: When the product team watches a customer use the application in their natural environment, which uncovers ways that customers struggle with the application Developers often learn a lot after participating in customer observation Can use same technique to observe how our work affects our internal customers. Developers should follow their work downstream, so that they can see how work centers must interact with their product to get it running into production. can improve deployability, manageability, operability

Four common traits that companies need to develop as part of conscious effort to build a process improvement culture

Continually identify problems Discussing problems openly Focusing on large and small problems Solving real problems

Potential Dangers of "overly controlling changes"

Controls we put in place when change control failures occur: 1) adding more questions that need to be answered to the change request form 2) Requiring more authorizations 3) Requiring more lead time for change approvals so that change requests can be properly evaluated creating approval steps from people who are located further and further away from the work may actually reduce the likelihood of success High performing organizations rely more on peer review and lesson external approval of change

3 main areas that commonly impact corporate culture:

Corporate Behaviors Corporate Values Corporate Structures

Components of a great culture?

Corporate culture begins with leadership: Leaders need to explain and share their vision Make culture store meaningful: helps employees believe in the transformation by answering their most important questions All employees should be treated equally 2-way communication is essential Hiring decisions should reflect desired corporate culture A proper vision is paramount Company values should be continuously promoted company policies should embody corporate values Company history is key Spotlight success

How a centralized Ops team can achieve outcomes typically associated with market-oriented teams:

Create Self-service capabilities to enable developers in the service to be productive Embed Ops engineers into the service teams Assign Ops liaisons to the service teams when embedding Ops is not possible

Enable Everyone to Teach and Learn

Create a culture of continual teaching and learning Code reviews can help teach skills through daily work Have dev and ops work together to solve small problems

CHAPTER 20: Convert local discoveries into global improvements

Create mechanisms that make it possible for new learnings and improvements discovered locally to be captured and shared globally throughout the entire organization

Contrasting Point - Creating vs. Deploying

Creating vs. Deploying: Developing software is inherent to Agile, but DevOps is concerned with the appropriate deployment of said software

System of Engagement

Customer-facing or employee facing systems, such as e-commerce systems and productivity applications. Typically have higher pace of change to support rapid feedback loops that enable them to conduct experimentation to discover how to best meet customer needs.

Daily Scrum Meeting

Daily, ~15 min, stand-up What did you do yesterday? What will you do today? What obstacles are in your way?

Architecture of centralized telemetry infrastructure should have following components:

Data collection at the business logic, application, and environments layer An event router responsible for storing our events and metrics

T-Shaped:(Generalist)

Deep expertise in one area Broad skills across many areas Can step up to remove bottlenecks Sensitive to downstream waste and impact Helps make planning flexible and absorbs variability

E-Shaped

Deep expertise in one area Experience across many areas, proven execution skills, always innovating Almost limitless potential

I-Shaped: (Specialist)

Deep expertise in one area Few skills or experiences in other areas Creates bottlenecks quickly Insensitive to downstream waste and impact Prevents planning flexibility or absorption of variability

Requirements for our Deployment Pipeline

Deploying the same way to every environment Smoke testing our deployments Ensure we maintain consistent environments

The Separate World of DevOps

DevOps is responsible for developing and deploying new products to the end-user DevOps walks a line between flexibility and the rigorous testing and communication that comes with deploying new software

Enable Automated Self-service deploymnets

Developer's ability to self-deploy code into production, to quickly see happy customers when their feature works, and to quickly fix any issues without having to open up a ticket with Operations has diminished over the last decade --in part as a result of a need for control and oversight perhaps driven by security and compliance requirements

A Typical Deployment Landscape

Development (developer) Build (build engineer) QA (QA team) SIT (Integration Tester) UAT (User) Production (ops engineer)

Integrate A/B Testing into our Release

Feature toggles A/B testing is made possible by doing production deployments on demand, using feature toggles and potentially delivering multiple versions of code simultaneously

Start with the Most Sympathetic and Innovative Groups

Find those teams who already believe in DevOps principles and practices, i.e. the innovators and early adopters, esp in early stages to spend less energy on converting more conservative groups. Strategy: choose to focus efforts in a few areas of org where there is the most support - make those initiatives successful and expand from there.

why we keep improvement planning horizons short

Flexibility and ability to re-prioritize and re-plan quickly Decrease the delay between work expanded and improvement realized, which strengthens our feedback loop Faster learning generated from the first iteration Reduction in activation energy to get improvements Quicker realization of improvement that make meaningful differences in our daily work Less risk that our project is killed before we can generate any demonstrable outcomes

Create a Value Stream Map to see the Work

Focus investigation and scrutiny on the following areas: Places where work must wait weeks or even months Places where significant rework is generated or received

Logging Level: Error

Focuses on Error conditions (ie API call failures)

What to do when Changes are Categorized as Normal Changes

For changes that cannot be classified as standard changes, they will be normal changes that require approval from at least a subset of CAB before deployment Goal: ensure that we can deploy quickly even if it is not fully automated Make sure submitted change requests are as complete and accurate as possible to prevent revisions that will increase the time required for us to get into production Automate the creation of complete and accurate Request for change forms to include links to machine readable data to give context of the change Goal should be to continually show an exemplary track record of successful changes, so we can eventually gain their agreement that our automated changes can be safely classified as standard changes

Integrating A/B Testing into our Feature Planning

Frame Hypotheses in feature development as: 1. We believe: 2. Will Result: 3. We will have confidence to proceed when:

Organizational Archetypes

Functional-oriented organizations Matrix-oriented organizations Market-oriented organizations

Case Study - Instrumenting the environment at Etsy

Galbreath (former VP of engineering at Etsy) defined fraud as when "the system works incorrectly, allowing invalid or un-inspected input into the system, causing financial loss, data loss/theft, system downtime, vandalism, or an attack on another system." Galbreath created security-related telemetry that were displayed alongside all other metrics, including: Abnormal production program terminations Database syntax error Indications of SQL injection attacks

CHAPTER 16: Enable Feedback so development and operations can safely deploy

Galbreath, VP of Engineering at Right Media found that dev and ops have the fear of deploying code Galbreath observed that providing faster and more frequent feedback to work, created safety and confidence It is not enough merely to automate the deployment process --we must also integrate the monitoring of production telemetry into our development work, as well as establish the cultural norms that everyone is equally responsible for the health of the entire value stream

Institute Game Days to Rehearse Failures

Game Days: popularized by Jesse Robbins, they are specific disaster recovery rehearsals from the discipline of resilience engineering to help teams simulate accidents to practice → create more resilient service and higher degree of assurance that we can resume operations when inopportune events occur

Apply Lean Principles

Get ideas into production fast Get people use to it Get feedback

CHAPTER 2: The first way principles of flow

Goal: Decrease the amount of time required for changes to be deployed into production and to increase the reliability and quality of those services

How to transform Local discoveries into global improvements

Goal: convert team/individuals experiences/expertise into explicit codified knowledge create mechanisms to create global knowledge such as making all post-mortem reports searchable by teams trying to solve similar problems, creating shared source code repositories that span the entire organization

CHAPTER 8: How to get great outcomes by integrating operations into the daily work of development

Goal: enable market-oriented outcomes where many teams can quickly and independently deliver value to the customer Challenge: when Ops is centralized and functionally-oriented -results in long lead times for needed Ops work, constant reprioritization, and poor deployment How: create more market-oriented outcomes by better integrating Ops capabilities into Dev Teams

CHAPTER 14: Create telemetry to enable seeing and solving problems

Goals: to enable problem solving-behavior, we need to design our systems so that they are continually creating telemetry, widely defined as "an automated communications process by which measurements and other data are collected at remote points and are subsequently transmitted to receiving equipment for monitoring" Ensure that we have enough telemetry to confirm our services are correctly operating in production

Ensure Security of the Application

Happy Path Testing Sad Path Testing /Bad Path Testing include as part of testing: Static analysis Dynamic Analysis Dependency Scanning Source code integrity and code signing

Continuously build, test, and integrate our code and environments

Have Developers build automated tests as part of daily work Create automated test suites that increase frequency of integration and testing of our code and our environments from periodic to continuous Build deployment pipeline that will perform integration of our code and environments Must create automated build and test processes

how Application-based patterns enable safer releases

Implement feature toggles: provide us with mechanism to selectively enable and disable features without requiring a production code deployment can control which features are visible and available to specific user segments Enable us to do the following: Roll back easily Gracefully degrade performance Increase our resilience through a service-oriented architecture Feature toggles enable the decoupling of code deployments and feature releases Perform dark launches: feature toggles allow us to deploy features into production without making them accessible to users, known as dark launching

Reduce Batch Size

Large Batch sizes were common prior to Lean and result in high levels of WIP and high levels fo variability in flow --> long lead times and poor quality Lean Lesson: smaller batches = smaller lead-times & increase quality Single Piece Flow: Each Operation is performed one unit at a time Small Batches = less WIP, faster lead-times, faster detection of errors, and less rework

Pathological organizations

Large amounts of fear and threat. Failure is hidden

How does leadership play a role in process improvement?

Leaders should connect the process and the people Ensure leadership practices are universally aligned and accepted Leadership Behaviors: Positive Negative Process focused leadership is about enabling and empowering people

Standard Changes

Lower risk changes that follow and established and approved process, but can also be pre-approved. Includes monthly updates of app tax tables or country codes, website content, styling changes and certain types of apps or operating system patches that have a well understood impact. The change proposer does not require approval before deploying the change, and change deployments can be completely automated and should be logged so there is traceability.

Ways to Influence People's Level of Awareness:

Maintaining proper work-life balance Ensuring people take their vacation time Having flexible working hours Holding regular team workshops, townhalls, etc Creating opportunities to move within the organization Coaching and mentoring opportunities Conducting regular performance reviews

Methods to make Work visible

Difference b/t tech and manufacturing value streams: work is invisible In tech value streams, we cannot easily see where flow is being impeded or when work is piling up To see where work is flowing or piling up, need to make work visible as possible Visual boards: Kanban or sprint planning boards manage our work so that it flows from left to right as quickly as possible limit-multitasking by enforcing WIP limits for each column Lead Time: When a card is placed on the board to when it is moved into the "done" column

Prevent uncontrolled configuration variances and motivate version control by:

Disable remote logins to production servers Routinely kill and replace production instances, ensuring that manually-applied production changes are removed

Consequence of Second Law of Architectural Thermodynamics

Downward spiral of deploying less frequently (change = unknowable catastrophes → so don't want to change). Reducing complexity and increasing productivity of dev teams is rarely the goal of an individual project

Creating Security Telemetry in our environments

Need to create telemetry in our environments so we can detect early indicators of unauthorized access, which may include: OS changes Security group changes Changes to configurations Cloud infrastructure changes XXS attempts SQLi attempts (SQL injection attacks) Web server errors

Agile (Definition Continued)

No "finished product" which is the goal of waterfall approach Agile methodologies encourage developers to break down software dev into small pieces "user stories" highlighting value agile places on customer, helps devs provide faster feedback loops and ensuring product alignment with market need Agile advocates for adaptive planning, evolving development, early and continuous delivery, and continuous improvement

Enable Pair Programming to improve all our changes

One engineer fills the role of the driver, the person who actually writes the code Another engineer acts as the navigator, observer, or pointer, the person who reviews the work as its being performed Another pair programming pattern reinforces TDD by having one engineer write the automated test and the other engineer implement the code Pair programming can also spread knowledge throughout the organization and increasing information flowed with the team

Create Dedicated Transformation Team

One of the inherent challenges with initiative such as DevOps transformations is that they are inevitably in conflict with ongoing business operations Dedicated transformation team is able to operate outside of the rest of the organization that is responsible for daily operations How to Execute DevOps initiative: Assign members fo the dedicated team to be solely allocated to the DevOps transformation efforts Select team members who are generalists Select team members who have long standing and mutually respectful relationships with the rest of the organization Create a separate space for the dedicated team

Positive Leadership Behaviors

Openness to learning, customer focus, accountability

Create shared services to increase developer Productivity

Operations can create set of centralized platforms and tooling services that any Dev team can use to become more productive Enable dev team to spend more time building functionality for their customer, as opposed to obtaining all the infrastructure required to deliver and support that feature in production

Integrate Ops into Dev rituals

Ops engineers discover what rituals the product teams follow, integrate into them, and add value to them

Market-oriented Organizations

Optimize for responding quickly to customer needs. Flat organizations, composed of multiple, cross-functional disciplines which lead to possible redundancies across the org. How many prominent organizations adopting DevOps operate.

Process Improvement Culture:

Organizations cannot improve unless they continually seek out and solve their problems. For many companies that means undertaking a profound cultural change

Conway's Law

Organizations which design systems...are constrained to produce designs which are copies of the communication structures of these organizations...the larger an organization the less flexibility it has and the more pronounced the phenomenon. The organization of the software and the organization of the software team will be congruent.

Why Automated build and test processes are critical?

Our build and test process can run all the time A segregate build and test process ensures that we understand all the dependencies required to build, package, run, and test our code Can package our application to enable the repeatable installation of code and configurations into an environment Instead of putting code into packages we may choose to package our applications into deployable containers Environments can be made more production-like in a way that is consistent and repeatable

Types of Code reviews;

Pair programming "Over-the-Shoulder" Email Pass-Around Tool-Assisted code review

Improvement Blitz (kaizen blitz)

Part of the Toyota Production System defined as dedicated and concentrated period of time to address a particular issue over the course of a several days Goal of improvement blitz team: new approach to solving a problem that we encounter in our daily work

Negative Leadership Behaviors

Passive, Avoidance, politics, cover-ups

Integrate performance testing into our test suite

Performance problems are often difficult to detect, so write and run automated performance tests that validate our performance across the entire application stack

4 steps to effectivly identify and solve problems:

Plan Do Check Act

Types of Corporate values

Positive values Negative values

Product Owner

Possibly a product manager or project sponsor Decides on features, release date, prioritization, $$$

Ensure Technology choices help achieve organizational goals

Prevent optimization for team productivity instead of achievement of organizational goals Identify technology that: Impede or slow down the flow of work Disproportionately create high levels of unplanned work Disproportionately create large numbers of support requests Are more inconsistent with our desired architectural outcomes

Case Study -- static security testing at Twitter

Prevent security mistakes from being repeated Integrate security objectives into existing developer tools Preserve trust of development Maintain fast flow through infosec through automation Make everything security related service, if possible Take a holistic approach to achieving infosec objectives

Limiting WIP

Prevents multitasking makes it easier to see problems that prevent the completion of work instead of starting something new, find out what is causing the delay and help fix it

Case Study--Pair programming replacing broken code review process at pivitol tables

Two accepted methods of code review at Pivotal: pair programming or a code review process managed by Gerrit (every code commit had two designated people "+1" the change) The problem with the Gerrit code review process was that it would often take an entire week for developers to receive their required reviews. Only senior engineers could "+1" changes Pivotal then eliminated the Gerrit code review process and required pair programming to implement code changes into the system.

Scrum

Scrum is an agile process that allows us to focus on delivering the highest business value in the shortest time Rapidly and repeatedly inspect actual working software every 2 weeks to one month Business sets the priorities Inspect and Adapt

Culture:

Shared attitudes, values, principles and beliefs that characterize a company and define its nature and approach to managing employees, customers, investors, and the greater community

Use tools to reinforce desired behavior

Shared tools allow for a shared backlog slack allows for fast and widespread communication

Contrasting Point - Specialization

Specialization: Agile is an equal opportunity team- every member of the scrum can do every job within the team, which prevents slowdowns and bottlenecks. DevOps on the other hand, assumes separate teams for development and operations and people stay within their teams but they all communicate frequently

Contrasting Point - Speed

Speed: Agile is all about rapid and frequent deployment, but this is rarely the goal or even part of it for DevOps

Build Reusable operations user stories into developement

Standardize, automate, and document operations work as much as possible For all recurring Ops work, we should know the following : What work is required Who needs to perform it What the steps are to complete it Create well defined "ops user stories" that represent work activities that can be reused across all our projects

Decrease Incident Tolerances to Find Ever-Weaker Failure Signals

Standardized Model vs. Experimental Model Work in technology value stream should use experimental model to continually seek to find ever-weaker failure signals so that we can better understand and manage the system we operate in

Sprint Retrospective

Take a look at what is and not working after every sprint ~15-30min

Sprint Review

Team presents what it accomplished during the sprint Typically takes the form of a demo of new features or underlying architecture Informal 2-hour prep time rule No slides Whole team participates Invite the world

Sprint Planning

Team selects item from the product backlog they can commit to completing Sprint backlog is created (tasks are identified and each is estimated) High-level design is considered

Logging Level: Fatal

Tell us when we must terminate

Unit Test

Test a single method, class, or function in isolation, providing assurance to the developer that their code operates as designed

Acceptance Tests

Test the application as a whole to provide assurance that higher level of functionality operates as designated. prove that our app does what the customer meant it to not that it works the way its programmers think it should

Changing Mindeset

The actual change itself (the most difficult phase) At conclusion of phase 2, you should have: New cultural behaviors being displayed in most employees, with leaders serving as role models for the rest of the organization Any needed performance management changes designed New Rituals, meetings, and activities that support the values and behaviors A flurry of communications using various methods to reinforce desired behaviors and update the community on status

Corporate Behaviors

The behavior of employees and leadership that are most commonly desplayed

Continuously Improving

The change is reinforced but you are also refining the processes to improve and accelerate your transformation At conclusion of phase 3 you should have: Everyone trained in the new value system and fully bought into the new corporate vision Everyone articulating the unique aspects about the culture Everyone understanding how what they and others do builds and reinforces that culture Evidence of a competitive advantage in both the customer and employment market based on the corporate culture Everyone looking to continually find ways to improve work, life, and operations

Assign an Ops Liaison to each service team

The designated Ops engineer is responsible for understanding: What the new product functionality is How it works as it pertains to operability, scalability and observability How to monitor and collect metrics Any departures from previous architectures and patterns Any extra needs for infrastructure Feature launch plans

Corporate Structures

The mechanisms by which a company manages tasks and people

Product Backlog

The requirements A list of all desired work on the project List of user stories along with "story points" Ideally expressed such that each item has value to the users or customers of the product Prioritized by the product owner Reprioritized at the start of each sprint

Corporate Values

The values employees associate meaning to and continually promote

Sad Path Testing (or Bad Path)

Things go wrong, especially in relation to security related error conditions

Building Momentum

This is about building the business case, obtaining commitment, and building the velocity required to make a significant change At the conclusion of phase 1 you should have: A clear outline of the current corporate culture and ideal culture A sponsored and endorsed plan for change A communication strategy and plan for the cultural change Commitment by executives to take on the journey

Integrate Security and Compliance into Change Approval Process

Types of changes: Standard, Normal, Urgent

Automated Tests fall into one of the following categories:

Unit Tests Acceptance Tests Integration Tests

Happy Path Testing

Validates user journeys where everything goes as expected

Integrate preventative security controls into shared source code repositories and shared services

Want a centralized shared service organization to collaborate and put our information security artifacts Put code libraries and recommended configurations, secret management, OS packages, and builds into shared source code repository Makes it easy for engineer to correctly create and use logging and encryption standards in their application and environments

Integrate non-functional requirements testing into our test suite

Want to validate every other attribute of the system we care about -- availability, scalability, capacity, security, etc Many of these requirements are fullfilled by correct configuration of our environments, so we must also validate that our environments have been built and configured properly

Build Fast and reliable validation test suite

We need fast automated tests that run within our build and test environments whenever a new change is introduced into version control

Green Build

Whatever is in version control is in buildable and deployable state

Environment-based release patterns

This is where we have 2 or more environments that we deployed into, but only one environment is receiving live customer traffic. New code is deployed into a non-live environment, and the release is performed moving traffic into this environment Types: Blue-Green deployment pattern Canary Release Pattern Cluster Immune system

Goal of Deployment Pipeline

To provide everyone in the value stream the fastest possible feedback that a change has taken us out of a deployable state

Continually identify and elevate our constraints

To reduce lead times and increase throughput we need to continually identify our system's constraints and improve its work capacity Constraint usually follows this progression: Environment creation Code Deployment Test setup and run Overly tight architecture

Redefine Failure and Encourage Calculated Risk Taking

To reinforce culture of learning and calculated risk taking we need leaders to continually reinforce that everyone should feel both comfortable with and responsible for surfacing and learning from failures High performing DevOps orgs fail and make mistakes more often

Chaos Monkey

Tool used by Amazon to simulate AWS failures by constantly and randomly killing production servers. Ex of constantly injecting failures into pre-production and production environments

Integrate security into development iteration demonstrations

Track all open security issues in the same work tracking system that dev and ops are using

Evolutionary of Delivery Practices

Traditional Iterative Agile Scaled Agile DevOps

How to Integrate Ops culture into Dev Teams?

Try to embed Ops engineers and architects into each of the Dev Teams (hard b/c sometimes not enough Ops engineers) 2 Ops Liaison Models: Business Relationship Manager -- worked with product management, line of business owners, project management, dev Management and developers. Act as advocates for product owners inside of Operations. Help product teams navigate the Operations Landscape to prioritize and streamline work requests Dedicated Release Manager: Familiar with product's development and QA issues, helps them get what they need from Ops organization to achieve their goals

Enabling Organizational Learning and Safety Culture

Dr. Westrum Defines 3 types of culture: Pathological organizations Bureaucratic Organizations Generative Organizations Remove Blame and put organizational learning in its palce

System of Record

ERP (enterprise reporting planning) systems that run businesses, where correctness of the transactions and data are paramount. Typically have slower pace of change, regulatory and compliance requirement.

Immutable infrastructure

manual changes to production environment are no longer allowed--the only way to make production changes is to put changes into version control and re-create the code and environment from scratch to prevent variance in production

Application-based release pattern

modify application so that we can selectively release and expose specific application functionality by small configuration changes (feature flags)

Microservice

modular, independent graph relationship vs tiers, isolated persistence Pros: 1. Each unit is simple 2. Independent scaling and performance 3. Independent testing and deployment 4. Can optimally tune performance (caching, replication, etc) Cons: 1. Many cooperating units 2. Many small repos 3. Requires more sophisticated tooling and dependency management 4. Network latencies

Mass Marketing/Brand Marketing

placing as many ad impressions in front of people as possible to influence buying decisions

Resilience

requires that define our failure modes then perform testing to ensure that these failure modes operate as designed

Logging Level: Warn

tells us of conditions that could potentially become an error (i.e. database call taking longer than predefined time)

Static Analysis

testing we perform in non-runtime environment (coding flaws, backdoors, malicious code)

Why fast flow from dev to ops is important?

to deliver value to customers quickly

CHAPTER 15

Analyze Telemetry to Better Anticipate Problems and Achieve Goals

Bad Apple Theory

(Dekker) notion of eliminating error by eliminating the people who caused the errors, dekker claims this is invalid because human error is not our cause of troubles; instead human error is the consequence of the design of the tools we gave them

Integrating A/B Testing into our Feature Testing

A/B Testing in UX and Multivariate Testing

Resilience Engineering

An exercise designed to increase resilience through large-scale fault injection across critical systems

Phases of Cultural Change

Analysis of current culture, along with its impact on the performance of your business and employees Identify the values and culture you strive for Set your goals, and build and implement your plan based on the values, behaviors, and structures

How is Agile Different?

Analysis, design, coding, and testing are continuous activities

Scrum Framework

-Roles: Product Owner, Scrum Master, Team -Ceremonies: Sprint Planning, Sprint Review, Sprint Retrospective, Daily Scrum meeting -Artifacts: Product Backlog, Sprint Backlog, Burndown Charts

3 main phases for basic framework necessary for culture to spread

1) Building Momentum 2) Changing mindsets 3) Continuously improving

GitHub Flow

1) Engineer creates named branch off of master 2)Engineer commits to that branch locally, regularly pushing their work to the same named branch on the server 3) when they need feedback or help or when they think the branch is ready for merging they open a pull request 4) When they get their desired reviews and approvals, the engineer can merge it into master 5) Once code changes are merged and pushed to master, the engineer deploys them into production

Source Code integrity and code signing

All commits to version control should be signed - that is straightforward to configure using the open source tools gpg and git

What happens in a post-mortem meeting?

1. Construct timeline and details from multiple perspectives on failures, ensuring we don't punish people for making mistakes 2. Empower all engineers to improve safety by allowing them to give detailed accounts of their contributions to failures 3. Enable and encourage people who do make mistakes to be the experts who educate the rest of the organization on how not to make them in the future 4. Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgement of those decisions lies in hindsight 5. Propose countermeasures to prevent a similar accident from re-occurring and ensure these countermeasures are recorded with a target date and an owner for follow up

Principles of organizational change

1. Establish a sense of urgency 2. Create the guiding coalition 3. Develop a vision and strategy 4. Communicate the change vision 5. Empower employees for broad-based action 6. Generate short-term wins 7. Consolidate gains and produce more change 8. Anchor new approaches in the culture

Iterative

1. Iterative Development 2. Risk-value lifecycle 3. Shared vision 4. Use case-driven development 5. Release planning

2 of Lean's major tenets

1. Manufacturing lead time required to convert raw materials into finished goods was the best predictor of quality, customer satisfaction, and employee happiness 2. Small batches of work best predictor of short lead times

Scaled Agile

1. Measured Performance 2. Formal Change Management 3. Concurrent Testing

Traditional

1. Multiple views 2. Quality attribute-driven development 3. Component based development 4. Asset reuse 5. Decision capture 6. Architecture proving

Stakeholders needed at a Blameless-Post Mortem

1. People involved in decisions that may have contributed to the problem 2. People who identified the problem 3. People who responded to the problem 4. People who diagnosed the problem 5. People who were affected by the problem 6. Anyone else interested in attending the meeting

Institutionalize Rituals to Pay Down Technical Debt

1. Schedule and conduct day-week long improvement blitzes where everyone on a team or in the entire org self-organizes to fix problems they care about--no feature work is allowed 2. Spring or Fall Cleanings 3. Schedule week long improvement blitzes that prioritize Dev and Ops working together towards improvement goals where at the end each team presents to their peers the problem and what they built (work across entire value stream to solve problems)

Contrasting Points

1. Speed 2. Creating vs. Deploying 3. Specialization 4. Communication and Documentation 5. Team Size 6. Scheduling 7. Automation

Agile

1. Test-driven development (TDD): Creating tests that are a specification of what the code should do first 2. Continuous integration:Encourage frequent integration and testing of program changes 3. Refactoring: Changing an existing body of code in order to improve its internal structure 4. Whole team: A focus on the value of highly-collaborative teams as exemplified by Scrum's daily standup meeting. INstills a sense of collective ownership and responsibility 5. User Story-driven Development: Capture requirements in a lightweight manner. Encourages collaboration with the relevant stakeholders throughout a project 6. Team Change Management: Supports the logging of defects or new requirements, by any member of the team, that are within the scope of the current iteration

Most common A/B testing method in UX

2 versions of a page, control (A) or treatment(B) using statistical analysis of behavior of these users demonstrate if there is a causal link between treatment and the outcome

What % of cycles should be reserved for non-functional requirements and reducing tech debt

20%

Team

5-10 members, teams are self organizing Cross functional Membership should change only between sprints

Evaluating the effectiveness of pull request processes

A bad Pull request is one that doesn't have enough context for the reader, having little or no documentation of what the change is intended to do A great PR: Sufficient detail on why a change is being made, identify risks and resulting countermeasures

Continuous Integration requires 3 capabilities

A comprehensive and reliable set of automated tests that validate we are in a deployable state A culture that "stops the entire production line" when our validation tests fail Developers working in small batches on trunk rather than long-lived feature branches

Burndown Chart

A display of what work has been completed and what is left to complete

Use telemetry to make deployments safer

Actively monitor production telemetry when anyone performs a production deployment

Generative Organizations

Actively seeking and sharing information to better enable the organization to achieve its mission

Case Study-- Standardizing a new tech stack at Etsy

After disastrous peak holiday season, Etsy decided to massively reduce the # of technologies used in production Goal was to standardize and reduce the support infrastructure and configurations Migrate Etsy's entire platform to PHP and MySQL wanted both dev and ops to understand the full technology stack

Re-Categorize the Majority of Our Lower Risk Changes as Standard Changes

After having a reliable deployment pipeline in place → fast reliable non-dramatic deployments next step is to gain agreement form Ops and relative change authorities that our changes have demonstrated to be low risk enough to be defined as standard changes, pre-approved by the CAB Show change is low risk by showing history of changes over a significant period of time and provide a complete list of production issues during the same period Assert high change success rates and low Mean Time To Repair (MTTR) Need standard changes to be visual and recorded in change management system ideally automatic deployments will have results automatically recorded and automatically link to work planning tools (i.e.JIRA ) Allows for visibility, and traceability

Contrasting Point - Automation

Agile Doesn't require automation DevOps: Automation is the heart of DevOps b/c their goal is to minimize disruptions and maximize efficiency

Contrasting Point - Team Size

Agile Small teams, to they can move faster DevOps Will have many teams working together and each team can realistically practice different theories

Contrasting Point - Scheduling

Agile Sprints: work in short predetermined amounts of time, rarely longer than a month and as short as a week DevOps: Values max reliability, so they focus on a long-term schedule that minimizes business disruptions

Culture of Agile and DevOps

Agile approaches a change in how we think about development Agile thinking promotes small manageable changes quickly that over time lead to large changes DevOps brings cultural shifts in the form of cultural shifts within an organization including enhanced communication, and balancing stability with change and flexibility

What is Agile?

Agile is a time boxed iterative approach to software and product delivery that builds software incrementally from the start of the project instead of trying to deliver it all at once near the end It works by breaking down projects into little bits of user functionality called "user stories' prioritizing them and then continuously delivering them over iterations

Contrasting Point - Communication & Documentation

Agile: Daily informal meetings required so that team members can share progress, daily goals, and indicate help when needed, scrums not meant to go over documentation or milestones or metrics Don't codify their meeting minutes or other communications - often preferring lo-fi methods of simple pen and paper DevOps: Meetings are not daily but they require a lot of documentation in order to communicate software deployment to all relevant teams Requires design documents and specs in order to fully understand a software release

Have developers initially self-manage their production service

Have dev groups self-manage their services in production before they become eligable for a centralized ops group to manage (Google started this) Define launch requirements that must be met in order for services to interact with real customers: Defect counts and severity Type//frequency of pager alerts Monitoring coverage System architecture Deployment process Production hygiene Effective monitoring should be in place, deployments should be reliable and deterministic, and architecture should support fast/frequent deployments Need a different mechanism to ensure that ops is never stuck with an unsupportable service in production Create a service hand-back mechanism- when production service becomes sufficiently fragile, ops has ability to return production support responsibility back to dev (practice at Google)

Use Chatrooms and chat bots to automate and capture organizational knowledge

Having work performed by automation in the chat room has numerous benefits: Everyone saw everything happening Engineers on the first day of work could see what daily work looked like People were more apt to ask for help when they saw others helping each other Rapid organizational learning was enabled and accumulated

Case Study: The Launch and Hand-off readiness review at Google

High importance product teams are assigned site reliability engineers (SREs) are 'what happens when a SWE is tasked with what used to be called operations" Google created 2 sets of safety checks for 2 critical stages of releasing new services called Launch readiness review (LRR) and Hand off readiness review (HRR) LRR must be performed and signed off on before any new google service is made publicly available to customers and receives live production traffic HRR is performed when the service is transitioned to an Ops managed state, usually months after the LRR Any product team going through LRR or HRR has a SRE assigned to them to help them understand/achieve the requirements Teams with the fastest HRR production approvals are the ones that worked with SREs earliest

Normal Changes

Higher-risk changes that require review or approval from the agreed upon change authority. In many orgs this responsibility is placed with a change advisory board (CAB)or emergency change advisory board which may lack the required expertise to understand the full impact of the change, → unacceptable long lead times. CAB will almost certainly have a well-defined request for change (RFC) form defines what info is required for the go/no-go decision.

Reduce Reliance on Separation of Duty

Historically separation of duty as one of our primary controls to reduce the risk of fraud or mistakes in the software development process As complexity and deployment frequency increase, separation of duty slows down and reduces the feedback engineers receive on their work Solutions: Pair Programming, code check-ins, code review give necessary reassurance about quality of work

Catch Errors as early in our Automated testing as possible

IN order to find errors as early as possible, run faster-running automated tests (unit tests) before slower-running automated tests (acceptances and integration tests) If most errors are found with acceptance and integration tests, feedback to devs is slower than with unit tests fixing errors with acceptance and integration tests is much slower because you have to validate that the tests pass by re-running then If error is found with an acceptance or integration test, we should create a unit test that could find the error faster, earlier, and cheaper IF unit or acceptance test are too difficult/expensive to write, it;'s likely that the architeture is too tightly-coupled

Find and Fill any telemetry gaps

Identify gaps in our telemetry that impede our ability to quickly detect and resolve incidents Use this data later to better anticipate problems Need metrics from the following levels: Business (# of sales transactions, revenue, etc) Application(transaction times, user response time) Infrastructure level (web server traffic, CPU load) Client software level(errors and crashes) Deployment pipeline level (build pipeline status--red or green, change deployment lead times) At application level, goal is to ensure generating telemetry not only around application health but also to what extent we are achieving our organizational goals Goals: to have every business metric be actionable--these top metrics should help inform how to change our product and be amenable to experimentation and A/B testing for production and non-production infrastructure ensure that we are getting enough telemetry so that if a problem occurs in any environment, we can quickly determine whether infrastructure is a contributing cause of the problem

Protect our deployment pipeline

If someone compromises the servers running deployment pipeline that has the credentials for our version control system, it could enable someone to steal source code. Or someone could inject malicious changes into our repository A good place for someone to hide malicious code is a unit test, because no one looks at them and they run every time someone commits code to the repo In order to protect our continuous build, integration, or deployment pipeline our mitigation strategies may include: Hardening continuous build and integration servers Reviewing all changes introduced into version control, either through pair programming or code review Instrumenting our repository to detect when test code contains suspicious API calls Ensuring every CI process runs on its own isolated container or VM Ensuring the version control credentials used by the CI system are read-only

Design for operations through codified non-functional requirements

Implementing non-functional requirements will enable our services to be easy to deploy and keep running in production. Examples of non-functional requirements: Sufficient production telemetry Ability to accurately track dependencies Services that are resilient and degrade gracefully Forward and backward compatibility versions Ability to achieve data to manage the size of production data set Ability to easily search and understand log messages Ability to trace requests from users Simple, centralized runtime configuration use feature flags

Creating security telemetry in our applications

In order to detect problematic user behavior we need to create relevant telemetry, which may include: Successful and unsuccessful user logins User password resets User email address resets User credit card changes

Pull our Andon cord when the pipeline breaks

In order to keep our deployment pipeline in a green state, we will create a virtual Andon Cord Whenever someone introduces a change that causes our build or automated test to fail, no new work is allowed to enter the system until the problem is fixed Create highly visible indicators so that the entire team can see when our automated tests are failing

Ways to Influence People's feelings:

Increase positive motivation (being valued and receiving recognition) Decrease Negative Motivation (eliminating mistrust, fear of making a mistake, fear of change, etc)

Inject Production Failures to Enable Resilience and Learning

Increase resilience by regularly injecting and rehearsing failure in the system to confirm that we have designed and architected our systems properly so that failures happen in specific and controlled ways

Sprint Backlog

Individuals sign up for work of their own choosing Estimated work remaining is updated daily Any team member can add, delete change sprint backlog Work for the sprint emerges If work is unclear, define a sprint backlog item with a larger amount of time and break it down later Update work remaining as more becomes known

How to change the behaviors of others

Influence: Beliefs and Values People's feelings People's level of awareness

DevOpsSec:

Infosec is integrated into all stages of the SDLC

Code Review:

Instead of requiring approval from an external body prior to deployment, we need to require engineers to get peer reviews of their changes goal; find errors by having fellow engineers close to the work scrutinize our changes Require reviews prior to committing code to trunk in source code

CHAPTER 17

Integrate Hypothesis Driven Development and A/B Testing into our Daily Work

Create Internal Consulting and Coaches to Spread Practices

Internal coaching or consulting org is useful to spread expertise across an organization

Dependency Scanning

Inventorying all our dependencies for binaries and executables, ensuring that dependencies are free of vulnerabilities

Deployment

Is the installation of a specific version of software to a given environment (may or may not be associated with a release of a feature to customers)

How does Agile Work

Make a list: sit down w/ customer and make list of features they would like to see in software or "user stories" Size things up: size up stories relatively to each other, guessing how long each will take Set some priorities: As the customer to prioritize their list Start Executing: Start at the top and work way to the bottom building, iterating, and getting feedback from the customer Update the plan as you go: As you start delivering you're either: going to fast too much to do and not enough time 2 choices: do less and cut scope or push out the date and ask for money Planning is adaptive Roles blur Scope Varies Requirements can change

Encourage Organizational Learning

Make engineers feel safe when giving details about mistakes Make them enthusiastic in helping the company avoid the same error in the future Continually reinforce that we value actions that expose and share more widely the problems in our daily work

An Agile Approach

Manifesto for Agile Software Development (2001) codified several values: People: teammates, customers, and interactions between these people-instead of processes and tools Immediacy: Working software -instead of comprehensive documentation Flexibility: Responding to, and even embracing change - instead of following a predetermined plan

Potential Dangers of doing more manual testing and change freezes

Manual testing is slower and more tedious that automated testing: Deploying less frequently and thus increasing deployment batch size We want to fully integrate testing into our daily work as part of smooth and continual flow into production

Integrate information security into production telemetry

Marcus Sachs (Verizon Data Breach) found that cardholder data breaches were detected months or quarters after the breach occurred, and it was not detected by internal monitoring but someone outside of the organization-either a business partner or customer This shows that internal security controls are often ineffective in successfully detecting breaches in a timely manner, either because of blind spots in our monitoring or because no one is examining the relevant telemetry in their daily work. Should integrate security telemetry into the same tools that development, QA, and operations are using to create visibility into how their application and environments are performing in a hostile threat environment This reinforces that everyone needs to be thinking about security risks and designing countermeasures in their daily work.

Agile/DevOps transformation considers

Method Tools Enablement Organization Infrastructure Adoption

Negative Values

Money, status, power, control, promoting winning over others

Create Application Logging telemetry that helps production

Must ensure that the application we build and operate are creating sufficient telemetry Do this by having Dev and Ops create production telemetry as part of their daily work Dev can use Telemetry to better diagnose problems on their workstation Ops can use telemetry to diagnose a produciton problem Different logging levels: Debug, Info, Warn, Error, Fatal Should create logging hierarchical catagories

Ensure Security of our software supply chain

Need to be aware of security vulnerabilities in open source/commercial software

Bureaucratic Organizations

Rules and processes. Failure is processed through system of judgement

How to institutionalize improvement of daily work

Processes degrade over time in the absence of improvements Reserve time to pay down tech debt, fix defects, refactor and improve problematic areas of our code an environments fix problems when it's not only easier and cheaper but when the consequences are smaller

Identify the teams supporting our value stream

Product Owner Development QA Operations InfoSec Release Manager Technology Executives Value Stream Manager

Scrum Master

Project manager or team leader Responsible for enacting scrum values and practices Remove impediments/politics

CHAPTER 23

Protecting the Deployment Pipeline

Integrate security into our deployment pipeline

Provide Dev and Ops fast feedback on their work so that they are notified whenever they commit changes that are potentially insecure

Ensure security of the environment

Put in monitoring controls to ensure that all production instances match these known good states

Automate standardized processes in software for re-use

Put knowledge into a centralized source code repository, making the tool available for everyone to search and use

Create Self-service access to telemetry and information radiators

Radiate this telemetry data to the rest of the organization, ensuring that anyone who wants info about any of the services we are running can get it without needing production system access or privileged accounts By putting info radiators into highly visible places we promote responsibility among team members, actively demonstrating the following values: Team has nothing to hide from its visitors Team has nothing to hide from itself

Waterfall vs Scrum

Rather than doing everything at one time, scrum teams do a little of everything all the time

Chapter 22: Information Security is everyone's job everyday

Ratio of engineers in dev, ops, and infosec is 100:10:1

CHAPTER 18: Create review and coordination processes to increase quality of our current work

Reduce the risk of production changes before they are made Pull requests allow for peer review in order to determine how risky a change is

Decouple deployments from releases

Releases are driven by marketing launch date, but things often don't go according to plan, failures may occur Restoring service may require a painful rollback process or an equally risky fix forward operation, where we make changes directly in production To deploy more frequently to achieve our desired outcome of smooth and fast flow: need to decouple our production deployments from our feature releases. Deployment and Release are 2 distinct actions

CHAPTER 21

Reserve Time to Create Organizational Learning and Improvement

Gated Commits

When deployment pipeline first confirms that submitted change will successfully merge, build as expected and pass all automated tests before actually being merged into trunk if not developer will be notified allowing for corrections to be made without impacting anyone else in the value stream

Interdependence of the systems

When improving brownfield systems we should strive to reduce complexity, improve reliability and stability making them faster, safer and easier to change. When new functionality is added to greenfield systems of engagement, they often cause reliability problems in the brownfield systems of records they rely on.

Enable coordination and scheduling of changes

When multiple groups are working on systems that share dependencies, our changes will likely need to be coordinated to ensure that they don't interfere with each other The more loosely coupled our architecture the less we need to communicate and coordinate with other component teams Even in loosely coupled architecture, there may be a risk of changes interfering with each other (simultaneous A/B testing) to mitigate these risks, use chat rooms to announce changes and proactively find collisions that may exist More tightly coupled architecture organizations need to deliberately schedule changes

Release

When we make a feature available to all our customers or a segment of customers. Our code and environment should be architected in a way that the release of functionality does not require changing our application code

Standardized Model

Where routine and systems govern everything, including strict compliance with timelines and budgets

User Stories

Who (user Role)-- is this a customer, employee, admin, etc? What(goal)-- what functionality must be achieved/developed? Why(reason) -- why does the user want to accomplish this goal? As a [user role], I want to [goal], so that I can [reason]

Publish our Post-Mortems as Widely as Possible

Widely publish meeting notes and artifacts ideally in a centralized location where the entire org can access it and learn from it esp. If similar incident arises on different team (Organizational learning) Prohibit production incidents from being closed until the post-mortem meeting has been completed Example: Etsy's Morgue is a tool for better access to post-mortem meeting notes

Test Driven Development:

Write automated tests before we write the code Begin every change to system by first writing an automated test that validates the expected behavior fails and then we write the code to make the tests pass Technique developed by Kent Beck has 3 steps: Ensure the tests fail ensure the tests pass refactor both new and old code to make it well structured. Ensure the tests pass

Blue-Green deployment pattern

You have 2 production environments; blue and green At any time, only one of these is serving customer traffic. To release a new version of our service, we deploy to the inactive environment where we can perform our testing without interupting the UX. We can execute our release, we direct traffic to the blue environment so blue goes live and green becomes staging. Roll back is performed by sending customer traffic back to the green environment Having 2 versions create problems when they depend upon a common database. Two approaches to solve this problem: create 2 databases (blue and green database) decouple database changes from application changes (make only additive changes to the database and make no assumptions in our application about which database version will be used in production)

Just Culture

codified by Dr. Sidney Dekker, when responses to incidents and accidents are seen as unjust, it can impede safety investigations, promoting fear rather than mindfulness in people who do safety critical work, making organizations more bureaucratic rather than more careful and cultivating professional secrecy, evasion, and self-protection

Blameless Post-Mortem

coined by John Allspaw, help us examine "mistakes in a way that focuses on the situational aspects of a failure's mechanism and the decision-making process of individuals proximate to the failure" Schedule post-mortem as soon as possible after the accident has occurred and problem has been solved so before memories and the links between cause and effect fade or circumstances change Disallow phrases "would have" or "could have" because they are counterfactual statements that result from tendency to create possible alternatives to events that have already happened

Feature toggles

control what percent of users see the treatment version of an experiment

Positive Values

customer focus, performance, safety, innovation, service oriented, reliable

Rugged DevOps:

incorporating information security objectives into DevOps

Game Day team

defines and executes drills any problems or difficulties that are identified, addressed, and tested again

Direct Response Marketing:

defining customer acquisition funnel and performing A/B Testing

Agile Infrastructure and Velocity Movement (2008-2009)

Shared goals b/t Dev and Ops and continuous integration practices make deployment part of everyone's daily work. "DevOps" term coined in Ghent Belgium by Patrick Debois.

Big Bang Approach

Starting everywhere at all and once (BAD STRATEGY) i.e. waterfall model

Using Strangler Application Pattern to Safely Evolve our Enterprise Architecture

Strangler application coined by Martin Fowler in 2004 When implement strangler applications seek to access all services through versioned APIs Creating strangler applications you avoid reproducing existing functionality in some new architecture or technology

Deployment Lead Time

Subset of value stream, begins when any engineer checks a change in to version control and ends when the change is successfully running in production, providing value to the customer and generating useful feedback and telemetry. PHASE 1 - Design and Development: highly variable and highly uncertain b/c of high degree of creativity and work may never be performed again, resulting in high variability of process times PHASE 2 - Testing and Operations: strives to have short and predictable lead times, with near 0 defects) Goal is to have testing/operations happening simultaneously with design/development, enabling fast flow and high quality.

Greenfield Project

A new software project or initiative, likely in the early stages of planning or implementation, where we build our applications and infrastructure anew, with few constraints. Typically easier to implement esp. if already funded / a team is in place, don't have to worry about existing code bases, processes, and teams. Typical examples: pilots to demonstrate feasibility of public or private clouds, piloting deployment automation, and similar tools

DevOps

A set of practices that seek to reduce the gap between software development and software operation. It focuses on automating and monitoring all steps of software construction, from integration, testing, releasing to deployment and infrastructure management. The objective is to build shorter development cycles, increased deployment frequency, more dependable releases, in close alignment with business objectives. DevOps is the result of applying Lean principles to the technology value stream and a logical continuation of the Agile software journey that began in 2001

What should happen when production changes are made?

Changes should be: 1. Put into version control 2. Automatically replicated everywhere in our production and pre-production environments as well as any newly created environments 3. Destroy old ones or taken out of rotation

Matrix-oriented organizations

Attempt to combine functional and market orientation. Complex organizational structures, such as individual contributors reporting to 2 or more managers . Sometimes achieving none of its functional or market orientation goals.

Versioned APIs/Immutable services

enable us to modify the service without impacting the callers, allowing for more loosely coupled architecture

Functional-oriented organizations

(how most IT Ops orgs are structured) optimize for expertise, division of labor, or reducing cost. (dbas in one group, infosec in another, QA in another) Centralize expertise which helps enable career growth and skill dev. Often have tall hierarchical organizational structures Prevailing method of organization from operations

CI (continuous integration)

- a practice that focuses on making a release easier - developers merge their changes back to the main branch as often as possible - the developer's changes are validated by creating a build and running automated tests against the build - CI puts a great emphasis on testing automation to check that the application is not broken whenever new commits are integrated into the main branch

Continuous Delivery

- an extension of CI to make sure that you can release new changes to your customers quickly in a sustainable way - in addition to having automated testing, you also have automated your release process and you can deploy your application at any point of time by clicking a button when all developers are working in small batches on trunk or everyone is working off trunk in short-lived feature branches that get merged to trunk regularly, trunk is always kept in a releasable state, can release on demand

Continuous Deployment

- every change that passes all stages of your production pipeline is released to your customers - there's no human intervention, only a failed test will prevent a new change to be deployed to production Deploying to production at least once per day per developer or perhaps even automatically deploying every change a developer commits Continuous delivery is the prerequisite for continuous deployment

What should be in version control?

1. All application code and dependencies 2. Any script used to create database schemas, application reference data 3. All the environment creation tools and artifacts 4. Any file used to create containers 5. All supporting automated tests and any manual test scripts 6. Any script that supports code packaging, deployment, database migration, and environment provisioning 7. All project artifacts (requirement documentation, deploy procedures, release notes) 8. All cloud configuration 9. Any other script or config info required to create infrastructure that supports multiple services

Problems Often Caused by Overly Functional Orientation ("Optimizing for Cost")

1. Long lead times and queues esp. for complex activities like large deployments where we need to open tickets w/ multiple groups and coordinate work handoffs leading to long queues at every step 2. Person performing the work has little visibility or understanding of how their work relates to any value stream goals (little motivation or creativity) 3. Poor handoffs, large amounts of re-work, quality issues, bottlenecks, delays 4. As you increase the number of dev teams and their deployment and release frequency most functionally oriented organizations will have difficulty keeping up and delivering satisfactory outcomes.

Why is swarming necessary?

1. Prevents problem from progressing downstream, where cost and effort to repair it increases exponentially and tech debt is allowed to accumulate 2. Prevents work center from starting new work, which will will introduce new errors 3. If problem is not addressed work center could have same problem in next operation, requiring more fixes and work 4. Enables learning and prevents loss of critical info due to fading memory or changing circumstances

Side effects of large batch size merges

1. Required effort to successfully merge branches back together increases exponentially as the number of branches increase 2. Increase in rate of code production → increase probability that any given change will impact someone else and increase the # of developers who will be impacted when someone breaks the dev pipeline 3. When merging is difficult → less able and motivated to improve and refactor code b/c refactoring will cause rework for everyone else → when reluctant to modify code with dependencies less payoff

Ineffective Quality Controls

1. Requiring another team to complete tedious error-prone manual tasks that could be easily automated and run as needed by team who needs the work performed 2. Requiring approvals from busy people who are distant from the work, forcing them to make decisions without adequate knowledge or the work or potential implications 3. Creating large volumes of documentation of questionable detail which become obsolete shortly after they are written 4. Pushing large batches of work to teams and special committees for approval and processing and then waiting for responses

Overly tight architecture risks

1. Risk global failure every time a commit to trunk or release to production is attempted 2. Complicated integration b/c of big batch changes in deployment increases risk of something going wrong 3. May take weeks to find and fix problems when coordinating with hundreds of developers to

requirements for a deployment pipeline

1. deploying the same way to every environment 2. smoke testing our deployments 3. ensure we maintain consistent environments

Working Safely within Complex Systems

4 conditions make it safer to work in complex systems: 1. Complex work is managed so that problems in design and operations are revealed 2. Problems are swarmed and solved, resulting in quick construction of new knowledge 3. New local knowledge is exploited globally throughout the organization 4. Leaders create other leaders who continually grow these types of capabilities

Deployment lead times of months

Common in large complex organization with tightly-coupled, monolithic applications, often with scarce integration test environments, long test and production environment lead times, high reliance on manual testing, and multiple required approval processes. With long deployment lead times: heroics are required at almost every stage of value stream, when merge all development teams changes together resulting code might not build correctly or pass any of the tests, fixing problems might require weeks resulting in poor customer outputs

Complex system

Complex system means no one person can see the system as a whole and understand how all the pieces fit together, high interconnectedness of tightly-coupled components and system-level behavior that cannot be explained merely in terms of the behavior of the system components

How to automate environment build process:

Copying a virtualization environment Building an automated environment creation process that starts from "bare metal" Using "infrastructure as code" configuration management tool Using Automated operating system configuration tools Assembling an environment from a set of virtual images or containers Spinning up a new environment in public cloud

Adopt Trunk-Based Development Practices

Countermeasure to large batch size merges: institute continuous integration and trunk-based development practices, where all developers check code into trunk once per day Frequent code commits to trunk → means can run all automated tests on our software system as a whole and receive alerts when a change breaks app or interferes with someone else's work → small problems can be corrected faster

See problems as they occur

Dr. Peter Senge describes feedback loops as critical part of learning organizations and systems of thinking. Feedforward and Feedback loops cause components within a system to reinforce or counteract each other Fast feedback and fast forward loops allow: 1. Quick detection and recovery of problems 2. Inform us how to prevent these problems from occurring again in the future

Expanding DevOps Across Our Organization

Dr. Roberto Fernandez describes ideal phases used by change agents to build and expand their coalition and base of support 1. Find Innovators and Early Adopters (Kindred spirits, and better if respected and have high credibility) 2. Build Critical Mass and Silent Majority (Work with more teams who are receptive to ideas, create bandwagon effect that further increase influence) 3. Identify Holdouts (After silent majority achieved, tackle high profile influential detractors who are most likely to resist)

The First Way

Enables fast left to right flow of work from Development to Operations to Customer Speeding up flow through TVS reduces lead time required to fulfill internal or customer requests, increasing quality of work and throughput and boost ability to out-experiment the competition Resulting practices: continuous build, integration, test, deployment process Creating environments on demand Limiting Work in Progress (WIP) Building systems and organizations that are safe to change

The Third Way

Enables the creation of generative high-trust culture that supports dynamic, disciplined, and scientific approach to experimentation and risk taking, facilitating the creation of organizational learning, both from success and failures. Design system of work so that we can multiply the effects of new knowledge, transforming local discoveries into global improvements.

Keep Pushing Quality Closer to the Source

Everyone in value stream should find and fix problems in their area of control as part of daily work, so push quality and safety responsibilities and decision-making to where the work is performed, instead of relying on approvals from distant execs.

Strangler Application pattern

Example of evolutionary design, instead of "ripping out and replacing" old services with architectures that no longer support our organizational goals, put the existing functionality behind an API and avoid avoid making further changes to it. All new functionality then implemented in the new services that use the new desired architecture, making calls to the old system when necessary Useful when migrating portions of a monolithic application or tightly-coupled service to one that is more loosely-coupled

Brownfield Project

Existing products or services that are already serving customers and have been in operation for years or even decades. Typically come with significant amounts of tech debt, i.e. no test automation or running on unsupported platforms. DevOps has been used to successfully transform brownfield projects (60% of transformation stories at devOps summit were brownfield).

The Second Way

Fast and constant flow of feedback from right to left at all stages of our value stream Requires ample feedback to prevent problems from happening again, or enable faster detection and recovery Create quality at source and generate or embed knowledge where it is needed Seeing problems as they occur and swarming them until effective countermeasures are in place we continually shorten and amplify our feedback loops (maximizes the opportunities for our organizations to learn and improve)

Swarm and Solve Problems to Build New Knowledge

Goal of Swarming: to contain problems before they have a chance to spread, and to diagnose and treat the problem so that it cannot recur (Dr. Spear)

CHAPTER 9

Goal: ensure that we can re-create the entire production environment based on what's in version control

Design Team Boundaries in Accordance with Conway's Law

Ideally software architecture should enable small teams to be independently productive, sufficiently decoupled from each other so that work can be done without excessive or unnecessary communication or coordination

Predictor of performance with DevOps

If application could be architected or rearchitected for testability and deployability.

Problems that arise when our telemetry data has non-gaussian distribution

In Operations data sets often have "chi squared" distribution so using standard deviation method can lead to over-alerting

Testing, Operations and Security as Everyone's Job Every Day

In high performing orgs. Team shares common goal of quality, availability, and security as part of everyone's everyday job. Rotation of on-call duty to firefight for services people built.

Continuous Delivery Movement (2006 and 2009)

Jez Humble and David Farley extended upon development discipline of continuous build, test, and integration. Continuous delivery: role of a "deployment pipeline" to ensure that code and infrastructure are always in a deployable state, all code checked into trunk can be safely deployed into production.

Principle of Evolutionary Architecture

Jez Humble says "architecture of any successful product or organization will necessarily evolve over its life cycle"

Keep Team Sizes Small (The "Two-Pizza Team" Rule)

Keeping team small reduces the amount of inter-team communication and encouraging us to keep the scope of each team's domain small and bounded Small team has 4 important effects: 1. Team has clear shared understanding of the system they are working on 2. Limits the growth rate of product or service being worked on 3. Decentralizes power and enables autonomy 4. Leading a 2PT is a way for employees to gain some leadership experience in an environment where failure does not have catastrophic consequences

Modify our Definition of Development "Done' to include running in production-like environments

Only accept dev work as done when it can be successfully built, deployed and confirmed that it runs as expected in production-like environment, instead of merely when developer thinks its done Ideally runs under production-like load with production-like dataset long before end of sprint

Making Functional Orientation Work

Only possible to achieve DevOps outcomes through functional organization as long as everyone in value stream views customer and organizational outcome as shared goal regardless of where they are in the organization so easy to get what they need from Ops reliably on-demand. High trust culture, work transparently prioritized and sufficient slack in system to allow high priority work to be completed.

Measures of performance in value streams

Lead Time: starts when request is made and ends when it is fulfilled (focus on this because lead time is what customer experiences) Process Time: starts when begin work on the customer request--omits time that the work is in queue-- %C/A (measure of rework): quality of output of each step in value stream, obtained by asking downstream customers what % of time they receive work that is "usable as is"

Continually optimizing for downstream work centers

Lean defines 2 types of customers that we must design for: 1. External customer(who most likely pays for the service we are delivering) 2. Internal Customer(who receives and process the work immediately after us) Most important customer, according to Lean Next step downstream How to create quality at source: Optimize for downstream work centers by designing for operations where non-functional requirements are as highly prioritized as user features

Agile Manifesto (2001)

Lightweight set of values and principles. Deliver working software frequently, from a couple of months to a couple weeks, with a preference to the shorter timescale (NO more waterfall) Small motivated teams working in a high-trust management model. Individuals and interactions over process and tools Working software over comprehensive documentation Customer collaboration over contract negotiation Responding to change over following a plan

An Architecture that Enables Productivity, Testability, and Safety

Loosely coupled architecture with well-defined interfaces that enforce how modules connect with each other promote productivity and safety

Create Loosely-Coupled Architectures to Enable Developer Productivity and Safety

Loosely-coupled architecture means services can update in production independently without having to update other services Have bounded contexts: dev should understand and update code of a service without knowing anything about the internals of its peer services Ensure that services are compartmentalized and have well-defined interfaces for easier testing

Toyota Kata (2009)

Mike Rother framed 20 year journey to understand and codify the Toyota Production System. Argued Lean community missed the "improvement kata": creating structure for the daily, habitual practice of improvement work b/c daily practice is what improves outcomes. Constant cycle of establishing desired future states, setting weekly target outcomes, and the continual improvement of daily work is what guided improvement at Toyota.

Deployment lead time of minutes

Needs for Ideal: Developers receive fast, constant feedback on work Quickly and independently implement, integrate, and validate their code and deploying into production environment Achievable by: Continually checking small code changes into version control repository Perform automated and exploratory testing against changes and deploying it into production Architecture that is modular, well encapsulated, and loosely coupled so small teams are able to work autonomously with failures being small and contained

Enable Every Team Member to be a Generalist

Prevent extreme cases of functionally-oriented orgs which is over-specialized departments Siloization which lead to multiple handoffs and queues and long lead times, don't want engineers who are "frozen in time' when tech is ever changing.

The Principles of Feedback

Principles that enable the reciprocal fast and constant feedback from right to left at all stages of the values stream. Goal is to achieve quality, reliability, and safety. 1. Seeing problems as they occur 2. Swarming and solving problems to build new knowledge 3. Pushing quality closer to the source 4. Continually optimizing for downstream work centers

Small Batch Development and what happens when we commit code to trunk infrequently

Problem with developers working in long-lived private branches("feature branches"): merges into trunk only happen sporadically → large batch size of changes Branching Strategies: 1. Optimize for individual productivity (everyone has their own private branch) 2. Optimize for team productivity (everyone works in the same common area)

CHAPTER 11

Problem: When developers fail to integrate their branches regularly and create more branches, integration becomes very difficult, requires rework, delayed feedback, delays production. When painful to integrate do less often making merges even worse Goal: use continuous integration to make merging into trunk a part of everyone's daily work

Enable On-Demand Creation of Dev, Test, and Production Environments

Problem: teams can only see how app behaves in production-like environment, but test environments are often scarce and have long lead times → teams cannot test in production-like environments → massive failure when actual released Solution: Developers run production-like environments on their own workstations , created on demand and self-service so that they can run and test code in production-like environments as part of daily work To do this create common build mechanism that creates all the environments, so anyone can get production-like environment fast without ticket Carefully define stable, reliable, secure environment requirements

Technology Value Stream

Process required to convert a business hypothesis into a technology-enabled service that delivers value to the customer. Input: formulation of a business objective, concept, idea, or hypothesis Start when accept work in development, adding it to our committed backlog of work → dev teams follow agile or iterative process to form user stories → implement in code into app or service → check code into version control repository → each change integrated and tested w/ rest of software system Value only created when services in production: Fast Flow and deployment without service outages, impairments, security or compliance failures.

Pros/Cons of Optimizing for individual productivity

Pros: everyone works independently and nobody can disrupt anyone else's work Cons: Merging is a nightmare, collaboration is very difficult b/c everyone's work has to be merged with everyone else's work to see even a small part of the complete system

Pros/Cons of optimizing for team productivity

Pros: no branches, just a long unbroken straight line of dev, nothing to understand so commits are simple Cons: each commit can break the entire project and bring all progress to a halt

Version Control

Records changes to files or set of files stored within the system, could be source code, assets, or other documents part of software project, stores changes as commits and allows for reversion back to previous versions.

Enable Market-Oriented Teams ("Optimizing for Speed")

Responsible for feature dev, testing, securing, deploying, and supporting their service in production, from idea conception → retirement Cross-functional and independent teams, run experiments build/deliver/deploy/run/fix features without manual dependencies on other teams To achieve this embed functional engineers and skills (Ops, QA, Infosec) into each service teams or provide their capabilities to teams through automated self-service platforms

Use means and standard deviations to detect potential problems

Use means and standard deviations to create filter that detects when a metric is significantly different from its norm and alert Better these alerts by focusing on variances or outliers that matter to prevent alert fatigue

Create our Single Repository of Truth for the Entire System

VERSION CONTROL Need to be able to re-create any previous state of production environment, pre-production, and build process, including tools, and environments Need code and environments to BOTH be in version control Ops needs version control b/c way MORE config settings for environment then code

Lean Movement (1980s-1990s)

Value Stream Mapping, Kanban Boards, and Total Productive Maintenance. Create value for customer by creating constancy of purpose, scientific thinking, creating flow and pull, assuring quality at source, humility, respect everyone

Outlier detection

abnormal running conditions from which significant performance degradation may well result

Andon Cord

cord that every worker and manager is trained to pull when something goes wrong, team leader is alerted and immediately works to resolve the problem if the problem cannot be resolved within specified time the production line is halted so the entire org can assist until problem is resolved

Anomaly Detection

the search for items or events which do not conform to an expected pattern

Benefits to creation of dev, test, and production environments

providing early and constant feedback to dev, reproduce/diagnose defects safely isolated from production, experiment with changes to environment to create more shared knowledge b/t dev and operations

Monolith v2

sets of monolithic tears: front end presentation, application server, database layer Pros: 1. Simple at first 2. Joint queries are easy 3. Single schema, deployment 4. Resource-efficient at small scales Cons: 1. Tendency for increased coupling over time 2. Poor scaling and redundancy(all or nothing, vertical only) 3. Difficult to tune properly 4. All or nothing schema management

Smoothing

suitable if the data is a time series, uses moving averages which transform data by avg each point with all the data within a sliding window this highlights longer-term trends or cycles

Discipline of daily code commits forces...

work broken down into smaller chunks while keeping trunk in a working, releasable state


Related study sets

PHYSICS - Misconceptual Questions

View Set

30: Computer Graded Unit 9: Lesson 1: LS Assignment 2

View Set