Chapter 12: Dev/Ops
What is Availability?
what percentage of the time is your app correctly serving requests? -fraction of time your site is available and working correctly
When would one use primary/replica or primary/secondary configuration?
when making the persistence layer "shared-nothing" is much more complicated - used when the database is read much more frequently than it is written. In this approach, any replica can perform reads, only the primary can perform writes, and the primary updates the replicas with the results of writes as quickly as possible.
What is Monitoring and and when do you use it?
when: how we can identify what parts of our app need attention what: consists of collecting app performance data for analysis and visualization. In the case of SaaS, application performance monitoring (APM) refers to monitoring the Key Performance Indicators (KPIs) that directly impact business value. KPIs are by nature app-specific
What two kinds of computation does caching help?
1. if information needed from the database to complete an action hasn't changed, we can avoid querying the database at all 2. if the information underlying a particular view or view fragment hasn't changed, we can avoid re-rendering the view
For a SaaS app to scale to large numbers of users, it must maintain its _________ and _________ as the number of users increases, without increasing the _________.
Availability; responsiveness; cost per user
What is Defensive programming?
a philosophy that tries to anticipate potential software flaws and write code to handle them
What does the presentation tier (HTTP server/webserver) do?
accepts HTTP requests from the outside world (i.e., users) and handles the serving of static assets such as images, stylesheets, files of JavaScript code, and so on.
What does performance stability and address security encompass?
performance stability: Responsiveness, Release Management, and availability - High availability and responsiveness, release management without downtime, and scalability without increasing per-user costs are three key performance stabilityconcerns of SaaS apps address security: Privacy, authentication, Data integrity
Why should you be wary of Rails page caching?
the name of the cached object ignores embedded parameters in URIs such as /movies?ratings=PG+G, so parameters that affect how the page would be displayed should instead be part of the RESTful route, as in /movies/ratings/PG+G.
Although deployment is a non-event, what is the role for release milestones?
they reassure the customer that new work is being deployed.
Why is SaaS preferred for early-stage and many mature SaaS apps?
- basic scaling issues and performance tuning are handled for you by professional SaaS administrators who are more experienced at operations than most developers
What is security in the scope of non-functional requirements?
- defensive programming to make your system more robust against failures can also help make your system more secure
How do you address incremental feature rollout?
- deploying the code and migration atomically: take the service offline, apply the migration to perform the schema change and copy the data into the new column, and bring the service back online ---- may cause unacceptable unavailability: due do the migration
Making deployment a true non-event requires meeting what two additional challenges?
- deployment testing: must account for differences between the development and production environments - incremental feature roll-out: features that may require several code pushes, especially features that require database schema changes
What is continuous deployment?
- extreme version of making deployment a non-event defined: every successful CI run (continuous integration) automatically triggers a deployment to staging or production. - CD can result in multiple deployments per day, many of which include changes not visible to the customer that "build towards" a feature that will be unveiled at a Release milestone.
What are three examples of defensive programming?
- Check input values. A common cause of problems is for the user to input values that the developer doesn't expect. Checking that the input is in a reasonable range for individual values, that it is not too big for a series of data, and that the collection of inputs are logically consistent can reduce the chances of outages. - Check input data type. Another mistake users can make is to enter an unexpected type of data in response to a query. Making sure the user enters a valid type of data increases the chances of success for the app. - Catch exceptions. Modern programming languages offer the ability to execute code when exceptions occur, such as arithmetic overflow. Offering code that can catch any exception increases the chances of the app continuing to run well even when unexpected events occur.
How does monitoring also help you understand your customers' behavior?
- Clickstreams: what are the most popular sequences of pages your users visit? - Think times/dwell times: how long does a typical user stay on a given page? - Abandonment: if your site contains a flow that has a well-defined termination, such as making a sale, what percentage of users "abandon" the flow rather than completing it and how far do they get? - can be extremely valuable business data even though unrelated to performance per se
What does the total response time perceived by the users include?
- DNS lookup, - time to set up the TCP connection and send the HTTP request to the server, - Internet-induced latency in receiving a response containing enough content that the browser can start to draw something
In any caching scenario, we must address what two issues?
- Naming: how do we specify that the result of some computation should be cached for later reuse, and name it in a way that ensures it will be used only when that exact same computation is called for? - Expiration: How do we detect when the cached version is out of date (stale) because the information on which it depends has changed, and how do we remove it from the cache? The variant of this problem that arises in microprocessor design is often referred to as cache invalidation.
What is release management in the scope of non-functional requirements?
- Plan-and-document processes often produce software products that have major releases and minor releases. Considered a case of configuration management in Plan-and-Document processes - Release management includes picking dates for the release, information on how it will be distributed, and documenting everything so that you know what exactly is in the release and how to make it again so that it is easy to change when you have to make the next release.
What is reliability in the scope of Non-functional requirements?
- The main tool in our bag to make a system dependable is redundancy - By having more hardware than the absolute minimum needed to run the app and store the data, the system has the potential to continue even if a component fails - make sure there is no single point of failure, as it can be the Achilles' Heel of a system.
What does stress testing or longevity testing do?
- Use after a monitoring tool has identified the slowest or most expensive requests - can quantify the level of demand at which those requests become bottlenecks - frequently expose bugs that would otherwise remain hidden
What is Platform as a Service?
- a curated software stack ready for you to deploy your app, with much of the administration and scaling responsibility managed for you, making deployment much more developer-friendly ex) heroku - PaaS providers may either run their own datacenters or, increasingly, rely on lower-level Infrastructure as a Service (IaaS) providers such as the Amazon public cloud, as Heroku does
What is lazy evaluation?
- actual database query doesn't happen until each is called, because that's the first time the ActiveRelation object is asked to produce a value
What is the strategy for data migration?
- add a new column - remove the old column - if necessary rename the new column, - using feature flags during each transition so that every deployed version of the code works with both versions of the schema.
What are the thresholds for user satisfaction on responsiveness?
- if a computer system responds to a user action within 100 ms, it's perceived as instantaneous - within 1 second, the user will still perceive a cause-and-effect connection between their action and the response, but will perceive the system as sluggish - after about 8 seconds, the user's attention drifts away from the task while waiting for a response - currently the "wight second rule" still applies even though systems as a whole have gotten faster
What is Dependability in the scope of non-functional requirements?
- is holistic, involving the software and the operators as well as the hardware - No matter how dependable the hardware is, errors in the software and mistakes by the operators can lead to outages that reduce the mean time to failure(MTTF
Why might your app be unavailable?
- may have crashed because of an unexpected error
What do both page- and fragment-level caching have uncommon?
- reward our ability to separate things that change (non-cacheable units) from those that stay the same (cacheable units)
What is the principle of least privilege?
- states that a user or software component should be given no more privilege—that is, no further access information and resources—than what is necessary to perform its assigned task - a "need-to-know" basis - ex) Unix processes corresponding to your Rails app, your database, and the Web server (presentation tier) should run with low privilege and in an environment where they cannot even create new files in the file system. Good PaaS providers, including Heroku, offer a deployment environment configured in just this way.
How do you measure reliability?
- taking longer between failures (MTTF) or by making the app reboot faster—mean time between repairs (MTTR) - We just crash a computer and see how long it takes the app to reboot.
How do you address deployment testing?
- test the app in ways it was never meant to be used—users submitting nonsensical input, browsers disabling cookies or JavaScript, miscreants trying to turn your site into a distributor of malware —and ensuring that it survives those conditions without compromising customer data or responsiveness.
How is the collected monitoring data is stored and how it is presented or reported to the app's dev/ops team.
- the combination of hosted PaaS, Ruby's dynamic language features, and well-factored frameworks such as Rails allows internal monitoring without modifying your app's source code or installing software. - (Remote Performance Monitoring) RPM: data is sent back to New Relic's SaaS site where you can view and analyze it
What is fragment caching?
- use partials to isolate each noncacheable entity, such as a single model instance, into its own partial that can be fragment-cached. - partials can be used to break up views into cacheable fragments.
Why is overprovisioning used and what does it do?
- used: in anticipation of crash recovery and rolling reboot - what: provide more servers in a tier at any given time than you think you'll need -pitfall: overprovisioning is both economically unattractive and may be insufficient by itself - A few years ago, overprovisioning meant purchasing additional hardware that might sit idle, but pay-as-you-go cloud computing lets you "rent" the extra servers for pennies per hour only when needed
What are the components in three-tier architecture?
1. presentation tier usually consists of an HTTP server (or simply "Web server". renders views and interacts with the user - web server forwards requests for dynamic content to the logic tier, 2. logic tier where your actual application runs (application server) 3. persistence tier - where application data that must remain stored across HTTP requests, such as users' login and profile information (ex, MYSQL or POSTgreSQL) - all tiers are "logical"
What are the two reasons databases became mainstream?
1. provided high durability for stored information—the guarantee that once something has been stored, unexpected events such as system crashes or transient data corruption won't cause data loss 2. databases store information in a structured format—in the case of relational databases, by far the most popular type, each kind of object is stored in a table whose rows represent object instances and whose columns represent object properties. makes things easier to manipulate
Which one, if any, is a POOR place to store the value (eg true/false) of a feature flag?
A YAML file in config/ directory of app explanation: If stored in a file, we need logic to determine if the file has changed and when it can be re-read, so that the feature flag value can be changed without restarting the app and re-reading all the config files. In contrast, database-stored values can be changed at runtime and the new values will be picked up immediately by the app.
Why are index's less expensive than databases?
An index is a separate data structure maintained by the database that uses hashing techniques over the column values to allow constant-time access to any row when that column is used as the constraint. - You can have more than one index on a given table and even have indices based on the values of multiple column - obvious attributes named explicitly in where queries, foreign keys(the subject of the association) should usually be indexed
How does shrink-wrapped software compare to SaaS developer-operators?
Compared to shrink-wrapped software, SaaS developer-operators are typically much more involved with deploying, releasing, and upgrading their apps and monitoring them for problems with performance or security.
What is Authentication?
Can the app ensure that a given user is who they claim to be, by verifying a password or using third-party authentication such as Facebook Connect or OpenID in such a way that an impostor cannot successfully impersonate another user without having obtained the user's credentials?
Let R = RottenPotatoes app's availability; H = Heroku's availability; C = Internet connection availability; P = Armando's perception of RP availability; Which relationship among these quantities holds? - P >= min (C, H, R) - P <= C <= H <= R - P <= C <= min(H, R) - Can't tell without additional information
Can't tell without additional information Explanation: If Prof. Fox was accessing the site constantly around-the-clock, you could argue that his perception of availability would be no better than the minimum of C,H,R. But since we don't know how often or when he is accessing it, there's no way to tell. If the site is only down 1 minute per day but that's the only time he tries to access it, his perception will be 0% availability, and so on. This is one reason availability is more subtle to quantify than you might expect. SubmitSome problems have options such as save, reset, hints, or show answer. These options follow the Submit button.
Explain why cloud computing might have had a lesser impact on SaaS if most SaaS apps didn't follow the shared-nothing architecture.
Cloud computing allows easily adding and removing computers while paying only for what you use. The shared-nothing architecture takes advantage of this ability to rapidly "absorb" new computers into a running app and "release" them when no longer needed.
True or False: From the perspective of responsiveness, faster is always better.
False. Faster than 100 ms is not perceptible to people, and people abandon sites only when responsiveness is slows to 8 seconds or worse.
What is Release management?
How can you deploy or upgrade your app "in place" without reducing availability and responsiveness?
What is Responsiveness?
How long do most users wait before the app delivers a useful response? - the perceived delay between when a user takes an action such as clicking on a link and when the user perceives a response, such as new content appearing on the page
Which aspects of application scalability are not automatically handled for you in a PaaS environment?
If your app "outgrows" the capacity of the largest database offered by the PaaS provider, you will need to manually build a solution to split it into multiple distinct databases. This task is highly app-specific so PaaS providers cannot provide a generic mechanism to do it.
What are the three axes in which you can characterize the various techniques for monitoring SaaS apps
Is the monitoring active or passive? - In active monitoring, an external stimulus is deliberately applied to the app (even if the app would be otherwise idle) in order to ensure it's working. In passive monitoring, no monitoring data is collected until some external user asks the app to do something. Is the monitoring external or internal? - External monitoring can only report on the behavior of an app as seen from the outside—for example, reporting that some types of requests take longer than other types. Internal monitoring can "hook" into the code of the app server or the app itself, so it can provide better attribution—how long did a request spend in each tier of the SaaS stack and in different parts of your app (for example, the controllers, the models, the database, or the view rendering)? Is the monitoring focused on app performance or user behavior? - For example, you might want to know what fraction of users who added an item to their shopping cart ended up purchasing the item, and what actions were taken instead by the users who didn't end up completing the purchase. Such questions can be critical for a business even though they have little do with performance. (Of course, performance monitoring might reveal the reasonssome users don't complete the purchase flow!)
Which is probably NOT a metric of high interest to you, the app operator? - 99 percentile response time - Rendering time of 3 slowest views - Maximum CPU utilization - Slowest queries
Maximum CPU utilization Explanation: As an operator, you should focus on metrics that directly impact the customer. While CPU utilization may be related to the other metrics, unless you know for certain that it's the root cause of poor behavior in those other metrics, you should instead focus on understanding why the other metrics are poor.
Which statement regarding reliability and security is most likely FALSE? - Not removing data races could violate the security principle of psychological acceptability - Improper initialization of data could violate the security principle of fail-safe defaults - None are false; all are true - Not checking buffer limits could violate the security principle of least privilege
Not removing data races could violate the security principle of psychological acceptability explanation: Data races describe an error related to non-determinism caused by two competing processes commonly found in systems (OS, databases, parallel computing). This is not particularly related to psychological acceptability, which refers to a more social set of conditions.
Besides buffer overflows, arithmetic overflows, and data races, list another potential bug that can lead to security problem by violating one of the three security principles listed above.
One example is improper initialization, which could violate the principle of fail-safe defaults.
What are needlessly expensive queries?
One way to relieve pressure on your database ex) - The n+1 queries problem occurs when traversing an association performs more queries than necessary. (be avoided by judicious use of eager loading) - The table scan problem occurs when your tables lack the proper indices to speed up certain queries. - avoid queries that result in a full table scan (use index instead!)
How is performance a non-functional requirement?
Performance is not a topic of focus in conventional software engineering, in part because it has been the excuse for bad practices and in part because it is well covered elsewhere. Performance can be part of the non-functional requirements and then later in acceptance-level testing to ensure the performance requirement is met.
What ideas are behind part of the non-functional requirements?
Performance, Release Management, Reliability, Dependability, and Security
What does the tiger team do?
Pretends to be adversaries who perform penetration tests. The team reports back to the developers the uncovered vulnerabilities.
Which of the following key performance indicators (KPIs) would be relevant for Application Performance Monitoring: - CPU utilization of a particular computer; - completion time of slow database queries; - view rendering time of 5 slowest views.
Query completion times and view rendering times are relevant because they have a direct impact on responsiveness, which is generally a Key Performance Indicator tied to business value delivered to the customer. CPU utilization, while useful to know, does not directly tell us about the customer experience.
An index on a database table usually speeds up _________ at the expense of _________ and _________.
Query performance at the expense of space and table-update performance
The master-slave configuration is appropriate for applications that experience what kind of workload? - None, the configuration doesn't affect how the workload is handled - Read heavy - Write heavy - Equal distribution of read and write operations
Read Heavy Explanation: The master/slave configuration is a communication model where the master has control over multiple slave devices and load balances requests across the slave devices.
What is the key to a dependable system?
Redundancy is the key to dependable system
How often should SaaS deployment be?
SaaS deployment should be so automated and straightforward that it can be done frequently, up to several times a day, while remaining a non-event.
We mentioned that passing :layout=>false to caches_actionprovides most of the benefit of action caching even when the page layout contains dynamic elements such as the logged-in user's name. Why doesn't the caches_page method also allow this option?
Since page caching is handled by the presentation tier, not the logic tier, a hit in the page cache means that Rails is bypassed entirely. The presentation tier has a copy of the whole page, but only the logic tier knows what part of the page came from the layout and what part came from rendering the action.
How do you alleviate a resource leak?
Software rejuvenation: he Apache web server runs a number of identical worker processes, and when a given worker process has "aged" enough, that process stops accepting requests and dies, to be replaced by a fresh worker - since only one worker (1/n1/n of total capacity) is "rejuvenated" at a time, this process is sometimes called rolling reboot
When does a SaaS app change its behavior?
The moment a SaaS app is deployed, its behavior changes because it has actual users.
Which of the following are appropriate places to store the value of a simple Boolean feature flag and why: ( a ) a YAML file in the app's config directory, ( b ) a column in an existing database table, ( c ) a separate database table?
The point of a feature flag is to allow its value to be changed at runtime without modifying the app. Therefore ( a ) is a poor choice because a YAML file cannot be changed without touching the production servers while the app is running.
Which tier(s) of three-tier SaaS apps can be scaled just by adding more computers and why?
The presentation and logic tiers, because since neither HTTP (Web) servers nor app servers maintain any of the state associated with user sessions, any computer in those tiers can in principle satisfy any user's request.
RottenPotatoes' target uptime is 99.9%. Yesterday there was a one hour outage. Which statement is true: - There isn't enough information to determine whether RottenPotatoes can meet its user-perceived uptime goal - If no live users actually tried to get to the site during the outage, uptime wasn't hurt - RottenPotatoes can still meet its uptime goal if there are no further outages this year - Because of the outage, RottenPotatoes has no hope of meeting its uptime goal this year
There isn't enough information to determine whether RottenPotatoes can meet its user-perceived uptime goal Explanation: Recall that Uptime is a measure of system reliability, expressed as the percentage of time a machine, typically a computer, has been working and available. Without knowing the time frame of the target uptime, there's no way for us to evaluate whether the percentage has been met.
What is a feature flag?
a configuration variable whose value can be changed while the app is runningto control which code paths in the app are executed. - each step is nondestructive: if something goes wrong at a given step, the app is still left in a working intermediate state. - use when you need to perform a complex upgrade that changes both the app cde and the schema - The feature flag's value selectively enables certain code paths at runtime, and can be immediately turned off if a bug is observed after deployment
What is key to making good decisions about scalability in your software architecture?
a high-level understanding of the hardware architecture
What is apdex score? (Application Performance Index)
an open standard that computes a simplified SLO as a number between 0 and 1 inclusive representing the fraction of satisfied users. - a simple SLO measure between 0.0 and 1.0 in which a site gets "full credit" for requests that complete within a site-specific latency threshold TT, "half credit" for reque
What does the logic their (application server) do?
application server: whose job is to hide the low-level mechanics of these HTTP interactions from the app writer ex) Rack application server which ships with the Rails framework.
What is Scalability?
as the number of users increases, either gradually and permanently or as a one-time surge of popularity, can your app maintain its steady-state availability and responsiveness without increasing the operational cost per user? Chapter 3 noted that three-tier SaaS apps on cloud computing have excellent potential horizontal scalability, but good design alone doesn't guarantee that your app will scale (though poor design guarantees that it won't). Caching (Section 12.6) and avoiding abuse of the database (Section 12.7) can help.
What is an arithmetic overflow attack?
attack: might be to supply such an unexpectedly large number that when added to another number it will look small due to the wraparound nature of overflow with 32-bit arithmetic solution: checking input values or catching exceptions might prevent this attack.
What is a buffer overflow attack?
attack: the adversary sends too much data to a buffer to overwrite nearby memory with their own code hidden inside the data. solution: Checking the inputs to ensure that that the user is not sending too much data can prevent such attacks
What is a data race attack?
attack: the program has non-deterministic behavior depending on the input.
What is Data integrity?
can the app prevent customer data from being tampered with, or at least detect when tampering has occurred or data may have been compromised?
What is a key to managing the growth of your app?
controlling the demands placed on the database, which is harder to scale horizontally
What is development-production parity and how to you remedy it?
defined: differences between development and production environments, remedy: to make deployment itself as agile as possible by relying heavily on automation (ex, how easily can you "roll back" to the previous version)
When do you want to use caches_action over caches_page?
for controller actions protected by before-filters, such as pages that require the user to be logged in and therefore require executing the controller filter - caches_action will still execute any filters but allow Rails to deliver a cached page without consulting the database or re-rendering views
When is Page-level caching not useful?
for pages whose content changes dynamically ex) the list of movies page (MoviesController#index action) changes when new movies are added or when the user filters the list by MPAA rating
Under-17 visitors to RottenPotatoes shouldn't see NC-17 movies in any listing. A controller filter exists that can determine if a user is under 17. What kinds of caching would be appropriate when implementing this: i. Page ii. Action iii. Fragment
ii. and iii. explanation: Page caching will bypass the controller filter, but action caching would allow us to avoid regenerating an 'under-17-specific' view of the listings page, and fragment caching could help if we miss in the action cache.
What is one way to ensure a smooth deployment?
include additional deployment tests that must run before a deploy is attempted, to test differences between the development and production environments and to stress the app by deliberately simulating user misbehaviors.
What is the idea behind caching?
information that hasn't changed since the last time it was requested can simply be regurgitated rather than recomputed
What is Privacy?
is important customer data accessible only to authorized parties, such as the data's owner and perhaps the app's administrators?
What are the two components in responsiveness?
latency: the initial delay to start receiving new content throughput: the time it takes for all the content to be delivered. - responsiveness is dominated by latency.
What is a resource leak?
long-running process eventually runs out of a resource, such as memory, because it cannot reclaim 100% of the unused resource due to either an application bug or the inherent design of a language or framework
Why do databases ultimately limit horizontal scaling?
not because you run out of space to store tables, but more likely because a single computer can no longer sustain the necessary number of queries per second while remaining responsive. fix: sharding and replication
Suppose Movie has many Moviegoers through Reviews. Which foreign-key index or indices would MOST help speed up the query: "fans = @movie.moviegoers" - moviegoers.review_id - reviews.movie_id - reviews.moviegoer_id - movies.review_id
reviews.movie_id Explanation: Because of the through-association, the query involves finding the review(s) whose movie_id matches this movie, and then for each of those, looking up the appropriate moviegoer_id. So we need an index on the movie_id field of reviews. We don't need any special index on moviegoers, since lookups by id are already indexed by default.
How is security different from other failures?
security is based on an intelligent adversary who is purposely exploiting unexpected events, such as buffer overflows.
What is principle of psychological acceptability?
states that the protection mechanism should not make the app harder to use than if there were no protection. That is, the user interface needs to be easy to use so that the security mechanisms are routinely followed.
What is the principle of fail-safe defaults?
states that unless a user or software component is given explicit access to an object, it should be denied access to the object. That is, the default should be denial of access. Proper use of strong parameters as described in Section 5.3 follows this principle.
What is service level objective (SLO)?
takes the form of a quantitative statement about the quantiles of the latency distribution over a time window of a given width ex) "95% of requests within any 5-minute window should have a latency below 100 ms." In statistical terms, the 95th quantile of the latency distribution must not exceed 100 ms.