System design questions
Say the disadvantages of failover
- Fail-over adds more hardware and additional complexity. - There is a potential for loss of data if the active system fails before any newly written data can be replicated to the passive.
Say some questions to ask regarding features (system design)
- Size/capacity (of whatever we're making). - Queries per second. - Access strategy (for a cache).
General disadvantages of replication
- There is a potential for loss of data if the master fails before any newly written data can be replicated to other nodes. - Writes are replayed to the read replicas. If there are a lot of writes, the read replicas can get bogged down with replaying writes and can't do as many reads. - The more read slaves, the more you have to replicate, which leads to greater replication lag. - On some systems, writing to the master can spawn multiple threads to write in parallel, whereas read replicas only support writing sequentially with a single thread. Replication adds more hardware and additional complexity.
IPv4 vs. IPv6
32 vs. 128 bit addresses. Also: No more NAT (Network Address Translation) Auto-configuration No more private address collisions Better multicast routing Simpler header format Simplified, more efficient routing True quality of service (QoS), also called "flow labeling" Built-in authentication and privacy support Flexible options and extensions Easier administration (say good-bye to DHCP)
What's a CDN?
A content delivery network (CDN) is a globally distributed network of proxy servers, serving content from locations closer to the user. Generally, static files such as HTML/CSS/JS, photos, and videos are served from CDN, although some CDNs such as Amazon's CloudFront support dynamic content. The site's DNS resolution will tell clients which server to contact. Serving content from CDNs can significantly improve performance in two ways: - Users receive content at data centers close to them - Your servers do not have to serve requests that the CDN fulfills
Say some additional benefits of Load Balancers
Additional benefits include: - SSL termination - Decrypt incoming requests and encrypt server responses so backend servers do not have to perform these potentially expensive operations - Removes the need to install X.509 certificates on each server - Session persistence - Issue cookies and route a specific client's requests to same instance if the web apps do not keep track of sessions
What do you call an external system that is used for authentication (like Facebook, Google...)?
An identity provider.
SQl vs NoSQL -> BASIC QUESTIONS
Are joins required? => SQL Size of the DB. Doesn't fit in a machine? => NoSQL Technology Maturity => Old? => SQL (rare) Read-write pattern => Lots of reads? => NoSQL.
What is Websocket?
WebSocket is a computer communications protocol, providing full-duplex communication channels over a single TCP connection. It is at the same level, but different, from HTTP (level 7 of OSI model).
In NoSQL DBs, what does BASE mean?
Basically Available, Soft state, Eventual consistency
What does BASE mean?
Basically Available, Soft state, Eventual consistency
In NoSQL DBs, what does Basically Available mean?
Basically Available: This constraint states that the system does guarantee the availability of the data as regards CAP Theorem; there will be a response to any request. But, that response could still be 'failure' to obtain the requested data or the data may be in an inconsistent or changing state, much like waiting for a check to clear in your bank account.
What types of NoSQL DBs are supported by DynamoDB?
Both key-values and documents.
Define master-master replication, and mention 3 disadvantages particular to this method of replication.
Both masters serve reads and writes and coordinate with each other on writes. If either master goes down, the system can continue to operate with both reads and writes. Disadvantages: - You'll need a load balancer or you'll need to make changes to your application logic to determine where to write. - Most master-master systems are either loosely consistent (violating ACID) or have increased write latency due to synchronization. (MR: This is basically CAP Theorem). - Conflict resolution comes more into play as more write nodes are added and as latency increases.
Disadvantage(s): CDN
CDN costs could be significant depending on traffic, although this should be weighed with additional costs you would incur not using a CDN. Content might be stale if it is updated before the TTL expires it. CDNs require changing URLs for static content to point to the CDN.
What is denormalization?
Denormalization attempts to improve read performance at the expense of some write performance. Redundant copies of the data are written in multiple tables to avoid expensive joins. Some RDBMS such as PostgreSQL and Oracle support materialized views which handle the work of storing redundant information and keeping redundant copies consistent. Once data becomes distributed with techniques such as federation and sharding, managing joins across data centers further increases complexity. Denormalization might circumvent the need for such complex joins. In most systems, reads can heavily outnumber writes 100:1 or even 1000:1. A read resulting in a complex database join can be very expensive, spending a significant amount of time on disk operations. Disadvantage(s): denormalization - Data is duplicated. - Constraints can help redundant copies of information stay in sync, which increases complexity of the database design. - A denormalized database under heavy write load might perform worse than its normalized counterpart.
Say some disadvantages of LBs
Disadvantage(s): load balancer The load balancer can become a performance bottleneck if it does not have enough resources or if it is not configured properly. Introducing a load balancer to help eliminate single points of failure results in increased complexity. A single load balancer is a single point of failure, configuring multiple load balancers further increases complexity.
Say some advantages of MongoDB
Dynamic schema: As mentioned, this gives you flexibility to change your data schema without modifying any of your existing data. Scalability: MongoDB is horizontally scalable, which helps reduce the workload and scale your business with ease. Manageability: The database doesn't require a database administrator. Since it is fairly user-friendly in this way, it can be used by both developers and administrators. Speed: It's high-performing for simple queries. Flexibility: You can add new columns or fields on MongoDB without affecting existing rows or application performance.
In NoSQL DBs, what does Eventual Consistency mean?
Eventual consistency: The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it should sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one.
What are the patterns of high availability?
Fail-over and replication Fail-over can be active-passive or active-active. Replication can be Master-slave or master-master.
Define DB Federation (or functional partitioning)
Federation (or functional partitioning) splits up databases by function. For example, instead of a single, monolithic database, you could have three databases: forums, users, and products, resulting in less read and write traffic to each database and therefore less replication lag. Smaller databases result in more data that can fit in memory, which in turn results in more cache hits due to improved cache locality. With no single central master serializing writes you can write in parallel, increasing throughput. Disadvantage(s): federation Federation is not effective if your schema requires huge functions or tables. You'll need to update your application logic to determine which database to read and write. Joining data from two databases is more complex with a server link. Federation adds more hardware and additional complexity.
Define active-active fail-over
In active-active, both servers are managing traffic, spreading the load between them. If the servers are public-facing, the DNS would need to know about the public IPs of both servers. If the servers are internal-facing, application logic would need to know about both servers. Active-active failover can also be referred to as master-master failover.
What enhancements brings HTTP2 over HTTP?
HTTP/2 allows the server to "push" content. Also multiplexing, header compression, and prioritization of requests.
Horizontal vs. vertical scalability
Horizontal: buy more machines. Vertical: improve the machines.
Why is it better to cache objects instead of database queries?
If you cache queries, when the object changes you'd need to find out which queries are affected by that change, which is not trivial.
Describe the differences in scalability between SQL and NoSQL DBs.
In most situations, SQL databases are vertically scalable, which means that you can increase the load on a single server by increasing things like CPU, RAM or SSD. NoSQL databases, on the other hand, are horizontally scalable. This means that you handle more traffic by sharding, or adding more servers in your NoSQL database. NoSQL databases is the preferred choice for large or ever-changing data sets.
What is the CAP theorem?
It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency (all nodes see the same data at the same time) Availability (every request receives a response about whether it succeeded or failed) Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures). In other words, in the presence of a network partition, one has to choose between consistency and availability.
Types of NoSQL DBs. Describe them.
Key-Wide-Docu-Graph. (kiwi-Grdo) - Key-value stores are the simplest. Every item in the database is stored as an attribute name (or "key") together with its value. Riak, Voldemort, and Redis are the most well-known in this category. - Wide-column stores store data together as columns instead of rows and are optimized for queries over large datasets. The most popular are Cassandra and HBase. - Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. MongoDB is the most popular of these databases. - Graph databases are used to store information about networks, such as social connections. Examples are Neo4J and HyperGraphDB.
What are the four layers of the TCP/IP suite?
LITA: Link layer Internet layer Transport layer Application layer
Say some examples of design goals
Latency - Is this problem very latency sensitive (Or in other words, Are requests with high latency and a failing request, equally bad?). For example, search typeahead suggestions are useless if they take more than a second. Consistency - Does this problem require tight consistency? Or is it okay if things are eventually consistent? Availability - Does this problem require 100% availability?
Define Layer 4 LB
Layer 4 load balancers look at info at the transport layer to decide how to distribute requests. Generally, this involves the source, destination IP addresses, and ports in the header, but not the contents of the packet. Layer 4 load balancers forward network packets to and from the upstream server, performing Network Address Translation (NAT).
Define Layer 7 LB
Layer 7 load balancers look at the application layer to decide how to distribute requests. This can involve contents of the header, message, and cookies. Layer 7 load balancers terminates network traffic, reads the message, makes a load-balancing decision, then opens a connection to the selected server. For example, a layer 7 load balancer can direct video traffic to servers that host videos while directing more sensitive user billing traffic to security-hardened servers. At the cost of flexibility, layer 4 load balancing requires less time and computing resources than Layer 7, although the performance impact can be minimal on modern commodity hardware.
In what ways can Load balancers distribute traffic?
Load balancers can route traffic based on various metrics, including: Random Least loaded Session/cookies Round robin or weighted round robin Layer 4 Layer 7
Define load balancer
Load balancers distribute incoming client requests to computing resources such as application servers and databases. In each case, the load balancer returns the response from the computing resource to the appropriate client. Load balancers are effective at: - Preventing requests from going to unhealthy servers - Preventing overloading resources - Helping eliminate single points of failure - Load balancers can be implemented with hardware (expensive) or with software such as HAProxy.
Techniques to scale a relational database
Master-slave replication, master-master replication, federation (aka functional partitioning), sharding, denormalization, and SQL tuning (including DB partitioning).
Say some advantages of MySQL
Maturity: MySQL is an extremely established database, meaning that there's a huge community, extensive testing and quite a bit of stability. Compatibility: MySQL is available for all major platforms, including Linux, Windows, Mac, BSD and Solaris. It also has connectors to languages like Node.js, Ruby, C#, C++, Java, Perl, Python and PHP, meaning that it's not limited to SQL query language. Cost-effective: The database is open source and free. Replicable: The MySQL database can be replicated across multiple nodes, meaning that the workload can be reduced and the scalability and availability of the application can be increased. Sharding: While sharding cannot be done on most SQL databases, it can be done on MySQL servers. This is both cost-effective and good for business.
MySQL vs MongoDB. When to use?
MySQL is a strong choice for any business that will benefit from its pre-defined structure and set schemas. For example, applications that require multi-row transactions - like accounting systems or systems that monitor inventory - or that run on legacy systems will thrive with the MySQL structure. MongoDB, on the other hand, is a good choice for businesses that have rapid growth or databases with no clear schema definitions. More specifically, if you cannot define a schema for your database, if you find yourself denormalizing data schemas, or if your schema continues to change - as is often the case with mobile apps, real-time analytics, content management systems, etc.- MongoDB can be a strong choice for you.
Describe Optimistic vs. pessimistic locking
Optimistic concurrency control (OCC) is a concurrency control method applied to transactional systems such as relational database management systems and software transactional memory. OCC assumes that multiple transactions can frequently complete without interfering with each other. While running, transactions use data resources without acquiring locks on those resources. Before committing, each transaction verifies that no other transaction has modified the data it has read. If the check reveals conflicting modifications, the committing transaction rolls back and can be restarted
Projects where SQL/NoSQL is ideal
Projects where SQL is ideal: - logical related discrete data requirements which can be identified up-front - data integrity is essential - standards-based proven technology with good developer experience and support. Projects where NoSQL is ideal: - unrelated, indeterminate or evolving data requirements - simpler or looser project objectives, able to start coding immediately - speed and scalability is imperative.
Define Pull CDNs
Pull CDNs grab new content from your server when the first user requests the content. You leave the content on your server and rewrite URLs to point to the CDN. This results in a slower request until the content is cached on the CDN. A time-to-live (TTL) determines how long content is cached. Pull CDNs minimize storage space on the CDN, but can create redundant traffic if files expire and are pulled before they have actually changed. Sites with heavy traffic work well with pull CDNs, as traffic is spread out more evenly with only recently-requested content remaining on the CDN.
Define Push CDNs
Push CDNs receive new content whenever changes occur on your server. You take full responsibility for providing content, uploading directly to the CDN and rewriting URLs to point to the CDN. You can configure when content expires and when it is updated. Content is uploaded only when it is new or changed, minimizing traffic, but maximizing storage. Sites with a small amount of traffic or sites with content that isn't often updated work well with push CDNs. Content is placed on the CDNs once, instead of being re-pulled at regular intervals.
Types of CDN?
Push and Pull
SQL vs NoSQL
Reasons for SQL: Structured data Strict schema Relational data Need for complex joins Transactions Clear patterns for scaling More established: developers, community, code, tools, etc Lookups by index are very fast - Data fits in a single machine. - Not write-heavy. Reasons for NoSQL: Semi-structured data Dynamic or flexible schema Non relational data No need for complex joins Store many TB (or PB) of data Very data intensive workload Very high throughput for IOPS Sample data well-suited for NoSQL: Rapid ingest of clickstream and log data Leaderboard or scoring data Temporary data, such as a shopping cart Frequently accessed ('hot') tables Metadata/lookup tables
Describe the differences in structure between SQL and NoSQL DBs, and when are SQL DBs better.
SQL databases are table-based, while NoSQL databases are either document-based, key-value pairs, graph databases or wide-column stores. This makes relational SQL databases a better option for applications that require multi-row transactions - such as an accounting system - or for legacy systems that were built for a relational structure.
How to speed up a noSQL DB?
Sharding (it reduces the number of indexes per DB table)! Also denormalization (removing references of one table in another one by adding the actual referred entries of the first a table in the second one). Also indexing; also using statistical methods provided my the DB (like checking how many documents are queried for certain query). Also, with a memory cache! Also, there could be vertical scaling like enhancing the memory or the processor of the DB.
What is DB sharding? Say advantages and disadvantages.
Sharding distributes data across different databases such that each database can only manage a subset of the data. Taking a users database as an example, as the number of users increases, more shards are added to the cluster. Similar to the advantages of federation, sharding results in less read and write traffic, less replication, and more cache hits. Index size is also reduced, which generally improves performance with faster queries. If one shard goes down, the other shards are still operational, although you'll want to add some form of replication to avoid data loss. Like federation, there is no single central master serializing writes, allowing you to write in parallel with increased throughput. Common ways to shard a table of users is either through the user's last name initial or the user's geographic location. Disadvantage(s): sharding You'll need to update your application logic to work with shards, which could result in complex SQL queries. Data distribution can become lopsided in a shard. For example, a set of power users on a shard could result in increased load to that shard compared to others. Rebalancing adds additional complexity. A sharding function based on consistent hashing can reduce the amount of transferred data. Joining data from multiple shards is more complex. Sharding adds more hardware and additional complexity.
How to scale a noSQL DB
Sharding.
In NoSQL DBs, what does Soft State mean?
Soft state: The state of the system could change over time, so even during times without input there may be changes going on due to 'eventual consistency,' thus the state of the system is always 'soft.'
Define reverse proxy
Source: Wikipedia A reverse proxy is a web server that centralizes internal services and provides unified interfaces to the public. Requests from clients are forwarded to a server that can fulfill it before the reverse proxy returns the server's response to the client. Additional benefits include: Increased security - Hide information about backend servers, blacklist IPs, limit number of connections per client Increased scalability and flexibility - Clients only see the reverse proxy's IP, allowing you to scale servers or change their configuration SSL termination - Decrypt incoming requests and encrypt server responses so backend servers do not have to perform these potentially expensive operations Removes the need to install X.509 certificates on each server Compression - Compress server responses Caching - Return the response for cached requests Static content - Serve static content directly HTML/CSS/JS Photos Videos Etc
What are benchmarking and profiling?
Strategies for enhancing SQL DBs. Benchmark - Simulate high-load situations with tools such as ab. Profile - Enable tools such as the slow query log to help track performance issues.
What is TCP?
TCP is a connection-oriented protocol that addresses numerous reliability issues in providing a reliable byte stream: data arrives in-order data has minimal error (i.e., correctness) duplicate data is discarded lost or discarded packets are resent includes traffic congestion control
What does the Internet Protocol (IP) do?
The Internet Protocol performs two basic functions: Host addressing and identification: This is accomplished with a hierarchical IP addressing system. Packet routing: This is the basic task of sending packets of data (datagrams) from source to destination by forwarding them to the next network router closer to the final destination.
What is MD5
The MD5 algorithm is a widely used hash function producing a 128-bit hash value.
Describe the TCP/IP application layer
The application layer is the scope within which applications create user data and communicate this data to other applications on another or the same host. This is the layer in which all higher level protocols, such as SMTP, FTP, SSH, HTTP, operate. Processes are addressed via ports which essentially represent services.
Describe the TCP/IP link layer
The link layer defines the networking methods within the scope of the local network link on which hosts communicate without intervening routers. This layer includes the protocols used to describe the local network topology and the interfaces needed to effect transmission of Internet layer datagrams to next-neighbor hosts.
In SQL DBs, what does ACID mean?
The four characteristics of database transactions: Atomicity, Consistency, Isolation and Durability. Atomicity: each transaction either happens or doesn't happen. Consistency: Databases are always in a consistent state. Isolation: Each transaction should be independent of each other. The effect of several transactions should be them running independently. Durability: The result of transactions should be permanent.
Describe the TCP/IP internet layer
The internet layer exchanges datagrams across network boundaries. It provides a uniform networking interface that hides the actual topology (layout) of the underlying network connections. This layer defines the addressing and routing structures used for the TCP/IP protocol suite. The primary protocol in this scope is the Internet Protocol, which defines IP addresses. Its function in routing is to transport datagrams to the next IP router that has the connectivity to a network closer to the final data destination.
Define master-slave replication, and mention one particular disadvantage of this method of replication.
The master serves reads and writes, replicating writes to one or more slaves, which serve only reads. Slaves can also replicate to additional slaves in a tree-like fashion. If the master goes offline, the system can continue to operate in read-only mode until a slave is promoted to a master or a new master is provisioned. Disadvantage: Additional logic is needed to promote a slave to a master.
Describe the TCP/IP transport layer
The transport layer performs host-to-host communications on either the same or different hosts and on either the local network or remote networks separated by routers. It provides a channel for the communication needs of applications. Protocols: TCP and UDP.
Describe Write back cache
This is a caching system where the write is directly done to the caching layer and the write is confirmed as soon as the write to the cache completes. The cache then asynchronously syncs this write to the DB. This would lead to a really quick write latency and high write throughput. But, as is the case with any non-persistent / in-memory write, we stand the risk of losing the data incase the caching layer dies. We can improve our odds by introducing having more than one replica acknowledging the write ( so that we don't lose data if just one of the replica dies ).
Describe Write around cache
This is a caching system where write directly goes to the DB. The cache system reads the information from DB incase of a miss. While this ensures lower write load to the cache and faster writes, this can lead to higher read latency incase of applications which write and re-read the information quickly.
Describe Write through cache
This is a caching system where writes go through the cache and write is confirmed as success only if writes to DB and the cache BOTH succeed. This is really useful for applications which write and re-read the information quickly. However, write latency will be higher in this case as there are writes to 2 separate systems.
Techniques to speed up a relational DB
Usually it's like vertical scaling, like adding indexes / DB tuning (like DB partitioning) / remove logs / updating memory, processor power, bandwidth... See https://www.catswhocode.com/blog/10-sql-tips-to-speed-up-your-database . Also, with a memory cache (like Amazon ElastiCache or memcached). MySQL + Memcached is common. And also with a CDN
Define the 3 consistency patterns
Weak consistency After a write, reads may or may not see it. A best effort approach is taken. This approach is seen in systems such as memcached. Weak consistency works well in real time use cases such as VoIP, video chat, and realtime multiplayer games. For example, if you are on a phone call and lose reception for a few seconds, when you regain connection you do not hear what was spoken during connection loss. Eventual consistency After a write, reads will eventually see it (typically within milliseconds). Data is replicated asynchronously. This approach is seen in systems such as DNS and email. Eventual consistency works well in highly available systems. Strong consistency After a write, reads will see it. Data is replicated synchronously. This approach is seen in file systems and RDBMSes. Strong consistency works well in systems that need transactions.
Define active-passive fail-over
With active-passive fail-over, heartbeats are sent between the active and the passive server on standby. If the heartbeat is interrupted, the passive server takes over the active's IP address and resumes service. The length of downtime is determined by whether the passive server is already running in 'hot' standby or whether it needs to start up from 'cold' standby. Only the active server handles traffic. Active-passive failover can also be referred to as master-slave failover.
What are the strategies for caching?
Write-through, write-around, write-back. https://www.interviewbit.com/problems/design-cache/
What are the steps for solving a systems design question? (HiredInTech)
http://old.hiredintech.com/system-design/the-system-design-process/ - Scope the problem: Don't make assumptions; Ask questions; Understand the constraints and use cases. - Sketch up an abstract design that illustrates the basic components of the system and the relationships between them. - Think about the bottlenecks these components face when the system scales. - Address these bottlenecks by using the fundamentals principles of scalable system design.
Design Pastebin.com (or Bit.ly)
https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/pastebin/README.md
Specify Twitter's System design
https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/twitter/README.md
Steps to approach a problem ( InterviewBit)
https://www.interviewbit.com/courses/system-design/topics/storage-scalability/ Feature expectations ( First 2 mins ) : There is no wrong design. There are just good and bad designs and the same solution can be a good design for one use case and a bad design for the other. It is extremely important hence to get a very clear understanding of whats the requirement for the question. Estimations ( 2-5 mins ) Next step is usually to estimate the scale required for the system. The goal of this step is to understand the level of sharding required ( if any ) and to zero down on the design goals for the system. For example, if the total data required for the system fits on a single machine, we might not need to go into sharding and the complications that go with a distributed system design. OR if the most frequently used data fits on a single machine, in which case caching could be done on a single machine. Design Goals ( 1 mins ) Figure out what are the most important goals for the system. It is possible that there are systems which are latency systems in which case a solution that does not account for it, might lead to bad design. Skeleton of the design ( 4 - 5 mins ) 30-40 mins is not enough time to discuss every single component in detail. As such, a good strategy is to discuss a very high level with the interviewer and go into a deep dive of components as enquired by the interviewer. Deep dive ( 20-30 mins ) This is an extension of the previous section.
Describe consistent hashing
https://www.toptal.com/big-data/consistent-hashing