CS 101 & Scalability

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Scaling Rules

1. we don't care about multiples 2. we take the largest in an equation.

What is a load balancer?

Acts as a "traffic cop" sitting in from of some of your servers and routing client requests across all servers capable of fulfilling those requests in a way that maximizes speed and capacity utilization,.

What is caching?

An in-memory cache is a simple key-value pairing and typically sits between your application layer and your data store. When an application requests a piece of information, it first tries the cache. If the cache doesn't contain the key, it'll look up the data in the data store.

How would you design data structures for a large social network like Facebook or Linkedin? How would you design an alogrithm to show the shortest path between two people?

Construct a graph. And if you wanted to find a path between to people, you could do Dijsktra's algo or a bi-direction BFS---that's starting a BFS from each person and seeing where the searches collide (that's the path). However, in this case, we wouldn't be able to just mark a node as searched like we would in a usual BFS. There could be multiple searches going on at the same time! Instead, we could mimic the marking of nodes for each searchkj with a hash table to look up node id and determine whether it's been visited.

Imagine you're building a service that will called by up to 1000 client applications to get stock price info (open, close, high, low). You already have the data and can store it in any format you wish. How would you design the client-facing service that provides the info to client applications?

Couple options: 1. Keep the data in simple text fields and let clients download data through an FPT server. Easy to view and back up files, but queries and updating files might be difficult to parse. 2. You could use a SQL database and let the clients plug directly into that. It would facilitate an easy way for clients to query, back up data, and easy for clients to integrate into existing applications---SQL integration is a standard feature in software development environments The problem is that SQL is way too heavy to feed a few bits of info. It's also difficult for humans to be able to read, so we'd need to implement an additional layer to view and maintain data--increases implementation costs SQL is probably better than pure text file.

A large eCommerce company wants to list the best selling products both overall and by category. For example, one product might be the #1 056th best-selling product overall but the #13th best-selling product under "Sports Equipment" and the #24th best-selling product under "Safety." Describe how you would design this system.

First, let's figure out exactly what we're building: a. what does sales rank mean? Is it total sales over time? Sales last month? Week? Lets assume it's week. b. We'll assume each product can be in multiple categories and that there's no such concept of subcategories. c. How often is the data being updated? Let's say every hour. d. but it's just the last week, right? e. We'll assume categorizations are based on the origin of the transaction, not the price or date. f. in terms of precision: it's important for the most popular items, but some error is probably ok for the less popular items. ^ these are all things that you should bring up! Also worth noting: analytics are expensive. And joins are expensive. And we can take advantage of the fact that only the most popular items need to be up-to-date...the less popular ones can just be modified once per day.

What is horizontal scaling?

Horizontal scaling means increasing the number of nodes. For example, you might add additional servers, thus decreasing the load on any one individual server.

What is latency?

How long it takes data to go from the sender the sender information and the receiver receiving it.

What is Database Denormalization? Why is it a solution for SQL scaling issues? What's another, possibly better solution?

Joins in a relational DB like SQL can get really slow as the system gets bigger. For this reason, you generally avoid them. Database denormalization means adding redundant info into a database to speed up reads. You can also just use a NoSQL database. NoSQL dbs don't support joins and is designed to scale better.

If you were designing a web crawler, how would you avoid getting into infinite loops?

Preventing infinite loops means detecting cycles. One way to do this is to create a hash table where we set hash[v] to true after we visit page [v] We can crawl through the web using BFS. Each time we visit a page, we gather all its links and insert them at the end of a queue. If we've already visit a page, we ignore it. But! what does it mean to actually visit a page? Is it is url? That could be tripped up by differing query parameters. What if it's its content? In that case, pages with randomly generated content each time you visit a page (like, say, a news feed) can mess with your efficiency as well. The best thing to do would probably be to create a signature of the page based on specific subsections of a page and its url. Then you can query the db with its signature and see if there's anything like it already in it. If something with this signature has been recently crawled, you can mark as low priority. If it hasn't, you can crawl the page and insert its links into the DB>

You have 10 billion URLS. How do you detect the duplicate documents? In this case, duplicate means URLS are identical.

That's a ton of data (like 4000 gigs) to hold in memory. If we magically could, the answer is a hash table that maps to true if the URL has already been visited. If we don't have that much memory... Disk Storage: split the list of URLS into 4000 chunks of 1 gig each, each with a unique hash. We divide urls based on their hash value and then load each file into memory, create a hash table and look for duplicates. If we had multiple machines, we could do this process simultaneously. But...it's really hard to coordinate 4000 machines to operate perfectly.

What is Bandwith?

The maximum amount of data that can be transferred in a unit of time/

What is vertical scaling?

Vertical scaling means increasing the resources of a specific node. For example, you might add additional memory to a server to improve its ability to handle load changes. It's generally easier than horizontal scaling, although it's limited---you can only add so much memory of disk space.

Linked lists are best suited

for the size of the structure and the data in the structure are constantly changing

What is Sharding? What are the 3 ways to do it?

sharing means splitting the data aross multiple machines while ensuring you have a way of figuring out which data is on which machine. 1. Vertical partitioning: partitioning by feature. For example, in a social network, you could have one partition for tables relating to profiles, one for messages, etc. 2. Key or Hash based partitioning: uses some part of the data (like an ID) to parition it. 3. Directory based Partitioning: mantain a lookup table for where data can be found.


Set pelajaran terkait

Deming's 14 Points for Management

View Set

Chapter 5 Plate Tectonics Section 2 Convection and the Mantle

View Set

Business Law II Mergers and Takeovers

View Set

Bio- unit 1-pt3- book notes- ch 13, 14, 15

View Set

Developmental Psychology Homework VI

View Set