IDS 200 Facebook Module Exam Review
1 ) Hadoop 2 ) Map-Reduce
1 ) a large-scale number-crunching system 2 ) an algorithm that accepts a large task and breaks into smaller tasks, completes them, and returns the combined results eg. searching a big text document for matching words...break the document into many segments, send each segment to one server for separate processing, total up the match counts...much faster b/c of parallelism although the total work remains unchanged
Types of Storage 1 ) RAM (Random Access Memory) 2 ) DASD (Direct Access Storage Device) 3 ) SAM (Sequential Access Memory)
1 ) accessing any memory location requires about the same time (equally fast wherever data is stored) v. Hard Drives: > HD contains 100x the memory for the same prices as RAM (1 TB HD ~$50 & 8GB RAM ~$40) >> more RAM in one device, more $$$ (1 TB RAM = $24k) > HD roughly 100x slower than the same amount of RAM >> too slow for many real-time apps >> server forms already gigantic; would require 100x as many to run in parallel 2 ) think of vinyl records, where you can pick a track and play it sequentially but picking up the needle & moving it imposes a delay 3 ) data is accessible in sequence, like spooling through a VHS tape
Matching News Feed to User Preferences
1 ) asking users about their interests wouldn't work well 2 ) FB tries to observe what interests users (particularly friends, media types, topics) and preferentially show them that kind of content 3 ) implemented via post ranking algorithm
1 ) Striping 2 ) Mirroring 3 ) Parity
1 ) breaks files into segments, stores segments on different disks >> increases speed for reads & writes 2 ) makes copies of files on different disks >> doesn't inherently add time but it will add to the system load, so the RAID system can be much slower if busy 3 ) includes error-checking for any file >> writes substantially slower when combined with mirroring
1 ) Cold Data 2 ) Hot Data
1 ) data that isn't used frequently (months, years, or decades) eg. old projects, records needed to be maintained > retrieval is much slower 2 ) data that is frequently used > commonly stored in hybrid or tiered storage environments eg. video editing, web content, online transactions, etc.
1 ) Phishing 2 ) Spear-phishing Attacks
1 ) one of the most popular and successful methods for obtaining access credentials (users are duped into giving up their id and passwords to a phony site) Attack methods > e-mail advising reader that some problem requires immediate logging into site >> personalized link helpfully included >> reader visits site, enters id & password > e-mail w/ a malware attachment (executable file or infected PDF doc) > preserve the illusion of legitimacy is vital >> typically a long delay before damage happens >> passwords could be changed in the meantime > phishers might: >> use a phony but plausible e-mail source address >> use a phony site w/ a similar URL (swapping a capital i, lowercase L, or the number 1) >> record or merge terms (targethelp.com) >> inform target of an id-password mismatch, then link to the equivalent legitimate site 2 ) phishing, but w/ a wider net > attackers don't esp. care whose accounts they can compromise > they need the large volume of attacks to obtain a small number of successes > however, important target may be targeted as small groups or even individually
Types of RAM 1 ) DRAM (Dynamic RAM) 2 ) SRAM (Static RAM) 3 ) SDRAM (Synchronized Dynamic RAM) 4 ) NVRAM (Non-Volatile Random Access Memory) Obsolete - no longer produced/used
1 ) was great back when, but now considered obsolete >> faster than the RAM that preceded it 2 ) doesn't require constant power & is faster and less power consumption than DRAM >> commonly used for high-performance CPU caches >>> L1/L2 for a particular CPU/core; L3 shared among CPUs/cores 3 ) still RAM but faster >> speed listed as power of relative to DRAM 4 ) very slow for performing computations >> used for storage & retrieval eg. USB drives
Key-Value Stores
> type of NoSQL db & s used to uniquely id the record (like RDB), but the value can be any kind of data, and different values for different records can have totally different structures >> very flexible and easier storage but queries become difficult
Data Size Prefixes
> Kilo (thousand)... short text file > Mega (million)... big image > Giga (billion)... long video > Tera (trillion)... typical hard drive > Peta (quadrillion)... one day's new FB data > Exa (quintillion)... all data of a large busi. > Zetta (sextillion)... total annual internet traffic
Data Replication & Decentralization
> Problem: your org. continuously operates 10,000 servers for global app. users...Should they centralized or not? > Solution: 'DECENTRALIZATION' - multiple copies of data are stored at separate server groups worldwide; when necessary, requests can be forwarded to other server groups >> adv. - less latency; locally important data can be made more accessible (eg. local news or language); system failure only affects local system; & smaller database size facilities updates
Read-Only Memory v. Read & Write Memory
> Read-Only Memory: can't be overwritten by ordinary users, like computer chips or burned DVDs > Read & Write Memory: can be overwritten by users, like ordinary RAM or hard drive disks
ACID Properties
> Atomicity: any related database operations will be completed entirely or not at all > Consistency: database follows system rules, so if db is consistent before an operation, it will still be consistent afterward > Isolation: operations on the same data will not interleave (cross over each other) > Durability: you have mechanisms for ensuing that data & changes aren't lost
EdgeRank
> FB's algorithm for ranking posts & ads (each is termed an "edge" by FB) > obsolete b/c current algorithms are much more complex > Purpose: delivers context to users that they are more likely to enjoy & interact w/ Factors > Affinity: one-way relationship strength > Weight: includes things like media type, could also include post topic > Time Decay: older posts stop being seen eventually
Database Operations
> Insert (a new record or table) > Delete (a record or table) > Modify: change data within a record > Query: retrieve data >>most important b/c typical databases read more often than they write (can span across tables)
RDBMS (Relational Database Management System) Elements
> Tables: the set of records of a particular type (photos or users) > Records (rows): represents a particular thing of the table's type (specific photo or user) > Fields (columns): the various attributes stored regarding each record in a table >> photo: size, location, poster >> user: name, password, profile picture
Thin v. Thick Clients
> Thin: not much storage or computing power (old smartphone) > Thick: lots of storage & computing power (current desktop)
Thin v. Thick Applications
> Thin: work happens on another machine, results sent over network > Thick: work happens on local machine
Content Management Systems (CMS)
> Two components 1. content management app.: tools for adding, changing, and deleting content 2. content delivery add.: tools for permission based content distribution > FB is based on a semi-public CMS >> users, busi. and FB inject content; content delivery based on user-influenced rules for visibility and preferences Elements > maintaining a world-scale set of static data is a challenge b/c of ensuring accessibility & planning for growth > FB's dynamic data problem is more complex to manage b/c of the continuous flow of new "most recent" data; real time updates, comments, and edits; changes in access and friendship
Volatile Memory v. Non-volatile Memory Volatile - liable to change rapidly & unpredictably
> Volatile Memory: the memory goes blank when the power goes away > Non-volatile Memory: the memory is persistent even when power goes away
RAID (Redundant Arrays of Inexpensive or Independent Disks)
> a bunch of hard disks inside a single household with one interface
Caching caching - to store away in hiding or for future use
> a solution to FB's basic capacity problem > applies to: 1 ) apps & web pages on servers 2 ) database tables 3 ) account data for most active clients 4 ) when the last time you were on
Clickbait
> an enticing link that leads to misleading news and users must often click through many screens to reach content > makes $$$ by users clicking on their sites, buying products, and it's the same as any other ad
Denial-of-Service Attacks
> attempt to overload the network >> too many connection requests >> forcing network to respond to repeated small communications > single-machine attacks can easily be blocked but large groups of attacking machines are not: >> Distributed Denial-of-Service (DDoS) >>reliance on botnets, groups of infected machines, w/ infection unknown to their owners > Solution (besides backup capacity): >> DDoS too large to be controlled manually >> defenders look for usage patterns and block in accordance w/ est. attack membership
Brute Force Attacks
> attempting to use every possible combination of password characters in order to gain access >> will work, eventually > eventually could be a really, really long time > solutions >> strong passwords >> enforced delays or locking account after a number of failed attempts > in a well-designed system, not a severe threat
Security v. Usability
> b/c perfect security is impossible, managers must balance usability against security >> the cost of added inconvenience for system user (ie. reduced usage up to abandonment >> the added system expenses of security >> the expected damage caused by all attacks attempts over time >> at the ideal point, managers minimize the sum of security costs and attack damage > How would system users and operators respond to excessive security requirements?
In-Memory Grids/Database Systems
> basically RAM (for speed) in support of hard drives (for cheap) > data that's currently popular gets stored in RAM for faster access but other data kept on hard drives b/c RAM about 1000x more $$$
2 Key Encryption
> every system has two keys: public & private for any pair of keys > Public Key: decrypt what's encrypted by the private key & vice versa eg. if X & Y are communicating securely, then they first share their public keys with each
Responsive & Adaptive Design
> need for auto-sizing b/c mobile devices have smaller screens and must prioritize visible content > app. windows can be designed to automatically scale to the current device >> all features are accessible on any device; start with min. size & accessible features; features shifted from menus to main screen as space permits
SQL Injection Attacks
> putting active SQL code in password field of a login attempt to bypass password checking
Multi-factor Authentication
> requires: password + code received via text (other secondary options available) > usefulness: >> works: some random person attempts to log into your account knowing your e-mail adress >> but, what if your password is locally saved and they steal your smartphone? though your BF account may then be a minor concern >> people are dupable
Big Data
> so much data that it requires special handling (could be in terms of volume or speed of changes, different data types, or transmission) > avoid using this term b/c it's vague & situation-dependent (________ for a typical Excel user isn't the same as _________ for Google)
Cloud Systems
> solution to FB's data storage problem > the Cloud is a server network with a shared interface, load balancing, backup systems, security, and is geographically distributed
Key Fields
> used to make each record uniquely identifiable within a table >> never duplicated within a table >> encompasses one or more fields >> automatically checked for uniqueness before an modification or creation attempt > records in a Table A may reference a key field from a Table B >> then, B's key within A is termed a "foreign key" eg. A has a zip code, B uses zip code as key
Encryption
> you're scrambling a message of data in some way that it's unreadable unless you have the decryption key > simple way is commonly to multiply each byte by a very large prime # and keep the modulo result (0 - 255) > it's no good unless the recipient can decrypt the message
Haystack: Problem & Solution
Problem > FB has lots of pictures, and two options: 1. either hard disk storage (cheap but slow) 2. going through another Content Delivery Network [CDN] (expensive) Solution > restructure retrieval process > instead of multiple db calls/image, embed the metadata in the image URL and arrange photos in albums and runs much faster
RDBMS (Relational Database Management System) v. NoSQL Databases: Ads. & Disads. SQL > Structured Query Language
RDBMS > keeps data in tables, every tables has a set of records (rows), and every record has the same set of fields > every record in a table occupies the same storage space 1. these systems are difficult to change (b/c one field change affects the whole table) 2. good thing you can run queries across the table but there are big problems when the tables scale NoSQL Database > basically aren't as strict as RDBs eg. a table might allow flexible fields for different records or individual fields can be composite, not atomic > CONSEQUENCE: typically more efficient for storage but you also typically lose the ability to do cross-table queries
Passwords: Weak & Strong
WEAK > short, just letters; including any complete words; simple sequence >> risk factors: derived from personal info; used on other sites; never changed STRONG > long randomized string of numbers, upper- & lowercase letters, and punctuation & symbols; unrelated to prior passwords; specific to FB >> adv. - practically impossible to guess in time to be useable (or before account is locked) >> disadv. - very difficult to remember, unless you are robot-like PROBABILITY OF RANDOM GUESS: n^(-c) > n = # of options/character > c = # of characters in pass.