MapReduce & Scala
Which strategy does MapReduce use: pull scheduling or push scheduling? Also define what those terms mean
MapReduce = pull scheduling; TTs pull tasks by making requests Pull = worker asks for more work once done Push = master keeps sending stuff
How is a Boolean query executed in Boolean retrieval?
NOTE: Traversal of postings is linear assuming sorted postings and start with shortest posting first
Describe the solution for slow workers
Near end of phase, spawn backup copies of tasks and first to finish "wins"
Describe the Cosine Similarity method Bonus question: If two vectors result in dot product of 0, are they the same doc or completely different?
No similarity b/w docs
What are partitions and why are they important?
Partitions = subsets of intermediate results that have the same key - all values with the same key are presented to a single Reducer together in a partition
Describe the overall process for fault tolerance
- JT detects failure via heartheat - JT re-executes complete + in-progress map tasks (since output on local disk and unaccessible) - JT re-executes in-progress reduce tasks - Task complete committed through master
What is the Map function and what does it do to lists of input data?
- a higher order function that applies a function element-wise to a list of elements - transforms lists of elements into new lists of output data
How is text represented? Name and describe the process along with assumptions made
"Bag of Words" - Treat all words in a doc as index terms - Assign a "weight" to each term based on importance (simplest: absence/ presence of word) - Disregard everything else Assumptions: - Term occurrence is indep - Doc relevance is indep - "Words" are well-defined
Describe the K-V Pair data structure for MapReduce
(K,V) where K = key, V = value - Mapper takes data with multiple keys as input - Outputs data in a meaningful K-V pair - Reducer takes the data with only a single key and compacts/ aggregates values of the key
How does MapReduce help HDFS?
- Acts as processing engine of HDFS - Helps with concept of "moving computation" instead of "moving data" => locality of computation - Cluster consists of nodes that have storage and processing power - Have multiple nodes perform computation in parallel
How are applications represented in MapReduce? What does it encompass?
- App represented as a job - Job encompasses multiple map and reduce tasks
How does Hadoop locate stragglers?
- Hadoop monitors each task progress using a progress score between 0 and 1 - If task's progress score is less than 0.2 (default in Hadoop) and the task has run for at least 1 min, marked as straggler
What are challenges with using commodity hardware with large-scale computing?
- How do you distribute computation? - How can we make it easy to write distributed programs? - What do you do when machines fails?
Define the following in terms of MapReduce: - Information retrieval - What do we search? - What do we find?
- Information retrieval > Focus on textual info, but can be image, video, etc. - What do we search? > Search collections - What do we find? > Find documents
Describe the generic retrieval process
- Look up postings lists corresponding to query terms - Traverse postings for each query term - Store partial query-doc scores in accumulators - Select top k results to return
How is MadReduce used in Index Construction?
- Map over all docs > Emit term as <key, (docno, tf)> as value > Emit other info as needed - Sort/ shuffle: group postings by term - Reduce > Gather and sort postings (by docno or tf) > Write postings to disk
Describe ranked retrieval
- Order docs by how likely they are to be relevant - User model - Can estimate relevance
What does the MapReduce environment take care of?
- Partitioning the input data - Scheduling program's execution across a set of machines - Performing the groupByKey step - Handling machine failures - Managing required inter-machine communication
What is TaskTracker and where is it? What is a JobTracker and where is it?
- TaskTracker is a process that monitors tasks and communicates results with a JobTracker; runs on each Datanode - JobTracker handles scheduling, progress tracking, fault tolerance, and resource management; runs on Namenode
Describe Boolean retrieval
- Users express queries as Boolean expr using AND, OR, NOT - Retrieval based on sets
What are the functions of a partition?
- Want to control how keys are partitioned - System uses a default partition function: hash(key) mod R - Sometimes useful to override the has functions to ensure URLs from a host end up in the same output file
What is the Reduce function and what does it do to lists of input data?
- also known as fold, a higher order function that processes a list of elements by applying a function pairwise and finally returning a scalar -transforms/ compacts a list into a scalar
Name design considerations of MapReduce
- process vasts amounts of data - parallel processing - large clusters of commodity hardware - fault-tolerant - should be able to increase processing power by adding more nodes -> "scale-out" not up - sharing data or processing between nodes is bad -> ideally want "shared-nothing" architecture - want batch processing -> process entire dataset and not random seeks
What is the general flow of the Map operation?
1) Define a function 2) Apply on a list 3) Get another list
What is the general flow of the Reduce operation?
1) Define an operator like + 2) Give initial value like 0 3) Apply on a list 4) Get a scalar
What is the general approach to MapReduce procedure?
1) Identify key 2) Identify mapper function 3) Identify reducer function 4) System does the rest!
Problem: copying data over a network takes time and can slow down distributed computation Soln: ?
Bring computation close to the data -> chunk servers also serve as compute servers Store files multiple times for reliability - MapReduce encompasses these solutions
Which of the following is considered when counting words? Case folding Syntax Tokenization Stopword removal Semantics Stemming Word knowledge
Case folding Tokenization Stopword removal Stemming
What assumption is made in the vector space model?
Docs that are "close together" in vector space "talk about" the same things, so retrieve docs based on how close the doc is to the query
Describe the inverted index used in Boolean retrieval
Each word is mapped to a linked list of doc numbers the word is present in EX: blue -> 2 cat -> 3 -> 2
Describe the inverted index used in TF-IDF
Each word is mapped to a linked list of tuples (docNum, numOccurrences) with head = number of docs the word appears in EX: blue -> 2 -> (2,1) -> (3,2) cat -> 1 -> (4,3)
Define heartbeat message
Every TT sends an update signal periodically to JT encompassing a request for a map or a reduce task to run
Define and describe the schedulers that come with MapReduce in Hadoop
FIFO Scheduler: the default that schedules jobs in order of submission Fair Scheduler: a multi-user scheduler which aims to give every user a fair share of the cluster capacity over time
How does task granularity help pipelining?
Fine granularity tasks: map tasks >> machines > minimizes time for fault recovery > can do pipeline shuffling with map execution > better dynamic load balancing
How many Map and Reduce jobs should there be?
Give M = num map tasks and R = num reduce tasks: - Make M much larger than the number of nodes in the cluster - One DFS chunk per map is common - Improves dynamic load balancing and speeds up recovery from worker failures - Usually R < M b/c output is spread across R files
How is MapReduce helpful with info retrieval? How is it not so good?
Helpful for indexing - Requires scalability - Fast - Batch operation - Incremental updates may not be important - Crawling the web Not so helpful for retrieval problem - Must have sub-second response time - Only need relatively few results
What is a key feature of functional programming?
Higher order functions: functions that accept other functions as arguments
Do Map and Reduce tasks run in parallel or in isolation? Why?
In isolation: once map tasks finish, reduces will start - this saves bandwidth so no waiting on other nodes
Where is the following stored: - Input - Intermediate results - Final output
Input: on a distributed file system (FS) Intermediate results: stored on LOCAL FS of map and reduce workers - NOT written to Hadoop Final output: stored on distributed FS, often input to another MapReduce task - written to Hadoop
If a TT fails to communicate with JT for a period of time (1 min by default), JT does what for map jobs and reduce jobs?
JT assumes the TT has crashed - For map phase job, JT asks another TT to re-execute ALL Mappers previously ran at failed TT - For reduce phase job, JT asks another TT to re-execute ALL Reducers that were in progress on the failed TT
Define Map Task Scheduling and Reduce Task Scheduling
Map Task Scheduling: JT satisfies requests for map tasks via attempting to schedule mappers in the vicinity of their input splits - So: considers locality Reduce Task Scheduling: JT simply assigns the next yet-to-run reduce task to a requesting TT regardless of TT's network location and its implied effect on the reducer's shuffle time - So: does NOT consider locality
How are the following failures dealt with: - Map worker failure - Reduce worker failure - Master failuer
Map worker failure - Map tasks completed or in-progress at worker are reset to idle - Reduce workers are notified when task is rescheduled on another worker Reduce worker failure - Only in-progress tasks are reset to idle - Reduce task is restarted Master failure - MapReduce task is aborted and client notified
Describe the input of Mappers, Shuffling/ Sorting, and Reducers
Mappers - Input: <key, value> pairs - Output: <key, value> pairs that can be grouped by key Shuffling/ Sorting - Input: <key, value> pairs that can be grouped by key - Output: <key, <list of values>> Reducers - Input: <key, <list of values>> - Output: <key, value> where list in the K-V pair is compacted to scalar
Who takes care of coordination and how?
Master node - Task status tracking - Scheduling idle tasks as workers become available - Pushes map task results location and sizes to reducers - Pings periodically to detect failures
How can Positional Indexes be used in the Inverted Index for TF-IDF?
Positional indexes can be added to the tuple to represent the word is the nth word in the doc EX: blue -> 2 -> (2,1, [3]) -> (3,2,[2,4]) cat -> 1 -> (4,3,[1,5,7])
Pros and cons of Boolean retrieval method
Pros - Precise, if know right strategies - Precise, if have an idea of what you're looking for - Fast/ efficient implementation Cons - Users must learn Boolean logic - Boolean logic insufficient to capture richness of language - No control over size of result set: can have too many hits or none - When do you stop reading? All docs in results are equally good - What about partial matches? Those docs could still be useful
Describe the abstract IR architecture Bonus question: Where is the representation of the data created?
Representation of the data is created in the offline portion
What is the primary way MapReduce achieves fault tolerance?
Restarting tasks
What is the job of the Hadoop framework in MapReduce?
Sort and shuffle data from mappers to reducers so each reducer gets data from only one key - hidden phase between mappers and reducers: groups all similar keys from all mappers, sorts, and passes them to a certain reducer
Define stragglers, speculative tasks, and speculative execution
Stragglers = slow tasks Speculative tasks = redundant tasks Speculative execution = MapReduce locates stragglers and runs a speculative task for each straggler that will hopefully finish before the straggler > the first to commit becomes the definitive copy and the other task is killed
When should you use TF-IDF over Cosine Similarity?
TF-IDF: 1 term and n docs Cosine sim: k terms and n docs
What architecture is used for task scheduling in MapReduce? Describe what is the JT and TT
Uses master-slave architecture JT = Job Tracker; Master node TT = Task Tracker; Slave node