MapReduce & Scala

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Which strategy does MapReduce use: pull scheduling or push scheduling? Also define what those terms mean

MapReduce = pull scheduling; TTs pull tasks by making requests Pull = worker asks for more work once done Push = master keeps sending stuff

How is a Boolean query executed in Boolean retrieval?

NOTE: Traversal of postings is linear assuming sorted postings and start with shortest posting first

Describe the solution for slow workers

Near end of phase, spawn backup copies of tasks and first to finish "wins"

Describe the Cosine Similarity method Bonus question: If two vectors result in dot product of 0, are they the same doc or completely different?

No similarity b/w docs

What are partitions and why are they important?

Partitions = subsets of intermediate results that have the same key - all values with the same key are presented to a single Reducer together in a partition

Describe the overall process for fault tolerance

- JT detects failure via heartheat - JT re-executes complete + in-progress map tasks (since output on local disk and unaccessible) - JT re-executes in-progress reduce tasks - Task complete committed through master

What is the Map function and what does it do to lists of input data?

- a higher order function that applies a function element-wise to a list of elements - transforms lists of elements into new lists of output data

How is text represented? Name and describe the process along with assumptions made

"Bag of Words" - Treat all words in a doc as index terms - Assign a "weight" to each term based on importance (simplest: absence/ presence of word) - Disregard everything else Assumptions: - Term occurrence is indep - Doc relevance is indep - "Words" are well-defined

Describe the K-V Pair data structure for MapReduce

(K,V) where K = key, V = value - Mapper takes data with multiple keys as input - Outputs data in a meaningful K-V pair - Reducer takes the data with only a single key and compacts/ aggregates values of the key

How does MapReduce help HDFS?

- Acts as processing engine of HDFS - Helps with concept of "moving computation" instead of "moving data" => locality of computation - Cluster consists of nodes that have storage and processing power - Have multiple nodes perform computation in parallel

How are applications represented in MapReduce? What does it encompass?

- App represented as a job - Job encompasses multiple map and reduce tasks

How does Hadoop locate stragglers?

- Hadoop monitors each task progress using a progress score between 0 and 1 - If task's progress score is less than 0.2 (default in Hadoop) and the task has run for at least 1 min, marked as straggler

What are challenges with using commodity hardware with large-scale computing?

- How do you distribute computation? - How can we make it easy to write distributed programs? - What do you do when machines fails?

Define the following in terms of MapReduce: - Information retrieval - What do we search? - What do we find?

- Information retrieval > Focus on textual info, but can be image, video, etc. - What do we search? > Search collections - What do we find? > Find documents

Describe the generic retrieval process

- Look up postings lists corresponding to query terms - Traverse postings for each query term - Store partial query-doc scores in accumulators - Select top k results to return

How is MadReduce used in Index Construction?

- Map over all docs > Emit term as <key, (docno, tf)> as value > Emit other info as needed - Sort/ shuffle: group postings by term - Reduce > Gather and sort postings (by docno or tf) > Write postings to disk

Describe ranked retrieval

- Order docs by how likely they are to be relevant - User model - Can estimate relevance

What does the MapReduce environment take care of?

- Partitioning the input data - Scheduling program's execution across a set of machines - Performing the groupByKey step - Handling machine failures - Managing required inter-machine communication

What is TaskTracker and where is it? What is a JobTracker and where is it?

- TaskTracker is a process that monitors tasks and communicates results with a JobTracker; runs on each Datanode - JobTracker handles scheduling, progress tracking, fault tolerance, and resource management; runs on Namenode

Describe Boolean retrieval

- Users express queries as Boolean expr using AND, OR, NOT - Retrieval based on sets

What are the functions of a partition?

- Want to control how keys are partitioned - System uses a default partition function: hash(key) mod R - Sometimes useful to override the has functions to ensure URLs from a host end up in the same output file

What is the Reduce function and what does it do to lists of input data?

- also known as fold, a higher order function that processes a list of elements by applying a function pairwise and finally returning a scalar -transforms/ compacts a list into a scalar

Name design considerations of MapReduce

- process vasts amounts of data - parallel processing - large clusters of commodity hardware - fault-tolerant - should be able to increase processing power by adding more nodes -> "scale-out" not up - sharing data or processing between nodes is bad -> ideally want "shared-nothing" architecture - want batch processing -> process entire dataset and not random seeks

What is the general flow of the Map operation?

1) Define a function 2) Apply on a list 3) Get another list

What is the general flow of the Reduce operation?

1) Define an operator like + 2) Give initial value like 0 3) Apply on a list 4) Get a scalar

What is the general approach to MapReduce procedure?

1) Identify key 2) Identify mapper function 3) Identify reducer function 4) System does the rest!

Problem: copying data over a network takes time and can slow down distributed computation Soln: ?

Bring computation close to the data -> chunk servers also serve as compute servers Store files multiple times for reliability - MapReduce encompasses these solutions

Which of the following is considered when counting words? Case folding Syntax Tokenization Stopword removal Semantics Stemming Word knowledge

Case folding Tokenization Stopword removal Stemming

What assumption is made in the vector space model?

Docs that are "close together" in vector space "talk about" the same things, so retrieve docs based on how close the doc is to the query

Describe the inverted index used in Boolean retrieval

Each word is mapped to a linked list of doc numbers the word is present in EX: blue -> 2 cat -> 3 -> 2

Describe the inverted index used in TF-IDF

Each word is mapped to a linked list of tuples (docNum, numOccurrences) with head = number of docs the word appears in EX: blue -> 2 -> (2,1) -> (3,2) cat -> 1 -> (4,3)

Define heartbeat message

Every TT sends an update signal periodically to JT encompassing a request for a map or a reduce task to run

Define and describe the schedulers that come with MapReduce in Hadoop

FIFO Scheduler: the default that schedules jobs in order of submission Fair Scheduler: a multi-user scheduler which aims to give every user a fair share of the cluster capacity over time

How does task granularity help pipelining?

Fine granularity tasks: map tasks >> machines > minimizes time for fault recovery > can do pipeline shuffling with map execution > better dynamic load balancing

How many Map and Reduce jobs should there be?

Give M = num map tasks and R = num reduce tasks: - Make M much larger than the number of nodes in the cluster - One DFS chunk per map is common - Improves dynamic load balancing and speeds up recovery from worker failures - Usually R < M b/c output is spread across R files

How is MapReduce helpful with info retrieval? How is it not so good?

Helpful for indexing - Requires scalability - Fast - Batch operation - Incremental updates may not be important - Crawling the web Not so helpful for retrieval problem - Must have sub-second response time - Only need relatively few results

What is a key feature of functional programming?

Higher order functions: functions that accept other functions as arguments

Do Map and Reduce tasks run in parallel or in isolation? Why?

In isolation: once map tasks finish, reduces will start - this saves bandwidth so no waiting on other nodes

Where is the following stored: - Input - Intermediate results - Final output

Input: on a distributed file system (FS) Intermediate results: stored on LOCAL FS of map and reduce workers - NOT written to Hadoop Final output: stored on distributed FS, often input to another MapReduce task - written to Hadoop

If a TT fails to communicate with JT for a period of time (1 min by default), JT does what for map jobs and reduce jobs?

JT assumes the TT has crashed - For map phase job, JT asks another TT to re-execute ALL Mappers previously ran at failed TT - For reduce phase job, JT asks another TT to re-execute ALL Reducers that were in progress on the failed TT

Define Map Task Scheduling and Reduce Task Scheduling

Map Task Scheduling: JT satisfies requests for map tasks via attempting to schedule mappers in the vicinity of their input splits - So: considers locality Reduce Task Scheduling: JT simply assigns the next yet-to-run reduce task to a requesting TT regardless of TT's network location and its implied effect on the reducer's shuffle time - So: does NOT consider locality

How are the following failures dealt with: - Map worker failure - Reduce worker failure - Master failuer

Map worker failure - Map tasks completed or in-progress at worker are reset to idle - Reduce workers are notified when task is rescheduled on another worker Reduce worker failure - Only in-progress tasks are reset to idle - Reduce task is restarted Master failure - MapReduce task is aborted and client notified

Describe the input of Mappers, Shuffling/ Sorting, and Reducers

Mappers - Input: <key, value> pairs - Output: <key, value> pairs that can be grouped by key Shuffling/ Sorting - Input: <key, value> pairs that can be grouped by key - Output: <key, <list of values>> Reducers - Input: <key, <list of values>> - Output: <key, value> where list in the K-V pair is compacted to scalar

Who takes care of coordination and how?

Master node - Task status tracking - Scheduling idle tasks as workers become available - Pushes map task results location and sizes to reducers - Pings periodically to detect failures

How can Positional Indexes be used in the Inverted Index for TF-IDF?

Positional indexes can be added to the tuple to represent the word is the nth word in the doc EX: blue -> 2 -> (2,1, [3]) -> (3,2,[2,4]) cat -> 1 -> (4,3,[1,5,7])

Pros and cons of Boolean retrieval method

Pros - Precise, if know right strategies - Precise, if have an idea of what you're looking for - Fast/ efficient implementation Cons - Users must learn Boolean logic - Boolean logic insufficient to capture richness of language - No control over size of result set: can have too many hits or none - When do you stop reading? All docs in results are equally good - What about partial matches? Those docs could still be useful

Describe the abstract IR architecture Bonus question: Where is the representation of the data created?

Representation of the data is created in the offline portion

What is the primary way MapReduce achieves fault tolerance?

Restarting tasks

What is the job of the Hadoop framework in MapReduce?

Sort and shuffle data from mappers to reducers so each reducer gets data from only one key - hidden phase between mappers and reducers: groups all similar keys from all mappers, sorts, and passes them to a certain reducer

Define stragglers, speculative tasks, and speculative execution

Stragglers = slow tasks Speculative tasks = redundant tasks Speculative execution = MapReduce locates stragglers and runs a speculative task for each straggler that will hopefully finish before the straggler > the first to commit becomes the definitive copy and the other task is killed

When should you use TF-IDF over Cosine Similarity?

TF-IDF: 1 term and n docs Cosine sim: k terms and n docs

What architecture is used for task scheduling in MapReduce? Describe what is the JT and TT

Uses master-slave architecture JT = Job Tracker; Master node TT = Task Tracker; Slave node


Kaugnay na mga set ng pag-aaral

Adult Nursing - Chapter 63: Assessment and Management of Patients With Eye and Vision Disorders - PrepU

View Set

piaget's 4 stages of cognitive development

View Set

9.5 Relativity Certified User - Practice Questions from Quizzes

View Set

Public Health - Chapter 10: Health Insurance and Healthcare System

View Set