Big Data Analytics
Consider the following relation database which consists of two relations (Project and Report). Notice that the attribute finalreport in the relation Project is a foreign key that references the primary key (attribute id) in the relation Report. Notice also that multiple projects may have the same final report. PROJECT (in table format) name ; budget ; finalreport UsMis. 1M 391 AMee3. 3.7M 391 Bee. 1.3M 121 REPORT (in table format) id; pages; location 121 70. p1 391 350 p2 699 100. p3 Capture all the data in this relational database as: a document database. From Exam 2018-05 Q3a)
"Document ID -> Fields" Collection 1: Project UsMis -> {budget: 1M, finalreport: 391} AMee3 -> {budget: 3.7M, final report: 391} Bee -> {budget: 1.3M, finalreport: 121} Collection 2: Report 121 -> {pages: 70, location:p1} 391 -> {pages: 350, location:p2} 699 -> {pages: 100, location:p3}
The MapReduce construct is very powerful and consists of 7 substeps as presented in the lecture. Which one(s) of these substeps may involve network I/O and for what purpose? From Exam 2019-08 Q8b
1) The record reader. It may be required to read HDFS from a remote disk. 2) Shuffle-and-sort. It may need to download data with relevant keys from mappers nodes' local disks. 3) Output formatted. It writes to HDFS where at least two of the three replicas will go to remote disk (due to replication mechanism in HDFS).
What is ACID guarantees? From lecture slides
Atomicity: ensures that all changes made to data are executed as a single entity and operation (operation succeeds ONLY when all changes are performed) Consistency: ensures that all data changes are executed while maintaining a consistent state between the transaction start and end. Isolation: ensured that each transaction executes as if it were the only transaction operating on the tables or keys. Durability: once a transaction is completed successfully, it will result in a permanent state change that will remain persistent even in the event of a system failure.
Read scalability can be achieved by scaling out. Since adding more servers (nodes) to a distributed system will make reading of files faster. g. Read scalability can also be achieved by scaling up. For instance, adding more RAM into a server may increase the cache of a DBMS, which then may allow the DBMS to handle an increased number of reads with the same performance with which it handled fewer reads when it had a smaller cache due to fewer RAM. Similarly, upgrading the CPU or replacing a slower hard disk with a faster and bigger one may help to achieve read scalability in a server. The claim is thus WRONG.
Consider the following claim: While read scalability can be achieved by scaling horizontally (scale out), it cannot be achieved by scaling vertically (scale up) Is this claim correct or wrong? Justify your answer in 2-4 sentences From exam: 2020-06 Q1
Give and explain the CAP theorem. Explain the notions in the CAP theorem. From Multiple exams
Consistency: The same copy of a replicated data item is visible from all nodes that have this item Availability: All requests for a data item get a response (response may be that operation cannot be completed) Partition Tolerance: The system continues to operate even if it gets partitioned into isolated sets of nodes Only 2 of 3 properties can be guaranteed at the same time in a distributed system with data replication. RDMBS have Consistency and Availability BASE (NoSQL) systems have Availability and Partition Tolerance.
Read scalability: A system can handle increasing number of read operations without losing performance. Write scalability: A system can handle an increasing number of read operations without losing performance.
Define the notions of read scalability and write scalability From exam: 2019-05 Q2a)
What is weak consistency? From lecture slides
No guarantee that all subsequent accesses will return the updates value eventual consistency: if no new updates are made, eventually all accessed will return the last updated value Inconsistency window: the period until all replicas have been updated in a lazy manner
Which parallel algorithmic design pattern (name and short explanation) is used for the parallel reduction algorithm? From Exam 2018-06 Q8a)
Parallel Divide and Conquer: Recursive calls can be done in parallel It can also parallelize the divide and combine phase. It requires associativity since for example x0 and x1 are operated upon by one node and x2 and x3 are operated upon by another. Then the result of these two operations can be operated upon by one of those nodes.
What is an RDD in Spark? From Exam 2019-05 Q7d)
RDDs are container for operand data passed between parallel operations. They are a read-only (after construction) collection of data objects that can be partitioned and distributed across workers (cluster nodes)
For what type of computations (structural property) will Spark outperform MapReduce considerably, and why? From Exam 2018-06 Q8e)
When the computation involves access to the same data files iteratively or when the computation involves a sequence of maps/reductions. MapReduce has to access the data files for every new map/reduce while Spark can keep it in memory which give performance gains.
Why and in what situations can it be beneficial for performance to use a Combiner in a MapReduce instance? From Exam 2019-08 Q8c
When the function is associative and commutative. Since key-value pairs are still in the cache/memory of the mapper node this reduces the amount of network I/O required. It also aggregates information into a smaller format which makes the Reduce function simpler.
How does a distributed file (HDFS) differ from a traditional file (with respect to how it is technically and logically structured, stored and accesses) and what is/are its advantage(s) for the processing of big-data computations over a distributed file compared to a traditional file? From Exam 2019-08 Q8a)
A distributed file is shared among many nodes. An advantage is that very large files can be stored that would not fit on a regular disk. The contents of the distributed file can be read in parallel. A distributed file is replicated in different nodes, where clients are able to read them. If any node fails the computation can still occur due to replication. A traditional file will be unavailable if the machine storing it is down.
Write pseudocode for a MapReduce program that reads floating point numbers x_i from a HDFS input file and computes their geometric mean: sqrt(sum x_i^2). Explain the code. Hint: Make sure to follow the MapReduce programming model and clearly different program parts in your code. From Exam 2020-08 Q7
TO DO
Describe a concrete application / use case for which read scalability is important but data scalability is not important. From Exam 2020-08 Q11
A video-hosting server is an example of this. Normally content will not be uploaded as increasing rates while a video may go viral and require many reads.
What is an (RDD) "action" in Spark? Give also one example operation of an action. From Exam 2019-08 Q8e
An Action in Spark is an internal global dependence structure such as collect(), reduce() or save(). It is the operation that triggers the evaluation of previous Transformations. It usually involves some I/O.
What is strong consistency? From lecture slides
After an update completes, every subsequent access will return the updated value (may be achieved without consistency in the CAP theorem)
Charaterize the type and structure of applications that are expected to perform significantly better when expressed in Spark than in MapReduce, and explain why. From Exam 2020-08 Q9
Applications that require a sequence of Map/Reduce operations are significantly better expressed in Spark. This is due to a few reasons: 1) Since a data flow graph is defined and transformations are evaluated lazily, the scheduler can plan data locality better and data required for a connected graph node can be kept in memory instead of unnecessarity writing to disk. 2) No replication of data blocks for fault tolerance is required in Spark. If a task fails, it can simply be recomputed from earlier data blocks following the data flow graph.
What (mathematical) properties do functions need to fulfill that are to be used in Combine or Reduce steps of MapReduce, and why? From Exam 2019-08 Q8a
Associative: The order of operations is not important as long as the overall sequence of operands is kept the same. Ex (2+2)+4 = 2+(2+4) Commutative: The sequence of operands doesn't matter. The result is the same. Ex: 2+4 +2 = 2+2+4 These properties are necessary otherwise the Combine and Reduce steps cannot be executed in parallel.
Why is Spark more suitable than MapReduce for implementing many machine learning algorithms? From Exam 2019-11 Q7
Because Spark does lazy evaluation of Transformations and keeps intermediate results in memory. Many ML algorithms optimize parameters iteratively through a loss function. This often involves passing the data through the model many times and making adjustments to the parameter values. If this was done in MapReduce each passing of the data through the model will involve I/O and thus performance losses compared to using Spark where the data can be distributed in memory across all nodes and then each node can use that data for their part of the computation.
Why is it important to consider (operand) data locality when scheduling tasks (e.g mapper tasks of a MapReduce program) to nodes in a cluster? From Exam 2019-08 Q8c)
Because the files are distributed, it is more efficient for a Mapper to perform computations of distributed files already in its local memory instead of sending the files over the network.
Assume a distributed database system in which every data item is stored on 3 nodes. For such a system to report to an application a write operation as been finished, what is the number of nodes that are required to complete the write successfully if the system aims to achieve the consistency property as per the CAP theorem? From Exam 2019-05 Q5
Every data item is stored on 3 nodes -> N = 3 Consistency per CAP requires W=N where W is the number of nodes that need to confirm the successful write operation. Thus the system requires 3 nodes to complete the write operation by CAP theorem consistency. Another perspective: Write consistency requires 2*W > N and if N = 4 (3 workers + master) then W = 3.
Data scalability is important for applications where amounts of data is increasing and must be handled without performance loss. In for instance, a self-driving car, the system must be able to handle increasing size and frequency of image and sensor data. If sensor data is unable to be stored and processed fast enough, the autonomous system could make incorrect computations and crashed could occur. Maybe another example for this?
Describe a concrete application / use case for which data scalability is important. From Exam: 2019-08 Q4
A streaming server is an example of this. Normally content is written once but accessed by millions.
Describe an example use case / application for which read scalability is important but write scalability is not. From exam: 2019-05 Q2b)
i) For key-value stores, only CRUD operations (create, retrieve, update, delete) with regards to keys are possible. It is not possible to query the values directly since they are opaque to the system (no secondary index over values). Usually key-value stores are "queried" in an application program and not through a query language such as SQL. ii) In key-value stores data partition is implemented by sharding (horizontal partitioning) based on the keys. Given a set of key-value pairs, the data is partitioned among the nodes in the cluster through for example HDFS. iii) The key-value data model is well suited for horizontal scalability since it is easy to partition the data and distribute it over the nodes in the cluster. Since the key-value model does not have any relations (such as in an SQL schema), partitioning to different nodes is efficient and easily distributed. With increasing amounts for data (key-value pairs), building more nodes (scaling out) will naturally keep performance due to this property.
Describe i) the types of queries and ii) the form of data partitioning implemented in key-value stores (1p) and iii) explain how these things are related to achieving horizontal scalability From Exam 2019-05 Q4
Write scalability can be achieved by upgrading the SDD / Harddisks. Faster/better storage will make writes faster and the system can thus handle increasing number of writes without loosing performance.
Describe in 2-4 sentences how write scalability can be achieved by scaling vertically (scale up) From exam: 2020-08 Q3
YES
Do you hate the Big Data Analytics course? From self-reflection
How can the MapReduce user control (i) the grain size of work and (ii) the degree of parallelism? From Exam 2018-11 Q6d)
Don't know......
Describe the execution model of Spark programs. In particular, there exists 2 different kinds of processes, driver and workers. Explain in general which operations of a Spark program are executed by each of them. From Exam 2019-05 Q7g)
Driver: creates mapper and reducer tasks. It dispatches them to workers using dynamic load balancing. It helps to create the RDD lineage and keeps track of where the data is cached in workers memory Workers: executes mappers / reducer tasks from driver. Returns the results to the driver after an action. It can persist (cache) the data in the worker node.
Where are the data elements of Spark RDD stored when evaluating a lineage of RDDS? From Exam 2020-08 Q8)
Each partition's elements are stored in main memory on each node computing that partition of the RDD.
Describe (by an annotated drawing and text) the hardware structure of modern hybrid clusters (used for both HPC and distributed parallel big-data processing). In particular, specify and explain their memory structure and how the different parts are connected to each other. From Exam 2019-08 Q8b) and 2019-05
Hybrid clusters are a collection of cluster nodes/computers (DMS) where each node consists of multiple processoers sharing the same memory (SMS). Processor within a node and the nodes in a cluster are connected by a network. The control structure ca be characterized by MIMD (multiple instruction and Multiple Data stream). Whereas the memory is characterized as a hybrid memory strucutre. The nodes dont have shared memory together so the are referred to as distributed memory system (DMS). They rely on message-passing for communication between each other. The nodes themselves are multprocessors where the processors have shared memory, SMS (Shared Memory System). MIDM means that the whole cluster operates on several data streams and executes different instructions at the same time leading to exploitation of parallelism.
What is BASE? From lecture slides
Idea: by giving up ACID guarantees, one can achieve much higher performance and scalability in a distributed database system Basically Available: The system is available whenever accessed, even if parts of it is unavailable Soft state: the distributed data does not need to be in a consistent state at all times Eventually consistent: the state will become consistent after a certain period of time BASE properties are suitable for application for which some inconsistency may be acceptable.
Why is fault tolerance an important aspect in big-data computations? Compare the fault tolerance mechansims in MapReduce and Spark to each other. From Exam 2020-08 Q16
In big-data computations, fault tolerance is important in order to not loose the progress gain of a computation if part of the system fails. In MapReduce, fault tolerance is handle by replicating blocks of data which involves I/O for transferering the replications to other nodes. In Spark, fault tolerance is handled by recomputing lost RDDs following the RDD lineage information in the data flow graph. Some previously required partial-result could still be in memory and then make recomputing rather quick compared to in MapReduce.
We know that Spark offers support for stream computing, i.e. computing on very long or even infinate data streams. Which fundamental property of stream computations makes it possible to overlap computation with data transfer? From Exam 2020-08 Q16
In streamning / stream computing, we do not need to have all input data in place to start the computation. There is no data dependence to "previous" element int he stream. Instead the system works forward on the stream as data comes in, in portitions of arbitrary size. This allows for one block of data to be processed while already aggregating and loading the next block on the stream in parallel.
How does Spark-streaming differ from ordinary Spark? For what type of computations can it be suitably used? What is "windowing" in Spark streaming, and what is its (technical) purpose? From Exam 2019-08 Q8f
It differs from ordinary Spark in since it allows for live datastreams to be processed efficiently, with scalability and fault tolerance. It is suitable for computations where new data is generated very quickly. Is is a daemon process which runs forever in the background. It keeps on reading the stream of input data, creating batches of input data (discretized the continuous data stream) and generates the stream of output for each batch to be used in regular Spark. "Windowing" in Spark streaming is a sliding window over the discretized data streams that applies some operation on a collection of DStreams and then takes a step and the same operation on the next set within the defined window. For each window, the DStreams are operated upon to create RDDs of the windowed DStreams. It can be used to aggregate data within a larger time frame compared to the DStreams time frame.
From a performance point of view, is it better to have long lineages or short lineages in Spark programs? Motivate answer From Exam 2020-08 Q15
It is better to have long lineages since the amount of disk I/O accesses saved at every data flow transaction after transformations within the lineage which Spark can perform in memory instead. The longer lineages, the more time is saved due to Spark's lazy executing of lineages with in-memory buffering of intermediate results.
What does the collect() operation in Spark do with its operand RDD? From Exam 2019-11 Q6f)
It retriees all the elements of the RDD from the nodes to the driver nodes and returns a list.
Describe the fault tolerance mechanisms in MapReduce From Exam 2018-11 Q6c)
MapRedcue use replicas for fault tolerance. This means distributing 3 copies (shards) of each block on different nodes. The Master node pings the workers periodically. If a worker dies, the Master will reassign the workload to one of the nodes with the replicas. Usually this is the closes node or the one that has a lower current workload.
a) In the lecture, the MapReduce mechanism was referred to as the "Swiss Army Knife" of a distributed parallel big-data computing over distributed files. Why? b) Explain how the Spark programming interface differs, and what benefit could be From Exam 2020-08 Q15
MapReduce is a Swiss Army Knife for parallelizing computations since it is a sort of super-skeleton which provides functionalities (map, sort, reduce) in a single programming construct. By a sequence of MapReduce steps, basically any distributed computation can be emulated. b) Spark splits the super-skeleton functionality into separate simpler operations (Transformations, Actions) which can be composed directly (by forming RDD lineages), and this is an enabling feature for forwarding partial results in memory rather than via distributed files which involve unnecessary I/O for a sequence of MapReduce operations.
Give an example of some computation of your choice (high-level description, pseudocode..) that can be expressed with the MapReduce programming model but requires no Reduce functionality? From Exam 2017-01 Q7
One example is MapOnly which could be e.g. a classification model (i.e. you need to classify all your data with that classifier). We do not necessarily need a reduce functionality there, since we just load the classifier, split the data and inside the map() function of your mapper we do the classification. After that we do write the result somewhere, since each output map is a relevant end result. Therefore, we dont need to aggregate the results and not need to apply the reduce functionality.
What is a Resilient Distributed Datset (RDD) in Spark? From Exam 2017-01 Q9
RDDs are fundamental data structures, i.e. Containers for operand data passed between parallel operations. Immutable + distributed. They are read-only (after construction) collections of data objects and partitioned and distributed across workers (cluster nodes). They are materialized on demand from construction description and can be rebuilt if a partition (data block) is lost. The are - by default- cached in main memory and not persistent (in secondary storage) until written back (and therefore faster). It consists of a copy of the data and a lineage graph. Lineage graph contains all the actions and transformations --> Makes it fault tolerant as you can recreate the RDD with the lineage graph.
Which substeps of the MapReduce construct involve disk I/O, and for what purpose? From Exam 2019-05 Q7b)
Record reader: Reads blocks from disk and parses them into key-value pairs. Partitioner: Partitions the intermediate results from the mapper/combiner and writes them to local disk Shuffle-and-Sort: Reads the block from local disks remotely and copies them to the node where the reducer is running Output formatter: Receives the final key-value pairs from the reducer and writes them to the global file system.
How can Spark be used with input data that arrives in a continuous stream (e.g. from external sensors over the network)? In particular, how can such a data stream be structured for processing by Spark? From Exam 2019-05 Q7f)
Spark has an extention for processing data streams. It uses a high-level abstraction for a continuous data stream called Dstream or discretized stream. Essentially, it is a continuous series of RDDs allowing for the operation of RDD like normally (stream pipelining). There is also a windowing functionality where the Dstream objects can be aggregated under a defined within with window length and window step.
The executing of a MapReduce operation involves 7 phases. Which of these phases of MapReduce may involve network I/O, and for what purpose From Exam 2020-08 Q6
Since the questions regard NETWORK I/O: 1) The record reader. It may be required to read HDFS from a remote disk. 2) Shuffle-and-sort. It may need to download data with relevant keys from mappers nodes' local disks. 3) Output formatted. It writes to HDFS where atleast two of the three replicas will go to remote disk (due to replication mechanism in HDFS).
For a Spark program consisting of 2 subsequent Map computations, show how Spark execution differs from Hadoop/MapReduce execution and explain why Spark execution is usually more efficient. From Exam 2017-01 Q10)
Spark is more efficient due to a number of reasons. 1) Fault tolerance mechanisms: Spark does not use data replication for fault tolerance which means that there is no I/O involved in replicating data. 2) In MapReduce, the output for the first Map computation will store the files in HDFS and then read them again. In Spark on the otherhand, the output from the first Mapper will be stored in memory. Might be something more here to get the full 2p
What is the RDD lineage graph and how is it used in Spark for the efficient executing of Spark programs? From Exam 2019-08 Q8d
The RDD lineage graph (data flow graph) describes how to compute all intermediate and final results from the initial input data. It is used to avoid replication of data for fault tolerance and to allow the scheduler to better keep track of data locality in addition to allowing data to be kept in memory instead of written to disks.
Consider the following claim: In master-slave replication, a replica (copy) of a database object at a slave node never changes. Is this claim TRUE or FALSE? From Exam 2020-08 Q4
The claim is FALSE. In a master-slave system, changes to a database object are addressed to the node that is the master of the database object in question. This master node then requests all corresponding slave nodes to change their copy of the database object accordingly. This makes sense because: if a replica never would change, then the system would be inconsistent.
Executing of a MapReduce operation involves seven phases. Which of thee phases of MapReduce may involve disk I/O a) to/from HDFS b) not to/from HDFS and for what purpose? From Exam 2020-08 Q7
The seven phases are: Map phase: 1) Record reader (Input-reader) 2) Mapper 3) Combiner (optional) 4) Partitioner Shuffle phase: 5) Shuffle and sort phase Reduce Phase: 6) Reduce 7) Output formatter a) The Record reader reads a block from HDFS The output formatted may write to HDFS b) The partitioner writes file(s) with key-value pairs to local disk The shuffle-and-sort downloads relevant files from the disks where the mappers' partitioner had written them
Spark classifies its function on RDDs into two main categories: "Transformations" and "Actions". Describe the main differences between these categories, and give one example operation for each category. From Exam 2020-08 Q8
Transformations are element-wise operations that are fully parallelizable. These transformation are typically variants of Map. The evaluation of transformations are done in a lazy way. A transformation is added to the data flow graph and only evaluated after and Action is called. Actions are global operations such as Reduce and writing back to a non-distributed file / to the master. After an action is called, all the transformations required by the data flow graph occur and the data from various nodes is pulled after these transformations to produce the action. We can thus say that Actions are evaluated immediately.
Why and in what situations can it be beneficial for performance to use Combiners in a MapReduce instance From Exam 2019-11 6c)
Using a Combiner that aggregates the Mapper's intermediate elements with the same key before passing on to the Reducer can be beneficial because the nodes that have produces the intermediate elements still have the key-value pairs in memory and thus does not need I/O for performing the Combiner operation. In addition the output of the Combiner will be smaller than if one was not used and thus the files from the Mapper phase are smaller and can be faster transferred to the Reducer nodes. To use a Combiner, the function must be commutative and associative.
Explain and compare the concepts of vertical and horizontal scalability From Exam: 2017-01 Q2
Vertical scalability means adding resources to current nodes. Typically this means increasing/upgrading RAM, HDD, or CPUs. Horizontal scalability means adding more nodes to the distributed system. Vertical scalability is usually cheaper but is limited by the available hardware, there is a possibility for downtime while the server is being upgraded. A threshold for vertical scalability is easily reached as there many for instance not exist faster/bigger memory hardware. Scaling horizontally on the other hand, is easy for a hardware perspective, simply build a new node and connect it to the distributed system. It is however far more expensive to build a new node than to upgrade an already existing one. From a maintenance perspective, having many servers may also increase maintenance costs.
Give and explain 4 V's (big data properties) and give an example for each. From multiple exams
Volume - the size of the data. On a social media platform there are millions of users and both operational data such as their email, interest etc. need to be stored in addition to likes, comments, pictures, videos which creates huge amounts of data for each user. Variety - the type and nature of the data. In an e-commerce store data could be in many different types of formats. Some could be operational / transactional and include temporal aspects while other data could be in string format such as reviews. Velocity - speed of generation and processing of data. Again on a social media platform, the speed of incomning messages, comments, likes etc will be generated extremely fast with many users. Veracity - The uncertainty (quality and availability) of the data. The data in a large retail chain. Massive amounts of data is collected from products bought by customers to different modes of payments, searches for products online, product comparisons etc. To use this data to support decision-making it is important the the data is reliable, up-to-date and organized.
Describe one advantage that using a document store has, in comparison to using a key-value store. From Exam 2020-08 Q2
We can group documents into separate sets (collections). There may exist secondary indexes over fields in the documents. We can query in terms of conditions. Useful whenever we have items of similar nature but slightly different structure.
Consider the following key-value database which contains four key-value pairs where the keys are user IDs and the values consist of a username, the user's year of birth, and an array of IDs of users that the current user likes. "alice_in_se" -> "Alice, 1987, [bob95 charlie]" "bob95" -> "Bob, 1995, [charlie]" "charlie" -> "Charlie, 1996, []" "selaya" -> "Alice, 1974, [charlie]" Describe how the types of queries typically implemented in key-value store can be used to retrieve the names of all users liked by a user, specified by his/her ID (i.e. the ID of that user is given as input) From Exam 2020-08 Q1
We have to write application code that first issues a "get(key)" query with the given user ID as the key. From the corresponding value that is retrieved as a results of this query, the application has to extract the array of IDs of the users liked by a given user. Thereafter, the application has to iterate over this array, for each user ID in this array, another "get(key)" query has to be issued. From the values retrieved by these queries, it is possible to extract the names of the users like by the given user.
Does a combiner function need to be commutative? Explain why or why not. From Exam 2020-08 Q14
Yes in a global view across nodes it is required. For example at block borders. A commutative combiner gives more flexibility for executing within a mapper task.
Recall that the CAP theorem considers three possible properties of distributed systems: Consistency, Availability and Partition Tolerance. Specify a) what each of these properties means b) What the CAP theorem says about them
a) Consistency: The same copy of a replicated data item is visible from all nodes that have this item Availability: All requests for a data item get a response (response may be that operation cannot be completed) Partition Tolerance: The system continues to operate even if it gets partitioned into isolated sets of nodes b) Only 2 of 3 properties can be guaranteed at the same time in a distributed system with data replication. Bonus: RDMBS have Consistency and Availability BASE (NoSQL) systems have Availability and Partition Tolerance.
The Combiner is optinal is MapReduce. a) What does the Combiner do? b) Why could it be omitted from a correctness point of view? c) In what kind of scenarios can it be beneficial and why? From Exam 2017-01 Q8
a) The combiner is an optional local reducer run in the mapper task as postprocessor. Therefore it is a kind of "pre-reducer" in the map function. It applies a user-provided function to aggregate values in the intermediate elements of one mapper task. b) Reduction/aggregation could also be done by the reducer and therefore be omitted. c) Local reduction can improve performance considerably. Data locality - key-value pairs still in cache resp. memory of same node thus no I/O overhead involved in the reduction. Data reduction - aggregated information is often smaller which requires less network I/O since Mapper results are typically smaller when sent to Reducer. Applicable if the user-defined Reduce function is commutative (it doesnt matter in which order the problem is solved) and associative (you can group the problem in different ways and still get the same result). Recommended if there is significant repetition of intermediate keys produced by each Mapper task
Consider the following key-value database which contains four key-value pairs where the keys are user IDs and the values consist of a username, the user's year of birth, and an array of IDs of users that the current user likes. "alice_in_se" -> "Alice, 1987, [bob95 charlie]" "bob95" -> "Bob, 1995, [charlie]" "charlie" -> "Charlie, 1996, []" "selaya" -> "Alice, 1974, [charlie]" a) Describe how the given key-value database can be changed/extended such that the data retrieval request can be done more efficiently (still key-value db). Your description has to say explicitly how the data retrieval request can be done now after the change of the database. b) Discuss the pros and, in particular the cons of your solution to change/extend the database From Exam 2020-08 Q10
a) we may extend the existing key-value db by adding another array into each value such that this array contains the names of all users liked by another user. For example: "alice_in_se" -> "Alice, 1987, [bob95 charlie], [Bob, Charlie]" Thus we can simply use a single "get(key)" query to get names of the people someone likes. b) The advantage is that querying is more efficient. The disadvantage is that, due to the added redundancy, the extended database required more storage space, and updating the database without introducing inconsistencies becomes more difficult.
Consider the following relation database which consists of two relations (Project and Report). Notice that the attribute finalreport in the relation Project is a foreign key that references the primary key (attribute id) in the relation Report. Notice also that multiple projects may have the same final report. PROJECT (in table format) name ; budget ; finalreport UsMis. 1M 391 AMee3. 3.7M 391 Bee. 1.3M 121 REPORT (in table format) id; pages; location 121 70. p1 391 350 p2 699 100. p3 Capture all the data in this relational database as: a key-value store. From Exam 2018-05 Q3a)
key [values]: UsMis , [1M, 391, 350, p2] Amee3, [3.7M, 391, 350, p2] Bee, [1.3M, 121, 70, p1] 699, [ , 699, 100, p3]