ISM4211 Quizzes Exam 2

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

You are presented with the following use case, which database type seems most appropriate: A company wants to extract new insights from their existing data by exploring hidden connections/relationships between different entities in their data. a) An Object Relational Mapper b) A document database like MongoDB WRONG c) Graph database e.g. Neo4J d) MySQL

c) Graph database e.g. Neo4J

One of the following distributed NoSQL databases does not use a master/slave architecture a) NoSQL b) MongoDB c) HBase d) Cassandra

d) Cassandra

You are presented with the following use case, which database type seems most appropriate: A company wants to manage all global data (multiple countries) in multiple regional data centers for maximum availability a) Neo4J b) MySQL c) Python d) Cassandra

d) Cassandra

Match the following document database concepts with the relational database concept MOST SIMILAR to it

A document (in document databases): is a row (in relational databases) A collection (in document databases): is a table (in relational databases) The key of a document (in document databases): is a primary key (in relational databases) The document structure (in document databases): is an ERD schema (in relational databases)

The storage system used by Hadoop is called: Mapper MapReduce YARN HDFS (Hadoop Distributed File System)

HDFS (Hadoop Distributed File System)

What happens when new information is written to the common column-family database described by the popular Google paper?

The old value is not overwritten. Rather, a new value is added along with a timestamp

The full meaning of YARN is

Yet Another Resource Negotiator

You are testing out a new database and you would like to set it up on a cloud service. Which of the following is possible from your understanding of the cloud? a) If your laptop is not a UNIX based system (e.g. Redhat Linux), you are limited to certain technologies b) You have to purchase licenses up front that can run you thousands of dollars c) You can pay just a few dollars to give this database test run for a few hours d) You have to buy a cloud from Amazon.com, Alibaba.com or the Microsoft Store at Best Buy first

You can pay just a few dollars to give this database test run for a few hours

In a graph database that stores information about a group of people who are on the same online social network (e.g. Facebook), which of the following bits of information can be used as an EDGE between people (vertices) Appearing in a photo together A comment of one person on the other's post Similar music interests All of the above

All of the above

Consider the following Graph database schema. Which one of the following is a relationship that can be bi-directional? [Person] / | \ acted in directed produced \ | / [Movie] Produced Directed Acted in All of the above

All of the above (Produced, Directed, Acted in)

The main disadvantage that arises from the similarities between column-family databases and relational databases is that column-family databases like Cassandra, HBase simply CANNOT work on commodity hardware. Companies still have to shell out big bucks for high end servers.

False, Cassandra is designed for linear, incremental scalability on top of commodity hardware. Companies do not have to spend much for high-end servers anymore

A document database is simply a key-value database whose value has some structure / defnition

False, both are NoSQL databases but document databases do not require data modelers to formally specify the structure of documents.

On-demand resources means being able to PAY FOR and USE only the EXACT AMOUNT of a resource you require, and only for as long as you want. But such a feature is NOT POSSIBLE with the cloud!

False, cloud pricing on demand allows you to pay for compute capacity by the hour with no-long term commitments

While working on a project, a developer spent a few weeks modeling the intended data schema before writing any code. This person was most likely working with a document database.

False, document databases do not need specific schemas

The MATCH statement in Neo4J is the same as the WHERE statement in SQL

False, it would be SELECT in SQL

Consider the following Graph database schema. Which one of these represents a Vertex/Node [Person] / | \ acted in directed produced \ | / [Movie] Produced Movie Directed Acted in

Movie. Person can also be but it's not an option

Graph databases are much faster for querying relationships in a database because they do not require complex JOINS.

True, instead of performing joins, you follow edges from vertex to vertex

The Cloud sounds a lot like AirBnB for hardware servers and processing capacity

True

If we are attempting to create a company to compete with Uber, we may want to setup our servers using Hadoop because: It would enable fast operations while minimizing network usage It would reduce the processing time of the requests on the system by having multiple systems address each request It would help prevent data loss by keeping distributed copies of each file All of the above

All of the above

Hadoop uses a peer to peer network structure very similar to Cassandra.

False

Column-family databases strive to keep similar pieces of information close together on disk since that is the manner in which data is most likely used in practice anyway.

True

Hadoop is an ecosystem, not a single product

True

One benefit of Hadoop is that you can achieve fast processing of large amounts of data using clusters of regular computers

True

An on-demand pay per use model in the cloud is ALWAYS BETTER than running your own servers locally.

False

Both a document database and a key-value database require STRUCTURED VALUES and UNIQUE KEYS

False, structured values are not required for document databases

Which of the following is NOT a characteristic of The Cloud and NOT a part of the definition of The Cloud according to NIST (National Institute of Standards and Technology)? Shared pool of computing resources (e.g. networks, servers, storage) Rapid provisioning and release On-demand network access Long term contract and commitment

Long term contract and commitment

Which of the following lines of SQL code will retrieve the SAME INFORMATION stored in a relational database? MongoDB Syntax: db.books.find({ "author": "Kurt Vonnegut, Jr." });

SELECT * FROM books WHERE author = "Kurt Vonnegut, Jr."

Match the language to the appropriate code for the code looking for actors who have collaborated with each other.

SQL = SELECT n, f, m FROM actors Python = for (actor,actor2) in actors: print (actor, " acted with ", actor 2) Neo4J's Cypher = MATCH (n:Actors)-[f:Acted_With]-(m:Actors) RETURN n,f,m

What is the name of the Python package used to run this MapReduce code? from mrjob.job import MRJob <---- Import class MRRatingCounter(MRJob): <---- Create class with M def mapper(self, key, line): A (userID, movieID, rating, timestamp) = line.split('\t') P yield rating, R def reducer(self, rating, occurences): ED yield rating, sum(occurences) ----> Perform operation if __name__ == '__main__': MRRatingCounter.run() ----> Run it a) mrjob b) MRRatingCounter c) Occurences d) reducer

a) mrjob

You are presented with the following use case, which technology/database type seems most appropriate: A business analyst has an incredibly large dataset (over 500 GB) that needs to be explored and analyzed. a) Key/Value e.g. Redis b) Spark running on Hadoop c) Hadoop WRONG d) MySQL

b) Spark running on Hadoop

Copy of You see a MongoDB query that reads as follows: receipts.purchases.insert( [ {"id": 923090, "location":"Aqua", "date":"1/1/2018"} ] What is the name of the COLLECTION being written to?

purchases

You see a MongoDB query that reads as follows: receipts.purchases.insert( [ {"id": 923090, "location":"Aqua", "date":"1/1/2018"} ] What is the name of the DATABASE being written to?

receipts

What is the title of the research paper, published by Google employees, that introduced the column family database to the IT community?

BigTable: A Distributed Storage System for Structured Data

You want to retrieve a list of actor names from a Neo4j database WITHOUT REPEATING NAMES. Which one of these is the most likely correct RETURN statement _________ RETURN collect(distinct actor) RETURN actors.name RETURN actors RETURN actors, names

RETURN collect(distinct actor)

Because of the similarities between column-family databases and relational databases, it is STRONGLY ENCOURAGED that you split your data into several column families and use joins and sub queries liberally.

False, joins are very costly because it's more computing power so splitting information is not encouraged. Joins cannot be done with non-relational databases (?)

A business stores a large number of pdfs and word documents on a central server that is accessible over the network. This sounds like they require a document database.

False, they require Hadoop which is characterized by centralized metadata management

The cloud is not new because it is the same thing as virtualization - the creation of a virtual resource such as a server, desktop, operating system, file, storage or network.

False. While similar, virtualization is tech that allows you to create multiple simulated environments from a single physical hardware system and Cloud is IT environments that abstract, pool, and share scalable resources across a network

Match the following processes with the appropriate phase in the Map-Reduce algorithm.

Filtering and sorting of the data = Mapper Phase Summary operations = Reducer Phase Having each node perform some operation separately = Mapper Phase Carrying out an overall operation with the outputs from each node = Reducer Phase

What is the name of the protocol used to share information about the state of data on different servers within Cassandra? With this protocol, each server updates another server about itself as well as all the servers it knows about. Those servers can then share what they know with a second set of other servers, and this process continues until all nodes have complete information.

Gossip protocol

You have the following bi-directional NETWORK GRAPH, which of the following EDGE LISTS below accurately represents the GRAPH John / / Jane / \ / \ Jill ------ Jim a. John, Jane Jill, Jane Jane, Jim John, John b. John, Jane Jim, Jill Jill, Jane Jane, Jim c. John, Jane Jim, Jill Jim, John Jill, Jane Jane, John d. Jim, Jill Jill, Jane Jane, Jim John, Jill

John, Jane Jim, Jill Jill, Jane Jane, Jim

Which of the following is definitely not a part of the Hadoop ecosystem? HIVE NEO4J FLUME SQOOP

NEO4J

Match the following concepts and definitions below:

NoSQL Databases: -A series of modern database genres that provides storage and retrieval of information for structured, semi-structured, and unstructured data Hadoop: -A framework that enables reliable shared storage and analysis system that allows the distribution of storage and processing across tens, hundreds or thousands of regular computers serving as nodes MapReduce: -An approach to computing that distributes the processing of very large multi-structured data files across a large cluster of ordinary machines/processors in two major phases Relational Database: -A database style that provides consistent and reliable storage and retrieval of information for structured data in tabular form

Cassandra is especially good for multi data-center deployments.

True

When we used MongoDB Atlas in class, that was an example of using a service hosted in the cloud.

True

Google demonstrated the power of Cassandra running on the Google Cloud by achieving one million writes per second. This sort of milestone is important for the following reasons EXCEPT a) Because your local computer hard drive can get filled up pretty quickly b) Social network applications are used widely by millions of people and such a high throughput is essential for performance c) Real-time analysis of stocks is enhanced by such capabilities d) Big science e.g. analysis of genetic data to cure diseases could use this sort of capacity

a) Because your local computer hard drive can get filled up pretty quickly

Which of the following is NOT TRUE about the information stored in a column-family database? a) Column families are organized so that all information about each new record is stored close by on disk b) Column family databases, like document databases, do not require all columns in all rows, making it flexible. c) Column families are organized into groups of data items that are frequently used together. d) Both column family databases and relational databases use unique identifiers for rows of data

a) Column families are organized so that all information about each new record is stored close by on disk

You are presented with the following use case, which database type seems most appropriate: A company wants to create a datawarehouse to support analytics operations. Which of the following will NOT be appropriate a) Graph database e.g. Neo4J b) Column database e.g. HP's Vertica WRONG c) SQL d) Spark running on Hadoop

a) Graph database e.g. Neo4J

Which of the following is NOT a feature of Cassandra a) Linear scalability to hundreds of nodes b) No single point of failure c) High availability d) ACID compliance

d) ACID compliance because Cassandra does not support joins or foreign keys, and consequently does not offer consistency in the ACID sense.

Inspect the dataset below and the MapReduce code presented and annotated below very carefully. What do you think this code is doing? from mrjob.job import MRJob <---- Import class MRRatingCounter(MRJob): <---- Create class with M def mapper(self, key, line): A (userID, movieID, rating, timestamp) = line.split('\t') P yield rating, R def reducer(self, rating, occurences): ED yield rating, sum(occurences) ----> Perform operation if __name__ == '__main__': MRRatingCounter.run() ----> Run it a) Calculating the number of movie ratings each user has provided b) Calculating the average rating score is each movie in the dataset c) Calculating how common each rating score is for the movies in the dataset d) Calculating the most recent movie rating for a single movie

c) Calculating how common each rating score is for the movies in the dataset

You are presented with the following use case, which database type/technology seems most appropriate: A company wants to manage routine financial transactions for a single country in a single dedicated datacenter that is highly secure. Transactions have to be ACID. a) NoSQL database with BASE guarantees b) Key/Value database e.g Redis WRONG c) Graph database e.g. Neo4J d) SQL e.g. Oracle, etc.

d) SQL e.g. Oracle, etc.


Kaugnay na mga set ng pag-aaral

The Pregnant Client with Diabetes

View Set

research methods for psychology exam #2 study set

View Set

[CYIS 2310] Ethics and Impacts RQ8

View Set

Texas State - Physical Geology - GEOL 1410 - Wernette - Final Exam Review

View Set

Reticular Activating System - stimulation and lesion

View Set

Honor's Biology Cumulative Review

View Set

E-Commerce: Internet, Web, and Mobile Platforms

View Set