CS327E Midterm 2
In MongoDB, which query operator is used to query for values of a single key? $and $not $or $in
$in
In a distributed database system, which of the properties can be obtained at the same time? Select all those apply a) Consistency and Availability b) Availability and Partition tolerance c) Partition tolerance and Consistency d) Consistency, Availability, and Partition Tolerance e) None of the above
* CAP Theorem: You can only have two at the same time. Not all 3. a) Consistency and Availability b) Availability and Partition tolerance c) Partition tolerance and Consistency
What are the main differences and similarities between Column Family DBs and other DB types?
- Column families are analogous to the 'namespace' in Key-Value DBs and like 'tables' in RDBMS - Column Family DBs support queries to select subsets of data available in a row, like in Document DBs; As well, columns (fields) can be added as needed - Like Relational DBs, Column Family DBs use unique identifiers for rows of data (Row Keys). As well, they store 'tabular' data The 'keyspace' is analogous to the RDBMS 'database' - However, Column Family DBs do not support typed columns, transactions, joins, or subqueries (but use denormalization)
What does CAP stand for?
- Consistency: means that all the database replicas see the same data at any given point in time. - Availability: means that every request received by a DB server will result in a response as long as (even part of) the network is available. - Partition tolerance: means that the distributed database can continue to operate even when the cluster is partitioned by network failures into two or more disconnected sections (partitions). •Theorem: You can have at most two of these properties for any distributed database system at any given time. You can't have all three at the same time.
What are the typical uses for SQL Views?
- Data Security: To omit or obscure sensitive data (restrict data access) - Data Aggregation: To setup common aggregated-data queries as views - Hiding Complexity: To shield end users from complexity - Joining Partitioned Data: To allow multiple, partitioned data sets to appear as a single one
What is Apache Cassandra?
- Free & open-source - Distributed: It stores data across many machines (servers), making it perfect for handling large amounts of data. - Wide-column store: It organizes data in flexible tables with rows and columns, but it's not like a traditional relational database. Instead, it's part of the NoSQL family. - Designed for big data: Cassandra works across clusters of servers, which can even span multiple datacenters, ensuring fast performance and reliability.
What are some key features of Apache Cassandra?
- High Availability: Cassandra is designed to stay online, even if some servers fail, making it reliable for critical applications. - Eventual Consistency: It ensures that all copies of the data across the cluster eventually become consistent. This is a trade-off for performance and availability but works well for many big data applications. - Asynchronous Masterless Replication/No master nodes: All nodes in Cassandra are equal, and any node can handle requests. This is called a masterless architecture, and it eliminates bottlenecks. - Asynchronous replication: When data is written to one node, it's copied to others in the background, ensuring efficiency. - Fault-Tolerance: If one server goes down, the system keeps running because other nodes take over its work. - Scalable: If your data grows, you can add more servers to your cluster without downtime, and Cassandra will distribute the workload automatically.
What are the basic components of Column Family DBs?
- Keyspace: top-level data structure - Row Key: a unique identifier for a row of grouped columns and also used to partition and order the data - Column: stores a single data value - Column Families: collections of related columns
What are some main characteristics of Neo4j?
- Labeled property graph database for highly connected data - Declarative, SQL-inspired query language (Cypher) - Open-source, sponsored by Neo4j, Inc. - Rich plugin and extension language (similar to Postgres) - ACID-compliant transactions (Atomicity, Consistency, Isolation, and Durability) - Visualization tools (Neo4j Browser, Bloom) - Optimized for graph traversals - Available as a cloud offering (Aura) - Distributed architecture for scaling reads - Limited scalability for writes (no sharding)
What is Apache HBase?
- Open-source: It's free and you can modify it to fit your needs. - Distributed: It stores data across many machines, so it can handle huge datasets (big data). - Scalable: It grows with your data. If you need more space or speed, you just add more machines. - Non-relational: It's not like a traditional database with tables, rows, and columns. Instead, it's based on a model called Google Bigtable. - Versioned: It keeps multiple versions of data, which is useful for tracking changes. HBase is built on Hadoop and uses HDFS (Hadoop Distributed File System) for storage, giving it access to Hadoop's big data power.
What are some key features of HBase?
- Scalability: It has linear and modular scalability so it grows with your needs. If your data doubles, you can double your resources, and it will keep performing well. - Consistency: Strictly consistent reads and writes. You always get the latest data when reading or writing, making it reliable. - Automatic Table Management: Sharding. It splits large tables into smaller pieces automatically to spread the data across machines. - Failover Support: If one node (RegionServer) fails, HBase shifts its tasks to another node without downtime. - Integration with Hadoop: Hadoop MapReduce. HBase works seamlessly with Hadoop jobs to analyze big data. - Metrics Exporting: You can monitor and export performance data using tools like Ganglia or JMX. - Developer-Friendly: Java API: Developers can easily access HBase using its Java interface. jruby-based Shell: Advanced users can script operations using a Ruby-based command-line tool. - Query Optimization: Block Cache & Bloom Filters. These help speed up real-time queries by efficiently finding data. Predicate Pushdown: You can filter queries on the server-side to reduce the amount of data sent to the client. - Flexible Data Access Thrift Gateway & REST API: You can access HBase from almost any programming language using these web services. They support multiple data formats like XML, Protobuf, and binary.
What are some basic operations to query a key-value pair database?
- get(key): retrieves the value associated with a key - set(key, value): creates or updates a key-value pair - delete(key): removes a key-value pair
What architectures are used in Column Family DBs?
-Multiple node types •HBase has name nodes, data nodes, region servers, master, & Zookeeper (coord, a SPF) nodes •Requires more administration, but servers can be specific for a task -Peer-to-peer •Cassandra has all nodes running the same software •No single master node •Offers simplicity, no single point of failure (SPF), & scalability
What three elements is a column in a Column Family DB consist of?
1. column name (Ex: LastName) 2. column value or datum (Ex: Able) 3. timestamp to record when the value was stored in the column (Ex: 40324081235)
Given the SQL query below, what could an equivalent MongoDB query look like? Can you write it? SELECT Title, Artist, Date FROM Artworks WHERE Nationality = 'Swedish' OR Classification = 'Sculpture';
> selection = {$or: [{ "Nationality": "Swedish" }, { "Classification": "Sculpture" } ]} > projection = {"Title": 1, "Artist": 1, "Date": 1, _id: 0} > db.Artworks.find(selection, projection).pretty();
Fill in the blanks for each of the following statements from among the following: Varied structures Collections Documents A MongoDB database is a container for _______ A MongoDB collection is a group of ________ A MongoDB document is a BSON record with ________
A MongoDB database is a container for Collections. A MongoDB collection is a group of Documents. A MongoDB document is a BSON record with Varied Structures.
Describe in your own words, what is an SQL View?
A SQL view is basically a query, but the output of the query is saved and you can reference it as a table. It's not an actual subset but views make it easier for yourself to reference.
In a Key-Value Database, define a namespace or keyspace.
A collection of key-value pairs No duplicate keys allowed in it Provides for separate sets of keys
What is a distributed database?
A database that is stored and/or processed on more than one computer or server
In a Key-Value Database, define a key.
A reference to a value Normally a string, but can be more complex (list, set, etc.)
Generally speaking, in comparison with Relational Database, which data models have a generally higher flexibility? a) Key-value store b) Column-oriented store c) Document-oriented store d) Graph databases e) All of the above f) None of the other answers is correct
All of the above
What features would you ascribe to a key-value pair? Select all those apply a) Bare minimum structure b) No schema c) Uses replication on multiple servers for scalability d) Values can be any datatype allowed e) All the above
All of the above
What are examples of real life tools that deal with Key-Value Pair databases?
Amazon DynamoDB
In a Key-Value Database, define a value.
An object associated with a key Can be of any data type & size But it may be restricted by the specific system
What are examples of real life tools that deal with Key-Value Pair Column Family databases?
Apache Cassandra, HBase
What is the full form of the acronym ACID?
Atomic, Consistent, Isolated, Durable
What are some key features of DynamoDB?
Built in security Continuous Backups Automated Multi-Region Replication In-Memory Caching Data Export Tools
Explain how the different CAP properties relate to each other.
CA: Consistency and Availability - RDBMS (Oracle, MySQL) CP: Consistency and Partition Tolerance NoSQL (MongoDB, HBase) AP: Availability and Partition Tolerance NoSQL (Cassandra)
How would you write a create statement in Neo4j?
CREATE (p:Person {name: "Bob", age: 35});
Considering the 'customer' table definition shown below, write a SQL statement that will create an index for this table in order to improve its access performance, given that a lot of the user queries look up via the customer email. customer PK | customer_id | CHAR | first_name | VARCHAR | last_name | VARCHAR | email | VARCHAR
CREATE INDEX customer_email_idx ON customer(email);
How would you create a keyspace (similar to a database in CQL) in CQL?
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
What are the architectures used in Column Family DBs?
Column-family databases are designed to handle large amounts of data across multiple machines. They can use different types of architectures to manage this. -> Multiple Node Types (HBase) HBase has a structure where different types of nodes perform specific tasks: NameNode: Keeps track of where the data is stored. DataNode: Stores the actual data. Region Server: Manages subsets of the data and responds to client read/write requests .Master Node: Coordinates the system, assigns tasks, and keeps things running. Zookeeper Node: Helps with coordination and keeps the system from having a single point of failure (SPF). This setup requires more administration (you have to manage each node type), but it's efficient because each node specializes in its task. -> Peer-to-Peer Architecture (Cassandra) All nodes are the same and run the same software. There's no single master node; every node can handle any request. Offers simplicity, scalability, and no single point of failure because there's no one "central" component that could take down the system if it fails
What is the meaning of the CRUD functions of a database?
Create, Read, Update, Delete
Write the equivalent SQL query from the following MongoDB statements: selection = {"class": "CS 439", "grade": "D"} db.students.deleteMany(selection)
DELETE FROM students WHERE class = 'CS439' AND grade = 'D';
What are two important characteristics of NoSQL databases a) Distributed and sharded b) Clustered and replicated c) Distributed and schema-flexibility d) Documented and Key-value stored
Distributed and schema-flexibility
A database that is stored and/or processed on different servers and computer is known as what type of a database? a. Distributed database b. Server database c. Memory database d. All the above
Distributed database
What is DynamoDB?
DynamoDB is a ... fully managed serverless key-value NoSQL database built for high-performance applications designed to scale effortlessly, handle enormous amounts of data, and maintain fast performance.
True/False: In a key-value pair each value must occur exactly once in a database
False
True/False: Just like in an RDBMS, a Column Family database necessarily has the same number of columns in all rows, unlike in a Key Value pair
False
True/False: One negative aspect of Neo4j is that the graph database itself is saved as a table, and hence it takes time to translate between the data and the graph based representation.
False, one of the advantages is that it's saved internally as a graph instead of tables.
Given the following neo4j commands, what would get printed out? CREATE (n:Person {name: 'Feynman', FavSubject: 'Physics '}) MATCH (n) RETURN (n.name)
Feynman
What does this query in Neo4j do? MATCH (p:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend) RETURN p.name;
Find all friends of a person named "Alice"
What does this query in Neo4j do? MATCH (p:Person) RETURN p;
Find all nodes with the label Person
Generally speaking, in comparison with Relational Database, which data models have a generally higher complexity? a) Key-value store b) Column-oriented store c) Document-oriented store d) Graph databases e) All of the above f) None of the other answers is correct
Graph databases
When is HBase ideal? or Cassandra?
HBase's architecture is great for systems where you need specialized tasks and control. It's perfect for real-time, large-scale applications like tracking customer behavior or storing sensor data. Cassandra's peer-to-peer setup is simple, scalable, and doesn't depend on a single node to keep the system running. Applications that need high availability (e.g., online shopping, messaging apps). Systems that span across multiple locations (e.g., global applications). Handling large-scale data (e.g., user logs, sensor data, or real-time analytics).
In a Column Family DB, what is a row key similar to in a RDBMS?
In a column family, we have rows of grouped columns, and each row has a Row Key (similar to a relational primary key).
What type of NoSQL DB is Amazon DynamoDB? Graph Document Column Family Key-Value Pair
Key-Value Pair
What are examples of real life tools that deal with Document databases?
MongoDB
If a database is designed for storing and processing web data, is schema-flexible, enabling horizontal scaling through sharding, and has a very rich query language which type of a database could this be? Select all that apply a) MongoDB b) Neo4j c) Cassandra d) MySQL
MongoDB - Neo4j is not for web-processing. In graph databases you're more concerned with relationships, while web processing uses documents.
What are examples of real life tools that deal with Graph databases?
Neo4J
What is the key claim of Apache Cassandra?
No Single Point of Failure: There's no central node that the system depends on. If one or more nodes fail, Cassandra keeps running smoothly.
Is this Neo4j command useful: MATCH (P:Person) WHERE P.name = "Einstein " RETURN U Why, or why not?
Not useful, because you have not defined U.
Given the following neo4j commands, what would get printed out? CREATE (n:Person {name: 'Feynman', FavSubject: 'Physics '}) MATCH (n) RETURN (name)
Nothing will get printed out
How would you describe the MapReduce model in Hadoop?
Parallel processing, divide problem into smaller subpart, map on to various servers, reduce the solution space and then recombine to obtain results (Break large problem into smaller chunks, map each of the small chunks into different servers, each chunk/task is the same in each server, the server solves the reduced problem)
Which of the following is NOT a reasonable use of Views? a) Data Aggregation b) Performance Improvement c) Data Security d) Hiding Complexity
Performance Improvement
What are the protocols used in Column Family DBs?
Protocols are the "rules" that govern how the nodes communicate and keep data consistent. - Gossiping Nodes send messages to each other to share information about the state of the cluster (like which nodes are up or down). This helps the system stay aware of changes, even if nodes fail or join. - Anti-Entropy Used to make sure replicas of the data are consistent across nodes. Nodes compare data using hashes (quick summaries of data) instead of directly comparing the data itself, which saves time. If differences are found, the nodes fix them. - Hinted Handoff When a node is temporarily unavailable, another node (the proxy node) holds onto the write request for the unavailable node. Once the unavailable node comes back online, the proxy node sends the data to it. This ensures that writes are not lost even if a node is down.
Give some examples of queries in DynamoDB.
PutItem: Add or replace an item GetItem: Retrieve a single item by key Query: Retrieve items by partition key Scan: Retrieve all items with optional filtering UpdateItem: Update an item's attributes DeleteItem: Remove an item ConditionExpression: Ensure conditions are met for updates ex: table.get_item(Key={'year': year, 'title': title}) table.query(KeyConditionExpression=Key('year').eq(year)) table.put_item( Item={ 'year': year,'title': title,'info': {'plot': plot, 'rating': Decimal(str(rating))}})
At the heart of a CREATE VIEW statement is a _________ statement. a) SELECT b) UPDATE c) DELETE d) INSERT
SELECT
What are some key characteristics of DynamoDB?
Serverless and Fully Managed Key-Value NoSQL Database High Performance Scalability It's ideal for applications requiring real-time performance, such as: E-commerce: Managing shopping carts on Amazon.com. Ad platforms: Real-time bidding. Gaming: Low-latency multiplayer games.
Partitioning means breaking a database into multiple parts and storing on multiple computers. Another word for it is a. Replication b. Sharding c. Clustering d. Networking e. None of the above
Sharding
When is sharding used? When is sharding particularly useful?
Sharding is particularly useful in large-scale database systems where data volume or query load exceeds the capacity of a single database instance. It is a horizontal partitioning technique that splits data across multiple servers or nodes, improving performance and scalability. In real time systems where high query throughput is required (example transactions, day trading, social media messaging, etc) as single database server will get overwhelmed. In such cases 'sharding' the data and by distributing the load across multiple servers one can improve efficiency. Another use is when the needs are geographically distributed, it makes sense to 'shard' the data and keep it closer to the geo where it will be needed more often. Another standard use of this is for fault tolerance. Data that is sharded and kept in multiple places avoids the single point of failure issue. A note to keep in mind however, sharding can introduce complexity in query routing, data consistency, and rebalancing, so it is most useful when these trade-offs are manageable and justified by the user needs.
Describe the three essential features of Key-Value databases.
Simplicity, speed, and scalability. Simplicity basically means that key-value databases use a bare-minimum data structure with no regard for schema or data types. And because of the simple structure, the database is flexible and forgiving. And key-value databases are also known for their speed. They use RAM, disk storage, and caching. As well as associative arrays (map, hash table) in their implementation. Scalability refers to the use of replications or having copies of the DB files on multiple servers.
In a Key-Value Database, define a partition.
Subset of a database Assigned to a server in a cluster
Match the terms in the RDBMS side with those on the GraphDB side where the terms have a similar connotation in the usage RDBMS |GraphDB Tables |Edges Columns |Node Properties E-R model relationships| Node Labels
Tables -> Node Labels Columns -> Node properties E-R model relationships -> Edges
In what way are the two commands shown below the same, and in what way are they different if at all: Command1: MATCH(n:Person) RETURN(n) Command 2: MATCH(p:Person) RETURN p.name AS PersonName
The first returns everything about the person. Second one will be tabular, and will only return name. And the column name will change from name to PersonName.
True/False: Can the edges of a GraphDB have numeric properties?
True
True/False: Each edge can have its own set of properties that are not shared by all other edges
True
True/False: One benefit of GraphDB over RDBMS is that the need for the 'bridge' tables is eliminated, and edges instead can model the relationships
True
True/False: One benefit of Neo4j is that the transactions are ACID-compliant in contrast with some other NoSQL database tools
True
True/False: One row does not have to have the same number of columns as another row.
True
True/False: Properties of the node can be thought of as the 'column values' for that particular entity in a SQL table
True
True/False: The standard index implementation in most database systems is the B-Tree Index.
True
True/False: There can be different types of vertices in the same graph
True
True/False: A view definition is not affected by any changes made to the underlying table(s) after the view's creation time.
True The view definition is "frozen" at creation time and is not affected by subsequent changes to the definitions of the underlying table(s), however... - SQL definition of the VIEW (the query itself) is static and does not automatically update when base table structures change. However, the execution of the VIEW is still dependent on the base tables, and if critical structural changes happen, it can fail.
True/False: In a key-value pair each key must occur exactly once in a database
True, the key is similar to primary key in SQL.
Given the following Neo4j commands, what are U and P in the database? MATCH (U:University), (P:Person) WHERE U.name = 'Trinity College' AND P.name = "Ramanujam" CREATE (P) -[stu:STUDIED_AT]->(U)
U and P represent vertices.
In a Key-Value Database, define a partition key.
Used to determine which partition should store a data value Any key in an item is used to determine the partition
What is typical for ACID compliance. Select all those that apply a) Harder to scale due to focus on consistency b) Only one transaction allowed at a time c) Adding multiple horizontal servers to improve availability d) Access will be available even when network issues exist in the system e) Multiple transactions on the same record will be allowed f) Delays that affect all subsequent transactions might occur due to an overhead in one particular transaction g) Partial inconsistencies might arise but eventually the transactions will all be correct
a) Harder to scale due to focus on consistency b) Only one transaction allowed at a time f) Delays that affect all subsequent transactions might occur due to an overhead in one particular transaction
If the data in a MongoDB in JSON is as follows: [ { "_id": 1, "title": "Book A", "publisher": null }, { "_id": 2, "title": "Book B" }, { "_id": 3, "title": "Book C", "publisher": "Penguin" } ] What would be the output of the following command? db.Books.find({ publisher: { $eq: null } }) a) Only Book A and Book B will be printed out b) Only Book B will be printed out c) Only book A will be printed out d) None of the other answers, as it is looking for a publisher's name which is "null"
a) Only Book A and Book B will be printed out
If the data in a MongoDB in JSON is as follows: [ { "_id": 1, "title": "Book A", "publisher": "null" }, { "_id": 2, "title": "Book B" }, { "_id": 3, "title": "Book C", "publisher": "Penguin" } ] What would be the output of the following command? db.Books.find({ publisher: { $eq: null } }) a) Only Book A and Book B will be printed out b) Only Book B will be printed out c) Only book A will be printed out d) None of the other answers, as it is looking for a publisher's name which is "null"
b) Only Book B will be printed out
What is the smallest unit of storage in Column Family DBs?
column
Given the following MongoDB query, what could an equivalent SQL query look like? >selection = {"Nationality": "Brazil", "Sports": "Soccer"} >projection = {"PlayerName":1, "PlayerNumber":0, "GoalsScored":1, _id:0} >db.EspnHistory.find(selection, projection).pretty() a) SELECT PlayerName, PlayerNumber, GoalsScored, _id FROM EspnHistory WHERE Nationality = 'Brazil' OR Sports = 'Soccer'; b) SELECT PlayerName, PlayerNumber, GoalsScored, _id FROM EspnHistory WHERE Nationality = 'Brazil' AND Sports = 'Soccer'; c) SELECT PlayerName, , GoalsScored, FROM EspnHistory WHERE Nationality = 'Brazil' OR Sports = 'Soccer'; d) SELECT PlayerName, , GoalsScored, FROM EspnHistory WHERE Nationality = 'Brazil' AND Sports = 'Soccer'; e) None of the given choices
d) SELECT PlayerName, , GoalsScored, FROM EspnHistory WHERE Nationality = 'Brazil' AND Sports = 'Soccer';
What does the following command achieve: MATCH(n) OPTIONAL MATCH (n)-[r]-() DELETE n, r a. Delete all nodes only b. Delete only nodes that have an edge, but not if they do not have an edge c. Delete only those nodes with edges and all edges connected only to them d. Delete all nodes and all edges
d. Delete all nodes and all edges
Translate the following SQL query into MongoDB's query language: SELECT ChairName, Department FROM Departments WHERE College = 'Natural Sciences' AND HighestDegree = 'PhD' ORDER BY Department ASC; Hint: Define the 'selection' & 'projection' first, and then define the method call(s).
sel = { College: "Natural Sciences", HighestDegree: "PhD" } proj = { id:0, ChairName: 1, Department: 1 } db.Departments.find( sel, proj ).sort( { Department: 1} )
What is a group of collected servers known as?
server cluster
What does each of the four facets of ACID mean? Can you give an example for each?
• Atomic: The whole transaction happens or nothing happens • Consistent: No matter what change occurs in the data base it is structurally sound and works • Isolated: Transactions don't affect each other and happen in isolation • Durable: Changes that are committed are permanent
What spurred the need for NOSQL databases?
• Need for greater scalability in throughput and response time • Need for schema flexibility • Preference for open source software • Varied type of data types became ubiquitous
What are some advantages of Graph Databases?
• Query faster by avoiding joins: It traverses a graph instead of joins • Simplified modeling: No need for "bridge" tables to model many-to-many relationships and uses edges to model the relationship • Multiple relationships between entities: Different relationships can be modeled with different types of edges and each edge type can have its own set of properties
