ITD 245 Quiz 1 & 2
Foreign keys allow rows from two tables to be joined, with the foreign key value referencing the secondary key in the primary table. True False
False
In MySQL, the CHAR datatype allows the value to take up only the space that it needs, thereby saving storage space--as opposed to the VARCHAR, which takes up a fixed, non-variable amount of space regardless of the length of the text. True False
False
Text files supply an easy way for users to query their contents and get the answers to various data-oriented questions. True False
False
True or False? Artificial intelligence is a subset of machine learning. True False
False
True or False? Clustering uses supervised learning in that it works with unlabeled data that has not been assigned to a category or group. False True
False
Much of the data that social media users generate is typically of an unstructured form. True False
True
The process of normalization is used to reduce redundancy and thus possible errors during insert, update, and delete operations. True False
True
Which regex pattern matches any single character? Correct! A. . (period) B. * (asterisk) C. ? (question mark) D. \d
A. . (period)
One of the key advantages of word vector embeddings is their ability to: A. Capture the semantic relationships between words based on their context. B. Perfectly translate text into different languages. C. Identify the grammatical structure of sentences with 100% accuracy. D. Reduce the memory consumption of text data.
A. Capture the semantic relationships between words based on their context.
_____ is the assignment of data items to a related group, and it uses unsupervised learning. A. Clustering B. Association C. Classification D. Prediction
A. Clustering
The Bag-of-Words model represents text by: A. Considering the frequency of words in a document. B. Capturing the order and relationships between words. C. Using a single word to represent an entire document. D. Mapping words to their corresponding definitions.
A. Considering the frequency of words in a document.
What data format is often used to represent geographic features? A. GeoJSON B. CSV C. SQL D. XML
A. GeoJSON
Which chart type best visualizes changes in data over a continuous time period? A. Line chart B. Pie chart C. Box-and-whisker plot D. Bar chart
A. Line chart
Which type of database is typically better suited for handling data with flexible and changing schemas? A. NoSQL B. SQL C. Key-value D. Graph
A. NoSQL
of these technologies is primarily used for data manipulation within a database? A. SQL B. Pandas C. Beautiful Soup D. JSON
A. SQL
To ensure an ETL process runs smoothly and on a regular basis, what actions are critical? A. Scheduling jobs and setting up monitoring tools to track errors and performance. B. Manually running each data load to ensure quality. C. Modifying the source database structure frequently. D. Designing the data warehouse without considering data size.
A. Scheduling jobs and setting up monitoring tools to track errors and performance.
A company aims to reduce the time it takes for a European customer to load a high-resolution image file from their US-based server. What would be the most effective strategy? A. Use a CDN to distribute the images globally. B. Increase the bandwidth of the US-based server. C. Upgrade the customer's internet connection. D. Reduce the resolution of the image file.
A. Use a CDN to distribute the images globally.
Which of the following is the process of splitting a sentence or paragraph into individual words? A. Word tokenization B. Stemming C. Lemmatization D. Part-of-speech tagging
A. Word tokenization
What types of values are finite or categorical? A. discrete B. histograms C. continuous D. numbers
A. discrete
In the following example, what is the general term for any of the highlighted portions? <data> <items> <item name="item1">item1abc</item> <item name="item2">item2abc</item> </items> </data> A. element B. attribute C. key D. identifier
A. element
In the document shown below, what term is used to describe the elements "account", "person", or "firstName" (or any highlighted item)? { "account" : 1, "person" : { "lastName" : "Jones", "firstName" : "Joe", "address" : { "home" : { "street" : "15 Elm", "city" : "Lakeville", "zip" : "12345" }, "work" : { "street" : "12 Main", "city" : "Lakeville", "zip" : "12345" } }, "phone" : { "home" : "800-555-1234", "work" : "877-123-4567" } } } A. key B. index C. value D. key-value
A. key
Isolation in a database system means A. transactions are invisible until completed or fully rolled back B. transactions cannot be allowed to follow integrity constraints C. transactions are visible before fully completely executed D. transactions cannot lead to an invalid data state
A. transactions are invisible until completed or fully rolled back
Which regex would match the words "dog" or "cat"? A. dog+cat B. (dog|cat) C. [dogcat] D. ^dog|cat$
B. (dog|cat)
Which of the following scenarios could benefit from applying text mining with sentiment analysis? A. A researcher wants to categorize historical documents by topic. B. A company wants to understand how customers feel about their latest product release by analyzing social media posts and online reviews. C. A website developer wants to optimize their site structure for search engines (SEO) D. A library wants to create a searchable database of book summaries.
B. A company wants to understand how customers feel about their latest product release by analyzing social media posts and online reviews.
Relational database systems support A. Applications, Consumers, Identity management and Difficulty B. Atomicity, Consistency, Isolation and Durability C. Auditing, Capitalization, Insurance and Data Fields
B. Atomicity, Consistency, Isolation and Durability
You're designing a system where high availability and the ability to handle massive write volumes are the top priorities, even if slight delays in data consistency are acceptable. Which consistency model aligns with these requirements? A. ACID B. BASE C. Strong consistency D. Transactional consistency
B. BASE
You want to visualize population density across different countries on a world map. What type of chart is the best fit? A. Area chart B. Choropleth map C. Histogram D. Box-and-whisker plot
B. Choropleth map
____ is the degree to which the data values align with the company's business rules such as, "the company will measure and store sensor values on 1-second intervals." A. Accuracy B. Conformity C. Completeness D. Consistency
B. Conformity
Which of the following techniques facilitates the distribution of data across multiple geographic regions? A. Data encryption B. Data replication C. Web scraping D. Data aggregation
B. Data replication
Which of the following correctly outlines the steps in the ETL process? A. Extract, Load, Transform B. Extract, Transform, Load C. Load, Transform, Extract D. Transform, Extract, Load
B. Extract, Transform, Load
HDFS stands for: A. HBase Distributed File System B. Hadoop Distributed File System C. Highly Dimensional File System D. Highly Distributed File System
B. Hadoop Distributed File System
You want to display the distribution of exam scores within a class. Which chart type would be most appropriate? A. Scatter plot B. Histogram C. Pie chart D. Area chart
B. Histogram
CDNs are primarily used for: A. Storing large volumes of data long-term. B. Improving content delivery speed by caching data closer to users. C. Data privacy and encryption. D. Automating ETL processes.
B. Improving content delivery speed by caching data closer to users.
MongoDB is an example of what type of NoSQL database? A. Graph database B. Key-value store C. Document store D. Columnar database
B. Key-value store
What is a common algorithm/approach to processing distributed, non-centralized data sets? A. Hadoop B. MapReduce C. remote function calls D. distributed file systems
B. MapReduce
Which task identifies names of people, places, and organizations in text? A. Text summarization B. Named-entity recognition (NER) C. Word tokenization D. Sentiment analysis
B. Named-entity recognition (NER)
A company needs a database to store social media interactions with highly variable data structures (e.g., text, images, nested comments). Which database type would be a strong candidate? A. SQL B. NoSQL (document store) C. Key-value store D. Relational database
B. NoSQL (document store)
Word embeddings transform words into: A. Dictionaries B. Numerical vectors C. Binary representations D. Images
B. Numerical vectors
You're training an object recognition model to detect cars in photos. Which of the following factors can make this task particularly challenging? A. Cars always being the same color and shape. B. Variations in lighting, angles, occlusions (objects blocking others), and backgrounds. C. Images always being high-resolution and clear. D. Lack of available training data with labeled images of cars.
B. Variations in lighting, angles, occlusions (objects blocking others), and backgrounds.
Which of the following best describes the difference between vertical and horizontal scaling? A. Vertical scaling is for database systems, while horizontal scaling is for web servers. B. Vertical scaling involves adding more powerful hardware to a single machine, while horizontal scaling involves adding more machines to a distributed system. C. Vertical scaling is always less expensive than horizontal scaling. D. Horizontal scaling only works for relational databases.
B. Vertical scaling involves adding more powerful hardware to a single machine, while horizontal scaling involves adding more machines to a distributed system.
You need to access real-time data on product inventory from a company's website that doesn't have a publicly available API. Which technique is likely the most suitable? A. Accessing CSV files B. Web scraping C. Querying an RDBMS D. Extracting XML data
B. Web scraping
Which of the following is a benefit of using web-based charts built with HTML/CSS/JS? A. Limited customization options B. Wide reach and accessibility through web browsers C. Steep learning curve D. Platform-specific, not easily portable
B. Wide reach and accessibility through web browsers
Durability in a database systems means A. committed transactions always create indexes B. committed transactions are permanent C. permanent transactions are committed D. transactions can never be permanent
B. committed transactions are permanent
What types of values are infinite, with no limit on the possible values? A. numbers B. continuous C. histograms D. discrete
B. continuous
A ___________ is a plot that shows the number of occurrences grouped by ranges of sample outcomes, where each group is defined by a bin. A. scatterplot B. histogram C. cluster D. boxplot
B. histogram
A typical relational database management system supports A. none of the choices presented here are correct B. multiple users and multiple programs accessing its information simultaneously C. multiple users and multiple programs accessing its information one-at-a-time D. single user access to its information
B. multiple users and multiple programs accessing its information simultaneously
When speaking about the "velocity" of unstructured data, we are talking about the A. scale of unstructured data B. pace at which data streams are accumulated and accessed for processing C. different forms that unstructured data comes in D. uncertainty involved in believing the data
B. pace at which data streams are accumulated and accessed for processing
When speaking about the "volume" of unstructured data, we are talking about the A. analysis involved in handling streaming data B. scale of the generated data C. different forms of data D. uncertainty involved in believing the data
B. scale of the generated data
You want to write a simple regex to detect positive words like "happy", "good", or "excellent" in customer reviews. Which pattern could you use? A. (happy+good+excellent) B. [happygoodexcellent] C. (happy|good|excellent) D. (happy,good,excellent)
C. (happy|good|excellent)
Consider the sentence: "The book was on the table." If tokenized, how many tokens would likely be produced (don't exclude 'stop' words)? A. 4 B. 5 C. 6 D. 7
C. 6
Which consistency model prioritizes immediate updates and guarantees that all clients see the same data at the same time? A. NoSQL B. BASE C. ACID D. Eventual consistency
C. ACID
You're analyzing a tweet that says: "Just visited Apple in Cupertino." Which entities would a named-entity recognition (NER) system likely identify? A. Apple (Product) B. Just, visited (Verbs) C. Apple (Organization), Cupertino (Location) D. Cupertino (Person)
C. Apple (Organization), Cupertino (Location)
_____ is the science of making intelligence machines that can perceive visual items, recognize voices, make decisions, and more A. Data mining B. Data association C. Artificial intelligence D. Machine learning
C. Artificial intelligence
ELT approaches to data integration are often favored in modern setups because: A. They offer flexibility in working with raw, unstructured data. B. They generally require less up-front planning and design. C. Both are correct.
C. Both are correct.
Which data source format is commonly used for storing tabular data? A. JSON B. RDBMS C. CSV D. XML
C. CSV
TF-IDF is a technique used to: A. Count the number of times a word appears in a document. B. Determine the grammatical role of a word in a sentence. C. Calculate the importance of a word in a document relative to a collection of documents. D. Identify the base form of a word.
C. Calculate the importance of a word in a document relative to a collection of documents.
Which of the following is a common task performed during the 'Transformation' stage of an ETL pipeline? A. Retrieving data from a website using web scraping. B. Loading data into a data warehouse for analysis. C. Cleaning inconsistencies, formatting data types, and applying calculations. D. Creating a dashboard to visualize data insights.
C. Cleaning inconsistencies, formatting data types, and applying calculations.
The Python library Pandas is best known for its use in: A. Web scraping B. ETL pipelines C. Data analysis and manipulation D. Building REST APIs
C. Data analysis and manipulation
Which Python library is commonly used for working with geospatial data? A. Matplotlib B. PlotlyGeo C. Geopandas D. Geoplot
C. Geopandas
Which SQL statement is used to add new data into a database table? A. UPDATE B. SELECT C. INSERT D. DELETE
C. INSERT
Which of the following is a common technique used in sentiment analysis? A. Image segmentation B. Fourier transforms C. Lexicon-based analysis (using dictionaries of positive/negative words) D. Convolutional Neural Networks (CNNs)
C. Lexicon-based analysis (using dictionaries of positive/negative words)
How does the MapReduce programming paradigm benefit from horizontal scaling? A. Horizontal scaling has no impact on MapReduce performance. B. MapReduce requires specialized hardware for vertical scaling. C. MapReduce can distribute tasks across multiple nodes, allowing it to process larger datasets more efficiently with horizontal scaling. D. MapReduce is primarily designed for single-machine environments.
C. MapReduce can distribute tasks across multiple nodes, allowing it to process larger datasets more efficiently with horizontal scaling.
To compare the market share of several different products, which chart type is a good choice? A. Line chart B. Area chart C. Pie chart or Donut chart D. Box-and-whisker plot
C. Pie chart or Donut chart
You're building a sentiment analysis model to classify customer reviews. Which of these would be a key challenge to be aware of? A. Limited processing power on most computers B. Lack of available training data C. Sarcasm, irony, and context-dependent language D. The strict grammar rules of natural language
C. Sarcasm, irony, and context-dependent language
A rapidly growing application needs to improve its ability to handle increasing data volumes (think "Big Data"). Which of the following strategies can help address this? A. Normalization B. Adding database indexes C. Vertical scaling and/or sharding D. Reducing the number of database tables
C. Vertical scaling and/or sharding
Which of these is a popular word embedding model? A. Linear Regression B. TF-IDF C. Word2Vec D. Bag-of-Words
C. Word2Vec
Which of the following are commonly used techniques for creating word vector embeddings? A. Stemming and Lemmatization B. Simplified dimensional vectorization C. Word2Vec and GloVe D. Named-Entity Recognition (NER)
C. Word2Vec and GloVe
Attributes that contribute to data quality which of the following? (select all that apply) Completeness Consistency Conformity Accuracy
Completeness Consistency Conformity Accuracy
_____ is the degree to which the data correctly represents the underlying real-world values, such as all temperatures from a sensor being in the correct range. A. Completeness B. Appropriateness C. Quality D. Accuracy
D. Accuracy
What format best describes the sample data below? ID,Name,Position,Hourly 3412,Ammons,Developer,TRUE 3554,Jenkins,DBA,FALSE3672,Connor,Project Manager,FALSE A. text flatfile B. XML C. JSON D. CSV
D. CSV
______ is the use of a supervised machine-learning algorithm to assign an observation into a specific category. A. Clustering B. Data mining C. Predicting D. Classification
D. Classification
______ is the degree to which the data represents all required values, such as a data set that should contain an hour of data, for a sensor that reports every second, having 100% of the data values. A. Accuracy B. Quality C. Appropriateness D. Completeness
D. Completeness
You want to create an interactive dashboard with charts, filters, and drill-down capabilities. Which libraries/tools would be most suitable? A. Geopandas B. HTML and CSS C. GeoJSON and Calendar D. Dash and Plotly
D. Dash and Plotly
Which of the following describes a value that falls outside of the expected range of values? A. Border value B. Extrinsic value C. Edge value D. Outlier value
D. Outlier value
What type of data store is most closely associated with ELT (as opposed to ETL)? A. data warehouse B. analytical database C. transactional database D. data lake
D. data lake
Which of the following describes a very large collection of data outside of a database, stored in a less structured form, such as a binary or text file A. HDFS B. data dump C. data warehouse D. data lake
D. data lake
A text file manipulated by a program like Notepad or Textedit is an example of a A. binary data file B. database file C. hexadecimal random access file D. flat file
D. flat file
Modern Relational Database systems include A. Interface drivers B. Storage engine C. SQL engine D. Transaction Engine E. All of the choices are components of a modern relational database system
E. All of the choices are components of a modern relational database system