StudyList Copy

Ace your homework & exams now with Quizwiz!

Brittany is using machine learning for an algorithm that classifies social media posts according to their sentiment ("positive", "negative", or "neutral"). She trains a neural network on a large open database of social media posts and tests the network on her personal social media feed. She notices that it's mis-classifying the posts from her teenage friends, who use different slang from her other friends. What's the best way that Brittany can improve the machine learning algorithm's ability to classify posts from teenagers?

She can add social media posts from teenagers into the training data set, both from her own network and globally available data.

Tajikstan is a country in Central Asia, where many people live in poverty and most do not have access to the internet. The following table shows the percentage of residents using the internet for the years 2012-2017: Year % Internet usage 2012 14.51% 2013 16.00% 2014 17.49% 2015 18.98% 2016 20.47% 2017 21.96% Assuming the internet usage keeps growing at a similar rate, what is the most reasonable prediction for internet usage in 2019 (two years after the last data point)?

25.3%

Which of these is not an example of how members of the public contribute to the creation of large data sets?

A teacher using a desktop spreadsheet program to manage their classroom's grades.

An online essay writing website decides to implement a plagiarism detection system, after several teachers report that their students submitted suspicious essays on the site. The website engineering team is considering a number of ways to detect plagiarism. Which of these plagiarism detection algorithms would benefit the most from big data?

An algorithm that computes the similarity of the wording in the student's essay to all other essays on the site and elsewhere on the web.

Andy is using machine learning for an algorithm that classifies photos of restaurant meals (like "sandwich", "curry", "salad"). He trains a neural network on a large open database of photos of restaurant meals. He then tests the network on local restaurants and notices that the Ethiopian restaurant meals aren't classified correctly. What's the best way to improve the machine learning algorithm's ability to recognize Ethiopian meals?

He can add Ethiopian meals to the training data set, by finding photos online, crowd-sourcing, or taking them himself.

The San Francisco Health Department keeps track of health inspections at restaurants and makes the data publicly available. Each row in the inspections data set contains these details: - Restaurant name - Restaurant address - Inspection date - Inspection score (0-100) - Violation description - Risk severity (low/medium/high) Which of the following questions can be answered using the available data?

How many restaurant have an inspection score greater than 90? Which restaurant has the lowest inspection score? What is the average inspection score for the high risk violations?

A hospital IT department is determining how much data storage capacity they will need to store electronic health records for patients. They start by making a list of the type of data that comes from each department: Department Data Format & average size Primary care Notes from patient chats with doctors 3 paragraphs per visit Laboratory Test results A table with 20 rows and 3 columns Radiology Imagery from scans (CT/PET/MRI) 64 1024x1024 grayscale images Pharmacy Medication prescriptions Patient name/ID, medicine name, data Which type of data is likely to require the most data storage capacity?

Imagery from scans (CT/PET/MRI)

On June 22, 1944, the U.S. introduced the G.I. Bill, a law that provided many benefits to war veterans, including college tuition. Cornell University has been tracking enrollment numbers since their inception. This table shows enrollment in the 10-year period from 1940-1950, broken down by gender: Year Male Female Total enrollment 1940 5,570 1,546 7,116 1941 5,299 1,647 6,946 1942 4,789 1,690 6,479 1943 3,128 1,748 4,876 1944 2,722 2,112 4,834 1945 3,141 2,202 5,343 1946 7,358 1,891 9,249 1947 7,864 1,937 9,801 1948 7,901 1,852 9,753 1949 7,949 1,895 9,844 1950 7,857 1,971 9,828 Which hypothesis is most consistent with the data?

The G.I. Bill led to a large increase in male enrollment.

A medical diagnosis app lets users track their symptoms. Whenever a user reports a symptom, the app adds a row to a database table. Each row contains: - The user ID - The date of the report - The time of the report - A description of how they're feeling - The severity of the feeling (1-10) Here are a few rows from the table: user_id date time description severity 62038 11/19/2018 07:52 Skin rash on arms 4 20394 09/24/2018 03:45 Pounding headache 9 36917 04/11/2018 23:22 Leg cramps 2 The app marketing team wants to understand their users better and asks the data analyst for various statistics. Which statistic can not be calculated from the table of reports?

The average duration of the feeling.

Greenland is the world's largest island, located east of Canada. It's connected to the rest of the world's internet via underwater fiber cables. The following table shows the percentage of Greenland residents using the internet for the years 2012-2017: Year % Internet usage 2012 64.90 2013 65.80 2014 66.70 2015 67.60 2016 68.50 2017 69.48 Assuming the internet usage keeps growing at a similar rate, what is the most reasonable prediction for internet usage in 2019 (two years after the last data point)?

71.4%

The Democratic Republic of Congo is a country located in Central Africa, where many people live in extreme poverty and few have access to the internet. The following table shows the percentage of residents using the internet for the years 2013-2017: Year % Internet usage 2013 6.60% 2014 7.11% 2015 7.62% 2016 8.12% 2017 8.65% Assuming the internet usage keeps growing at a similar rate, what is the most reasonable prediction for internet usage in 2019 (two years after the last data point)? Choose 1 answer:

9.8%

A researcher is granted access to a large data set from a hospital's obstetrics department. The hospital hopes the researcher can both compute basic statistics about the data and find interesting patterns in the data. Which of these analyses is an example of searching for patterns in the data set?

Calculating the likelihood that a baby will be born premature, based on the similarity of the fetus and mother to others.

A national bank opts to use machine learning for deciding whether to award loans to applicants. The engineers create the algorithm by training a neural network on their large database of previous loan applications and decisions (made by loan officers). After they start using the algorithm for new loan applicants, they receive complaints that their algorithm must be biased, because all the loan applicants from a particular zip code are always denied. What is the most likely explanation for the algorithm's bias?

For that zip code, the training data set only has loan applications that were denied.

A high school computer science department wants to better understand how students study for the AP CSP exam. The CSP teachers send a survey to their 6 classes, asking these questions: - What grade are you in? - What CS classes did you take before CSP? - How many hours did you study? - How early did you start studying? - Which study resources did you use? - What did you earn on the exam? They find the results very interesting, spread the survey to teachers all across the United States, and collate the results in a central database. Which of these analyses is an example of searching for patterns in a large data set?

Predicting the optimal start date and study hours, for a student using a particular study resource and with a certain level of experience.

A team of scientists and engineers is putting together a research project to study whale sounds. In order to develop the infrastructure for the project, they need to first determine how much data storage space their observational data will require. This is an example of a single observation: Sound Location Date/time Species 3 minute long audio file 63.776871, -171.742193 May 27, 2019, 2:23:13 PM Beluga The team hopes to capture thousands of whale sounds from all the world's oceans. Which piece of data will increase their data storage needs the most?

Recording of whale sound

Two neighboring high schools both offer an AP Biology course and track how well the students do on the exam. The first high school stores the data in this format: Columns: student ID, grade year (9-12), age, AP exam score, study hours Sample row: 673489, 11, 17, 5, 12 The second high school stores the data in this format: Columns:student email, grade year (9-12), AP exam score, class score Sample row:[email protected], 10, 4, 78 The two schools decide to combine their data sets to see what they can learn from them together. Which of the following can be determined from the combined data set?

The distribution of AP exam scores for 11th graders The total number of students that earned either a 4 or 5 on the AP exam

Company recruiters use applicant tracking systems to keep track of the resumes that candidates send in for the job. Many applicant tracking systems use algorithms to automatically rank the resumes, to help recruiters sift through large quantities of resumes. Which of these algorithms would benefit the most from big data? Choose 1 answer:

An algorithm that is trained on resumes from already hired applicants and ranks based on similarity to those resumes.

Bianca is planning to start a service for programmers who want to prepare for software engineering interviews. To help her figure out the target audience, she does some market research by sending around a survey. The survey asks: How many years have you been programming? From 1-10, how interested are you in a service that helps you prepare for interviews? How much would you be willing to pay monthly for the service? She creates two scatter plots based on the results. The first plot compares years of programming experience to interest in the service: The second plot compares willing payment amount to interest: Which conclusions can Bianca make from the data? 👁️Note that there are 2 answers to this question.

A higher interest in the service is positively correlated with a higher willingness to pay. There is a negative correlation between years of programming experience and interest in the service.

Spam email are unsolicited messages sent in bulk, typically for advertising or phishing purposes. Email providers typically include a spam detection system, to automatically label and hide emails that look like spam. Which of these spam detection algorithms would benefit the most from big data?

An algorithm that computes the spam likelihood by computing the similarity of an email to other spam emails.

Community gardens are public gardens where local residents can grow plants in a plot. They are very popular, so there are often waitlists to get a plot. Alioto Community Garden stores their waitlist data in this format: Columns: Name, email, address, waitlist date, plot size Sample row: Jolie Clover, [email protected], 501 Stanyan St, 05-06-2018, small A neighboring garden, Arkansas Friendship Garden, stores their waitlist data in this format: Columns: Last name, first name, phone, address, waitlist date Sample row: McGee, Eirene, 631-421-4141, 1351 24th Ave, 11-11-2018 The gardens decide to combine their data sets, since they're located so near to each other. Which of the following can be done using the combined data set? 👁️Note that there are 2 answers to this question.

Figure out who has been waiting the longest Make a map of the waitlisted people

An online curriculum provider offers their product to two audiences: independent learners (self-directed) and classroom learners (led by their teachers). They want to understand the differences between the audiences and how they use the product, so they sent surveys and collected data. Users rated their satisfaction with the product from 1-10, where 1 is least satisfied and 10 is the most satisfied. This scatter plot compares the hours per week spent by a user to their rating of the product: The green dots represent classroom learners and the purple dots represent independent learners. Which hypothesis is most consistent with the chart?

Independent learners are generally more satisfied with the product as their usage increases.

A travel website is adding a feature for users to store trip itineraries. Here's a sample itinerary: Title: Summer trip to Japan 1. Inari shrine (Kyoto, Japan) 2. Iwatayama Monkey Park (Kyoto, Japan) 3. Fushimi Inari Taisha (Kyoto, Japan) 3. Fukui Prefectural Dinosaur Museum (Katsuyama, Japan) 4. Kōtoku-in (Kamakura, Japan) 5. Ghibli Museum (Mitaka, Japan) 6. Tokyo Anime Center (Tokyo, Japan) They are considering a number of enhancements to the trip itinerary feature, and the engineering team is considering the data storage requirements of the new features. Which feature is likely to require the greatest increase in data storage needs?

Making copies of the user's trip itinerary in 6 data centers around the world

A non-profit website decided to launch a fundraising campaign in December, to encourage people to make tax-deductible donations before the end of the year. For their campaign, the website displayed a "Please donate!" banner along the top of every page, starting on December 15th. This table shows their data from December 12th to December 22nd, tracking donations from signed in users, donations from users that weren't signed in, and total sign-ups for the site. Day Donations (Signed in) Donations (Signed out)Sign-ups 12/12 $11,775 $2,024 831 12/13 $11,783 $2,527 874 12/14 $11,455 $2,849 839 12/15 $22,582 $3,732 864 12/16 $22,867 $3,724 853 12/17 $22,669 $3,810 893 12/18 $23,679 $3,270 897 12/19 $23,577 $3,477 803 12/20 $23,866 $3,052 842 12/21 $23,837 $3,634 811 12/22 $24,519 $3,928 855 Which hypothesis is most consistent with the data?

The donation banner led to a significant increase in total donations and did not affect sign ups.

Two neighboring cities have created data sets of places where people can get their flu shot. Mountain View stores the data in this format: Columns:Facility name, Street address, Zip code, Start date, End dateSample row:Southeast Health Clinic, 2420 Shotwell St, 94041, 2013-11-22, 2013-11-25 Los Altos stores the data in this format: Columns:Facility name, Street address, Begin date, End date, Eligibility Sample row:Public Health Center, 1301 Pierce St, 2013-11-15, 2013-12-05, Uninsured adults The two cities are combining their data sets to create informational campaigns for their residents. Which of the following can be determined from the combined data sets?

The last possible date to get a flu shot The city that has the most locations open

StackOverflow is a popular question & answers site. Each time a user asks a new question, they insert a row in a database table. Each row contains: The user ID The user display name The timestamp of the question The text of the question The spam score of the question (0-5) Here are a few rows from the table: user_id display_name timestamp question spam_score 62038 TheAskerator 11/27/2012 06:15:28 How do I geocode a lat/lng? 0 20394 NewCoder123 03/12/2015 10:55:10 Where can I host my website for free? 1 36917 QuestionErrthing 05/04/2014 11:34:25 Wanna download this free file? 3 The team wants to display question statistics on an internal dashboard. Which statistic can not be calculated from the table of questions?

The user ID with the most number of unanswered questions

Craig is developing a new micro-blogging app and has shared it with a group of beta testers. He wants to understand their usage patterns, so he tracks data on the number of posts a user makes each day, the average length of their posts, and the average sentiment of their posts (from very negative to very positive). This plot compares the average sentiment for a user's posts to the number of posts they make each day: This second plot compares the average length of a user's posts to their number of posts per day: Which conclusions can Craig make from the data?

There is a negative correlation between posts per day and sentiment.

The Chicago Police Department uses a database to keep track of reported crimes. After anonymizing the data, they make it freely available online. Here's what the the crime data set includes: The date of the crime The address (block level) The type of crime (theft/battery/robbery/assault/etc.) The location type (street/residence/business/etc.) Whether an arrest was made (true/false) Which of the following questions can be answered using the available data?

What type of crime was the most common for each year in the data set? Which month has the most number of robberies? What was the average number of crimes committed per location type?

Stephanie is researching the friendliness of U.S. cities. To determine the friendliest and unfriendliest cities, she conducts a nation-wide survey . She then collects data about each of the cities, to try to understand what factors are related to a city's friendliness, and visualizes the data in scatter plots. This scatter plot compares the population of cities to friendliness: This second plot compares each city's latitude to its friendliness: Which conclusions can Stephanie make from the data?

There is a stronger correlation between latitude and friendliness than between population and friendliness.

An online website for pet lovers provides articles written by veterinarians on health and nutrition, plus a community forum for discussions and photo sharing. The website developers are curious to see if there are usage patterns in how pet owners use the site, and are especially curious to see if there's a difference between dog owners and cat owners. This scatter plot compares the weekly hours spent on the site by a user to the number of forum posts they made that week: The Green Dots represent dog owners and the Purple Dots represent cat owners. Which hypothesis is most consistent with the chart?

The more that dog owners use the site, the more forum posts they make.

A website for sports fans includes a discussion forum for fans to discuss games, athletes, and predictions for the coming seasons. They decide to redesign their discussion forum to be more usable and modern looking, and ask the data analysis team to analyze the effect of the redesign on usage statistics. The website released the redesign on March 8, 2018. This table shows daily usage data before and after the redesign, including number of posts created, number of replies, and number of upvotes on posts: Day Posts Replies Upvotes 3/1/18 592 1328 4064 3/2/18 560 1349 4191 3/3/18 555 1362 4084 3/4/18 576 1318 4121 3/5/18 582 1332 4003 3/6/18 559 1340 4030 3/7/18 576 1323 4066 3/8/18 591 1558 3103 3/9/18 552 1599 3044 3/10/18 587 1611 3073 3/11/18 581 1567 3089 3/12/18 557 1554 3015 3/13/18 565 1607 3050 3/14/18 599 1601 3062 Which hypothesis is most consistent with the data?

The redesign led to a significant increase in replies and a decrease in upvotes.

A mood tracking app decides to help users understand their mood changes better by also tracking the hours they spend on other applications. This chart visualizes the results for a video watching app, using a scatter plot to compare each user's hours spent in the app to their mood after exiting the app: Users rate their mood from 1-10, where 1 is least happy and 10 is most happy. The green dots represent users that reported using the app primarily for educational videos, and the purple dots represent users that reported using it primarily for amusing videos. Which hypothesis is most consistent with the chart?

Users who watch amusing videos generally feel less happy the more they watch.

OpenPowerLifting is an organization that tracks results in the sport of power lifting and makes the data openly available in a CSV file. Each row contains the following details: - Name of the competitor - Gender of the competitor (M/F) - Age of the competitor - Body weight (kilograms) - Best bench press (kilograms) - Best deadlift (kilograms) Which of the following questions can be answered using the available data?

What are the names of the oldest and youngest competitor? What is the overall proportion of female vs. males in the dataset? What is the correlation between best deadlift amount and gender?

A "red light camera" is a camera installed at street intersections that records whenever a car runs a red light. The camera records two images, one right before the car enters the intersection, and one after it's entered the intersection. In addition to the images, it records metadata about the incident: the date and time, the intersection location, the speed of the car, and the seconds elapsed past the light turning red. Which of these questions can be better answered by analyzing the metadata instead of the image data?

What is the average speed of a car when it runs a red light? Which intersections have the greatest number of red light runners?

An online curriculum provider offers their product to two audiences: independent learners (self-directed) and classroom learners (led by their teachers). They want to understand the differences between the audiences and how they use the product, so they sent surveys and collected data. Users rated their satisfaction with the product from 1-10, where 1 is least satisfied and 10 is the most satisfied. This scatter plot compares the hours per week spent by a user to their rating of the product: The green dots represent classroom learners and the purple dots represent independent learners. Which hypothesis is most consistent with the chart?

Independent learners are generally more satisfied with the product as their usage increases.

An online toy store keeps a database of all sales. For each purchase, the database includes the following details: - The date of the sale - The time of the sale - The method of payment (credit/PayPal) - The total amount paid - A list of the items sold Here are a few rows from the database: date time method totalitems 01/09/201914:44PayPal38.47"Play-doh 36-pack", "MagicBeadz" 02/13/201911:25credit18.19"Grumpy cat stickers", "Sidewalk chalk" 03/04/201918:13PayPal59.42"Lego Death Star", "Etch-a-sketch" The toy store manager asks the database administrator for a report on various sale metrics. Which of these metrics can not be reported from the sales database?

The most expensive item sold.

Oil tankers are prone to fire, due to all the gasoline on the ship. Some tankers use specialized cameras for flame and smoke detection, and install them in the most fire-prone spots of the ship. The cameras record videos whenever they detect motion, and also record metadata along with the videos. The metadata includes the location of the camera, the temperature near the camera, the start date/time of the recording, and the end date/time of the recording. Which of these questions can be better answered by analyzing the metadata instead of the recorded videos? 👁️Note that there are 2 answers to this question.

What is the range in temperature at the cameras? On average, how many recordings are made each day per camera location?

An obstetrics department is studying fetal heartbeat and how it corresponds to a healthy birth. They make audio recordings of the fetal heartbeat at various stages of pregnancy. Along with each recording, they also record metadata. The metadata includes the gestational age of the fetus (in weeks), the age of the mother, the height of the mother and the weight of the mother. Which of these questions can be better answered by analyzing the audio data instead of the metadata?

What is the range in the heartbeat of a fetus? What is the average heartbeat of a fetus?


Related study sets

kelly & personal construct theory

View Set

ITN 106 Module 15,16,17,18,19,21

View Set

pediatric success neurological disorders chapter 5

View Set

QuickBooks Practice Test 100 Questions

View Set

Lesson 3: Basic Network Commands

View Set

Canvas Briefing: Harvesting - Quizlet

View Set

Phases of Wound healing Acute Wound/ Wound Healing

View Set

Actus Reus Attempt Test at Common Law

View Set