khan academy answers computer science
Brittany is using machine learning for an algorithm that classifies social media posts according to their sentiment ("positive", "negative", or "neutral"). She trains a neural network on a large open database of social media posts and tests the network on her personal social media feed. She notices that it's mis-classifying the posts from her teenage friends, who use different slang from her other friends. What's the best way that Brittany can improve the machine learning algorithm's ability to classify posts from teenagers?
She can add social media posts from teenagers into the training data set, both from her own network and globally available data.
A website for sports fans includes a discussion forum for fans to discuss games, athletes, and predictions for the coming seasons. They decide to redesign their discussion forum to be more usable and modern looking, and ask the data analysis team to analyze the effect of the redesign on usage statistics. The website released the redesign on March 8, 2018. This table shows daily usage data before and after the redesign, including number of posts created, number of replies, and number of upvotes on posts: Which hypothesis is most consistent with the data?
The redesign led to a significant increase in replies and a decrease in upvotes.
Craig is developing a new micro-blogging app and has shared it with a group of beta testers. He wants to understand their usage patterns, so he tracks data on the number of posts a user makes each day, the average length of their posts, and the average sentiment of their posts (from very negative to very positive). This plot compares the average sentiment for a user's posts to the number of posts they make each day: A scatter plot with post length on the x axis and posts per day on the y axis. The x axis goes from 0 to 280, and the y axis goes from 0 to 20. Dots are scattered on the plot for all values of post length and posts per day, seemingly randomly. Which conclusions can Craig make from the data?
There is a negative correlation between posts per day and sentiment.
Hong Lien is a research director for NASA. She is hoping to build a system that can process millions of images from satellites orbiting the earth and analyze them each day for signs of deforestation and ocean debris. She decides to hire an engineer specifically for the task of building the system and reviews many resumes. Which of these lines on a resume is most pertinent to this task?
"I have experience building highly scalable systems."
Tajikstan is a country in Central Asia, where many people live in poverty and most do not have access to the Internet. The following table shows the percentage of residents using the Internet for the years 2012-2017: Assuming the Internet usage keeps growing at a similar rate, which of the following is the most reasonable prediction for Internet usage in 2019 (two years after the last data point)?
25.1%
Greenland is the world's largest island, located east of Canada. It's connected to the rest of the world's Internet via underwater fiber cables. The following table shows the percentage of Greenland residents using the Internet for the years 2012-2017: Assuming the Internet usage keeps growing at a similar rate, what is the most reasonable prediction for Internet usage in 2019 (two years after the last data point)?
71.4%
Bianca is planning to start a service for programmers who want to prepare for software engineering interviews. To help her figure out the target audience, she does some market research by sending around a survey. Which conclusions can Bianca make from the data? 👁️Note that there are 2 answers to this question.
A higher interest in the service is positively correlated with a higher willingness to pay. There is a negative correlation between years of programming experience and interest in the service.
Note that this is a two-part question. Part 1: In 2016, researchers studied the differences between two dialects of English used in tweets on Twitter. They first categorized over eight million tweets as either African-American-aligned (AA-aligned) or White-aligned. [How?] The researchers then tried out three different language identification algorithms to see whether each tweet would be categorized as the English language, a different language, or an unknown language. The following table shows the proportion of tweets from each dialect that were classified as non-English by each algorithm: Which of the following statements is supported by that data? Part 2: Which of the following does not describe a way in which a biased language identification algorithm could result in discrimination?
All three algorithms contain biases that cause them to misidentify an AA-aligned tweet as non-English more often than they misidentify the language of a White-aligned tweet. The analytics team of a reviews website uses the language identification algorithm to display a pie chart of the number of reviews written per language in the last year.
An online essay writing website decides to implement a plagiarism detection system, after several teachers report that their students submitted suspicious essays on the site. The website engineering team is considering a number of ways to detect plagiarism. Which of these plagiarism detection algorithms would benefit the most from access to a large data set? Which of the following does not describe a way in which the service could be used to discriminate against individuals?
An algorithm that computes the similarity of the wording in a student's essay to all other essays on the site and elsewhere on the web.
Spam emails are unsolicited messages sent in bulk, typically for advertising or phishing purposes. Email providers typically include a spam detection system, to automatically label and hide emails that look like spam. Which of these spam detection algorithms would benefit the most from access to a large database of emails?
An algorithm that computes the spam likelihood based on the similarity of the email to other spam emails.
Company recruiters use applicant tracking systems to keep track of the résumés that candidates send in for the job. Many applicant tracking systems use algorithms to automatically rank the résumés, to help recruiters sift through large quantities of résumés. Which of these algorithms would require access to a large database of résumés?
An algorithm that ranks résumés based on similarity to résumés from already hired applicants.
Note that this is a two-part question. Part 1: In a 2019 research paper, a group of researchers analyzed two machine learning algorithms for automated hate speech detection. Both algorithms were trained on thousands of tweets that had been annotated by crowdsourced workers as being offensive or not. To test the algorithms, the researchers collected millions of tweets that were classified as written in either the dialect of African American English (AAE) dialect or White-aligned English. [How?] The researchers ran the algorithms over each tweet and recorded the rate at which non-offensive tweets were marked as offensive as well as the rate at which offensive tweets were marked as non-offensive. Which of the following statements is supported by that data? Part 2: Which of the following does not describe a way in which a biased hate speech detection algorithm could result in discrimination?
Both algorithms contain bias that results in more often falsely labeling AAE tweets as "offensive" versus White tweets. A discussion platform could calculate the number of posts that have been categorized as "hate speech" by the algorithm compared to the number of posts that were flagged by users.
Community gardens are public gardens where local residents can grow plants in a plot. They are very popular, so there are often waitlists to get a plot. Alioto Community Garden stores their waitlist data in this format: The gardens decide to combine their data sets, since they're located so near to each other. Which of the following can be done using the combined data set? 👁️Note that there are 2 answers to this question.
Create an electronic mailing list for everyone waitlisted Make a map of the waitlisted people
Talisa is an engineer that is helping a museum to digitize and analyze all of its historical books. After running the software over the first 100 books, she realizes that the museum computer has run out of space to store the digital files. Which technique is the most needed to help them digitize the remaining books?
Distributed computing
Community gardens are public gardens where local residents can grow plants in a plot. They are very popular, so there are often waitlists to get a plot. Alioto Community Garden stores their waitlist data in this format: Columns:Name, email, address, waitlist date, plot sizeSample row:Jolie Clover, [email protected], 501 Stanyan St, 05-06-2018, small A neighboring garden, Arkansas Friendship Garden, stores their waitlist data in this format: Columns:Last name, first name, phone, address, waitlist dateSample row:McGee, Eirene, 631-421-4141, 1351 24th Ave, 11-11-2018 The gardens decide to combine their data sets, since they're located so near to each other. Which of the following can be done using the combined data set? 👁️Note that there are 2 answers to this question.
Figure out who has been waiting the longest Make a map of the waitlisted people
A national bank opts to use machine learning for deciding whether to award loans to applicants. The engineers create the algorithm by training a neural network on their large database of previous loan applications and decisions (made by loan officers). After they start using the algorithm for new loan applicants, they receive complaints that their algorithm must be biased, because all the loan applicants from a particular zip code are always denied. What is the most likely explanation for the algorithm's bias?
For that zip code, the training data set only has loan applications that were denied.
Andy is using machine learning for an algorithm that classifies photos of restaurant meals by category (such as "sandwich", "curry", or "salad"). He trains a neural network on a large open database of photos of restaurant meals. He then tests the network on local restaurants and notices that the Ethiopian restaurant meals aren't classified correctly. What's the best way to improve the machine learning algorithm's ability to recognize Ethiopian meals?
He can add Ethiopian meals to the training data set, by finding photos online, crowd-sourcing, or taking them himself.
The Chicago Police Department uses a database to keep track of reported crimes. After anonymizing the data, they make it freely available online. Here's what the the crime data set includes: The date of the crime The address (block level) The type of crime (theft/battery/robbery/assault/etc.) The location type (street/residence/business/etc.) Whether an arrest was made (true/false) Which of the following questions can be answered using the available data? 👁️Note that there may be multiple answers to this question.
How many assaults were committed by the same individual? What type of crime was the most common for each year in the data set? What was the average number of crimes committed per location type?
A hospital IT department is determining how much data storage capacity they will need to store electronic health records for patients. They start by making a list of the type of data that comes from each department: Which type of data is likely to require the most data storage capacity?
Imagery from scans (CT/PET/MRI)
An online curriculum provider offers their product to two audiences: independent learners (self-directed) and classroom learners (led by their teachers). They want to understand the differences between the audiences and how they use the product, so they sent surveys and collected data. Users rated their satisfaction with the product from 1-10, where 1 is least satisfied and 10 is the most satisfied. This scatter plot compares the hours per week spent by a user to their rating of the product: Which hypothesis is most consistent with the chart?
Independent learners are generally more satisfied with the product as their usage increases.
A travel website is adding a feature for users to store trip itineraries. Here's a sample itinerary: Title: Summer trip to Japan 1. Inari shrine (Kyoto, Japan) 2. Iwatayama Monkey Park (Kyoto, Japan) 3. Fushimi Inari Taisha (Kyoto, Japan) 3. Fukui Prefectural Dinosaur Museum (Katsuyama, Japan) 4. Kōtoku-in (Kamakura, Japan) 5. Ghibli Museum (Mitaka, Japan) 6. Tokyo Anime Center (Tokyo, Japan) They are considering a number of enhancements to the trip itinerary feature, and the engineering team is considering the data storage requirements of the new features. Which feature is likely to require the greatest increase in data storage needs?
Making copies of the user's trip itinerary in 6 data centers around the world
Oil tankers are prone to fire, due to all the gasoline on the ship. Some tankers use specialized cameras for flame and smoke detection, and install them in the most fire-prone spots of the ship. The cameras record videos whenever they detect motion, and also record metadata along with the videos. The metadata includes the location of the camera, the temperature near the camera, the start date/time of the recording, and the end date/time of the recording. Which of these questions can be better answered by analyzing the metadata instead of the recorded videos? 👁️Note that there are 2 answers to this question.
On average, how many recordings are made each day per camera location? What is the range in temperature at the cameras?
Xiomara is a researcher studying the effect of carbon emissions from airplanes on global warming. She collects millions of data points tracking the path of airplanes and develops a program that analyzes the data. When she runs the program on a single company's airplanes, it takes an hour to complete, so she becomes concerned that it will take much too long to run on all of the airplane data. Her friend Dacari suggests using parallel computing to speed up the analysis of the airplane emission data. How would parallel computing speed up the analysis?
Parallel computing can run the program in parallel on subsets of the data, so that the total amount of time is less.
A team of scientists and engineers is putting together a research project to study whale sounds. In order to develop the infrastructure for the project, they need to first determine how much data storage space their observational data will require. This is an example of a single observation: will increase their data storage needs the most?
Recording of whale sound
Safiya is a software engineer at a company that's developing software for self-driving cars. She's working on software that uses computer vision and machine learning algorithms to detect pedestrians walking near the car and trigger the brakes when needed. After training the algorithm on a large dataset of training data (videos of pedestrians walking near cars), they try it out in cars with backup drivers. The drivers report that it detected most pedestrians, but failed to detect people using wheelchairs and parents pushing strollers. The drivers had to manually respond in those cases. What's the best way that Safiya can improve the machine learning algorithm's ability to detect all pedestrians?
She can add videos of people using wheelchairs and strollers into the training data set (perhaps crowd-sourcing them if there are none already available).
On June 22, 1944, the U.S. introduced the G.I. Bill, a law that provided many benefits to war veterans, including college tuition. Cornell University has been tracking enrollment numbers since their inception. This table shows enrollment in the 10-year period from 1940-1950, broken down by gender: Which hypothesis is most consistent with the data?
The G.I. Bill led to a large increase in male enrollment.
Note that this is a two-part question. Part 1: Yitaf discovers a new startup called GenderChecker, a service that claims to be able to "identify the gender of your customers" based on an email address. Yitaf is interested in the accuracy of its predictions and tries out a bunch of addresses: Part 2: Which of the following statements is most supported by that data?
The algorithm contains bias that associates "doctor" (and related abbreviations) with the male gender. A social media network could generate a graph that shows the fraction of user sign-ups each month per gender.
HireView is a startup that claims to speed up the process of interviewing job candidates. A candidate submits a video answering interview questions and HireView analyzes the video with a machine learning algorithm. The algorithm scores the candidate on various aspects of their personality, such as "willingness to learn" and "personal stability". HireView engineers trained the algorithm on a data set of past videos that were scored by employers and psychologists. In a test of the algorithm, HireView engineers discover that the algorithm always gives lower scores to people who speak more slowly. What is the most likely explanation for the algorithm's bias?
The algorithm was trained on data where videos with slower speech were scored lower, due to the bias of the scorers.
A medical diagnosis app lets users track their symptoms. Whenever a user reports a symptom, the app adds a row to a database table. Each row contains: The user ID The date of the report The time of the report A description of how they're feeling The severity of the feeling (1-10) Here are a few rows from the table: The app marketing team wants to understand their users better and asks the data analyst for various statistics. Which statistic can not be calculated from the table of reports?
The average duration of the feeling.
Two neighboring high schools both offer an AP Biology course and track how well the students do on the exam. The two schools decide to combine their data sets to see what they can learn from them together. Which of the following can be determined from the combined data set? 👁️Note that there are 2 answers to this question.
The distribution of AP exam scores for 11th graders The total number of students that earned either a 4 or 5 on the AP exam
A non-profit website decided to launch a fundraising campaign in December, to encourage people to make tax-deductible donations before the end of the year. For their campaign, the website displayed a "Please donate!" banner along the top of every page, starting on December 15th. This table shows their data from December 12th to December 22nd, tracking donations from signed in users, donations from users that weren't signed in, and total sign-ups for the site. Which hypothesis is most consistent with the data?
The donation banner led to a significant increase in total donations and did not affect sign ups.
A company develops a program to help high school counselors suggest career paths to students. The program uses a machine learning algorithm that is trained on data from social media sites, looking at a user's current job title, their background, and their interests. After counselors use the program for a while, they observe that the program suggests jobs in STEM (science, technology, engineering, math) much more often for male students than for other students. What is the most probable explanation for the bias in the career suggestions?
The machine learning algorithm was trained on data from a society where STEM jobs are more likely to be held by male-identified people due to historical bias.
An online website for pet lovers provides articles written by veterinarians on health and nutrition, plus a community forum for discussions and photo sharing. The website developers are curious to see if there are usage patterns in how pet owners use the site, and are especially curious to see if there's a difference between dog owners and cat owners. This scatter plot compares the weekly hours spent on the site by a user to the number of forum posts they made that week: Which hypothesis is most consistent with the chart?
The more that dog owners use the site, the more forum posts they make.
An online toy store keeps a database of all sales. For each purchase, the database includes the following details: The date of the sale The time of the sale The method of payment (credit/PayPal) The total amount paid A list of the items sold The toy store manager asks the database administrator for a report on various sale metrics. Which of these metrics can not be reported from the sales database?
The most expensive item sold.
Shameeka is setting up a computing system for predicting earthquakes based on processing data from seismographs (devices that record earth movements). The system will start off with data from local seismographs but eventually handle millions of data points from seismographs worldwide. For her system to work well, what is an important feature?
The system must be scalable.
Each row contains: The user ID The date of the report The time of the report A description of how they're feeling The severity of the feeling (1-10) The app marketing team wants to understand their users better and asks the data analyst for various statistics. Which statistic can not be calculated from the table of reports?
The user ID with the most number of reports.
StackOverflow is a popular question & answers site. Each time a user asks a new question, they insert a row in a database table. Each row contains: The user ID The user display name The timestamp of the question The text of the question The spam score of the question (0-5) The team wants to display question statistics on an internal dashboard. Which statistic can not be calculated from the table of questions? Choose 1 answer:Choose 1 answer:
The user ID with the most number of unanswered questions
Stephanie is researching the friendliness of U.S. cities. To determine the friendliest and unfriendliest cities, she conducts a nation-wide survey . She then collects data about each of the cities, to try to understand what factors are related to a city's friendliness, and visualizes the data in scatter plots. Which conclusions can Stephanie make from the data?
There is a stronger correlation between latitude and friendliness than between population and friendliness.
A mood tracking app decides to help users understand their mood changes better by also tracking the hours they spend on other applications. This chart visualizes the results for a video watching app, using a scatter plot to compare each user's hours spent in the app to their mood after exiting the app: Which hypothesis is most consistent with the chart?
Users who watch amusing videos generally feel less happy the more they watch.
Lakeisha is developing a program to process data from smart sensors installed in factories. The thousands of sensors produce millions of data points each day. When she ran her program on her computer, it took 10 hours to complete. Which of these strategies are most likely to speed up her data processing? 👁️Note that there are 2 answers to this question.
Using parallel computing on a computer with a multi-core CPU. Distributing the computing to multiple machines to run the program on subsets of the data.
OpenPowerLifting is an organization that tracks results in the sport of power lifting and makes the data openly available in a CSV file. Each row contains the following details: Name of the competitor Age of the competitor Body weight (kilograms) Best bench press (kilograms) Best deadlift (kilograms) Equipment used (multi-ply, single-ply, wraps, straps, or raw) Which of the following questions can be answered using the available data? 👁️Note that there may be multiple answers to this question. Choose all answers that apply:Choose all answers that apply:
What are the names of the oldest and youngest competitors? What is the relationship between best deadlift amount and the equipment used? What is the proportion of competitors using straps as their equipment?
An obstetrics department is studying fetal heartbeat and how it corresponds to a healthy birth. They make audio recordings of the fetal heartbeat at various stages of pregnancy. Along with each recording, they also record metadata. The metadata includes the gestational age of the fetus (in weeks), the age of the mother, the height of the mother and the weight of the mother. Which of these questions can be better answered by analyzing the audio data instead of the metadata? 👁️Note that there are 2 answers to this question.
What is the average heartbeat of a fetus? What is the range in the heartbeat of a fetus?
Problem A "red light camera" is a camera installed at street intersections that records whenever a car runs a red light. The camera records two images, one right before the car enters the intersection, and one after it's entered the intersection. In addition to the images, it records metadata about the incident: the date and time, the intersection location, the speed of the car, and the seconds elapsed past the light turning red. Which of these questions can be better answered by analyzing the metadata instead of the image data? 👁️Note that there are 2 answers to this question.
What is the average speed of a car when it runs a red light? Which intersections have the greatest number of red light runners?
The San Francisco Health Department keeps track of health inspections at restaurants and makes the data publicly available. Each row in the inspections data set contains these details: Restaurant name Restaurant address Inspection date Inspection score (0-100) Violation description Risk severity (low/medium/high) Which of the following questions can be answered using the available data? 👁️Note that there may be multiple answers to this question.
Which restaurant has the lowest inspection score? What is the average inspection score for the high risk violations? How many restaurants have an inspection score greater than 90?
The Democratic Republic of Congo is a country located in Central Africa, where many people live in extreme poverty and few have access to the Internet. The following table shows the percentage of residents using the Internet for the years 2013-2017: Assuming the Internet usage keeps growing at a similar rate, what is the most reasonable prediction for Internet usage in 2019 (two years after the last data point)?
9.8%
Big data Problem Xiomara is a researcher studying the effect of carbon emissions from airplanes on global warming. She collects millions of data points tracking the path of airplanes and develops a program that analyzes the data. When she runs the program on a single company's airplanes, it takes an hour to complete, so she becomes concerned that it will take much too long to run on all of the airplane data. Her friend Dacari suggests using parallel computing to speed up the analysis of the airplane emission data. How would parallel computing speed up the analysis?
Parallel computing can run the program in parallel on subsets of the data, so that the total amount of time is less.
Two neighboring cities have created data sets of places where people can get their flu shot. The two cities are combining their data sets to create informational campaigns for their residents. Which of the following can be determined from the combined data sets? 👁️Note that there are 2 answers to this question.
The last possible date to get a flu shot The city that has the most locations open