APCSP Khan Academy QZ
Encryption
A process of encoding messages to keep them secret, so only "authorized" parties can read it.
Decryption
A process that reverses encryption, taking a secret message and reproducing the original plain text.
Caesar's Cipher
A technique for encryption that shifts the alphabet by some number of characters.
An online essay writing website decides to implement a plagiarism detection system, after several teachers report that their students submitted suspicious essays on the site. The website engineering team is considering a number of ways to detect plagiarism. Which of these plagiarism detection algorithms would benefit the most from big data?
An algorithm that computes the similarity of the wording in the student's essay to all other essays on the site and elsewhere on the web.
Spam email are unsolicited messages sent in bulk, typically for advertising or phishing purposes. Email providers typically include a spam detection system, to automatically label and hide emails that look like spam. Screenshot of Gmail interface, with filters for "All Mail" and "Spam" messages Which of these spam detection algorithms would benefit the most from big data?
An algorithm that computes the spam likelihood by computing the similarity of an email to other spam emails.
Company recruiters use applicant tracking systems to keep track of the resumes that candidates send in for the job. Many applicant tracking systems use algorithms to automatically rank the resumes, to help recruiters sift through large quantities of resumes. Which of these algorithms would benefit the most from big data?
An algorithm that is trained on resumes from already hired applicants and ranks based on similarity to those resumes.
Random Substitution Cipher
An encryption technique that maps each letter of the alphabet to a randomly chosen other letters of the alphabet.
A researcher is granted access to a large data set from a hospital's obstetrics department. The hospital hopes the researcher can both compute basic statistics about the data and find interesting patterns in the data. Which of these analyses is an example of searching for patterns in the data set?
Calculating the likelihood that a baby will be born premature, based on the similarity of the fetus and mother to others.
A national bank opts to use machine learning for deciding whether to award loans to applicants. The engineers create the algorithm by training a neural network on their large database of previous loan applications and decisions (made by loan officers). After they start using the algorithm for new loan applicants, they receive complaints that their algorithm must be biased, because all the loan applicants from a particular zip code are always denied. What is the most likely explanation for the algorithm's bias?
For that zip code, the training data set only has loan applications that were denied.
Andy is using machine learning for an algorithm that classifies photos of restaurant meals (like "sandwich", "curry", "salad"). He trains a neural network on a large open database of photos of restaurant meals. He then tests the network on local restaurants and notices that the Ethiopian restaurant meals aren't classified correctly. What's the best way to improve the machine learning algorithm's ability to recognize Ethiopian meals?
He can add Ethiopian meals to the training data set, by finding photos online, crowd-sourcing, or taking them himself.
A hospital IT department is determining how much data storage capacity they will need to store electronic health records for patients. They start by making a list of the type of data that comes from each department: Department Data Format & average size Primary care Notes from patient chats with doctors 3 paragraphs per visit Laboratory Test results A table with 20 rows and 3 columns Radiology Imagery from scans (CT/PET/MRI) 64 1024x1024 grayscale images Pharmacy Medication prescriptions Patient name/ID, medicine name, data Which type of data is likely to require the most data storage capacity?
Imagery from scans (CT/PET/MRI)
A travel website is adding a feature for users to store trip itineraries. Here's a sample itinerary: Title: Summer trip to Japan 1. Inari shrine (Kyoto, Japan) 2. Iwatayama Monkey Park (Kyoto, Japan) 3. Fushimi Inari Taisha (Kyoto, Japan) 3. Fukui Prefectural Dinosaur Museum (Katsuyama, Japan) 4. Kōtoku-in (Kamakura, Japan) 5. Ghibli Museum (Mitaka, Japan) 6. Tokyo Anime Center (Tokyo, Japan) They are considering a number of enhancements to the trip itinerary feature, and the engineering team is considering the data storage requirements of the new features. Which feature is likely to require the greatest increase in data storage needs?
Making copies of the user's trip itinerary in 6 data centers around the world
In the modern age, supermarket chains can collect a huge amount of information about customers and product inventory. They can analyze that data to help them understand their customer base better and to make more informed decisions about store layout and marketing campaigns. Which of these analyses is an example of searching for patterns in a large data set?
Predicting the likelihood of any two products being purchased together (e.g. if a customer buys Cheerios, there is a 30% chance they'll buy Strauss 2% milk.)
A high school computer science department wants to better understand how students study for the AP CSP exam. The CSP teachers send a survey to their 6 classes, asking these questions: What grade are you in? What CS classes did you take before CSP? How many hours did you study? How early did you start studying? Which study resources did you use? What did you earn on the exam? They find the results very interesting, spread the survey to teachers all across the United States, and collate the results in a central database. Which of these analyses is an example of searching for patterns in a large data set?
Predicting the optimal start date and study hours, for a student using a particular study resource and with a certain level of experience.
A team of scientists and engineers is putting together a research project to study whale sounds. In order to develop the infrastructure for the project, they need to first determine how much data storage space their observational data will require. This is an example of a single observation: Sound Location Date/time Species 3 minute long audio file 63.776871, -171.742193 May 27, 2019, 2:23:13 PM Beluga The team hopes to capture thousands of whale sounds from all the world's oceans. Which piece of data will increase their data storage needs the most?
Recording of whale sound
Brittany is using machine learning for an algorithm that classifies social media posts according to their sentiment ("positive", "negative", or "neutral"). She trains a neural network on a large open database of social media posts and tests the network on her personal social media feed. She notices that it's mis-classifying the posts from her teenage friends, who use different slang from her other friends. What's the best way that Brittany can improve the machine learning algorithm's ability to classify posts from teenagers?
She can add social media posts from teenagers into the training data set, both from her own network and globally available data.
Cipher
The generic term for a technique (or algorithm) that performs encryption.
Cracking Encryption
When you attempt to decode a secret message without knowing all the specifics of the cipher, you are trying to crack the encryption.