Data Science Quizlet
Xpath which outputs cities located between 2E and 2W in format "city: latitude, longitude"
/mondial/country//city[@longitude<2 and @longitude >-2]/concat(name,':',@latitude, ',', @longitutde);
Map/Reduce
A computational paradigm for distributed processing of large-scale datasets. A simple 2-step parallel programming paradigm (Map, Reduce). Takes a big file as input. Not a database system.
Foreign key
An attribute K (or collection of attributes) in relation B that uniquely identifies a type in a relation A. Typically K is a primary key in A.
Horizontal Partitioning
Distributing the rows of a logical relation into several separate tables
Round-robin (Block) partitioning
Each next group of k rows are assigned to the next node. The simplest strategy. uniform distribution
Inter-query parallelism
Each query runs on one node. Several different queries execute concurrently on multiple processors.
Supervised Learning
Explicit data labels. Goal is to learn that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data (image tagging, sentiment analysis)
Xquery differs from Xpath because
FLW expresssion
Page rank (definition and equation)
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites Page Rank Formula: On Desktop
Benefits to Relational Model
Physical Data Independence (Oblivious to data organization on disk), Logical Data Independence (Scheme can evolve without affecting the users/applications).
Pros and Cons of Shared Disk Architecture
Pros: Easy to use, maintain and program Cons: Expensive, difficult to scale past a certain configuration, changes to codebase might be needed during migration
Pros and cons of shared-everything architecture
Pros: Easy to use, maintain, and program. Can migrate old codebase without major changes. Cons: Expensive, difficult to upgrade, difficult to scale past a certain configuration
Pros and Cons of Shared Nothing Architecture
Pros: Less contention for resources, the most scalable architecture. Cons: Difficult to administer, configure, and maintain. Changes to codebase required during migration
Inter-operator parallelism
Query runs on multiple nodes in parallel. Each operator runs of one node. enables different operators of the query to be executed in parallel. The more complex the query, the more opportunities there are for inter-operator parallelism.
Ranking Relevance Evaluation: F-Measure and equation
Recall + Precision. Equation on desktop
What can a learning algorithm do?
Recognizing Patterns, Generating Patterns, Recognizing Anomalies, Prediction
Main components of a relation schema:
Relation name, name of each attribute, domain of each field
Hash partitioning
Row r assigned to the node r.K mod P. applies a hash function to some attribute that yields the partition number. This strategy allows exact-match queries on the selection attribute to be processed by exactly one node and all other queries to be processed by all the nodes in parallel. May be skewed if k has the same value for all rows
Write a SQL query that lists the average length of movies played at Landmark Neptune Theatre
SELECT AVG(Length) FROM movies,movies2theaters, theaters WHERE theaters.Name = 'Landmark Neptune Theatre' AND theaters.TheaterCode = movies2theaters.TheaterCode AND movies2theaters.MovieCode = movies.MovieCode
Write a SQL query that shows how many times the Metro Cinemas theater has played a movie that is shorter than 2 hours
SELECT COUNT (*) FROM movies, theaters, movies2theaters WHERE theaters.name='Landmark Metro Cinemas' AND theaters.TheaterCode=movies2theaters.TheaterCode AND movies2theaters.MovieCode=movies.MovieCode AND movies.Length<120;
Company(name, city) Product(pname, maker) Purchase(id, product, buyer) Return cities having companies that manufacturer products bought by John (use nested query)
SELECT Company.city FROM Company WHERE Company.name IN ( SELECT Product.maker FROM Purchase, Product WHERE Product.pname = Purchase.product AND Purchase.buyer = 'John')
Indexing and two files that are created
Take the web pages, create an inverted index, output the inverted index. Split index into two files. Lexicon - hash table on disk (one read), main memory (has the words) Occurrence List: On Disk, distributed file system
Machine Learning
The study of algorithms that: improve their performance P, at some task T, with experience E. A well-defined learning task is given by <P,T,E>
what's a tuple
Tuple: is an element of S1 x S2 x ...x Sn,
is Kmeans supervised or unsupervised
Unsupervised
Normalized Discounted Cumulative Gain (NDCG) and equation
Using a graded relevance scale of documents in a search-engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. equation on desktop
Where can subqueries be found
WHERE clause, FROM clause, HAVING clause (less common)
What a relational Database
a collection of relations
An inverted index in a web search is
a data structure designed to help find web pages relevant to the user query
Dummy Variables
a numerical variable used in regression analysis to represent subgroups of the sample in the study
Sentiment Analysis
a technique that allows marketers to analyze data from social media sites to collect consumer comments about companies and their products
kmeans
an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. · Assign each point to its closest mean · Recompute the means to be the mean of the points assigned to each cluster
MongoDB
an open source document-oriented DBMS
Shared nothing architecture
architecture where nodes communicate over the network
K nearest neighbors does all the work during:
classification time
MongoDB Query Language Find
db.collection.find( <condition>, <projection>).limit (N)
MongoDB Query Language insert
db.collection.insert( <document> )
XPath one-liner which outputs the names and population of the countries with population less than 5 thousand
doc("world.xml")//country[@population<5000]/data((name, @population))
Xpath, which outputs all seas shallower than 200
doc("world.xml")//sea[@depth<200]/data(@name)
Xpath which outputs the population of Guadalajara
doc("world.xml")/mondial//city[contains(.,'Guadalajara')]/data(normalize-space(population))
Xpath that returns cities in Brazil location nearby seashore along with their population
doc("world.xml")/mondial//country[@name='Brazil']//city[located_at[@type='sea']]/data((name, population))
dimensionality reduction
find features that differentiate individuals. methods used to represent data using less columns or features,
FLWR expression which outputs the name and population of countries with population less than 5 thousand in order of descending population
for $x in doc("world.xml")/mondial/country where $x/@population < 5000) order by xs:int($x/@population) descending return <name> {$x/@name, $x/@population}</name>
FLWR expression which outputs a list of capitals of countries, where national inflation is greater than 10
for $x in document("world.xml")//country[@inflation>10] where $x/@capital=$x//city/@id return ($x//city/name/text())
FLWR expression which outputs a list of river names in ascending order
for $x in document("world.xml"//river/data(@name) order by $x ascending return $x
FLWR expression which outputs the population for each city in Spain
for $x in document("world.xml"/mondial//country[@name='Spain']//city/data(population) return $x
Ranking Relevance Evaluation: Precision
fraction of the documents retrieved that are relevant to the user's information need. (Relevant and Retrieved)/Retrieved
Ranking Relevance Evaluation: Recall
fraction of the documents that are relevant to the query that are successfully retrieved. (Relevant and Retrieved)/relevant
Goal of Data Science
improve decision making by extracting insights from large, heterogenous data sets.
Bayes Naive Algorithm assumes the features are
independent (but conditioned on the class)
Pagerank (static feature)
is an algorithm to calculate rank of a web page in a web graph
Reinforcement learning
label on the result of a sequence of actions, but not on each action (game, robotics). concerned with how software agents ought to take actions in an environment in order to maximize some notion of cumulative reward.
String similarity measure
number of characters you need to change in one string to modify into another string
A ranking feature to help rank search results:
page rank
A web page is a good authority if
pointed to by many good hubs
A web page is a good authority if it is
pointed to by many good hubs
A web page is a good hub
points to many good authorities
Data becomes dirty from
primary key violations, diff data representations, formatting differences
Parallel Database
seeks to improve performance through parallelization of various operations, such as loading data, building indexes and evaluating queries. The distribution is governed solely by performance considerations. Parallel databases improve processing and input/output speeds by using multiple CPUs and disks in parallel. In parallel processing, many operations are performed simultaneously.
Range Partitioning
selects a partition by determining if the partitioning key is within a certain range. An example could be a partition for all rows where the "zipcode" column has a value between 70000 and 79999. May be skewed - difficult to partition k into equal width chunks.
XML data is:
semi-structured
Parallel-Join
split the pairs to be tested over several processors. Each processor computes part of the join, and then the results are assembled (merged).
Stragglers in map-reduce
take a long time to finish
Stragglers
tasks that take very long time to complete and drag overall job progress (bad sectors on disk -> significant read/write performance degradation, scheduler assigned too many tasks on a single machine)
Omitted Variable Bias
the specific type of bias that results from the failure to include a variable that belongs in our regression model. We assume changes in the dependent variable that are correlated with the explanatory variable are because of the explanatory variable. We assume that changes in the dependence variable that are not explained by the explanatory variables is "noise."
clustering
the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. EX: find groups of customers with similar tastes, find topics within a set of news articles
Weakly/Distantly supervised
using noisy labels or partial labels (bootstrapping, automatically labeled data) This approach alleviates the burden of obtaining hand-labeled data sets, which can be costly or impractical. Instead, inexpensive weak labels are employed with the understanding that they are imperfect, but can nonetheless be used to create a strong predictive model
Relational Algebra expression for "Addresses and phone numbers of theaters that play 'Frozen' and query plan
Πtheaters.Address, theaters.Phone ((theaters ⋈ movies2theaters ⋈ (σTitle = "Frozen" (movies))) tree on desktop
Reduce in Map/Reduce
Sum, Filter, Transform, Aggregate. hich performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)
Naiive Bayes practice session Positive Reviews: 1. Not bad, it is very fast 2. This is an awesome laptop 3. it has exceeded my expectations Negative Reviews 1. The fan noise is very bad. 2. Turned out to be factory defect. 3. You are going to hate it! P(+) = 1/2, P(-) = 1/2 Predict the class for 1. It is very bad. 2. Hate the noise P(+|words) P(-|words)
1. It is very bad. P(+|words) = 2/81. P(-|words) = 1/162 2. Hate the noise P(+|words) = 0, P(-|words) = 1/54
Google File System (GFS)
A distributed file system Google developed to meet rapidly increasing storage requirements; offers scalability to petabyte storage, the capability to handle extremely large files, data storage on distributed commodity servers, and simultaneous file access by multiple distributed applications. each file is N blocks (>= 64 MB), Blocks are randomly distributed among machines and replicated for fault tolerance. File system is optimized for append.
What is a relation
A subset of S1 x S2 x ...x Sn, where si is Domain of attribute i ' [1,n], and n is a number of attributes of R. A set of tuples
K-Nearest Neighbor
All work is done at classification time. Arguably the simplest ML algorithm. "Non-parametric" - no assumptions about the form of the classification model. Works with tiny amounts of training data. Assumes similar things exist in close proximity. The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.
Broadcast Join
Broadcast join can be very efficient for joins between a large table (fact) with relatively small tables. Do not reshuffle R or S, Replicate S to all nodes.
Inverted index for the following: Doc with ID 1 = Data Science Studies methods of recording, storing, and deriving insights from data. Doc w ID 2 = Machine learning studies algorithms that learn from data and improve with experience. Lexi Columns = Word, Ndocs, PTR Occurrence list columns = DocId, occurs, Pos1, Pos2
Check phone
Semi-supervised learning
Combining large amount of unlabeled with smaller amount of labelled (pre-training)
Map in Map/Reduce
Computes a function based on the information read from each record in a file. Performs filtering and sorting (such as sorting students by first name into queues, one queue for each name).
Web Search - crawling
Crawl the web, store documents, check for duplicates, extract links -> web pages
Data Science Workflow
Data collection, Data processing, exploration/visualization, Analysis/machine learning, insight/policy decision
shared nothing architecture
Many nodes share nothing; each node has its own memory and disk. Nodes do not share memory or storage. Ex: HP Vertica running on a cluster. each update request is satisfied by a single node (processor/memory/storage unit).
Shared Disk Architecture
Many nodes which share the same disk(s) Ex: Oracle running on a cluster. All disks are accessible from all cluster nodes.
what takes into account order of returned search results
NDCG (Normalized discounted cumulative gain)
Unsupervised
No explicit labels. Its goal is to infer the natural structure present within a set of data points. "Finding structure in data" which is "exploratory analysis" · Used extensively for "pre-training" (dimensionality reduction, language modeling, auto-encoding) (clustering, dimensionality reduction)
What is shared-everything architecture
Nodes share disk and memory, Up to hundreds of CPUs. Ex: MySQL running on a single powerful machine.
Goal of classification
One goal: P(Y|X) where Y is label and X is features. P(email is spam | words in the message)
Intra-operator parallelism
Operator runs on multiple nodes in parallel Ex: Join running on several nodes. Intra-operator parallelism is based on the decomposition of a relational operator into a set of independent operator instances, each processing a different relation partition.
Gradient Descent
Optimization algorithm for finding the input to a function that produces the optimal value; iterative
Oversimplified model of ML and with more detail
Oversimplified Machine Learning: Data -> Model -> Goal/Task In more detail Machine Learning: Training Text Documents, Images, Sounds -> Features vectors, Labels -> Machine Learning Algorithm -> Predictive Model, Expected Label. Also, New Text Document Image, sound, features vector -> Predictive Model -> Expected Label.
Formula for likelihood estimation
P(C)Pi(n, I=1) P(xi, c)
how to declare key constraints when creating a table
PRIMARY KEY (sno, pno) FOREIGN KEY(pno) REFERENCES Part
SQL query that lists title, length, and movie time of movies that are played in the Cineplex Odeon Uptown theaters and relational algebra and tree
SELECT DISTINCT movies.Title, movies.Length, movies2theaters.movietime FROM movies, movies2theaters, theaters WHERE movies.MovieCode = movies2theaters.MovieCode AND movies2theaters.TheaterCode = theaters.TheaterCode AND theaters.Name = "Cineplex Odeon Uptown"; 3. Πmovies.Title, movies.Length, movies2theaters.movietime((movies ⋈ movies2theaters ⋈ (σname = "Cineplex Odeon Uptown" (theaters))) tree on desktop
Write a SQL query that lists the name and address of all theaters showing less than five different movies
SELECT DISTINCT theaters.Name, theaters.Address FROM theatersNATURAL JOIN movies2theaters GROUP BY theaters.NameHAVING COUNT(DISTINCT movies2theaters.MovieCode) < 5;
Write SQL and Relational Algebra that lists only the length of all movies with an R rating
SELECT Length FROM movies WHERE Rating ="r' Πlength σrating = 'R' (Movies)
SQL query that lists theater names, phone numbers and the number of different movies that are shown in these theaters sorted by this number (of different movies) in descending order
SELECT theaters.Name, theaters.Phone, COUNT(DISTINCT movies2theaters.MovieCode) FROM theaters NATURAL JOIN movies2theaters GROUP BY theaters.Name ORDER BY COUNT(DISTINCT movies2theaters.MovieCode) DESC;
Write a SQL query that lists all theaters in Capitol Hill District that play the longest R-Rated movie and out their names (in ascending order) and telephone numbers
SELECT theaters.Name,theaters.Phone FROM ((((districts INNER JOIN theaters2districts ON districts.DistrictCode=theaters2districts.DistrictCode)INNER JOIN movies2theaters ON theaters2districts.TheaterCode=movies2theaters.TheaterCode)INNER JOIN theaters ON movies2theaters.TheaterCode= theaters.TheaterCode) INNER JOIN movies ON movies2theaters.MovieCode= movies.MovieCode) WHERE districts.Name='Capitol Hill' AND Rating='R' GROUP BY theaters.Name HAVING MAX(movies.Length) ORDER BYtheaters.Name;
Write SQL and Relational Algebra that lists only the phone number and district name of Boeing Imax Theater
SELECT theaters.Phone, districts,Name FROM theaters INNER JOIN theaters2districts ON theaters.TheaterCode = theaters2districts.TheaterCode INNER JOIN districts ON districts.DistrictCode = theaters2districts.District Code WHERE theaters.Name = "Boeing Imax Theater" Πtheaters.Phone, district.Name(( σname = 'Boeing Imax Theater' (theaters)) ⋈theaters2districts ⋈ district)
Extended Relation Algebra. Duplication elimination, grouping, sorting. What are the operators?
Screenshot on laptop
Web Search Static and Dynamic Ranking Features
Static: Spam, Pagerank Dynamic: Term matching, Headings, Fonts, Anchortext, Title/Summary/Keywords/Body
What is a result in statistical linear regression vs a result in machine learning linear regression
Statistics: A "result" is typically in the form of a significant relationship and/or practically relevant effect size ML: A "result" is typically in the form of an improvement in prediction performance on a (held out) test set Regression in Statistics, makes claims about whether there is a meaningful relationship between X and Y Regression in ML, Focused on prediction accuracy; exploiting correlation is totally fine
While training linear regression this is minimized:
Sum of square errors