Data Science Quizlet

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Xpath which outputs cities located between 2E and 2W in format "city: latitude, longitude"

/mondial/country//city[@longitude<2 and @longitude >-2]/concat(name,':',@latitude, ',', @longitutde);

Map/Reduce

A computational paradigm for distributed processing of large-scale datasets. A simple 2-step parallel programming paradigm (Map, Reduce). Takes a big file as input. Not a database system.

Foreign key

An attribute K (or collection of attributes) in relation B that uniquely identifies a type in a relation A. Typically K is a primary key in A.

Horizontal Partitioning

Distributing the rows of a logical relation into several separate tables

Round-robin (Block) partitioning

Each next group of k rows are assigned to the next node. The simplest strategy. uniform distribution

Inter-query parallelism

Each query runs on one node. Several different queries execute concurrently on multiple processors.

Supervised Learning

Explicit data labels. Goal is to learn that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data (image tagging, sentiment analysis)

Xquery differs from Xpath because

FLW expresssion

Page rank (definition and equation)

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites Page Rank Formula: On Desktop

Benefits to Relational Model

Physical Data Independence (Oblivious to data organization on disk), Logical Data Independence (Scheme can evolve without affecting the users/applications).

Pros and Cons of Shared Disk Architecture

Pros: Easy to use, maintain and program Cons: Expensive, difficult to scale past a certain configuration, changes to codebase might be needed during migration

Pros and cons of shared-everything architecture

Pros: Easy to use, maintain, and program. Can migrate old codebase without major changes. Cons: Expensive, difficult to upgrade, difficult to scale past a certain configuration

Pros and Cons of Shared Nothing Architecture

Pros: Less contention for resources, the most scalable architecture. Cons: Difficult to administer, configure, and maintain. Changes to codebase required during migration

Inter-operator parallelism

Query runs on multiple nodes in parallel. Each operator runs of one node. enables different operators of the query to be executed in parallel. The more complex the query, the more opportunities there are for inter-operator parallelism.

Ranking Relevance Evaluation: F-Measure and equation

Recall + Precision. Equation on desktop

What can a learning algorithm do?

Recognizing Patterns, Generating Patterns, Recognizing Anomalies, Prediction

Main components of a relation schema:

Relation name, name of each attribute, domain of each field

Hash partitioning

Row r assigned to the node r.K mod P. applies a hash function to some attribute that yields the partition number. This strategy allows exact-match queries on the selection attribute to be processed by exactly one node and all other queries to be processed by all the nodes in parallel. May be skewed if k has the same value for all rows

Write a SQL query that lists the average length of movies played at Landmark Neptune Theatre

SELECT AVG(Length) FROM movies,movies2theaters, theaters WHERE theaters.Name = 'Landmark Neptune Theatre' AND theaters.TheaterCode = movies2theaters.TheaterCode AND movies2theaters.MovieCode = movies.MovieCode

Write a SQL query that shows how many times the Metro Cinemas theater has played a movie that is shorter than 2 hours

SELECT COUNT (*) FROM movies, theaters, movies2theaters WHERE theaters.name='Landmark Metro Cinemas' AND theaters.TheaterCode=movies2theaters.TheaterCode AND movies2theaters.MovieCode=movies.MovieCode AND movies.Length<120;

Company(name, city) Product(pname, maker) Purchase(id, product, buyer) Return cities having companies that manufacturer products bought by John (use nested query)

SELECT Company.city FROM Company WHERE Company.name IN ( SELECT Product.maker FROM Purchase, Product WHERE Product.pname = Purchase.product AND Purchase.buyer = 'John')

Indexing and two files that are created

Take the web pages, create an inverted index, output the inverted index. Split index into two files. Lexicon - hash table on disk (one read), main memory (has the words) Occurrence List: On Disk, distributed file system

Machine Learning

The study of algorithms that: improve their performance P, at some task T, with experience E. A well-defined learning task is given by <P,T,E>

what's a tuple

Tuple: is an element of S1 x S2 x ...x Sn,

is Kmeans supervised or unsupervised

Unsupervised

Normalized Discounted Cumulative Gain (NDCG) and equation

Using a graded relevance scale of documents in a search-engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. equation on desktop

Where can subqueries be found

WHERE clause, FROM clause, HAVING clause (less common)

What a relational Database

a collection of relations

An inverted index in a web search is

a data structure designed to help find web pages relevant to the user query

Dummy Variables

a numerical variable used in regression analysis to represent subgroups of the sample in the study

Sentiment Analysis

a technique that allows marketers to analyze data from social media sites to collect consumer comments about companies and their products

kmeans

an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. · Assign each point to its closest mean · Recompute the means to be the mean of the points assigned to each cluster

MongoDB

an open source document-oriented DBMS

Shared nothing architecture

architecture where nodes communicate over the network

K nearest neighbors does all the work during:

classification time

MongoDB Query Language Find

db.collection.find( <condition>, <projection>).limit (N)

MongoDB Query Language insert

db.collection.insert( <document> )

XPath one-liner which outputs the names and population of the countries with population less than 5 thousand

doc("world.xml")//country[@population<5000]/data((name, @population))

Xpath, which outputs all seas shallower than 200

doc("world.xml")//sea[@depth<200]/data(@name)

Xpath which outputs the population of Guadalajara

doc("world.xml")/mondial//city[contains(.,'Guadalajara')]/data(normalize-space(population))

Xpath that returns cities in Brazil location nearby seashore along with their population

doc("world.xml")/mondial//country[@name='Brazil']//city[located_at[@type='sea']]/data((name, population))

dimensionality reduction

find features that differentiate individuals. methods used to represent data using less columns or features,

FLWR expression which outputs the name and population of countries with population less than 5 thousand in order of descending population

for $x in doc("world.xml")/mondial/country where $x/@population < 5000) order by xs:int($x/@population) descending return <name> {$x/@name, $x/@population}</name>

FLWR expression which outputs a list of capitals of countries, where national inflation is greater than 10

for $x in document("world.xml")//country[@inflation>10] where $x/@capital=$x//city/@id return ($x//city/name/text())

FLWR expression which outputs a list of river names in ascending order

for $x in document("world.xml"//river/data(@name) order by $x ascending return $x

FLWR expression which outputs the population for each city in Spain

for $x in document("world.xml"/mondial//country[@name='Spain']//city/data(population) return $x

Ranking Relevance Evaluation: Precision

fraction of the documents retrieved that are relevant to the user's information need. (Relevant and Retrieved)/Retrieved

Ranking Relevance Evaluation: Recall

fraction of the documents that are relevant to the query that are successfully retrieved. (Relevant and Retrieved)/relevant

Goal of Data Science

improve decision making by extracting insights from large, heterogenous data sets.

Bayes Naive Algorithm assumes the features are

independent (but conditioned on the class)

Pagerank (static feature)

is an algorithm to calculate rank of a web page in a web graph

Reinforcement learning

label on the result of a sequence of actions, but not on each action (game, robotics). concerned with how software agents ought to take actions in an environment in order to maximize some notion of cumulative reward.

String similarity measure

number of characters you need to change in one string to modify into another string

A ranking feature to help rank search results:

page rank

A web page is a good authority if

pointed to by many good hubs

A web page is a good authority if it is

pointed to by many good hubs

A web page is a good hub

points to many good authorities

Data becomes dirty from

primary key violations, diff data representations, formatting differences

Parallel Database

seeks to improve performance through parallelization of various operations, such as loading data, building indexes and evaluating queries. The distribution is governed solely by performance considerations. Parallel databases improve processing and input/output speeds by using multiple CPUs and disks in parallel. In parallel processing, many operations are performed simultaneously.

Range Partitioning

selects a partition by determining if the partitioning key is within a certain range. An example could be a partition for all rows where the "zipcode" column has a value between 70000 and 79999. May be skewed - difficult to partition k into equal width chunks.

XML data is:

semi-structured

Parallel-Join

split the pairs to be tested over several processors. Each processor computes part of the join, and then the results are assembled (merged).

Stragglers in map-reduce

take a long time to finish

Stragglers

tasks that take very long time to complete and drag overall job progress (bad sectors on disk -> significant read/write performance degradation, scheduler assigned too many tasks on a single machine)

Omitted Variable Bias

the specific type of bias that results from the failure to include a variable that belongs in our regression model. We assume changes in the dependent variable that are correlated with the explanatory variable are because of the explanatory variable. We assume that changes in the dependence variable that are not explained by the explanatory variables is "noise."

clustering

the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. EX: find groups of customers with similar tastes, find topics within a set of news articles

Weakly/Distantly supervised

using noisy labels or partial labels (bootstrapping, automatically labeled data) This approach alleviates the burden of obtaining hand-labeled data sets, which can be costly or impractical. Instead, inexpensive weak labels are employed with the understanding that they are imperfect, but can nonetheless be used to create a strong predictive model

Relational Algebra expression for "Addresses and phone numbers of theaters that play 'Frozen' and query plan

Πtheaters.Address, theaters.Phone ((theaters ⋈ movies2theaters ⋈ (σTitle = "Frozen" (movies))) tree on desktop

Reduce in Map/Reduce

Sum, Filter, Transform, Aggregate. hich performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)

Naiive Bayes practice session Positive Reviews: 1. Not bad, it is very fast 2. This is an awesome laptop 3. it has exceeded my expectations Negative Reviews 1. The fan noise is very bad. 2. Turned out to be factory defect. 3. You are going to hate it! P(+) = 1/2, P(-) = 1/2 Predict the class for 1. It is very bad. 2. Hate the noise P(+|words) P(-|words)

1. It is very bad. P(+|words) = 2/81. P(-|words) = 1/162 2. Hate the noise P(+|words) = 0, P(-|words) = 1/54

Google File System (GFS)

A distributed file system Google developed to meet rapidly increasing storage requirements; offers scalability to petabyte storage, the capability to handle extremely large files, data storage on distributed commodity servers, and simultaneous file access by multiple distributed applications. each file is N blocks (>= 64 MB), Blocks are randomly distributed among machines and replicated for fault tolerance. File system is optimized for append.

What is a relation

A subset of S1 x S2 x ...x Sn, where si is Domain of attribute i ' [1,n], and n is a number of attributes of R. A set of tuples

K-Nearest Neighbor

All work is done at classification time. Arguably the simplest ML algorithm. "Non-parametric" - no assumptions about the form of the classification model. Works with tiny amounts of training data. Assumes similar things exist in close proximity. The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.

Broadcast Join

Broadcast join can be very efficient for joins between a large table (fact) with relatively small tables. Do not reshuffle R or S, Replicate S to all nodes.

Inverted index for the following: Doc with ID 1 = Data Science Studies methods of recording, storing, and deriving insights from data. Doc w ID 2 = Machine learning studies algorithms that learn from data and improve with experience. Lexi Columns = Word, Ndocs, PTR Occurrence list columns = DocId, occurs, Pos1, Pos2

Check phone

Semi-supervised learning

Combining large amount of unlabeled with smaller amount of labelled (pre-training)

Map in Map/Reduce

Computes a function based on the information read from each record in a file. Performs filtering and sorting (such as sorting students by first name into queues, one queue for each name).

Web Search - crawling

Crawl the web, store documents, check for duplicates, extract links -> web pages

Data Science Workflow

Data collection, Data processing, exploration/visualization, Analysis/machine learning, insight/policy decision

shared nothing architecture

Many nodes share nothing; each node has its own memory and disk. Nodes do not share memory or storage. Ex: HP Vertica running on a cluster. each update request is satisfied by a single node (processor/memory/storage unit).

Shared Disk Architecture

Many nodes which share the same disk(s) Ex: Oracle running on a cluster. All disks are accessible from all cluster nodes.

what takes into account order of returned search results

NDCG (Normalized discounted cumulative gain)

Unsupervised

No explicit labels. Its goal is to infer the natural structure present within a set of data points. "Finding structure in data" which is "exploratory analysis" · Used extensively for "pre-training" (dimensionality reduction, language modeling, auto-encoding) (clustering, dimensionality reduction)

What is shared-everything architecture

Nodes share disk and memory, Up to hundreds of CPUs. Ex: MySQL running on a single powerful machine.

Goal of classification

One goal: P(Y|X) where Y is label and X is features. P(email is spam | words in the message)

Intra-operator parallelism

Operator runs on multiple nodes in parallel Ex: Join running on several nodes. Intra-operator parallelism is based on the decomposition of a relational operator into a set of independent operator instances, each processing a different relation partition.

Gradient Descent

Optimization algorithm for finding the input to a function that produces the optimal value; iterative

Oversimplified model of ML and with more detail

Oversimplified Machine Learning: Data -> Model -> Goal/Task In more detail Machine Learning: Training Text Documents, Images, Sounds -> Features vectors, Labels -> Machine Learning Algorithm -> Predictive Model, Expected Label. Also, New Text Document Image, sound, features vector -> Predictive Model -> Expected Label.

Formula for likelihood estimation

P(C)Pi(n, I=1) P(xi, c)

how to declare key constraints when creating a table

PRIMARY KEY (sno, pno) FOREIGN KEY(pno) REFERENCES Part

SQL query that lists title, length, and movie time of movies that are played in the Cineplex Odeon Uptown theaters and relational algebra and tree

SELECT DISTINCT movies.Title, movies.Length, movies2theaters.movietime FROM movies, movies2theaters, theaters WHERE movies.MovieCode = movies2theaters.MovieCode AND movies2theaters.TheaterCode = theaters.TheaterCode AND theaters.Name = "Cineplex Odeon Uptown"; 3. Πmovies.Title, movies.Length, movies2theaters.movietime((movies ⋈ movies2theaters ⋈ (σname = "Cineplex Odeon Uptown" (theaters))) tree on desktop

Write a SQL query that lists the name and address of all theaters showing less than five different movies

SELECT DISTINCT theaters.Name, theaters.Address FROM theatersNATURAL JOIN movies2theaters GROUP BY theaters.NameHAVING COUNT(DISTINCT movies2theaters.MovieCode) < 5;

Write SQL and Relational Algebra that lists only the length of all movies with an R rating

SELECT Length FROM movies WHERE Rating ="r' Πlength σrating = 'R' (Movies)

SQL query that lists theater names, phone numbers and the number of different movies that are shown in these theaters sorted by this number (of different movies) in descending order

SELECT theaters.Name, theaters.Phone, COUNT(DISTINCT movies2theaters.MovieCode) FROM theaters NATURAL JOIN movies2theaters GROUP BY theaters.Name ORDER BY COUNT(DISTINCT movies2theaters.MovieCode) DESC;

Write a SQL query that lists all theaters in Capitol Hill District that play the longest R-Rated movie and out their names (in ascending order) and telephone numbers

SELECT theaters.Name,theaters.Phone FROM ((((districts INNER JOIN theaters2districts ON districts.DistrictCode=theaters2districts.DistrictCode)INNER JOIN movies2theaters ON theaters2districts.TheaterCode=movies2theaters.TheaterCode)INNER JOIN theaters ON movies2theaters.TheaterCode= theaters.TheaterCode) INNER JOIN movies ON movies2theaters.MovieCode= movies.MovieCode) WHERE districts.Name='Capitol Hill' AND Rating='R' GROUP BY theaters.Name HAVING MAX(movies.Length) ORDER BYtheaters.Name;

Write SQL and Relational Algebra that lists only the phone number and district name of Boeing Imax Theater

SELECT theaters.Phone, districts,Name FROM theaters INNER JOIN theaters2districts ON theaters.TheaterCode = theaters2districts.TheaterCode INNER JOIN districts ON districts.DistrictCode = theaters2districts.District Code WHERE theaters.Name = "Boeing Imax Theater" Πtheaters.Phone, district.Name(( σname = 'Boeing Imax Theater' (theaters)) ⋈theaters2districts ⋈ district)

Extended Relation Algebra. Duplication elimination, grouping, sorting. What are the operators?

Screenshot on laptop

Web Search Static and Dynamic Ranking Features

Static: Spam, Pagerank Dynamic: Term matching, Headings, Fonts, Anchortext, Title/Summary/Keywords/Body

What is a result in statistical linear regression vs a result in machine learning linear regression

Statistics: A "result" is typically in the form of a significant relationship and/or practically relevant effect size ML: A "result" is typically in the form of an improvement in prediction performance on a (held out) test set Regression in Statistics, makes claims about whether there is a meaningful relationship between X and Y Regression in ML, Focused on prediction accuracy; exploiting correlation is totally fine

While training linear regression this is minimized:

Sum of square errors


Kaugnay na mga set ng pag-aaral

Marketing Chapters 9, 12, 13, 14, 15, & 16 Multiple Choice Questions

View Set

HTML Tutorial 1 (test study material)

View Set

Which OSI layer contains/does this?

View Set

Tostevin World since 1500 S1 Final: Terms

View Set