Data Science (ds100) Midterm 1

¡Supera tus tareas y exámenes ahora con Quizwiz!

What is data cleaning?

transforming data into a format that can be used by the data scientist for analysis

What is the data science lifecycle?

1. Ask a question 2. Obtain relevant data 3. Understand the data 4. Propose and inference and prediction

what are the different join types? (for SQL joins)

1. Inner 2. Left 3. Right 4. Outer

what are the different kinds of data?

1. Quantitative Data 2. Categorical Data a) ordinal b) nominal

what are key properties of data to consider when collecting data sets for analysis (during EDA/Sci. method) Hint(5)

1. Structure 2. Granularity 3. Scope 4. Temporality 5. Faithfulness

describe these matplotlib functions 1. plt.title 2. plt.xlabel/ylabel 3. plt.grid 4. plt.legend 5. plt.imshow

1. param takes a string that will become the title of the graph 2. labels x and y 3. overlays background with a grid 4. shows legend if multiple lines are plotted on the same graph

What is a loss function?

A function that characterizes the error of a the model you build to interpret your data (like best predicting a dataset by yielding a single value) (where in this case the model we're choosing is a single parameter that we use to guess the entire distribution) ex. Mean squared error

what is a convex set? what is a non-convex set?

A polygon where a line segment between two points on the boundary never leaves the boundary

what is a model?

A way to represent a system

what is a simple random sample?

Analogy = population of animals randomly selecting individual animals when constructing a sample (dataset) where sampled animals =< total population Mathematically: every combination of individual datum has the same chance as appearing as another sequence of the same size

What does mathplotlib.plt.plot do?

Can plot a single array, Can also scatter multiple data arrays see image

what are the two sublanguages of SQL

Data Definition Language - writing schemas when building data Data Manipulation language - queries to fetch data

Considering data provenance (how data was generated) how do data scientists get information about a field or population of interest?

Data scientists collect samples on the desired population

True or False, when a field with null values is aggregated the null value is considered during aggregation? True or False, when a table is aggregate records with null values are considered

False True

True or false: More data is an indication that your model is more accurate?

False, a simple random sample tends to be more accurate than a biased data set with more data points

True or false: T = Null is False in SQL

False. T = Null is Null if you want to compare booleans with nulls must us the IS or IS NOT comparators

Concerning data structure what are keys? what is a primary key? what is a Foreign key

Fields in a record that uniquely identify that record (see example with ID) Primary keys are unique ids found within the table they're defined Foreign keys are references to primary keys in another table

What is full join and how is it different from cartesian join (default)

Full join still joins on a field but places nulls if the data cannot be found. Cartesian join creates all possible parings between tables

What is the huber loss function?

It's a mix of the squared loss and absolute loss huber loss on a single parameter = • 1/2(y-guess)^2 if |y-alpha| < alpha • alpha(|y-guess| - alpha / 2) o.w (avg. loss on entire dataset) minimize expected loss on all values y can take on.

what is a database?

It's a place to store organized data

Describe the output of the SQL bool comparsion T and Null T or Null F and Null F or Null f(Null, Null)

Null True False Null Null

what is data structure? What are the four different data formats discussed in class?

The fields and attributes of a record as well as the number of records. 1. TSV, 2. CSV, 3. JSON, 4. HTML/XML

True or False: Is it possible to group a dataframe and then aggregate by multiple functions? If true write the code that would group a dataset, dset, by field and agg for count mean and var.

True dset.groupby('field').agg(['count', 'var', 'mean'])

what is data science?

Using data to form experiments and test hypothesis (using the scientific method)

what is gradient descent intuitively?

When the loss function defined on a particular dataset is defined by multiple parameters and we calculated the gradient loss (partial losses in all dimensions of the loss) We then move in the direction that decreases the gradient Stop when we reach zero gradient

what does a, b equal and what do they mean? a, b = plt.subplots()

a = fig b = ax

what is a pandas series?

a column of a data frame that contains data of the same type

what is seaborn?

a python visualization library built on top of matplotlib

what is absolute loss?

abs. loss on a single param. = |(y - guess)| Minimizing abs. loss is E[|(y-guess)|] by deciding what guess is (guess is a variable) (avg. loss on entire dataset) in the graph theta represents the choices for guess

what is a dataframe in pandas?

an object in pandas that holds data in tabular format (rows and columns)

Create a table, Sailors, listing the sid, sname, and age enforcing the logical type in SQL. After creating the this table insert a "Bill Gates, 1, 20" into the table

create table Sailors ( sid integer sname char(20) age real ); insert into Sailors (sid, sname, age) values (Bill Gates, 1, 20)

what is the DOM

document object model. It's the hierarchy of how an html webpage is formatted and xml is stored

If you wanted to determine what the correlation between two fields x, y in a dataset were what would be the command in pandas? write the command to generate a linear regression of the correlation Note that seaborn under the hood does this for you

ds.loc[['x', 'y]].corr() from scipy import stats ds.apply(lambda df: stats.linearregress(df.x, df.y)

what is exploratory data analysis

falls in place with what data science is. Use data to understand and hypothesize about a question of interest (scientific method)

what does plt.gca() do?

get current axes

what does DISTINCT do in SQL?

get's unique results by removing duplicates

What is data granularity?

how aggregated the data becomes The finer the data the more representative of an individual datum the coarser the data the less representative of an individual datum

what is the group by method in pandas? How does aggregation come into play here?

it groups data by data in a specific column Aggregation then combines the data once aggregated the results are fused into a single table

what kind of language is SQL

it's a declarative language (programs describe desired results rather than writing commands on HOW to acquire results) see image for example

what is pandas

it's a module for manipulating a dataset

what is a join statement in SQL? If no join type is specified the outer product is executed Rows from first table combined with every row from the second table

it's a way to combine two different datasets

What is a kernel density estimator?

it's an estimate for a given data point when compared to other data points its used when converting from a descriptive plot to an inferential plot ex. see image

what is a gaussian kernal density estimator?

p(x) = the probability of a single data point based on the uniform normalized Gaussian for every data point

write the code that would read a csv file in pandas

pd.read_csv

what is a database management system?

platforms that allow for interacting with large collections of data Core functions: 1. Data Storage 2. Data Management 3. Data interaction (access)

what are inferential plots?

plots that draw conclusions beyond the scope of the data for example, if there's no data point for x10 but using the plot we make a prediction for x10

what are descriptive plots?

plots that exactly describe the data

what is the general format a query in SQL? What's the conceptual SQL pipeline?

see image

what is sns.lmplot?

see img

write a query that joins all values in tables A and B on id (common in both) but with records with the same id (from both tables) are excluded

select * from A full outer join B on A.id = B.id where A.id is Null or B.id is Null this selects records where there was no match in both tables

write two queries for two tables, A and B that inner joins the two tables by id (a field that's common in both tables) conceptually what's the idea behind inner joins?

select * from A, B where A.id = B.id or select * from A inner join B on A.id = B.id conceptually preserves rows where A.id and B.id are found in both columns (if there are ids that are unique to just A (and possibly just B) those rows are dropped)

write a query for two tables Right as R and Left as L that left joins the tables. conceptually what's the idea behind left joins

select * from L left join R on L.id = R.id conceptually we join records in L on records in R based on a specified field. If there is no matching field found when joining a record from L to a record in R we don't drop that record rather we preserve it with nulls for the values where a join could've been found

what is squared loss

squared loss on a single para. = (y - guess) ^2 Minimizing squared loss is E[(y-guess) ^2] by deciding what guess is (guess is a variable) (best guess is the mean) (in the graph theta represents the choices for guess)

what does sns.pairplot(ds, hue='field') do?

takes all fields except the entry to to hue and creates pairs of the fields and plots them. see example

Ver todos los conjuntos de estudio

Data Science (ds100) Midterm 1

Conjuntos de estudio relacionados

ECON 102

Chapter 14: Drugs Used to Treat Neurodegenerative Disorders

Intro leisure travel final exam Chapter 11,12

Chapter 4 fundies

ECON 102 FINAL Q'S

Daily Deductions

Chapter 11

Business Law MidTerm

Life Insurance Policies - Provisions, Options and Riders(2)

Basic Anatomy & Physiology of the Human Body

Chapter 26 - Behavioral Emergencies

Chapter 10

Insurance Review (Red Mod- Fortis)

World History - Chapter 16, Lesson 2

Principles of Bio Mid Term Study

10-4

Business Law: Chapter Questions for Exam #3

H5. Assessment Methods for Selection and Placement. Industrial and organizational Psychology

Frequently asked: Node JS Interview Questions and Answers

EBF 200 RANDOM FINAL MATERIAL