Data Science (ds100) Midterm 1
What is data cleaning?
transforming data into a format that can be used by the data scientist for analysis
What is the data science lifecycle?
1. Ask a question 2. Obtain relevant data 3. Understand the data 4. Propose and inference and prediction
what are the different join types? (for SQL joins)
1. Inner 2. Left 3. Right 4. Outer
what are the different kinds of data?
1. Quantitative Data 2. Categorical Data a) ordinal b) nominal
what are key properties of data to consider when collecting data sets for analysis (during EDA/Sci. method) Hint(5)
1. Structure 2. Granularity 3. Scope 4. Temporality 5. Faithfulness
describe these matplotlib functions 1. plt.title 2. plt.xlabel/ylabel 3. plt.grid 4. plt.legend 5. plt.imshow
1. param takes a string that will become the title of the graph 2. labels x and y 3. overlays background with a grid 4. shows legend if multiple lines are plotted on the same graph
What is a loss function?
A function that characterizes the error of a the model you build to interpret your data (like best predicting a dataset by yielding a single value) (where in this case the model we're choosing is a single parameter that we use to guess the entire distribution) ex. Mean squared error
what is a convex set? what is a non-convex set?
A polygon where a line segment between two points on the boundary never leaves the boundary
what is a model?
A way to represent a system
what is a simple random sample?
Analogy = population of animals randomly selecting individual animals when constructing a sample (dataset) where sampled animals =< total population Mathematically: every combination of individual datum has the same chance as appearing as another sequence of the same size
What does mathplotlib.plt.plot do?
Can plot a single array, Can also scatter multiple data arrays see image
what are the two sublanguages of SQL
Data Definition Language - writing schemas when building data Data Manipulation language - queries to fetch data
Considering data provenance (how data was generated) how do data scientists get information about a field or population of interest?
Data scientists collect samples on the desired population
True or False, when a field with null values is aggregated the null value is considered during aggregation? True or False, when a table is aggregate records with null values are considered
False True
True or false: More data is an indication that your model is more accurate?
False, a simple random sample tends to be more accurate than a biased data set with more data points
True or false: T = Null is False in SQL
False. T = Null is Null if you want to compare booleans with nulls must us the IS or IS NOT comparators
Concerning data structure what are keys? what is a primary key? what is a Foreign key
Fields in a record that uniquely identify that record (see example with ID) Primary keys are unique ids found within the table they're defined Foreign keys are references to primary keys in another table
What is full join and how is it different from cartesian join (default)
Full join still joins on a field but places nulls if the data cannot be found. Cartesian join creates all possible parings between tables
What is the huber loss function?
It's a mix of the squared loss and absolute loss huber loss on a single parameter = • 1/2(y-guess)^2 if |y-alpha| < alpha • alpha(|y-guess| - alpha / 2) o.w (avg. loss on entire dataset) minimize expected loss on all values y can take on.
what is a database?
It's a place to store organized data
Describe the output of the SQL bool comparsion T and Null T or Null F and Null F or Null f(Null, Null)
Null True False Null Null
what is data structure? What are the four different data formats discussed in class?
The fields and attributes of a record as well as the number of records. 1. TSV, 2. CSV, 3. JSON, 4. HTML/XML
True or False: Is it possible to group a dataframe and then aggregate by multiple functions? If true write the code that would group a dataset, dset, by field and agg for count mean and var.
True dset.groupby('field').agg(['count', 'var', 'mean'])
what is data science?
Using data to form experiments and test hypothesis (using the scientific method)
what is gradient descent intuitively?
When the loss function defined on a particular dataset is defined by multiple parameters and we calculated the gradient loss (partial losses in all dimensions of the loss) We then move in the direction that decreases the gradient Stop when we reach zero gradient
what does a, b equal and what do they mean? a, b = plt.subplots()
a = fig b = ax
what is a pandas series?
a column of a data frame that contains data of the same type
what is seaborn?
a python visualization library built on top of matplotlib
what is absolute loss?
abs. loss on a single param. = |(y - guess)| Minimizing abs. loss is E[|(y-guess)|] by deciding what guess is (guess is a variable) (avg. loss on entire dataset) in the graph theta represents the choices for guess
what is a dataframe in pandas?
an object in pandas that holds data in tabular format (rows and columns)
Create a table, Sailors, listing the sid, sname, and age enforcing the logical type in SQL. After creating the this table insert a "Bill Gates, 1, 20" into the table
create table Sailors ( sid integer sname char(20) age real ); insert into Sailors (sid, sname, age) values (Bill Gates, 1, 20)
what is the DOM
document object model. It's the hierarchy of how an html webpage is formatted and xml is stored
If you wanted to determine what the correlation between two fields x, y in a dataset were what would be the command in pandas? write the command to generate a linear regression of the correlation Note that seaborn under the hood does this for you
ds.loc[['x', 'y]].corr() from scipy import stats ds.apply(lambda df: stats.linearregress(df.x, df.y)
what is exploratory data analysis
falls in place with what data science is. Use data to understand and hypothesize about a question of interest (scientific method)
what does plt.gca() do?
get current axes
what does DISTINCT do in SQL?
get's unique results by removing duplicates
What is data granularity?
how aggregated the data becomes The finer the data the more representative of an individual datum the coarser the data the less representative of an individual datum
what is the group by method in pandas? How does aggregation come into play here?
it groups data by data in a specific column Aggregation then combines the data once aggregated the results are fused into a single table
what kind of language is SQL
it's a declarative language (programs describe desired results rather than writing commands on HOW to acquire results) see image for example
what is pandas
it's a module for manipulating a dataset
what is a join statement in SQL? If no join type is specified the outer product is executed Rows from first table combined with every row from the second table
it's a way to combine two different datasets
What is a kernel density estimator?
it's an estimate for a given data point when compared to other data points its used when converting from a descriptive plot to an inferential plot ex. see image
what is a gaussian kernal density estimator?
p(x) = the probability of a single data point based on the uniform normalized Gaussian for every data point
write the code that would read a csv file in pandas
pd.read_csv
what is a database management system?
platforms that allow for interacting with large collections of data Core functions: 1. Data Storage 2. Data Management 3. Data interaction (access)
what are inferential plots?
plots that draw conclusions beyond the scope of the data for example, if there's no data point for x10 but using the plot we make a prediction for x10
what are descriptive plots?
plots that exactly describe the data
what is the general format a query in SQL? What's the conceptual SQL pipeline?
see image
what is sns.lmplot?
see img
write a query that joins all values in tables A and B on id (common in both) but with records with the same id (from both tables) are excluded
select * from A full outer join B on A.id = B.id where A.id is Null or B.id is Null this selects records where there was no match in both tables
write two queries for two tables, A and B that inner joins the two tables by id (a field that's common in both tables) conceptually what's the idea behind inner joins?
select * from A, B where A.id = B.id or select * from A inner join B on A.id = B.id conceptually preserves rows where A.id and B.id are found in both columns (if there are ids that are unique to just A (and possibly just B) those rows are dropped)
write a query for two tables Right as R and Left as L that left joins the tables. conceptually what's the idea behind left joins
select * from L left join R on L.id = R.id conceptually we join records in L on records in R based on a specified field. If there is no matching field found when joining a record from L to a record in R we don't drop that record rather we preserve it with nulls for the values where a join could've been found
what is squared loss
squared loss on a single para. = (y - guess) ^2 Minimizing squared loss is E[(y-guess) ^2] by deciding what guess is (guess is a variable) (best guess is the mean) (in the graph theta represents the choices for guess)
what does sns.pairplot(ds, hue='field') do?
takes all fields except the entry to to hue and creates pairs of the fields and plots them. see example