PIC 16A Final Review

Ace your homework & exams now with Quizwiz!

Why do we split into a training and test set at the beginning of our data inspection?

- Avoid overfitting - Show the performance of machine learning models when they're used on data that wasn't used to train the model

Classification vs regression problems

- Both are types of supervised learning problems Classification - Output variables are qualitative - Example: "red" or "blue", "disease" or "no disease" Regression - Output variable is a real value - example: "dollars", "weight"

Mathematical definition for mean squared error

- Average squared difference between the estimated values and actual values

What are good reasons for removing certain columns of the DataFrame almost immediately? Think about what it means if a column is constant and/or whether a column holds physical or other measurements of Penguins vs. is a mechanism for record-keeping. During your project you will almost certainly have removed certain columns very early on, make sure you can justify why you did this!

- Columns with comments, for example, aren't necessarily useful unless you are trying to explore an outlying point in greater depth (if you need context for it before you want to remove it, for example). - In the case of a lot of models, especially linear ones, you'd also want to remove collinear variables (variables that are too closely related e.g. one column for length in cm and another column for the same length in mm). Collinear variables can be redundant and can mess with summary statistics (infated p-values, etc.). - You also don't need ID data if the samples are independent and identically distributed. For sequential data, like time series, this column would be necessary, however.

What types of tasks are Pandas DataFrames well suited for?

- Easy handling of missing data (represented as NaN) - Size mutability: columns can be deleted or inserted - Automatic and explicit data alignment: align objects to labels - Flexible group by: performs split-apply-combine operations - Label based slicing, fancy indexing, and subsetting - Merging and joining data sets

Indexing: loc

- Loc selects rows and columns with specific labels - May use boolean - Raises KeyError when not found)

Role of the loss function

- Method for evaluating how well the model does in predicting the datase - Mean squared error is a type of loss function

How to spot a model underfitting in your project using a decision boundary plot?

- Model fails to learn the patterns of the dataset - Model performs better on test data than training data - (In Classification problems) The decision boundary is too simple - E.g. just a straight line

How to spot a model overfitting in your project using a decision boundary plot?

- Model performs better on training data than test data - (In Classification problems) The decision boundary is very complex and tries to fit each individual data point - (In Regression problems) The "best fit" line is a complex curve and tries to cross through each individual data point

Define and describe algorithmic bias: Sample Bias

- Occurs when the training data is not sampled randomly enough from the collected data - Creates a preference toward some populations

Indexing: [ ] (square brackets)

- Passes a list of columns to [] to select columns in that order. - If a column is not contained in the DataFrame, an exception will be raised. - Multiple columns can also be set in this manner. example: penguins [ [ "Sex", "Islands" ] ]

What types of data are Pandas DataFrame well suited for?

- Tabular data (think excel spreadsheet) - Arbitrary matrix data - Observational/statistical data sets

If we want a pair of qualitative features to have strong predictive power of species, what properties would we want the relevant scatter plot to have?

- Weak relationship: dots are widely spread - Strong relationship: dots are concentrated around a line

What are the potential pitfalls of using dropna() too early on...

- You may lose valuable data that could have been useful later on, especially if you're applying it to multiple columns at once. This can skew the data and lead to bias or less accurate insights.

How to control overfitting via complexity parameter, e.g., for decision trees?

- decrease complexity of model

Indexing: iloc

- iloc selects rows and columns at specific integer positions - May use boolean - Raises IndexError when not out of bounds

How to control underfitting via complexity parameter, e.g., for decision trees?

- increase training data - increase size/number of parameters in model - increase complexity of model

Define and describe algorithmic bias: Measurement Bias

- linked to underlying problems with the accuracy of the training data and how it is measured or assessed - An experiment containing invalid measurement or data collection methods will create measurement bias and biased output - example: The aim of the system was to predict success at the company, but what was actually measured was whether or not a given candidate was hired, regardless of whether they were subsequently successful.

Explain what a Pandas DataFrame is?

A DataFrame is a 2D labeled data structure with columns containing potentially different types

Describe an example of each using the Penguin data set: Population Bias

Example from campuswire: - Only selecting penguins that are not fully grown on one island and only selecting fully grown penguins on another island. - Unless the population selection is changed, no sampling method will accurately represent the population

Define and describe algorithmic bias: Population Bias

Inherent bias within a population <3

How does k-fold cross-validation work? Why do we do it?

K-fold cross-validation: - Data gets split into training and test data - The training data then gets split into k folds - Within this training data, k-1 folds gets trained and 1 fold gets tested - This process is repeated k times so that each fold eventually gets tested once - Gives average performance of the model and how it's expected to perform on unseen test data - Purpose - Tool for estimating optimal complexity of a model - You can test out different depths and compare validation scores to get the optimal depth and best CV score

Supervised vs. unsupervised machine learning problems, what is the objective in each case?

Supervised: -When you have input and output data - Objective: algorithm uses input data to approximate the mapping function as well as possible in order to be able to predict output variables given new input variables - Example: Classification and regression problems Unsupervised: -When you only have input data and no output data -Objective: Model underlying structure or distribution of data - Clustering: finding groupings in data - Association: discover rules that describe large portions of data

Describe an example of each using the Penguin data set: Sample Bias

The penguin species was not randomly sampled, leading to bias within the dataset. For example, a researcher could have just grabbed the first 20 penguins he saw on an Island without actually trying to make the penguin selection randomized. That Island may favor a particular penguins

Describe an example of each using the Penguin data set: Measurement Bias

The scientists did not agree on where to measure the culmen length from the base or from a bit lower on the beak (no dick joke intended), leading to inconsistent measurements

What is the method you can use to replace values in a DataFrame?

Use a dictionary describing the changes and the map() method Example: recode = { "MALE" : "m", "FEMALE" : "f", np.nan : "unknown" } penguins["Sex"] = penguins["Sex"].map(recode)

How to use .groupby( ) to create a summary table for certain statistics?

groupby( )will... - Split the data frame into pieces - Apply an aggregation function to each piece, yielding a single number - Mean, standard deviation (std), sum, etc - Combine the results into a new data frame

Model hyperparameters

parameters that determine the entire training process external to the model and their value cannot be estimated from data, they are manually set example: n_estimators: number of trees in a forest, max_depth: maximum depth of tree

Model parameters

the features of training data that will learn on its own during training example: split points in decision trees internal to the model and their value can be estimated from data


Related study sets

Chapter 25 Thermo-regulation-homestasis, Big Bio Exam

View Set

Government Funding for STEM Programs

View Set

0Marketing Quiz Book Questions (Ch 1)

View Set

Nutrition 191 Chapter 9-14 Review

View Set

Unit 2 WA ATAR P&L: Electoral Systems In Australia

View Set

Chapter 29 - Listening Guide Quiz 19: Haydn: String Quartet in E-flat Major, Op. 33, No. 2 (Joke), IV

View Set

Study Guide for Test on the Declaration of Independence

View Set

Chapter 11: Substance-Related Disorders

View Set