INFO2950 Midterm Study Guide

Ace your homework & exams now with Quizwiz!

Benefits of standard deviation

-Shares units with underlying observations -Useful properties with Normal distribution

Why is low Inertia better?

Looking for point where inertia stops dropping rapidly with increase in K can't compare inertia if different k cluster numbers because different numbers Inertia decreases with higher centroids

Examine outliers

Use idxmin and idxmax countries.loc[countries['change'].idxmin()] Index label corresponding to min value in Panda series, gives specific row

Call index

book_df.index Can't call index by name, use index accessor -> book_df.index.get_level_values('author')

Information about dataframe

book_df.info()

Change column name

col_names=list(covid.columns) col_names[6]='active_cases'

Visualize dataframes

covid['TCHD'].plot() plt.plot(covid['Date'],covid['TCHD'])

df.index = [5,18,3]

give 3 rows, rows #5, 18, 3

Pearson correlation

measures the degree and the direction of the linear relationship between two variables based on values of actual data

Difference between Numpy and lists

-Numpy arrays are fixed size -Elements in Numpy are all the same data type -Lists can contain arbitrary types

Limitations of Kmeans

-set K manually rather than learning directly from data -not guaranteed for globally optimum solution -only produce for linear partitions of data

Kmeans algorithm

1. set k= # of clusters to find 2. Assign k cluster centroids at random Assign each data point to nearest centroid Shift each centroid to mean position of data points assigned 5. Continue reassigning points and reshifting centroids until equilibrium

Kmeans in SKLearn

1. set up learning object kmeans=Kmeans(n_clusters=2) 2. calculate proper centroids kmeans.fit(X) 3. predict method to generate set of cluster labels based on centroid we learn from fit y_means=kmeans.predict(X)

Dataframe

2D array of elements where elements in the same column can have different types of data book_df=pd.DataFrame(book_dict)

Spearman correlation

A correlation used with ordinal data or to evaluate monotonic relationship Correlation on ranks than on actual values countries['population'].corr(countries['change'], method='Spearman')

Groupby

A groupby operation involves some combination of splitting the object, applying a function, and combining the results grouped=covid.groupby(by=['weekday']) grouped['TCHD'].mean() for name, group in grouped: display(group['TCHD'].head(5)

Cosine Distance

A measure of distance by differences in angular distance from origin

Correlation

A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other Bounded in range of -1 to 1

Manhattan Distance

A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC.

Modify dataframe

Assign output back into dataframe to change data structure book_df['year']=book_df['year'] +100

Rank Transformation

Eliminates outliers, focuses on magnitude of different positions instead of actual values countries['population'].rank()

df.loc[[True, False, False]]

First line is True, third line False

Why only change one variable?

Log transform intended to collapse influence of far outliers, no huge outliers in change variable no need to change data.

Covariance

Measure of how 2 units change in relation to each other Given observation has x value above/below mean, how likely is it that observation has y value above/below mean

Silhouette scores

Normalized mean of difference between intracluster distance (distance from each point to assigned centroid) and nearest intracluster disance (disance from each point to nearest non assigned centroid) Bounded by +1 and -1

Manipulate index in dataframe

Set column to be index book_df.set_index(['author'], inplace=True)

Difference between Pearson and Spearman

Similar, but sometimes you can see relationships using ranks you wouldn't see in linear relationship

Why Log data transformation?

The log compresses the outliers. Create new column countries['LogPop']=np.log(countries['Population'])

Why transform data?

There can be outliers in a linear relation. These outliers are far away from the mean, having an outsized influence on the measure of correlation between them. Transform data for data points that aren't large outliers.

Standard deviation

a measure of variability that describes an average distance of every score from the mean within dataset grouped_data['average_price'].std()

Select all rows and title of column

book_df.loc[:, ['title of column']]

Select rows in dataframe

book_df.loc[range:range, ['labels on index', 'list of column']] book_df.loc[1:2, ['year', 'title']]

Select a dataframe column

book_df['author'] single brackets allow you to select from Dataframe object

Dataframe column operations

book_df[['year']]+100 -> automatically adds 100 to every element book_df['year'] +100 -> does not change data structure, won't change unless assign output of operation into original data structure

Get rid of undesired data

covid.columns columns_to_retain=list(covid.columns)[1:] covid=covid.loc[1:, columns_to_retain]

Change column types

covid['Date'].dtype type(covid.at[0, 'Date']) covid['Date']=pd._to_datetime(covid['Date'])

Calculate correlation in Python

df.corr()

Calculate covariance in Python

df.cov(ddof=0)

Create subset

df.loc[df['b'] <=5, 'c'] =False or copy data subset=subset.copy()

df.at[0,'b']

get value for single value

df.loc[0, 'b']

get values for row 0, column b

df.loc[18:3]

give 2 rows, rows #18, 3

df.iloc[[1]]

index positions, give 1st row

Using selector to plot only 'Monday' data

monday_selector= covid['Data'].dt.day_name()=='Monday' covid_mondays=covid.loc[monday_selector, 'Date', 'TCHD']] covid_mondays.shape -> (77,2) 77 Mondays in dataset

N vs. N-1

n= descriptive data like facts and calculations n-1 = inferential data like measuring/room for error imperfect estimate of population mean, actual variance will be larger because actual mean does not equal sample mean

Construct sets

passing list to set constructor for fast set arithmetic regions=set('West','Cali') US=set(['Total US']) city_like=set(avo.geography.unique()) - regions - US

Data prep

sheet='link' pd.read_html(sheet) covid.to_csv('___.csv', index=False) covid=pd.read_csv('____.csv')

Inertia

sum of squared distances between each data point and assigned centroid Lots of data points far away from assigned centroid -> squared distance will be larger Lower distance is better, inertia is direct measure of cluster quality

Supervised vs. Unsupervised learning

supervised- know set of properties you're trying to predict -> easy evaluation metric Unsupervised- don't know set of properties trying to identify -> allow to discover structures didn't know in advanced

Euclidean distance

the straight-line distance, or shortest possible path, between two points

covid.loc[1, 'Date'] covid.loc[1, 'Date'].day_name covid['Date'].dt.weekday.head()

timestamp Monday day number

Clustering Algorithm DBSCAN

true density based clustering, detect clusters of unusual shape and assign outlying data points to no cluster at all must specific "dense"

covid.dtypes covid.shape len(covid) covid.describe()

types of data table size (530,13) table length (54) descriptive stats (basic summary stats)

Plot selector

us_conv_selector=(avo['geography']=='Total US') & (avo['type'] == 'conventional') sns.lineplot(x='date, y='total_volume', data=avo.loc[us_conv_selector]


Related study sets

Science semester 2 multiple choice

View Set

MBA 600: Ch12 Video - Warby Parker

View Set

Futur II (Vermutungen über die Vergangenheit)

View Set

Ethos, Pathos, Logos, Ethos, Pathos, Logos

View Set

Justice Chapter 2: Utilitarianism

View Set

Hypothalamic and Pituitary Disorders

View Set