INFO2950 Midterm Study Guide
Benefits of standard deviation
-Shares units with underlying observations -Useful properties with Normal distribution
Why is low Inertia better?
Looking for point where inertia stops dropping rapidly with increase in K can't compare inertia if different k cluster numbers because different numbers Inertia decreases with higher centroids
Examine outliers
Use idxmin and idxmax countries.loc[countries['change'].idxmin()] Index label corresponding to min value in Panda series, gives specific row
Call index
book_df.index Can't call index by name, use index accessor -> book_df.index.get_level_values('author')
Information about dataframe
book_df.info()
Change column name
col_names=list(covid.columns) col_names[6]='active_cases'
Visualize dataframes
covid['TCHD'].plot() plt.plot(covid['Date'],covid['TCHD'])
df.index = [5,18,3]
give 3 rows, rows #5, 18, 3
Pearson correlation
measures the degree and the direction of the linear relationship between two variables based on values of actual data
Difference between Numpy and lists
-Numpy arrays are fixed size -Elements in Numpy are all the same data type -Lists can contain arbitrary types
Limitations of Kmeans
-set K manually rather than learning directly from data -not guaranteed for globally optimum solution -only produce for linear partitions of data
Kmeans algorithm
1. set k= # of clusters to find 2. Assign k cluster centroids at random Assign each data point to nearest centroid Shift each centroid to mean position of data points assigned 5. Continue reassigning points and reshifting centroids until equilibrium
Kmeans in SKLearn
1. set up learning object kmeans=Kmeans(n_clusters=2) 2. calculate proper centroids kmeans.fit(X) 3. predict method to generate set of cluster labels based on centroid we learn from fit y_means=kmeans.predict(X)
Dataframe
2D array of elements where elements in the same column can have different types of data book_df=pd.DataFrame(book_dict)
Spearman correlation
A correlation used with ordinal data or to evaluate monotonic relationship Correlation on ranks than on actual values countries['population'].corr(countries['change'], method='Spearman')
Groupby
A groupby operation involves some combination of splitting the object, applying a function, and combining the results grouped=covid.groupby(by=['weekday']) grouped['TCHD'].mean() for name, group in grouped: display(group['TCHD'].head(5)
Cosine Distance
A measure of distance by differences in angular distance from origin
Correlation
A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other Bounded in range of -1 to 1
Manhattan Distance
A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC.
Modify dataframe
Assign output back into dataframe to change data structure book_df['year']=book_df['year'] +100
Rank Transformation
Eliminates outliers, focuses on magnitude of different positions instead of actual values countries['population'].rank()
df.loc[[True, False, False]]
First line is True, third line False
Why only change one variable?
Log transform intended to collapse influence of far outliers, no huge outliers in change variable no need to change data.
Covariance
Measure of how 2 units change in relation to each other Given observation has x value above/below mean, how likely is it that observation has y value above/below mean
Silhouette scores
Normalized mean of difference between intracluster distance (distance from each point to assigned centroid) and nearest intracluster disance (disance from each point to nearest non assigned centroid) Bounded by +1 and -1
Manipulate index in dataframe
Set column to be index book_df.set_index(['author'], inplace=True)
Difference between Pearson and Spearman
Similar, but sometimes you can see relationships using ranks you wouldn't see in linear relationship
Why Log data transformation?
The log compresses the outliers. Create new column countries['LogPop']=np.log(countries['Population'])
Why transform data?
There can be outliers in a linear relation. These outliers are far away from the mean, having an outsized influence on the measure of correlation between them. Transform data for data points that aren't large outliers.
Standard deviation
a measure of variability that describes an average distance of every score from the mean within dataset grouped_data['average_price'].std()
Select all rows and title of column
book_df.loc[:, ['title of column']]
Select rows in dataframe
book_df.loc[range:range, ['labels on index', 'list of column']] book_df.loc[1:2, ['year', 'title']]
Select a dataframe column
book_df['author'] single brackets allow you to select from Dataframe object
Dataframe column operations
book_df[['year']]+100 -> automatically adds 100 to every element book_df['year'] +100 -> does not change data structure, won't change unless assign output of operation into original data structure
Get rid of undesired data
covid.columns columns_to_retain=list(covid.columns)[1:] covid=covid.loc[1:, columns_to_retain]
Change column types
covid['Date'].dtype type(covid.at[0, 'Date']) covid['Date']=pd._to_datetime(covid['Date'])
Calculate correlation in Python
df.corr()
Calculate covariance in Python
df.cov(ddof=0)
Create subset
df.loc[df['b'] <=5, 'c'] =False or copy data subset=subset.copy()
df.at[0,'b']
get value for single value
df.loc[0, 'b']
get values for row 0, column b
df.loc[18:3]
give 2 rows, rows #18, 3
df.iloc[[1]]
index positions, give 1st row
Using selector to plot only 'Monday' data
monday_selector= covid['Data'].dt.day_name()=='Monday' covid_mondays=covid.loc[monday_selector, 'Date', 'TCHD']] covid_mondays.shape -> (77,2) 77 Mondays in dataset
N vs. N-1
n= descriptive data like facts and calculations n-1 = inferential data like measuring/room for error imperfect estimate of population mean, actual variance will be larger because actual mean does not equal sample mean
Construct sets
passing list to set constructor for fast set arithmetic regions=set('West','Cali') US=set(['Total US']) city_like=set(avo.geography.unique()) - regions - US
Data prep
sheet='link' pd.read_html(sheet) covid.to_csv('___.csv', index=False) covid=pd.read_csv('____.csv')
Inertia
sum of squared distances between each data point and assigned centroid Lots of data points far away from assigned centroid -> squared distance will be larger Lower distance is better, inertia is direct measure of cluster quality
Supervised vs. Unsupervised learning
supervised- know set of properties you're trying to predict -> easy evaluation metric Unsupervised- don't know set of properties trying to identify -> allow to discover structures didn't know in advanced
Euclidean distance
the straight-line distance, or shortest possible path, between two points
covid.loc[1, 'Date'] covid.loc[1, 'Date'].day_name covid['Date'].dt.weekday.head()
timestamp Monday day number
Clustering Algorithm DBSCAN
true density based clustering, detect clusters of unusual shape and assign outlying data points to no cluster at all must specific "dense"
covid.dtypes covid.shape len(covid) covid.describe()
types of data table size (530,13) table length (54) descriptive stats (basic summary stats)
Plot selector
us_conv_selector=(avo['geography']=='Total US') & (avo['type'] == 'conventional') sns.lineplot(x='date, y='total_volume', data=avo.loc[us_conv_selector]