W8 - Data & Pandas

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Analysis

- Calculating correlations + data.corr(method = "pearson") + data.corr()

Dataframe column operations

- Column is identified by column's name. - We could first use df.columns to get all columns' names - Adding a column requires first the construction of another dataframe with same structure then use "+" to add to existing dataframe. + df = df + df #add the dataframe to itself and treats data as string. - Deleting column using "pop(column_name)" remove a column. + df.pop('acquiree') #remove the first column

Making changes to data

- Create a new column from current columns + data["new_header"] = list - Delete columns + data.drop(columns = ["column_name"], inplace = True) - Replace + data.column.replace(old_value, new_value, inplace = True) + data.loc[data["column"] == "old_value", "column"] = "new_value" - Set values of a specific cell using loc[] + data.loc[0,"column"] = value - Set values using iloc[] + data.iloc[0,5] = value + data.iloc[0, data.columns.get_loc("column")] = value - Append other dataframe + df2.append(df3,ignore_index=True) #renames the axis(identifier) to 0 ,1,2,3,...

Data

- Data created from varying sources, can be automatically or manually collected - Data may be: + Redundant + Inconsistent + Inaccurate

Dataframe

- Dataframe stores tabular data or 2D data into a single variable. - Features of dataframe: + Mutable size: data can be added and shrank. + Each data point is identifiable by row index number and header name. - Constructs of dataframe: + pandas.DataFrame( data, index, columns, dtype, copy)

Exporting using pandas DataFrame

- Export your DataFrame to a CSV file + data.to.csv("updated_df.csv"), index = False) + use index = False to ensure that index number is not exported - Export your DataFrame to an Excel file + data.to.excel("updated_df.xlsv", index = False) - Export your DataFrame to a "tab separated values" file + data.to.csv("updated_df.xlsv", index = False, sep = "\t")

Data cleaning (Inconsistent unit)

- For inconsistent unit, we can write code to remove the unit like "m" and convert to same base unit by multiplying. - This involves searching for the respective data then conversion.

Groupby & Aggregation functions

- Groupby essentially splits the data into different groups depending on the unique values of a column of your choice. - Aggregation functions like max(), min(), mean(),count() can be quickly applied to the GroupBy object to obtain summary statistics for each group. + data.groupby(["column"]).mean().round(2) + data.groupby(["column1"]).mean()["column2"]

Data cleaning (Duplicated data)

- df.duplicated() #returns a Boolean value of True for each row if it is a duplicated record. - Use df.drop_duplicates("duplicated_value") method to remove duplicates - Use df.duplicated().sum() to return total nuber of rows with duplicated values - df[df["column1", "column2"]].duplicated() #show only those rows with duplicated on column1 and column2 - Delete rows with duplicates + df.drop_duplicates(inplace = True)

Scipy module

- from scipy import sub_module_name (as scipy has many sub packages) - For additional data processing. - High level visualization, numerical processing, and optimizations.

Pandas basics

- head() method returns the first 5 rows + By default is 5 + Index starts from 0 - tail() method returns the last 5 rows - shape attribute returns the number of rows, followed by the number of columns - infor() method returns basic infor about the file - value_count() returns counts of unique values

Matplotlib module

- import matplotlib.pyplot as plt - The most popular Python library for visualization. - Can make interactive plotting and data exploration. - pyplot in matplotlib package is the main interface for plotting. - Other modules are available that can wrap around this matplotlib module to produce more powerful visualization: § seaborn § ggpy

numPy module

- import numpy as np - Numpy stands for Numerical Python - Essential package for data computation and manipulation. -It is the foundational library which SciPy, Scikit-learn and etc are based on.

Pandas module

- import pandas as pd - It's a powerful and flexible open source for data analysis which provides a rich set of data structures to work on structured data. - The primary object that we will be using is DataFrame object - DataFrame is a two dimensional, table like structure organized into column with header and row number corresponds to each record.

Data cleaning (Missing headers)

- import pandas as pd + header = ["header_name"] + df = pd.read_csv("file_name.csv", name = header) - The code started by importing the important pandas module for data cleaning, and use pd as pointer to point to pandas. - Header is defined using the list structure to hold multiple strings. - Next calling read_csv() function from pandas through the pandas pointer. + Parse in the file name of csv file, together with header previously defined and set as names of header. - pd.read_csv() returns an important structure -> dataframe

Selecting rows using pandas

- read a column + data.iloc[1] + data.iloc[1,:] - read multiple rows + data.iloc[:4] -> 4 rows + data.iloc[:4,:] -> 4 rows + data.iloc[2:4] -> rows 2 & 3 + data.iloc[2:4,:] -> rows 2 & 3 + row with stop index is exluded

Selecting columns using pandas

- read each column + data["column_name"] + data.column_name - read multiple columns + data[["column1", "column2"]]

Filtering data

- Two criterias (and) + data[(data["column1" = criteria1]) & (data["column2" = criteria2)] - One or the other (or) + data[(data["column1" = criteria1]) | (data["column2" = criteria2)] - str.contrains() + data[data["column1"].str.contains("value")] - str.endswith() + data[data["column1"].str.endswith("value")] - str.startswith() + data[data["column1"].str.startswith("value")]

Creating DataFrame

- Pandas DataFrame can be created in multiple ways. - Here, Methods 1, 2, 3 do not require any CSV files and they use the pd.DataFrame() method. + note: it is not pd.Dataframe() + note: it is not pd.dataframe() - Method 1: + When you are given a list of lists, use this method. + df.list = pd.DataFrame(list, columns = ["columns_name"]) - Method 2: + When you are given a dictionary and the values are Python lists, use this method. + The keys will become the column names and the values will become the rows. + df_dictlist = pd.DataFrame(dictlist) - Method 3: + Two lists can be merged by using list(zip()) function. + tuple_list = list(zip(list1, list2)) + df = pd.DataFrame(tuple_list, columns = ["columns_name"]) - Method 4 + Used when you already have a CSV file and it uses the pd.read_csv() method + If you have a CSV file named abc.csv in the SAME folder as this .ipynb file, then you can simply read it in using the following: + my_file = pd.read_csv("abc.csv")

Selecting rows and columns

- Read rows and columns + data.iloc[2:4, 5:7] -> rows 2&3 + columns 5&6 - Read a specific location + data.iloc[3, 4] -> returns cell at row 3 and column 4 - Read all rows using column name + data.loc[:,"column_name"] - Read a specific row according to criterias + data.loc[data["header"] = criteria,["columns_name"]]

Dataframe row operations

- Row is accessed through the use of index, index 0 is typically for header + Getting specific row or rows (slicing) using df.loc[index] + index starts from 0 + index slicing [start:stop:step] and stop value is included - Other methods: + df.drop(number) #drop certain row + df.drop([no1,no2]) #drop row2 + df.append(another_df) #to insert another dataframe into df

Standard indexing

- Selecting rows and columns using standard indexing + This may look similar to loc[] or iloc[], but this is different! + When using standard indexing, we select rows using index numbers and we select columns using column names. + df[0:2][["column1","column2"]]

Data cleaning (Missing values)

- Some ways to handle missing values: + Removal: remove the record with missing values + Re-design data collection: to ensures every field is filled, giving user options to select instead of allowing empty value field. + Substitute or replace an appropriate value. - Check the total number of empty cells for each column + data.isnull().sum() + data.isna().sum() - Show which entries are NaN or empty cells + data.[data.isna().any(axis=1)] + Axis=0 means that it is referring to the row +Axis=1 means it is referring to the column - Fill in the missing value with acquisition_filled using fillna() method. - Other handling methods can be used accordingly: + Fill missing value with appropriate statistical value (median, mean, mode etc.) - Dropping the record + df.dropna() + df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True) #at least 1 empty cell + df.dropna(axis=0, how='all', thresh=None, subset=None, inplace=True) #rows where all cells are empty

Sorting data in pandas

- Sorting DataFrame in ascending order based on 1 specific column + data.sort_values("column_name") + data.sort_values("column_name", ascending = False) #reverse sort order - Notice that the rows are not sorted even though we used sort_values() in previous cell. + That's because you need to use inplace=True if you want the sorted order to be saved + data.sort_values(['product2_price', 'product2_quantity_ordered'], ascending = (True, False), inplace=True) - Sorting a text-based column + default is A to Z

Pandas indexing and slicing

- There are 3 ways to select and slice rows and colums of a Pandas DataFrame: + using standard indexing + using iloc[] + using loc[] - Note: selecting and slicing do NOT change the contents of the Pandas DataFrame.

Selecting using iloc[]

- When iloc[], you need to remember that there are 2 parts inside the pair of square brackets in iloc[]. + On the left of the comma, you will slice the rows of the DataFrame using index numbers. + On the right of the comma, you will slice the columns of the DataFrame using index numbers. NOT column names. - Important note: When using iloc[], Stop index is EXcluded!!! + This is different from when you are using iloc[] and when using standard indexing (that means without loc[] or iloc[]). - Selecting multiple rows by using iloc[] + df.iloc[0:2] -> return rows 0 and 1 + df.iloc[[0:2]] -> return rows 0 and 2 + df.iloc[0,2] -> return value at row 0 and column2 - Selecting multiple rows and multiple columns + df.iloc[[0,1],[0,1]] - Using slice object in iloc[] + When using iloc[], stop index is EXcluded!!! + df.iloc[0:2,0:2] -> return rows 0&1 + columns 0&1 + df.iloc[::2,0:2] -> return rows 0&2&4 + columns 0&1

Selecting using loc[]

- When loc[], you need to remember that there are 2 parts inside the pair of square brackets in loc[]. + On the left of the comma, you will slice the rows of the DataFrame using index numbers. + On the right of the comma, you will slice the columns of the DataFrame using the column names (Not index numbers) - Important note: When using loc[], Stop index is INcluded!!! + This is different from when you are using iloc[] and when using standard indexing (that means without loc[] or iloc[]). - Selecting single row + df.loc[0] - Selecting multiple rows + df.loc[[0,2]] + df.loc[0:2] - Selecting one row and one column + df.loc[[0], "columns"] - Using slice object in loc[] + When using loc[], Stop index is INcluded!!! + df.loc[0:4:2] -> return rows 0,2 and 4


Ensembles d'études connexes

Human Biology Test 3 Chapters 11, 12, 14

View Set

A&P Exam 2 - Four Major Phases of Muscle Contraction and Relaxation

View Set

Unit 16 - Respiratory Disorders Questions

View Set

DECA Marketing Cluster Exam District Level Practice Test

View Set

CHAPTER 17 Inflammation and Immunity

View Set