Data Cleaning
import the following: pandas numpy stats from scipy minmax_scaling from mlxtend.preprocessing seaborn matplotlib.pyplot kickstarter random.seed
# modules we'll use import pandas as pd import numpy as np # for Box-Cox Transformation from scipy import stats # for min_max scaling from mlxtend.preprocessing import minmax_scaling # plotting modules import seaborn as sns import matplotlib.pyplot as plt # read in all our data kickstarter= ? # set seed for reproducibility np.random.seed(0)
Use the next code cell to create a variable new_entry that changes the encoding from "big5-tw" to "utf-8". new_entry should have the bytes datatype. 1. Create a new variable that is equal to the previous variable called sample_entry and use method decode ('big5-tw') 2. Create a new variable that uses the new one created with big5 and use method encode () after it
Solution: before = sample_entry.decode("big5-tw") new_entry = before.encode()
Save a version of the police killings dataset to CSV with UTF-8 encoding. Your answer will be marked correct after saving this file. The good news is, since UTF-8 is the standard encoding in Python, when you save a file it will be saved as UTF-8 by default:
Solution: policekillings.to_csv("my_file.csv")
Setup: import pandas as pd import numpy as np import chardet np.random.seed(0) sample_entry = b'\xa7A\xa6n' print(sample_entry) print('data type:', type(sample_entry)) Can you guess what the class type is?
bytes
# remove all columns with at least one missing value Create variable called columns_with_na_dropped and create a code where it drops the na from axis 1 Get the first rows from the new variable
columns_with_na_dropped = nfl.dropna(axis=1) columns_with_na_dropped.head()
earthquakes.loc[3378, "Date"] = "02/23/1975" earthquakes.loc[7512, "Date"] = "04/28/1985" earthquakes.loc[20650, "Date"] = "03/13/2011" earthquakes['date_parsed'] = pd.to_datetime(earthquakes['Date'], format="%m/%d/%Y") Create a variable day_of_month_earthquakes containing the earthquakes with the index 'date_parsed with method dt.day Use the same variable and make it equal to itself with method dropna() Plot the days of the month from your earthquake dataset using distplot and make kde False with bins 31
day_of_month_earthquakes = earthquakes['date_parsed'].dt.day day_of_month_earthquakes = day_of_month_earthquakes.dropna() sns.distplot(day_of_month_earthquakes, kde=False, bins=31)
# modules we'll use import pandas as pd import numpy as np import seaborn as sns import datetime # read in our data earthquakes = earthquake.csv # set seed for reproducibility np.random.seed(0) You'll be working with the "Date" column from the earthquakes dataframe. Investigate this column now: does it look like it contains dates? What code can be used to find the the dtype of the Date column?
earthquakes.Date print(type(earthquake['Date'])) <class 'pandas.core.series.Series'> Pandas uses the "object" dtype for storing various types of data types, but most often when you see a column with the dtype "object" it will have strings in it.
Now that we know that our date column isn't being recognized as a date, it's time to convert it so that it is recognized as a date. This is called "parsing dates" because we're taking in a string and identifying its component parts Given all of this information, it's your turn to create a new column "date_parsed" in the earthquakes dataset that has correctly parsed dates in it. * Combine earthquakes with loc and followed by an index that includes the index and "Date" and equal to the following: 1. Index 3378 with "Date" and equal to "02/23/1975" 2. Index 7512 with "Date" and equal to "04/28/1985" 3. Index 20650 with "Date" and equal to "03/13/2011" * Create a new code that starts with earthquakes with index called 'date_parsed' and equal to pd.to_datetime(earthquakes['Date'],format='%m/%d/%Y')
earthquakes.loc[3378, "Date"] = "02/23/1975" earthquakes.loc[7512, "Date"] = "04/28/1985" earthquakes.loc[20650, "Date"] = "03/13/2011" earthquakes['date_parsed'] = pd.to_datetime(earthquakes['Date'], format="%m/%d/%Y")
The values for index 3378, 7512, 20650 in the Time column need to be corrected as well. Follow the steps as before to change the time format to '%H:%M:%S' and replace the values for all three indices to '13:44:18'
earthquakes.loc[3378,'Time']= '13:44:18' earthquakes.loc[7512,'Time']= '13:44:18' earthquakes.loc[20650,'Time']= '13:44:18' earthquakes['date_parsed'] = pd.to_datetime(earthquakes['Time'], format ='%H:%M:%S') print("COMPLETE")
1) Most of the entries in the "Date" column follow the same format: "month/day/four-digit year". However, the entry at index 3378 follows a completely different pattern. Run a code to see this from 3378 to 3383 2) This does appear to be an issue with data entry: ideally, all entries in the column have the same format. We can get an idea of how widespread this issue is by checking the length of each entry in the "Date" column. * Run a code to first create a variable and then check the length of earthquake Date str len() with a new variable * Run a code to find the value_counts() of the new variable created by using the method
earthquakes[3378:3383] date_lengths = earthquakes.Date.str.len() date_lengths.value_counts()
1. Import pandas 2. Import numpy 3. Create variable for nfl.csv 4. Create random seed 0 with numpy
import pandas as pd import numpy as np nfl = pd.read_csv("C:\\Users\\PC\\data\\nfl.csv") np.random.seed(0) print("Complete")
Now you'll practice normalization. We begin by normalizing the amount of money pledged to each campaign. # create a new variable called index_of_positive_pledges and make the value called kickstarter.usd_pledged_real and greater than 0 # create a new variable called positive_pledges and make the value kickstater.usd_pledged_real.loc and make the index called index_of_positive_pledges # create a new variable called normalized_pledges and equal to pd.Series and in a tuple put stats.boxcox and in another tuple put positive_pledges and beside that add an index of 0 then add a comma and create a variable called name that is equal to 'usd_pledged_real' and then another comma and create a variable named index that is equal to positive_pledges.index # plot both positive an normalized pledges together using a histogram for each
index_of_positive_pledges = kickstarter.usd_pledged_real > 0 positive_pledges = kickstarter.usd_pledged_real.loc[index_of_positive_pledges] normalized_pledges = pd.Series(stats.boxcox(positive_pledges)[0], name='usd_pledged_real', index=positive_pledges.index) fig, ax=plt.subplots(1,2,figsize=(15,3)) sns.distplot(positive_pledges, ax=ax[0]) ax[0].set_title("Original Data") sns.distplot(normalized_pledges, ax=ax[1]) ax[1].set_title("Normalized data")
Looks like there are two more rows that has a date in a different format. Create a code that will find the other 2. 1. First write a code to show the exact index number of each value that has a length equal to 24 (You will need to use numpy with its method and create a variable) 2. Then write another code that shows the location in a table using the new variable created with loc and earthquakes * Create a new variable named indices and use a np method called where with a tuple and inside the tuple add an index with the date_length variable that is equal to 24 and outside that tuple\index add a index of 1 * Print the variable indices * Combine earthquakes with loc and the index indices
indices = np.where([date_lengths == 24])[1] print('Indices with corrupted data:', indices) earthquakes.loc[indices]
The other data is the byt__________________ data type, which is a sequence of integ__________________. You can convert a string into bytes by specifying which encoding it's in
integers
Character encodings are specific sets of rules for map__________________ from raw bina__________________ byt__________________ strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and if you tried to read in text with a different encoding than the one it was originally written in, you ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here's an example of mojibake: æ-‡å—åŒ-ã?? UT____-8 is the standard text encoding. All Python code is in UT____-8 and, ideally, all your data should be as well. It's when things aren't in UT____-8 that you run into trouble.
mapping binary byte UTF
Normal distribution: Also known as the "bell curve", this is a specific statistical distribution where a roughly equal observations fall above and below the mea____________, the mea____________ and the med____________ are the sam____________, and there are more observ____________ clos____________ to the mea____________. The normal distribution is also known as the Gaussian distribution
mean mean median same mean observations
Ok, now we know that we do have some missing values. Let's see how many we have in each column. # get the number of missing data points per column for the nfl data 1) Create a new varirable missing_values_count 2) Use a code to get the sum of null values 3) Get the first 10 columns (using slice) from the missing_values_count
missing_values_count = nfl.isnull().sum() missing_values_count[0:10]
# look at a few rows of the nfl file using the sample code of 5 rows
nfl_data.sample(5)
Fuzzy matching: The process of automatically finding text strings that are very simil__________________ to the targ__________________ string. In general, a string is considered "closer" to another one the fewer characters you'd need to change if you were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (rplace "i" with "o"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time. Fuzzywuzzy returns a rati__________________ given two strings. The closer the rati__________________ is to 1__________________, the smal__________________ the edit dista__________________ between the two strings.
similar target string ratio 100 smaller distance
We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the NaN values to be replaced with. Here, I'm saying that I would like to replace all the NaN values with 0. # replace all NA's with 0 by using the fillna method with subset_nfl
subset_nfl.fillna(0)
I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.) # replace all NA's the value that comes directly after it in the same column # use the fillna as a method behind subset_nfl and in the tuple, make method equal to bfill and axis is 0 and create another fillna method behind it with 0
subset_nfl.fillna(method='bfill', axis=0).fillna(0)
Filling in missing values automatically Another option is to try and fill in the missing values. For this next bit, I'm getting a small sub-section of the NFL data so that it will print well. # get a small subset of the NFL dataset by creating the variable subset_nfl and use the loc with nfl variable and in the list, include : , EPA, Season and get the first 5 rows and PRINT
subset_nfl= nfl.loc[:, 'EPA':'Season'].head()
That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem: 1. Create a variable called total_cells and use a code to get the total amount of cells in the nfl file (use product with shape) 2. Create a variable called total_missing and get the sum of the total missing values variable made earlier 3. Divide total_missing by total_cells * 100
total_cells = np.product(nfl.shape) total_missing = missing_values_count.sum() (total_missing/total_cells) * 100
# look at a few rows of the permits file using the sample code of 5 rows
sf_permits.sample(5)
Let's start by scaling the goals of each campaign, which is how much money they were asking for. The plots show a histogram of the values in the "usd_goal_real" column, both before and after scaling. # create a new variable called original and cast pd.DataFrame with kickstarter with the usd_goal_real column as a method in a tuple after pd.DataFrame # crete a new variable called scaled and cast minmax_scaling and combine with a tuple that includes original variable, and columns that is equal to an index called 'usd_goal_real' # plot the original & scaled data together to compare fig, ax = includes subplots and figsize 15,3 sns.distplot( for variable original and ax ax[0]. set a title sns.distplot( for variable scaled with ax ax[1]. set a title o s f s a s a
original_data = pd.DataFrame(kickstarter.usd_goal_real) scaled_data = minmax_scaling(original_data, columns=['usd_goal_real']) fig, ax=plt.subplots(1,2,figsize=(15,3)) sns.distplot(kickstarter.usd_goal_real, ax=ax[0]) ax[0].set_title("Original Data") sns.distplot(scaled, ax=ax[1]) ax[1].set_title("Scaled data")
Figure out what the correct encoding should be and read in the file to a DataFrame policekillings. * Make sure to end the tuple with encoding = 'Windows-1252'
police_killings = pd.read_csv(C:\\PC\\Users\\CSV\\policekillings.csv, encoding='Windows-1252')
# just how much data did we lose? Find the total columns using the shape of the original nfl data set of nfl (needs %d and %) and end it with a index of 1 in the tuple. Find the remaining columns using the shape of columns_with_na_dropped (needs %d and %) and end it with a index of 1 in the tuple.
print("Columns in original dataset: %d \n" % nfl.shape[1]) print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1]) Columns in original dataset: 102 Columns with na's dropped: 41
# convert to lower case professors['Country'] = convert this variable to lower case using method str.lower while using the same variable # remove trailing white spaces professors['Country'] = convert this variable to trailing white spaces using method str.strip() while using the same variable # get the top 10 closest matches to "south korea" create a variable called countries that is equal to professors['Country'].unique() create variable called matches and equals fuzzywuzzy with method process and method extract and then a tuple that starts with 'south korea', 'countries', with limit equalling 10 and scorer = fuzzywuzzy.fuzz.token_sort_ratio)
professors['Country'] = professors['Country'].str.lower() professors['Country'] = professors['Country'].str.strip() countries = professors['Country'].unique() matches = fuzzywuzzy.process.extract("south korea", 'countries', limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
In scaling, you're changing the ran___________ of your data. This means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1. In normalization, you're changing the sha____________ of the distri________________ of your data. The point of normalization is to change your observ____________ so that they can be described as a norm____________ distrib____________.
range shape distribution observations normal
