7. Cleaning data

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

# Merge the DataFrames: o2o o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

1-to-1 data merge

# Concatenate the DataFrames row-wise gapminder = pd.concat([g1800s, g1900s, g2000s]) # Print the shape of gapminder print(gapminder.shape) # Print the head of gapminder print(gapminder.head())

Assembling your data

# Convert the year column to numeric gapminder.year = pd.to_numeric(gapminder['year']) # Test if country is of type object assert gapminder.country.dtypes == np.object # Test if year is of type int64 assert gapminder.year.dtypes == np.int64 # Test if life_expectancy is of type float64 assert gapminder.life_expectancy.dtypes == np.float64

Checking the data types

# Concatenate ebola_melt and status_country column-wise: ebola_tidy ebola_tidy = pd.concat([ebola_melt, status_country], axis=1) # Print the shape of ebola_tidy print(ebola_tidy.shape) # Print the head of ebola_tidy print(ebola_tidy.head())

Combining columns of data The same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation.

# Concatenate uber1, uber2, and uber3: row_concat row_concat = pd.concat([uber1, uber2, uber3]) # Print the shape of row_concat print(row_concat.shape) # Print the head of row_concat print(row_concat.head())

Combining rows of data

# Convert the sex column to type 'category' tips.sex = tips.sex.astype('category') # Convert the smoker column to type 'category' tips.smoker = tips.smoker.astype('category') # Print the info of tips print(tips.info())

Converting data types categorical variables in a DataFrame are of type category reduces memory usage

# Define recode_sex() def recode_sex(sex_value): # Return 1 if sex_value is 'Male' if sex_value == 'Male': return 1 # Return 0 if sex_value is 'Female' elif sex_value == 'Female': return 0 # Return np.nan else: return np.nan # Apply the function to the sex column tips['sex_recode'] = tips.sex.apply(recode_sex)

Custom functions to clean data Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

# Create the new DataFrame: tracks tracks = billboard[['year', 'artist', 'track', 'time']] # Print info of tracks print(tracks.info()) # Drop the duplicates: tracks_no_duplicates tracks_no_duplicates = tracks.drop_duplicates() # Print info of tracks print(tracks_no_duplicates.info())

Dropping duplicate data Duplicate data causes a variety of problems. From the point of view of performance, they use up unnecessary amounts of memory and cause unneeded calculations to be performed when processing data. In addition, they can also bias any analysis results.

In the IPython Shell, use pandas methods such as .head(), .info(), and .describe(), and DataFrame attributes like .columns and .shape to explore it.

Exploratory analysis

# Import the regular expression module import re # Find the numeric values: matches matches = re.findall('\d+', 'the recipe calls for 6 strawberries and 2 bananas') # Print the matches print(matches)

Extracting numerical values from strings Extracting numbers from strings is a common task, particularly when working with unstructured data or log files

# Calculate the mean of the Ozone column: oz_mean oz_mean = airquality.Ozone.mean() # Replace all the missing values in the Ozone column with the mean airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

Filling missing data It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values.

# Write the pattern: pattern pattern = '*.csv' # Save all file matches: csv_files csv_files = glob.glob(pattern) # Print the file names print(csv_files) # Load the second file into a DataFrame: csv2 csv2 = pd.read_csv(csv_files[1])

Finding files that match a pattern using the glob module to find all csv files in the workspace. You can find all .csv files with '*.csv', or all parts with 'part_*'. The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

# Create an empty list: frames frames = [] # Iterate over csv_files for csv in csv_files: # Read csv into a DataFrame: df df = pd.read_csv(csv) # Append df to frames frames.append(df) # Concatenate frames into a single DataFrame: uber uber = pd.concat(frames)

Iterating and concatenating all matches load all the files into a list of DataFrames that can then be concatenated

# Write the lambda function using replace tips['total_dollar_replace'] = tips['total_dollar'].apply(lambda x: x.replace('$', '')) # Write the lambda function using regular expressions tips['total_dollar_re'] = tips['total_dollar'].apply(lambda x: re.findall('\d+\.\d+', x)[0])

Lambda functions to clean data

# Write the regular expression: pattern pattern = '^[A-Za-z\.\s]*$' # Create the Boolean vector: mask mask = countries.str.contains(pattern) # Invert the mask: mask_inverse mask_inverse = ~mask # Subset countries using mask_inverse: invalid_countries invalid_countries = countries.loc[mask_inverse]

Looking at country spellings It is reasonable to assume that country names will contain: The set of lower and upper case letters. Whitespace between words. Periods for any abbreviations.

# Merge the DataFrames: m2o m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

Many-to-1 data merge one of the values will be duplicated and recycled in the output. That is, one of the keys in the merge is not unique.

# Merge site and visited: m2m m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site') # Merge m2m and survey: m2m m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken')

Many-to-many data merge for each duplicated key, every pairwise combination will be created.

# Assert that country does not contain any missing values assert pd.notnull(gapminder.country).all() # Assert that year does not contain any missing values assert pd.notnull(gapminder.year).all() # Drop the missing values gapminder = gapminder.dropna() # Print the shape of gapminder print(gapminder.shape)

More data cleaning and processing

plot and visualize

One of the best ways to confirm what the numbers are telling you

# Write the first pattern pattern1 = bool(re.match(pattern='\d{3}\-\d{3}\-\d{4}', string='123-456-7890')) print(pattern1) # Write the second pattern pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45')) print(pattern2) # Write the third pattern pattern3 = bool(re.match(pattern='[A]\w*', string='Australia')) print(pattern3)

Pattern matching

In melting we turn columns in to rows In pivoting we turn unique values into separate columns Analysis friendly shape to reporting friendly shape airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

Pivoting data (un-melting)

# Pivot airquality_dup: airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean) # Reset the index of airquality_pivot airquality_pivot = airquality_pivot.reset_index()

Pivoting duplicate values

columns represent different variables rows represent individual observations observational units form tables

Principles of tidy data

18 \d* - to represent any digit $18 \$\d* -to match integer $18.00 \$\d*\.\d* to match value with decimal points $18.89 \$\d*\.\d{2} - to match only 2 digits after the decimal point $18.897 ^\$\d*\.\d{2}$ -

Regex Example match

airquality_pivot_reset = airquality_pivot.reset_index() simple method you can use to get back the original DataFrame from the pivoted DataFrame

Resetting the index of a DataFrame

# Melt gapminder: gapminder_melt gapminder_melt = pd.melt(gapminder, id_vars='Life expectancy') # Rename the columns gapminder_melt.columns = ['country', 'year', 'life_expectancy'] # Print the head of gapminder_melt print(gapminder_melt.head())

Reshaping your data gapminder DataFrame has a separate column for each year. What you want instead is a single column that contains the year, and a single column that represents the average life expectancy for each year and country. By having year in its own column, you can use it as a predictor variable in a later analysis.

Melting data is the process of turning columns of your data into rows of data. pd.melt(df, id_vars=['A'], value_vars=['B'], var_name='myVarname', value_name='myValname')

Reshaping your data using melt

# Melt ebola: ebola_melt ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts') # Create the 'str_split' column ebola_melt['str_split'] = ebola_melt['type_country'].str.split('_') # Create the 'type' column ebola_melt['type'] = ebola_melt.str_split.str.get(0) # Create the 'country' column ebola_melt['country'] = ebola_melt.str_split.str.get(1)

Splitting a column with .split() and .get()

# Melt tb: tb_melt tb_melt = pd.melt(tb, id_vars=['country', 'year']) # Create the 'gender' column tb_melt['gender'] = tb_melt.variable.str[0]

Splitting a column with .str you can take advantage of pandas' vectorized string slicing by using the str attribute of columns of type object

# Import the regular expression module import re # Compile the pattern: prog prog = re.compile('\d{3}\-\d{3}\-\d{4}') # See if the pattern matches result = prog.match('123-456-7890') print(bool(result)) True # See if the pattern matches result2 = prog.match('1123-456-7890') print(bool(result2)) False

String parsing with regular expressions

# Assert that there are no missing values assert pd.notnull(ebola).all().all() # Assert that all values are >= 0 assert (ebola>=0).all().all()

Testing your data with asserts assert statement to verify that there is no more missing values

def check_null_or_valid(row_data): """Function that takes a row of data, drops all missing values, and checks if all remaining values are greater than or equal to 0 """ no_na = row_data.dropna()[1:-1] numeric = pd.to_numeric(no_na) ge0 = numeric >= 0 return ge0 # Check whether the first column is 'Life expectancy' assert g1800s.columns[0] == 'Life expectancy' # Check whether the values in the row are valid assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all() # Check that there is only one instance of each country assert g1800s['Life expectancy'].value_counts()[0] == 1

Thinking about the question at hand Before continuing, however, it's important to make sure that the following assumptions about the data are true: 'Life expectancy' is the first column (index 0) of the DataFrame. The other columns contain either null or numeric values. The numeric values are all greater than or equal to 0. There is only one instance of each country. You can write a function that you can apply over the entire DataFrame to verify some of these assumptions. Note that spending the time to write such a script will help you when working with other datasets as well.

One-to-one Many-to-one/ one-to-many Many-to-many

Types fo DataFrame mergers

String manipulation: Many build-in and external libraries like 're' 're' is a formal way of specifying the pattern

Using regular expressions to clean strings (regex)

df.boxplot(column='initial_cost', by='Borough', rot=90) use a boxplot to compare the 'initial_cost' across the different values of the 'Borough' column. Further investigation is needed to determine whether or not you can drop or keep those outliers in your data

Visualizing multiple variables with boxplots: To visualize multiple variables, boxplots are useful, especially when one of the variables is categorical

df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70) In many cases you need to remove outliers to see patterns in scatter plot.

Visualizing multiple variables with scatter plots. When you want to visualize two numeric columns, scatter plots are ideal

df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True) Create a histogram of the 'Existing Zoning Sqft' column. Rotate the axis labels by 70 degrees and use a log scale for both axes.

Visualizing single variables with histograms

# Create the scatter plot g1800s.plot(kind='scatter', x='1800', y='1899') # Specify axis labels plt.xlabel('Life Expectancy by Country in 1800') plt.ylabel('Life Expectancy by Country in 1899') # Specify axis limits plt.xlim(20, 55) plt.ylim(20, 55)

Visualizing your data the goal is to visually check the data for insights as well as errors. When looking at the plot, pay attention to whether the scatter plot takes the form of a diagonal line, and which points fall below or above the diagonal line. This will inform how life expectancy in 1899 changed (or did not change) compared to 1800 for different countries

# Convert 'total_bill' to a numeric dtype tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce') #'coerce' creates NaN # Convert 'tip' to a numeric dtype tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce') # Print the info of tips print(tips.info())

Working with numeric data If you expect the data type of a column to be numeric (int or float), but instead it is of type object, this typically means that there is a non numeric value in the column, which also signifies bad data.

# Add first subplot plt.subplot(2, 1, 1) # Create a histogram of life_expectancy gapminder.life_expectancy.plot(kind='hist') # Group gapminder: gapminder_agg gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean() # Print the head of gapminder_agg print(gapminder_agg.head()) # Print the tail of gapminder_agg print(gapminder_agg.tail()) # Add second subplot plt.subplot(2, 1, 2) # Create a line plot of life expectancy per year gapminder_agg.plot() # Add title and specify axis labels plt.title('Life expectancy over the years') plt.ylabel('Life expectancy') plt.xlabel('Year') # Display the plots plt.tight_layout() plt.show() # Save both DataFrames to csv files gapminder.to_csv('gapminder.csv') gapminder_agg.to_csv('gapminder_agg.csv')

Wrapping up you'll begin by creating a histogram of the life_expectancy column. You should not get any values under 0 and you should see something reasonable on the higher end of the life_expectancy age range. Your next task is to investigate how average life expectancy changed over the years. To do this, you need to subset the data by each year, get the life_expectancy column from each subset, and take an average of the values. You can achieve this using the .groupby() method. This .groupby() method is covered in greater depth in Manipulating DataFrames with pandas. Finally, you can save your tidy and summarized DataFrame to a file using the .to_csv() method.

# Check that there is only one instance of each country assert g1800s['Life expectancy'].value_counts()[0] == 1

assert statement to make sure that each country occurs only once in the data. Use the .value_counts() method on the 'Life expectancy' column for this. Specifically, index 0 of .value_counts() will contain the most frequently occuring value. If this is equal to 1 for the 'Life expectancy' column, then you can be certain that no country appears more than once in the data.

# Check whether the values in the row are valid assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()

assert statement to test that all the values are valid for the g1800s DataFrame. Use the check_null_or_valid() function placed inside the .apply() method for this. Note that because you're applying it over the entire DataFrame, and not just one column, you'll have to chain the .all() method twice, and remember that you don't have to use () for functions placed inside .apply().

melting and pivoting

basic tools to prepare data

Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

hierarchical index

.describe()

method for computing summary statistics

1. compile the pattern (save in the variable) 2. Use the compiled pattern to match values

steps to use regex

Categorical data

variable that can take on one of a limited, and usually fixed number of possible values

assert statement

оператор контроля ошибок

Outlier

статистический выброс


Kaugnay na mga set ng pag-aaral

Organizational Behavior Midterm 2 Multiple Choice

View Set

Biochem Exam 4 Challenge Questions

View Set

C214 - Financial Management Study Guide 2023

View Set

Chapter 14 -Medical Expense Insurance

View Set

Ch. 39 Oxygenation and Perfusion

View Set