8. pandas Foundations
# Read in the data file: df df = pd.read_csv(data_file) # Print the output of df.head() print(df.head()) # Read in the data file with header=None: df_headers df_headers = pd.read_csv(data_file, header=None) # Print the output of df_headers.head() print(df_headers.head())
1. Reading in a data file The problem with real data such as this is that the files are almost never formatted in a convenient way. In this exercise, there are several problems to overcome in reading the file. First, there is no header, and thus the columns don't have labels. There is also no obvious index column, since none of the data columns contain a full date or time.
# Split on the comma to create a list: column_labels_list column_labels_list = column_labels.split(',') # Assign the new column labels to the DataFrame: df.columns df.columns = column_labels_list # Remove the appropriate columns: df_dropped df_dropped = df.drop(list_to_drop, axis='columns') # Print the output of df_dropped.head() print(df_dropped.head())
2. Re-assigning column names After the initial step of reading in the data, the next step is to clean and tidy it so that it is easier to work with.
# Convert the date column to string: df_dropped['date'] df_dropped['date'] = df_dropped['date'].astype(str) # Pad leading zeros to the Time column: df_dropped['Time'] df_dropped['Time'] = df_dropped['Time'].apply(lambda x:'{:0>4}'.format(x)) # Concatenate the new date and Time columns: date_string date_string = df_dropped['date'] + df_dropped['Time'] # Convert the date_string Series to datetime: date_times date_times = pd.to_datetime(date_string, format='%Y%m%d%H%M') # Set the index to be the new date_times container: df_clean df_clean = df_dropped.set_index(date_times) # Print the output of df_clean.head() print(df_clean.head())
3. Cleaning and tidying datetime data
# Print the dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011 print(df_clean.loc['2011-06-20 08:00:00':'2011-06-20 09:00:00', 'dry_bulb_faren']) # Convert the dry_bulb_faren column to numeric values: df_clean['dry_bulb_faren'] df_clean['dry_bulb_faren'] = pd.to_numeric(df_clean['dry_bulb_faren'], errors='coerce') # Print the transformed dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011 print(df_clean.loc['2011-06-20 08:00:00':'2011-06-20 09:00:00', 'dry_bulb_faren']) # Convert the wind_speed and dew_point_faren columns to numeric values df_clean['wind_speed'] = pd.to_numeric(df_clean['wind_speed'], errors='coerce') df_clean['dew_point_faren'] = pd.to_numeric(df_clean['dew_point_faren'], errors='coerce')
4. Cleaning the numeric columns The numeric columns contain missing values labeled as 'M'. In this exercise, your job is to transform these columns such that they contain only numeric values and interpret missing data as NaN.
# Make a string with the value 'PA': state state = 'PA' # Construct a dictionary: data data = {'state':state, 'city':cities} # Construct a DataFrame from dictionary data: df df = pd.DataFrame(data)
Building DataFrames with broadcasting You can implicitly use 'broadcasting', a feature of NumPy, when creating pandas DataFrames. In this exercise, you're going to create a DataFrame of cities in Pennsylvania that contains the city name in one column and the state name in the second. We have imported the names of 15 cities as the list cities.
# Print the minimum value of the Engineering column print(df['Engineering'].min()) # Print the maximum value of the Engineering column print(df['Engineering'].max()) # Construct the mean percentage per year: mean mean = df.mean(axis='columns') # Plot the average percentage per year mean.plot() # Display the plot plt.show()
Compute the minimum and maximum values of the 'Engineering' column and generate a line plot of the mean value of all 17 academic fields per year. To perform this step, you'll use the .mean() method with the keyword argument axis='columns'. This computes the mean across all columns per row.
# Prepare a format string: time_format time_format = '%Y-%m-%d %H:%M' # Convert date_list into a datetime object: my_datetimes my_datetimes = pd.to_datetime(date_list, format=time_format) # Construct a pandas Series using temperature_list and my_datetimes: time_series time_series = pd.Series(temperature_list, index=my_datetimes)
Creating and using a DatetimeIndex The pandas Index is a powerful way to handle time series data, so it is valuable to know how to build one yourself. Pandas provides the pd.to_datetime() function for just this task. For example, if passed the list of strings ['2015-01-01 091234','2015-01-01 091234'] and a format specification variable, such as format='%Y-%m-%d %H%M%S, pandas will parse the string into the proper datetime elements and build the datetime objects.
# Create a Boolean Series for sunny days: sunny sunny = df_clean['sky_condition']=='CLR' # Resample the Boolean Series by day and compute the sum: sunny_hours sunny_hours = sunny.resample('D').sum() # Resample the Boolean Series by day and compute the count: total_hours total_hours = sunny.resample('D').count() # Divide sunny_hours by total_hours: sunny_fraction sunny_fraction = sunny_hours / total_hours # Make a box plot of sunny_fraction sunny_fraction.plot(kind='box') plt.show()
Daily hours of clear sky The 'sky_condition' column is recorded hourly. Your job is to resample this column appropriately such that you can extract the number of sunny hours in a day and the number of total hours. Then, you can divide the number of sunny hours by the number of total hours, and generate a box plot of the resulting fraction.
# Read in the file with the correct parameters: df2 df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#') # Print the output of df2.head() print(df2.head()) # Save the cleaned up DataFrame to a CSV file without the index df2.to_csv(file_clean, index=False) # Save the cleaned up DataFrame to an excel file without the index df2.to_excel('file_clean.xlsx', index=False)
Delimiters, headers, and extensions
reduce datetime rows to slower the frequency
Downsampling
df[df['origin']=='Asia'].count()
Filtering and counting How many automobiles were manufactured in Asia in the automobile dataset?
# Print the minimum value of the Engineering column print(df['Engineering'].min()) # Print the maximum value of the Engineering column print(df['Engineering'].max()) # Construct the mean percentage per year: mean mean = df['Engineering'].mean() # Plot the average percentage per year df['Engineering'].plot() # Display the plot plt.show()
Fuel efficiency
# Resample dew_point_faren and dry_bulb_faren by Month, aggregating the maximum values: monthly_max monthly_max = df_clean[['dew_point_faren','dry_bulb_faren']].resample('M').max() # Generate a histogram with bins=8, alpha=0.5, subplots=True monthly_max.plot(kind='hist', bins=8, alpha=0.5, subplots=True) # Show the plot plt.show()
Heat or humidity Dew point is a measure of relative humidity based on pressure and temperature. A dew point above 65 is considered uncomfortable while a temperature above 90 is also considered uncomfortable.
# Build a list of labels: list_labels list_labels = ['year', 'artist', 'song', 'chart weeks'] # Assign the list of labels to the columns attribute: df.columns df.columns = list_labels
Labeling your data
# Print summary statistics of the fare column with .describe() print(df.fare.describe()) # Generate a box plot of the fare column df.fare.plot(kind='box') # Show the plot plt.show()
Median vs mean In many data sets, there can be large differences in the mean and median value due to the presence of outliers.
# Strip extra whitespace from the column names: df.columns df.columns = df.columns.str.strip() # Extract data for which the destination airport is Dallas: dallas dallas = df['Destination Airport'].str.contains('DAL') # Compute the total number of Dallas departures each day: daily_departures daily_departures = dallas.resample('D').sum() # Generate the summary statistics for daily Dallas departures: stats stats = daily_departures.describe()
Method chaining and filtering We've seen that pandas supports method chaining. This technique can be very powerful when cleaning and filtering data.
# Reset the index of ts2 to ts1, and then use linear interpolation to fill in the NaNs: ts2_interp ts2_interp = ts2.reindex(ts1.index).interpolate(how='linear') # Compute the absolute difference of ts1 and ts2_interp: differences differences = np.abs(ts1 - ts2_interp) # Generate and print summary statistics of the differences print(differences.describe())
Missing values and interpolation One common application of interpolation in data analysis is to fill in missing data. In this exercise, noisy measured data that has some dropped or otherwise missing values has been loaded. The goal is to compare two time series, and then look at summary statistics of the differences. The problem is that one of the data sets is missing data at some of the times
# Create array of DataFrame values: np_vals np_vals = df.values # Create new array of base 10 logarithm values: np_vals_log10 np_vals_log10 = np.log10(np_vals) # Create array of new DataFrame by passing df to np.log10(): df_log10 df_log10 = np.log10(df) # Print original and new data containers [print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']]
NumPy and pandas working together pandas depends upon and interoperates with NumPy, the Python library for fast numeric array computations. For example, you can use the DataFrame attribute .values to represent a DataFrame df as a NumPy array. You can also pass pandas data structures to NumPy methods
# Extract the hour from 9pm to 10pm on '2010-10-11': ts1 ts1 = ts0.loc['2010-10-11 21:00:00':'2010-10-11 22:00:00'] # Extract '2010-07-04' from ts0: ts2 ts2 = ts0.loc['2010-07-04'] # Extract data from '2010-12-15' to '2010-12-31': ts3 ts3 = ts0.loc['2010-12-15':'2010-12-31']
Partial string indexing and slicing Pandas time series support "partial string" indexing. What this means is that even when passed only a portion of the datetime, such as the date but not the time, pandas is remarkably good at doing what one would expect. Pandas datetime indexing also supports a wide variety of commonly used datetime string formats, even when mixed.
# Plot all columns (default) df.plot() plt.show() # Plot all columns as subplots df.plot(subplots=True) plt.show() # Plot just the Dew Point data column_list1 = ['Dew Point (deg F)'] df[column_list1].plot() plt.show() # Plot the Dew Point and Temperature data, but not the Pressure data column_list2 = ['Temperature (deg F)','Dew Point (deg F)'] df[column_list2].plot() plt.show()
Plotting DataFrames
# Plot the summer data df.Temperature['2010-Jun':'2010-Aug'].plot() plt.show() plt.clf() # Plot the one week data df.Temperature['2010-06-10':'2010-06-17'].plot() plt.show() plt.clf()
Plotting date ranges, partial indexing Now that you have set the DatetimeIndex in your DataFrame, you have a much more powerful and flexible set of tools to use when plotting your time series data. Of these, one of the most convenient is partial string indexing and slicing.
# Create a plot with color='red' df.plot(color='red') # Add a title plt.title('Temperature in Austin') # Specify the x-axis label plt.xlabel('Hours since midnight August 1, 2010') # Specify the y-axis label plt.ylabel('Temperature (degrees F)')
Plotting series using pandas
# Plot the raw data before setting the datetime index df.plot() plt.show() # Convert the 'Date' column into a collection of datetime objects: df.Date df.Date = pd.to_datetime(df.Date) # Set the index to be the converted 'Date' column df.set_index('Date', inplace=True) # Re-plot the DataFrame to see that the axis is now datetime aware! df.plot() plt.show()
Plotting time series, datetime indexing Pandas handles datetimes not only in your data, but also in your plotting
# Print the number of countries reported in 2015 print(df['2015'].count()) # Print the 5th and 95th percentiles print(df.quantile([0.05, 0.95])) # Generate a box plot years = ['1800','1850','1900','1950','2000'] df[years].plot(kind='box') plt.show()
Quantiles
# Read in the file: df1 df1 = pd.read_csv(data_file) # Create a list of the new column labels: new_labels new_labels = ['year', 'population'] # Read in the file, specifying the header and names parameters: df2 df2 = pd.read_csv(data_file, header=0, names=new_labels)
Reading a flat file
df3 = pd.read_csv(filename, index_col='Date', parse_dates=True) df1.loc['2010-Aug-01']
Reading and slicing times
# Reindex without fill method: ts3 ts3 = ts2.reindex(ts1.index) # Reindex with fill method, using forward fill: ts4 ts4 = ts2.reindex(ts1.index, method='ffill')
Reindexing the Index Reindexing is useful in preparation for adding or otherwise combining two time series data sets.
# Extract the August 2010 data: august august = df['Temperature']['2010-08'] # Resample to daily data, aggregating by max: daily_highs daily_highs = august.resample('D').max() # Use a rolling 7-day window with method chaining to smooth the daily high temperatures in August daily_highs_smoothed = daily_highs.rolling(window=7).mean() print(daily_highs_smoothed)
Resample and roll with it As of pandas version 0.18.0, the interface for applying rolling transformations to time series has become more consistent and flexible, and feels somewhat like a groupby. You can now flexibly chain together resampling and rolling operations.
Use statistical methods over different time intervals
Resampling
# Downsample to 6 hour data and aggregate by mean: df1 df1 = df['Temperature'].resample('6h').mean() # Downsample to daily data and count the number of data points: df2 df2 = df['Temperature'].resample('D').count()
Resampling and frequency Pandas provides methods for resampling time series data. When downsampling or upsampling, the syntax is similar, but the methods called are different. Both use the concept of 'method chaining' - df.method1().method2().method3() - to direct the output from one method call to the input of the next, and so on, as a sequence of operations, one feeding into the next.
# Extract data from 2010-Aug-01 to 2010-Aug-15: unsmoothed unsmoothed = df['Temperature']['2010-Aug-01':'2010-Aug-15'] # Apply a rolling mean with a 24 hour window: smoothed smoothed = unsmoothed.rolling(window=24).mean() # Create a new DataFrame with columns smoothed and unsmoothed: august august = pd.DataFrame({'smoothed':smoothed, 'unsmoothed':unsmoothed}) # Plot both smoothed and unsmoothed data using august.plot(). august.plot() plt.show()
Rolling mean and frequency Rolling means (or moving averages) are generally used to smooth out short-term fluctuations in time series data and highlight long-term trends. To use the .rolling() method, you must always use method chaining, first calling .rolling() and then chaining an aggregation method after it.
# Display the box plots on 3 separate rows and 1 column fig, axes = plt.subplots(nrows=3, ncols=1) # Generate a box plot of the fare prices for the First passenger class titanic.loc[titanic['pclass'] == 1].plot(ax=axes[0], y='fare', kind='box') # Generate a box plot of the fare prices for the Second passenger class titanic.loc[titanic['pclass'] == 2].plot(ax=axes[1], y='fare', kind='box') # Generate a box plot of the fare prices for the Third passenger class titanic.loc[titanic['pclass'] == 3].plot(ax=axes[2], y='fare', kind='box') # Display the plot plt.show()
Separate and plot use Boolean filtering and generate box plots of the fare prices for each of the three passenger classes. The fare prices are contained in the 'fare' column and passenger class information is contained in the 'pclass' column.
# Compute the global mean and global standard deviation: global_mean, global_std global_mean = df.mean() global_std = df.std() # Filter the US population from the origin column: us us = df[df['origin']=='US'] # Compute the US mean and US standard deviation: us_mean, us_std us_mean = us.mean() us_std = us.std() # Print the differences print(us_mean - global_mean) print(us_std - global_std)
Separate and summarize Let's use population filtering to determine how the automobiles in the US differ from the global average and standard deviation. How does the distribution of fuel efficiency (MPG) for the US differ from the global average and standard deviation?
# Extract temperature data for August: august august = df.loc['2010-08']['Temperature'] # Downsample to obtain only the daily highest temperatures in August: august_highs august_highs = august.resample('D').max() # Extract temperature data for February: february february = df.loc['2010-02']['Temperature'] # Downsample to obtain the daily lowest temperatures in February: february_lows february_lows = february.resample('D').min()
Separating and resampling With pandas, you can resample in different ways on different subsets of your data. For example, resampling different months of data with different aggregations.
# Downsample df_clean by day and aggregate by mean: daily_mean_2011 daily_mean_2011 = df_clean.resample('D').mean() # Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011 daily_temp_2011 = daily_mean_2011.dry_bulb_faren.values # Downsample df_climate by day and aggregate by mean: daily_climate daily_climate = df_climate.resample('D').mean() # Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate daily_temp_climate = daily_climate.reset_index()['Temperature'] # Compute the difference between the two arrays and print the mean difference difference = daily_temp_2011 - daily_temp_climate print(difference.mean())
Signal variance Your job is to first resample df_clean and df_climate by day and aggregate the mean temperatures. You will then extract the temperature related columns from each - 'dry_bulb_faren' in df_clean, and 'Temperature' in df_climate - as NumPy arrays and compute the difference.
# Print the mean of the January and March data print(january.mean(), march.mean()) # Print the standard deviation of the January and March data print(january.std(), march.std())
Standard deviation of temperature
# Print the median of the dry_bulb_faren column print(df_clean['dry_bulb_faren'].median()) # Print the median of the dry_bulb_faren column for the time range '2011-Apr':'2011-Jun' print(df_clean.loc['2011-04':'2011-06', 'dry_bulb_faren'].median()) # Print the median of the dry_bulb_faren column for the month of January print(df_clean.loc['2011-01', 'dry_bulb_faren'].median())
Statistical exploratory data analysis: Signal min, max, median print the median temperatures for specific time ranges. You can do this using partial datetime string selection.
# Select days that are sunny: sunny sunny = df_clean.loc[df_clean['sky_condition']=='CLR'] # Select days that are overcast: overcast overcast = df_clean.loc[df_clean['sky_condition'].str.contains('OVC')] # Resample sunny and overcast, aggregating by maximum daily temperature sunny_daily_max = sunny.resample('D').max() overcast_daily_max = overcast.resample('D').max() # Print the difference between the mean of sunny_daily_max and overcast_daily_max print(sunny_daily_max.mean() - overcast_daily_max.mean())
Sunny or cloudy On average, how much hotter is it when the sun is shining? In this exercise, you will compare temperatures on sunny days against temperatures on overcast days. Your job is to use Boolean selection to filter out sunny and overcast days, and then compute the difference of the mean daily maximum temperatures between each type of day.
# Build a Boolean mask to filter out all the 'LAX' departure flights: mask mask = df['Destination Airport'] == 'LAX' # Use the mask to subset the data: la la = df[mask == True] # Combine two columns of data to create a datetime series: times_tz_none times_tz_none = pd.to_datetime( la['Date (MM/DD/YYYY)'] + ' ' + la['Wheels-off Time'] ) # Localize the time to US/Central: times_tz_central times_tz_central = times_tz_none.dt.tz_localize('US/Central') # Convert the datetimes from US/Central to US/Pacific times_tz_pacific = times_tz_central.dt.tz_convert('US/Pacific')
Time zones and conversion
# Import matplotlib.pyplot as plt import matplotlib.pyplot as plt # Select the visibility and dry_bulb_faren columns and resample them: weekly_mean weekly_mean = df_clean[['visibility', 'dry_bulb_faren']].resample('W').mean() # Print the output of weekly_mean.corr() print(weekly_mean.corr()) # Plot weekly_mean with subplots=True weekly_mean.plot(subplots=True) plt.show()
Weekly average temperature and visibility your job is to plot the weekly average temperature and visibility as subplots. To do this, you need to first select the appropriate columns and then resample by week, aggregating the mean.
# Zip the 2 lists together into one list of (key,value) tuples: zipped zipped = list(zip(list_keys, list_values)) # Inspect the list using print() print(zipped) # Build a dictionary with the zipped list: data data = dict(zipped) # Build and inspect a DataFrame from the dictionary: df df = pd.DataFrame(data)
Zip lists to build a DataFrame
# Make a list of the column names to be plotted: cols cols = ['weight', 'mpg'] # Generate the box plots df[cols].plot(kind='box', subplots=True) # Display the plot plt.show()
pandas box plots
# This formats the plots such that they appear on separate rows fig, axes = plt.subplots(nrows=2, ncols=1) # Plot the PDF df.fraction.plot(ax=axes[0], kind='hist', bins=30, normed=True , range=(0,.3)) plt.show() # Plot the CDF df.fraction.plot(kind='hist', ax=axes[1], bins=30, cumulative=True, normed=True, range=(0,.3)) plt.show()
pandas hist, pdf and cdf Pandas relies on the .hist() method to not only generate histograms, but also plots of probability density functions (PDFs) and cumulative density functions (CDFs).
# Create a list of y-axis column names: y_columns y_columns = ['AAPL', 'IBM'] # Generate a line plot df.plot(x='Month', y=y_columns) # Add the title plt.title('Monthly stock prices') # Add the y-axis label plt.ylabel('Price ($US)') # Display the plot plt.show()
pandas line plots
# Generate a scatter plot df.plot(kind='scatter', x='hp', y='mpg', s=sizes) # Add the title plt.title('Fuel efficiency vs Horse-power') # Add the x-axis label plt.xlabel('Horse-power') # Add the y-axis label plt.ylabel('Fuel efficiency (mpg)')
pandas scatter plots