Python Workout

¡Supera tus tareas y exámenes ahora con Quizwiz!

How to format or suppress scientific notations in a pandas dataframe?

# Solution 1: Rounding df.round(4) # Solution 2: Use apply to change format df.apply(lambda x: '%.4f' % x, axis=1) # or df.applymap(lambda x: '%.4f' % x) # Solution 3: Use set_option pd.set_option('display.float_format', lambda x: '%.4f' % x) # Solution 4: Assign display.float_format pd.options.display.float_format = '{:.4f}'.format print(df) # Reset/undo float formatting pd.options.display.float_format = None

How to import only every nth row from a csv file to create a dataframe?

# Solution 1: Use chunks and for-loop df = pd.read_csv('https://.csv', chunksize=50) df2 = pd.DataFrame() for chunk in df: df2 = df2.append(chunk.iloc[0,:]) # Solution 2: Use chunks and list comprehension df = pd.read_csv('https://csv', chunksize=50) df2 = pd.concat([chunk.iloc[0] for chunk in df], axis=1) df2 = df2.transpose() # Solution 3: Use csv reader import csv with open('https://csv', 'r') as f: reader = csv.reader(f) out = [] for i, row in enumerate(reader): if i%50 == 0: out.append(row) df2 = pd.DataFrame(out[1:], columns=out[0]) print(df2.head())

How to stack two series vertically and horizontally ?

# Vertical ser1.append(ser2) # Horizontal df = pd.concat([ser1, ser2], axis=1)

How to get the nrows, ncolumns, datatype, summary stats of each column of a dataframe? Also get the array and list equivalent.

# number of rows and columns print(df.shape) # datatypes print(df.dtypes) # how many columns under each dtype print(df.get_dtype_counts()) print(df.dtypes.value_counts()) # summary statistics df_stats = df.describe() # numpy array df_arr = df.values # list df_list = df.values.tolist()

How to bin a numeric series to 10 groups of equal size?

pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head()

How to extract items at given positions from a series?

pos = [0, 4, 8, 14, 20] ser.take(pos)

How to find the position of the nth largest value greater than a given value?

print('ser: ', ser.tolist(), 'mean: ', round(ser.mean())) np.argwhere(ser > ser.mean())[1]

How to filter every nth row in a dataframe?

print(df.iloc[::20, :][['Date', 'Close', 'Volume']])

How to convert the first character of each element in a series to uppercase?

ser = pd.Series(['how', 'to', 'kick', 'ass?']) # Solution 1 ser.map(lambda x: x.title()) # Solution 2 ser.map(lambda x: x[0].upper() + x[1:]) # Solution 3 pd.Series([i.title() for i in ser])

How to create a TimeSeries starting '2000-01-01' and 10 weekends (saturdays) after that having random numbers as values?

ser = pd.Series(np.random.randint(1,10,10), pd.date_range('2000-01-01', periods=10, freq='W-SAT'))

How to calculate the number of characters in each word in a series?

ser.map(lambda x: len(x))

How to assign name to the series' index?

ser.name = 'alphabets'

How to get frequency counts of unique items of a series?

ser.value_counts()

How to create a series from a list, numpy array and dict?

ser1 = pd.Series(mylist) ser2 = pd.Series(myarr) ser3 = pd.Series(mydict)

How to get the items of series A not present in series B?

ser1[~ser1.isin(ser2)]

How to change the order of columns of a dataframe?

# Input df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde')) # Solution Q1 df[list('cbade')] # Solution Q2 - No hard coding def switch_columns(df, col1=None, col2=None): colnames = df.columns.tolist() i1, i2 = colnames.index(col1), colnames.index(col2) colnames[i2], colnames[i1] = colnames[i1], colnames[i2] return df[colnames] df1 = switch_columns(df, 'a', 'c') # Solution Q3 df[sorted(df.columns)] # or df.sort_index(axis=1, ascending=False, inplace=True)

How to get the mean of a series grouped by another series?

# Input fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10)) weights = pd.Series(np.linspace(1, 10, 10)) # Solution weights.groupby(fruit).mean()

How to compute the euclidean distance between two series?

# Input p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1]) # Solution sum((p - q)**2)**.5 # Solution (using func) np.linalg.norm(p-q)

How to select a specific column from a dataframe as a dataframe instead of a series?

# Input df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde')) # Solution type(df[['a']]) type(df.loc[:, ['a']]) type(df.iloc[:, [0]]) # Alternately the following returns a Series type(df.a) type(df['a']) type(df.loc[:, 'a']) type(df.iloc[:, 1])

How to find and cap outliers from a series or dataframe column?

def cap_outliers(ser, low_perc, high_perc): low, high = ser.quantile([low_perc, high_perc]) print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high) ser[ser < low] = low ser[ser > high] = high return(ser) capped_ser = cap_outliers(ser, .05, .95)

How to compute the mean squared error on a truth and predicted series?

truth = pd.Series(range(10)) pred = pd.Series(range(10)) + np.random.random(10) # Solution np.mean((truth-pred)**2)

How to set the number of rows and columns displayed in the output?

# Solution pd.set_option('display.max_columns', 10) pd.set_option('display.max_rows', 10)

How to convert a series of date-strings to a timeseries?

# Solution 1 from dateutil.parser import parse ser.map(lambda x: parse(x)) # Solution 2 pd.to_datetime(ser)

How to compute the autocorrelations of a numeric series?

autocorrelations = [ser.autocorr(i).round(2) for i in range(11)] print(autocorrelations[1:]) print('Lag having highest correlation: ', np.argmax(np.abs(autocorrelations[1:]))+1)

How to use apply function on existing columns with global variables as additional arguments?

d = {'Min.Price': np.nanmean, 'Max.Price': np.nanmedian} df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x, d: x.fillna(d[x.name](x)), args=(d, ))

How to find all the local maxima (or peaks) in a numeric series?

dd = np.diff(np.sign(np.diff(ser))) peak_locs = np.where(dd == -2)[0] + 1 peak_locs

How to convert a numpy array to a dataframe of given shape?

df = pd.DataFrame(ser.values.reshape(7,5))

How to change column values when importing csv to a dataframe?

df = pd.read_csv('https://.csv', converters={'medv': lambda x: 'High' if float(x) > 25 else 'Low'})

How to import only specified columns from a csv file?

df = pd.read_csv('https://.csv', usecols=['col1', 'col2'])

How to convert the index of a series into a column of a dataframe?

df = ser.to_frame().reset_index()

How to check if a dataframe has any missing values?

df.isnull().values.any()

How to create a primary key index by combining relevant columns?

df[['Manufacturer', 'Model', 'Type']] = df[['Manufacturer', 'Model', 'Type']].fillna('missing') df.index = df.Manufacturer + '_' + df.Model + '_' + df.Type print(df.index.is_unique)

How to replace missing values of multiple numeric columns with the mean?

df_out = df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x: x.fillna(x.mean())) print(df_out.head())

How to filter words that contain atleast 2 vowels from a series?

from collections import Counter mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2) ser[mask]

How to import pandas and check the version?

import numpy as np # optional import pandas as pd print(pd.__version__) print(pd.show_versions(as_json=True))

How to replace missing spaces in a string with the least frequent character?

my_str = 'dbc deb abed gade' # Solution ser = pd.Series(list(my_str)) freq = ser.value_counts() print(freq) least_freq = freq.dropna().index[-1] "".join(ser.replace(' ', least_freq))

How to get the row number of the nth largest value in a column?

n = 5 df['a'].argsort()[::-1][n]

How to count the number of missing values in each column?

n_missings_each_col = df.apply(lambda x: x.isnull().sum()) n_missings_each_col.argmax()

How to find the positions of numbers that are multiples of 3 from a series?

np.argwhere(ser % 3==0)

How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?

np.percentile(ser, q=[0, 25, 50, 75, 100])

How to format all the values in a dataframe as percentages?

out = df.style.format({ 'colname': '{0:.2%}'.format, })

How to compute difference of differences between consequtive numbers of a series?

print(ser.diff().tolist()) print(ser.diff().diff().tolist())

How to keep only top 2 most frequent values as it is and replace everything else as 'Other'?

ser[~ser.isin(ser.value_counts().index[:2])] = 'Other'

How to get the items not common to both series A and series B?

ser_u = pd.Series(np.union1d(ser1, ser2)) # union ser_i = pd.Series(np.intersect1d(ser1, ser2)) # intersect ser_u[~ser_u.isin(ser_i)]

How to extract the row and column number of a particular cell with given criterion?

# Solution # Get Manufacturer with highest price df.loc[df.Price == np.max(df.Price), ['Manufacturer', 'Model', 'Type']] # Get Row and Column number row, col = np.where(df.values == np.max(df.Price)) # Get the value df.iat[row[0], col[0]] df.iloc[row[0], col[0]] # Alternates df.at[row[0], 'Price'] df.get_value(row[0], 'Price') # The difference between `iat` - `iloc` vs `at` - `loc` is: # `iat` snd `iloc` accepts row and column numbers. # Whereas `at` and `loc` accepts index and column names.

How to rename a specific columns in a dataframe?

# Solution # Step 1: df=df.rename(columns = {'Type':'CarType'}) # or df.columns.values[2] = "CarType" # Step 2: df.columns = df.columns.map(lambda x: x.replace('.', '_')) print(df.columns)

How to reshape a dataframe to the largest possible square after removing the negative values?

# Solution # Step 1: remove negative values from arr arr = df[df > 0].values.flatten() arr_qualified = arr[~np.isnan(arr)] # Step 2: find side-length of largest possible square n = int(np.floor(arr_qualified.shape[0]**.5)) # Step 3: Take top n^2 items without changing positions top_indexes = np.argsort(arr_qualified)[::-1] output = np.take(arr_qualified, sorted(top_indexes[:n**2])).reshape(n, -1) print(output)

How to get the last n rows of a dataframe with row sum > 100?

# Solution # print row sums rowsums = df.apply(np.sum, axis=1) # last two rows with row sum greater than 100 last_two_rows = df.iloc[np.where(rowsums > 100)[0][-2:], :]

How to get the day of month, week number, day of year and day of week from a series of date strings?

# Solution from dateutil.parser import parse ser_ts = ser.map(lambda x: parse(x)) # day of month print("Date: ", ser_ts.dt.day.tolist()) # week number print("Week number: ", ser_ts.dt.weekofyear.tolist()) # day of year print("Day number of year: ", ser_ts.dt.dayofyear.tolist()) # day of week print("Day of week: ", ser_ts.dt.weekday_name.tolist())

How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?

# Solution ser.resample('D').ffill() # fill with previous value # Alternatives ser.resample('D').bfill() # fill with next value ser.resample('D').bfill().ffill() # fill next else prev value

How to get the positions of items of series A in another series B?

# Solution 1 [np.where(i == ser1)[0].tolist()[0] for i in ser2] # Solution 2 [pd.Index(ser1).get_loc(i) for i in ser2]

How to combine many series to form a dataframe?

# Solution 1 df = pd.concat([ser1, ser2], axis=1) # Solution 2 df = pd.DataFrame({'col1': ser1, 'col2': ser2}) print(df.head())

How to convert year-month string to dates corresponding to the 4th day of the month?

# Solution 1 from dateutil.parser import parse # Parse the date ser_ts = ser.map(lambda x: parse(x)) # Construct date string with date as 4 ser_datestr = ser_ts.dt.year.astype('str') + '-' + ser_ts.dt.month.astype('str') + '-' + '04' # Format it. [parse(i).strftime('%Y-%m-%d') for i in ser_datestr] # Solution 2 ser.map(lambda x: parse('04 ' + x))

How to filter valid emails from a series?

# Solution 1 (as series of strings) import re pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}' mask = emails.map(lambda x: bool(re.match(pattern, x))) emails[mask] # Solution 2 (as series of list) emails.str.findall(pattern, flags=re.IGNORECASE) # Solution 3 (as list) [x[0] for x in [re.findall(pattern, email) for email in emails] if len(x) > 0]

How to create a dataframe with rows as strides from a given series?

L = pd.Series(range(15)) def gen_strides(a, stride_len=5, window_len=5): n_strides = ((a.size-window_len)//stride_len) + 1 return np.array([a[s:(s+window_len)] for s in np.arange(0, a.size, stride_len)[:n_strides]]) gen_strides(L, stride_len=2, window_len=4)


Conjuntos de estudio relacionados

State and Local Government CO1 Multiple Choice

View Set

Marketing 300 Chapter Review 6-9 exam II

View Set

Chapter 13 - Managing Diversity and Inclusion

View Set

MAC OS X Essentials 10.13 Lesson 11- Manage File Systems and Storage, Lesson 12- FileVault

View Set