Pandas Cheat sheet
df['w'].nunique()
# of distinct values in a column (Summarize data)
len(df)
# of rows in df (Summarize data)
Latex functions
%%latex
interactive debugger
%debug
magic fuctions
%lsmagic
better charts
%matplotlib notebook
df['Volume'] = df.Length*df.Height*df.Depth
Add single column (Make new columns)
adf[~adf.x1.isin(bdf.x1)]
All rows in adf that do not have a match id bdf (Combine Datasets Filter joins)
adf[adf.x1.isin(bdf.x1)]
All rows in adf that have a match in bdf (Combine Datasets Filter joins)
pd.concat([df1, df2], axis=1)
Append columns of DF (Reshaping Data)
pd.concat([df1, df2])
Append rows of DF (Reshaping Data)
df.describe()
Basic descriptive stat of each column (Summarize data)
Reshaping Data
Change the layout of a data set
df.assign(Area=lambda df: df.Length*df.Height)
Compute and append one or more new columns (Make new columns)
count()
Count non-NA/null values fo each object (Summarize data - the result si return as a Pandas Series)
df['w'].value_counts() (dropna=False - to count missing valus)
Count number of rows with each unique value of variable (Summarize data)
df.drop(columns=["leangth", "Hight"]
Draop columns from DF (Reshaping Data)
df.dropna()
Drop rows with any column having having NA/null data (Handling Missing Data)
df[df['Lenght']>7]
Extract row that meet logical criteria (Subset observation - Rows)
pd.melt(df)
Gather columns into rows (Reshaping Data)
df.column.isin(values)
Grpoup membership (Logic)
pd.merge(adf, bdf, how='outer', on='x1')
Join data. Retain all values, all rows. (Combine Datasets Standard joins)
pd.merge(adf, bdf, how='inner', on='col1')
Join data. retain only rows in both sets (Combine Datasets Standard joins)
max()
Maximum value of each object (Summarize data - the result si return as a Pandas Series)
median()
Median value of each object (Summarize data - the result si return as a Pandas Series)
df = (pd.melt(df).rename().rolling().mean())
Method Chaining (improve code readability)
df.sort_values('mpg', ascending=False)
Order rows by values of column (high to low) (Reshaping Data)
df.sort_values('mpg')
Order rows by values of column (low to high) (Reshaping Data)
df.pct_change()
Persent change method over the df or series (a great way to calculate growth)
quantile()
Quantile of each object (Summarize data - the result si return as a Pandas Series)
df.sample(frac=0.5)
Randomly select fraction of rows (Subset observation - Rows)
df.sample(n=10)
Randomly select n rows (Subset observation - Rows)
df.drop_duplicates()
Remove duplicate rows only considers columns (Subset observation - Rows)
df.fillna(value)
Replace all NA/null data with value (Handling Missing Data)
df.reset_index()
Reset index of DF to row (Reshaping Data)
df.groupby(by='col')
Return a groupby opject grouped by vlues in columen (Group Data)
pd.merge(ydf, zdf)
Rows that appear in both ydf and zdf (Combine Datasets Set-like operations)
pd.merge(ydf, zdf, how='outer')
Rows that in either or both ydf and zdf Union (Combine Datasets Set-like operations)
df.nsmallest(n, 'value')
Select and order bottom n entries (Subset observation - Rows)
df.nlargest(n, 'value')
Select and order top n entries (Subset observation - Rows)
df.filter(regex='regex')
Select columns whose name matches regex (Subset Variables - Columns)
df[['lenght', 'width']]
Select multiple columns with specific names (Subset Variables - Columns)
df.iloc[10:20]
Select rows by position (Subset observation - Rows)
df.loc[df['a']>10, ['a', 'c']
Select rows meeting logical conditions and only specific columns (Subset Variables - Columns)
df['width']
Select single column with specific name (Subset Variables - Columns)
size()
Size of each group (Group Data)
df.loc[:, 'x2':'x4']
Slect all columns between x2 and x4 inclusive (Subset Variables - Columns)
df.iloc[:,[1,2,5]]
Slect all columns in a position starting from 0 (Subset Variables - Columns)
df.sort_index()
Sort the index of DF (Reshaping Data)
df.pivot(columns='var', values='val')
Spread rows into columns (Reshaping Data)
std()
Standard deviation of each object (Summarize data - the result si return as a Pandas Series
sum()
Sum of each object (Summarize data - the result si return as a Pandas Series)
Each variable in it's own column Each observation in it's own row
Tidy data foundation
Vectorized functions
When performance is paramount, you should avoid using .apply() and .map() because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)
.map(dict) election['color'] = election.winner.map(red_vs_blue)
a great way to map categorical data as Obama: Blue, Romney: red as additional categorical column to dataframe.
df1.add(df2, fill_value=0)
adding df + df to avoind null values
agg(function)
aggregate group using function (Group Data)
&, |, ~,df.any(), df.all()
and, or, not, xor, any, all (Logic)
apply(function)
apply function to each object (Summarize data - he result si return as a Pandas Series)
.values (np_vals = df.values)
attribute that helps to store df column in np array type
df.filter(["Name", "College", "Salary"])
choose specific columns of df
df.astype()
convert columns of data frame
dict(list(zip(list_keys, list_values)))
convert list of tuple into dictionary
df.to_numeric()
convert non-numeric types (e.g. strings) to a suitable numeric type
list(zip(list_keys, list_values))
cteate list of tuples
pawer of plotly (interactivity)
cufflinks
df.divide(series) (if we use '/' then number of coumns don't match)
df/pd.Series with fine-grain control
Tidy Data (complements vectorized operations)
foundation of wrangling in pandas
.resample('A' or 'W')
generates a unique sampling distribution on the basis of the actual data. We can apply various frequency to resample our time series data.
pd.notnull(obj)
in not NaN (Logic)
pd.isnull(obj)
is NaN (Logic)
pd.merge(adf, bdf, how='right', on='col1')
join matchin rows from adf to bdf (Combine Datasets Standard joins)
pd.merge(adf, bdf, how='left', on='col1')
join matching rows from bdf to adf (Combine Datasets Standard joins)
mean()
mean value of each object (Summarize data - the result si return as a Pandas Series)
ffill()
method to replace the null values with the last preceding non-null value
min()
min value of each object (Summarize data - the result si return as a Pandas Series)
!=
not equal (Logic)
best explorytary analysis of df
pandas_profiling.ProfileReport
df.sort_index() df.sort_values()
rearrange the sequence of the rows of a DataFrame by sorting
df.groupby(level='ind')
return a groupby oblect grouped by values in index level named 'ind' (Group Data)
zscore (vectorized function from scipy.stats)
the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean
var()
variance data of each object (Summarize data - the result si return as a Pandas Series