Pandas Cheat sheet

¡Supera tus tareas y exámenes ahora con Quizwiz!

df['w'].nunique()

# of distinct values in a column (Summarize data)

len(df)

# of rows in df (Summarize data)

Latex functions

%%latex

interactive debugger

%debug

magic fuctions

%lsmagic

better charts

%matplotlib notebook

df['Volume'] = df.Length*df.Height*df.Depth

Add single column (Make new columns)

adf[~adf.x1.isin(bdf.x1)]

All rows in adf that do not have a match id bdf (Combine Datasets Filter joins)

adf[adf.x1.isin(bdf.x1)]

All rows in adf that have a match in bdf (Combine Datasets Filter joins)

pd.concat([df1, df2], axis=1)

Append columns of DF (Reshaping Data)

pd.concat([df1, df2])

Append rows of DF (Reshaping Data)

df.describe()

Basic descriptive stat of each column (Summarize data)

Reshaping Data

Change the layout of a data set

df.assign(Area=lambda df: df.Length*df.Height)

Compute and append one or more new columns (Make new columns)

count()

Count non-NA/null values fo each object (Summarize data - the result si return as a Pandas Series)

df['w'].value_counts() (dropna=False - to count missing valus)

Count number of rows with each unique value of variable (Summarize data)

df.drop(columns=["leangth", "Hight"]

Draop columns from DF (Reshaping Data)

df.dropna()

Drop rows with any column having having NA/null data (Handling Missing Data)

df[df['Lenght']>7]

Extract row that meet logical criteria (Subset observation - Rows)

pd.melt(df)

Gather columns into rows (Reshaping Data)

df.column.isin(values)

Grpoup membership (Logic)

pd.merge(adf, bdf, how='outer', on='x1')

Join data. Retain all values, all rows. (Combine Datasets Standard joins)

pd.merge(adf, bdf, how='inner', on='col1')

Join data. retain only rows in both sets (Combine Datasets Standard joins)

max()

Maximum value of each object (Summarize data - the result si return as a Pandas Series)

median()

Median value of each object (Summarize data - the result si return as a Pandas Series)

df = (pd.melt(df).rename().rolling().mean())

Method Chaining (improve code readability)

df.sort_values('mpg', ascending=False)

Order rows by values of column (high to low) (Reshaping Data)

df.sort_values('mpg')

Order rows by values of column (low to high) (Reshaping Data)

df.pct_change()

Persent change method over the df or series (a great way to calculate growth)

quantile()

Quantile of each object (Summarize data - the result si return as a Pandas Series)

df.sample(frac=0.5)

Randomly select fraction of rows (Subset observation - Rows)

df.sample(n=10)

Randomly select n rows (Subset observation - Rows)

df.drop_duplicates()

Remove duplicate rows only considers columns (Subset observation - Rows)

df.fillna(value)

Replace all NA/null data with value (Handling Missing Data)

df.reset_index()

Reset index of DF to row (Reshaping Data)

df.groupby(by='col')

Return a groupby opject grouped by vlues in columen (Group Data)

pd.merge(ydf, zdf)

Rows that appear in both ydf and zdf (Combine Datasets Set-like operations)

pd.merge(ydf, zdf, how='outer')

Rows that in either or both ydf and zdf Union (Combine Datasets Set-like operations)

df.nsmallest(n, 'value')

Select and order bottom n entries (Subset observation - Rows)

df.nlargest(n, 'value')

Select and order top n entries (Subset observation - Rows)

df.filter(regex='regex')

Select columns whose name matches regex (Subset Variables - Columns)

df[['lenght', 'width']]

Select multiple columns with specific names (Subset Variables - Columns)

df.iloc[10:20]

Select rows by position (Subset observation - Rows)

df.loc[df['a']>10, ['a', 'c']

Select rows meeting logical conditions and only specific columns (Subset Variables - Columns)

df['width']

Select single column with specific name (Subset Variables - Columns)

size()

Size of each group (Group Data)

df.loc[:, 'x2':'x4']

Slect all columns between x2 and x4 inclusive (Subset Variables - Columns)

df.iloc[:,[1,2,5]]

Slect all columns in a position starting from 0 (Subset Variables - Columns)

df.sort_index()

Sort the index of DF (Reshaping Data)

df.pivot(columns='var', values='val')

Spread rows into columns (Reshaping Data)

std()

Standard deviation of each object (Summarize data - the result si return as a Pandas Series

sum()

Sum of each object (Summarize data - the result si return as a Pandas Series)

Each variable in it's own column Each observation in it's own row

Tidy data foundation

Vectorized functions

When performance is paramount, you should avoid using .apply() and .map() because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)

.map(dict) election['color'] = election.winner.map(red_vs_blue)

a great way to map categorical data as Obama: Blue, Romney: red as additional categorical column to dataframe.

df1.add(df2, fill_value=0)

adding df + df to avoind null values

agg(function)

aggregate group using function (Group Data)

&, |, ~,df.any(), df.all()

and, or, not, xor, any, all (Logic)

apply(function)

apply function to each object (Summarize data - he result si return as a Pandas Series)

.values (np_vals = df.values)

attribute that helps to store df column in np array type

df.filter(["Name", "College", "Salary"])

choose specific columns of df

df.astype()

convert columns of data frame

dict(list(zip(list_keys, list_values)))

convert list of tuple into dictionary

df.to_numeric()

convert non-numeric types (e.g. strings) to a suitable numeric type

list(zip(list_keys, list_values))

cteate list of tuples

pawer of plotly (interactivity)

cufflinks

df.divide(series) (if we use '/' then number of coumns don't match)

df/pd.Series with fine-grain control

Tidy Data (complements vectorized operations)

foundation of wrangling in pandas

.resample('A' or 'W')

generates a unique sampling distribution on the basis of the actual data. We can apply various frequency to resample our time series data.

pd.notnull(obj)

in not NaN (Logic)

pd.isnull(obj)

is NaN (Logic)

pd.merge(adf, bdf, how='right', on='col1')

join matchin rows from adf to bdf (Combine Datasets Standard joins)

pd.merge(adf, bdf, how='left', on='col1')

join matching rows from bdf to adf (Combine Datasets Standard joins)

mean()

mean value of each object (Summarize data - the result si return as a Pandas Series)

ffill()

method to replace the null values with the last preceding non-null value

min()

min value of each object (Summarize data - the result si return as a Pandas Series)

!=

not equal (Logic)

best explorytary analysis of df

pandas_profiling.ProfileReport

df.sort_index() df.sort_values()

rearrange the sequence of the rows of a DataFrame by sorting

df.groupby(level='ind')

return a groupby oblect grouped by values in index level named 'ind' (Group Data)

zscore (vectorized function from scipy.stats)

the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean

var()

variance data of each object (Summarize data - the result si return as a Pandas Series


Conjuntos de estudio relacionados

Chapter 9 Muscles and Muscle Tissue

View Set

Sadlier-Oxford Vocab Level H - Unit 14

View Set

Chapter 3: Cilia and Microvilli 3.8

View Set

NCLEX liver / cancer / ETOH / Substance abuse from my books/lecture

View Set