Week 2: Data visualization (Pandas, Matplotlib, seaborn) (mlcourse.ai)

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

importing plot from matplotlib

import matplotlib.pyplot as plt

Code for creating a correlation matrix for quantitative vs quantitative multivariate visualization.

# Drop non-numerical variables numerical = list( set(df.columns) - set( [ "State", "International plan", "Voice mail plan", "Area code", "Churn", "Customer service calls", ] ) ) # Calculate and plot corr_matrix = df[numerical].corr() sns.heatmap(corr_matrix); First we get all of the columns we want. The numerical variable stores all the numerical columns here by subtracting the set of all columns from all of the non numeric columns. Then we create a correlation matrix by calling corr() method on the numerical columns. Then we use seaborns heatmap() method to create the correlation matrix visualization. Remove any dependent variables (columns that were creates by applying operations on other columns. They are essentially a proportion related to the previous value and will not give us new info since we already know that those variables are correlated since they are calculated using the values they correlate to.)

Code for a scatter plot between the total day minutes and the total night minutes

plt.scatter(df["Total day minutes"], df["Total night minutes"]);

How to interpret a box plot?

The box itself illustrates the interquartile spread of the distribution. Inside of the box, the line in the center means the 50th percentile (median). The length of the box is determined by the 25th (Q1) and 75th (Q3) percentiles. The whiskers extending from the box represent the entire scatter of data points. The individual points on the outside are the outliers.

Difference between the box plot and the violin plot?

The box plot illustrates certain statistics concerning individual examples in a dataset while the violin plot concentrates more on teh smoothed distribution as a whole

Code for creating box plots to visualize the distribution statistics of the numerical variables in two disjoint groups (churn = False and churn = True)

# Sometimes you can analyze an ordinal variable just as numerical one numerical.append("Customer service calls") fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(10, 7)) for idx, feat in enumerate(numerical): ax = axes[int(idx / 4), idx % 4] sns.boxplot(x="Churn", y=feat, data=df, ax=ax) ax.set_xlabel("") ax.set_ylabel(feat) fig.tight_layout(); # numerical is the list of all the numerical attributes we have.

Matrix of plots that show the scatter plots for each variable and when the variables are the same on the x and y axis it shows their distribution.

# `pairplot()` may become very slow with the SVG format %config InlineBackend.figure_format = 'png' sns.pairplot(df[numerical]);

A plot which is the graphical representation of the frequency table.

Bar plot

This plot displays values of two numerical variables as cartesian coordinates in 2D space. These plots in 3D are also possible.

Scatter plots

What are categorical and binary features?

Categorical features take on a fixed number of values. Binary variables are a special case of categorical variiables when the number of possible values is exactly 2. (True/False, Yes/No, Male/Female). If the values of a categorical variable are ordered, it is called ordinal.

categorical vs categorical

Comparing categorical features together (these features can also be ordinal or numerical)

Considered a smoothed version of the histogram

Density plots or more formally Kernel Density Plots

Differences between histogram and bar plot

Histogram best suited for distribution of numerical variables while bar plot for categorical features. The x axis of histogram is numerical values while for par plot it can be any (char, bool, str, num, etc). The x axis on histogram is cartesian coordinate axis along which the values cannot be changed. In bar plots the ordinal variables are often ordered by variable value (number of customer service calls goes from 0 to max (ordered))

Easiest way to take a look at the distribution of a numerical variable

Histograms

A way to see relationships between two or more different variables all in one figure.

Multivariate plots (the specific type of visualization will depend on the types of the variable being analyzed).

Features that take on ordered numerical values. They can be discrete (integers), or continuous (real numbers) and usually express a count or a measurement.

Quantitiative features

What can histograms help with.

The shape of the histogram may contain clues about the underlying distribution type (Gaussian, exponential, etc). You can also spot skewness in its shape when the shape is near regular but has some anomalies.

What is the advantage of density plots over histograms

They do not depend on the size of the bins.

Looking at one feature at a time and analyzing the distribution of its values while ignoring other features in the dataset.

Univariate Visualization

How to analyze a quantitative variable in two categorical dimensions at once. For example the interaction between the total day minutes (quantitative) and two categorical variables (churn and customer service calls) on the same plot.

Use seaborn function catplot().

Quantitative vs Quantitative multivariate visualization

Using a correlation matrix we can look at correlations among the numerical variables in the dataset. This is important to know since there are machine learning algorithms (linear and logistic regression) that do not handle highly correlated inpyt variables well.

Plot similar to the box plot but it shows a smoothed distribution.

Violin plot

Quantitative vs Categorical plot which will make a scatter plot with a third variable (categorical) having different colors for each value in value count. This allows us to make a scatter plot with x and y axis being numercal and it shows how they relate to a third (categorical) variable by changing the colors of the dots

We can use scatter() or lmplot() to do this. For lmplot() we give the parameter hue to indicate the categorical feature of interest.

How to create a correlation matrix.

We use the corr() method on a dataframe that calculates the correlation between each pair of features then we pass the resulting correlation matrix to the heatmap() method from seaborn which renders a color coded matrix for the provided values. Also make sure to first isolate all of the numerical features so that you can apply the corr() method only to them.

box and violin plot for total day minutes. With each chart having two plots for churn = True, and churn = False

_, axes = plt.subplots(1, 2, sharey=True, figsize=(10, 4)) sns.boxplot(x="Churn", y="Total day minutes", data=df, ax=axes[0]) sns.violinplot(x="Churn", y="Total day minutes", data=df, ax=axes[1]); # from this chart we see that customers who discontinue their contracts are more active users of communication services. Perhaps reducing call rates is good (the company needs to do some sort of analysis to determine if this would be beneficial).

Create bar plot for the Churn, and one for the Customer Service calls.

_, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4)) sns.countplot(x="Churn", data=df, ax=axes[0]) sns.countplot(x="Customer service calls", data=df, ax=axes[1]);

Plot with a box, "whiskers", and individual points (outliers).

box plot

Create a frequency table which shows the value counts of each categorical value in the churn column. By default the entries are sorted from most to least occuring.

df["Churn"].value_counts() # this will show the counts of each value (True and False) ex: False: 2850 True: 483 Name? Churn, dtype: int64

create a density plot of a dataframe

df[features].plot( kind="density", subplots=True, layout=(1, 2), sharex=False, figsize=(10, 4) );

dataframe method to describe the data. It will give info such as the count, mean, std, min, percentiles, and max.

features = ["Total day minutes", "Total intl calls"] df[features].describe() # describe can be used on multiple columns as seen here

Code for the dataframe's histogram method in pandas

features = ["Total day minutes", "Total intl calls"] df[features].hist(figsize=(10, 4)); # this will plot a histogram for each feature listed in the features list

importing numpy and seaborn

import numpy as np import seaborn as sns

Code for plot showing the interaction between the total day minutes (quantitative) and two categorical variables (churn and customer service calls) on the same plot.

sns.catplot( x="Churn", y="Total day minutes", col="Customer service calls", data=df[df["Customer service calls"] < 8], kind="box", col_wrap=4, height=3, aspect=0.8, ); # we see in the jupyter notebook that there is a total of 8 plots. This is because we set the column to be customer service calls and we made sure that the data is for all customer service calls less than 8. So each plot has an x axis of churn (true/false) and a y axis of total day minutes and then a column (chart number) equal to the amount of customer service calls. We see an interesting correlation that the more the customer service calls increase the less total day minutes are spent on the phone. Maybe if there are more customer service calls it means that they are dissatisfied with the service and therefore they spend less time on the phone. We see that starting with 4 customer service calls, the total day minutes may no longer be the main factor for customer churn.

Use the count plot but this time add a hue so that we can have the count of the first feature, but the counts are separated by hue of the second categorical feature. Us the customer service calls as the x axis (y axis is count by default), and use churn as the hue. This will give us the count of each value in customer service calls (ex: 52 made 0 calls, 44 made 1 call, etc) but it will be split up by the churn. So for each value in customer service calls there will be two bars, one for churn = True the other for false.

sns.countplot(x="Customer service calls", hue="Churn", data=df); # we observe from the charts that the churn rate increases significantly after 4 or more calls to customer service.

Plot a histogram with a kernel density estimate/ KDE (estimates data into a smooth plot) on top using seaborn.

sns.distplot(df["Total intl calls"]); # this function will be removed in later versions of seaborn

Fancier scatter plot which shows a smoothed out version of our bivariate distribution and smoothed out lines (histogram turned into line) on the side. It is basically a bivariate version of the kernel density plot discussed earlier.

sns.jointplot( "Total day minutes", "Total night minutes", data=df, kind="kde", color="g" );

Fancier scatter plot which shows the distribution on the side as a histogram which show the distribution of both axes.

sns.jointplot(x="Total day minutes", y="Total night minutes", data=df, kind="scatter");

Code for a plot which compares total day minutes and total night minutes (x and y axis) while having the churn as a categorical value and the hue is changed by the churn.

sns.lmplot( "Total day minutes", "Total night minutes", data=df, hue="Churn", fit_reg=False ); # we see that the hue is set to churn so that the color of our dots reflects the churn values.

Plot a violin plot using seaborn

sns.violinplot(data=df["Total intl calls"]);


Ensembles d'études connexes

Electronic Medical Records Mid Term

View Set

Unit 6 (World War 1 & The Treaty of Versailles)

View Set

English Discoveries - First Discoveries - About me

View Set

Service and Support Applications (12%) - Admin

View Set

Chapter 8: Cultural Care Considerations

View Set

Marketing Management - Iacobucci - Ch 11

View Set

Immunology Nclex Questions (Chp 13, 14, 15)

View Set