Descriptive Analysis
GIGO
Garbage in. Garbage out. It's always worth spending time at the beginning of a project to determine whether or not the data you have are garbage. be certain they are actually able to help you answer the question you're interested in
random distribution
distribution has no apparent pattern
Shape:different types of distribution:
distribution of values is constant over the range of the variable, the possibility of getting any value is the same distribution if you rolled a die 1M times
bimodal distribution
distribution with two distinct peaks
bimodal example:
heights of a family with two parents and 10 children
distribution of heights of a family with two parents and 10 children
heights tend to be similar. heights of 1 -bimodal distribution
variability tells you:
how spread out the values are. the central tendency tells you part of the story. the variability in the values in your observation helps fill in the rest. measure how close the values in the distribution are to the middle of the distribution. ex: average squared difference of the scores from the mean
sns.set():
increasing the font size on plots so that it's large enough to view on slides when projected. Multiple settings can be specified within that function.
central tendency:
knowing the mean, median, and/or mode can help you get an idea of what a typical value is for your variable of interest
seaborn is great for.....
making lots of different plots
what is the Better measure of central tendency when you have outliers?
when you have outliers, use median
Can you generate a boxplot for fare broken down by what class they were on the Titanic?
# Generate fare boxplot broken down by group sns.boxplot(x='fare', y='class', data=ti);
Can you generate a barplot for embark_town by deck? (Hint: be sure to consider if this is a countplot or a barplot.)
# Specify second variable with `hue` sns.countplot(x='embark_town', hue='deck', data=ti); # if your legend was covering up your plot # plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);
Outlier determination
# calculate upper cutoff # values above this are outliers upper_cutoff = upper + 1.5 * iqr # calculate lower cutoff # values below this are outliers lower_cutoff = lower - 1.5 * iqr upper_cutoff, lower_cutoff
Can you generate a histogram for the fares paid by passengers on the titanic?
# histogram for fares sns.distplot(ti['fare'], kde=False, bins=30, color = 'dimgrey');
A good data visualization can help you:
-identify anomalies in your data -better understand your own data -communicate your findings
Calculate the sample variance:
1. calculate sample mean: x=3.5 2. numerator:((1-3.5)^2+(2-3.5)^2+(3-3.5)^2....=17.5 3. denominator:(n-1)=6-1=5 s^2=17.5/5=3.5
In seaborn there are two types of bar charts:
1. countplot - counts the number of times each category appears in a column 2. barplot - groups dataframe by a categorical column and plots the height bars according to the average of a numerical column within each group
outlier equation:
1.5*IQR
In matplotlib the process involves the following: and give an example
1.Create the figure 2.Add axes to the figure # Create a figure f = plt.figure(figsize=(10, 5)) # Add an axes to the figure. # The second and third arguments create a 1x1 table # The first argument places the axes in the first # cell of the table. ax = f.add_subplot(1, 1, 1) # Create a line plot on the axes ax.plot([0, 1, 2, 3], [1, 3, 4, 3]) # Show the plot. # In Jupyter notebook you don't need this # It will show up plt.show()
mean and median are used to summarize the central tendency for___________________
Quantitative variables
How do we describe a dataset? What measurements do we use? (size, missingness, shape, central tendency, variability)
Size(number of rows and columns), missingness(where is data missing), shape(It's critical to know the distribution of the variables in your dataset b/c certain statistical approaches can only be used with certain distributions), central tendency, variation
What is a descriptive analysis?
The goal of descriptive analysis is to understand the components of a data set, describe what they are, and explain that description to others who might want to understand the data
Outliers can occur due to....
data entry errors, poor sampling procedures, technical or mechanical error, unexpected changes in weather. indicate that you removed some outliers
# remove regression lineScatterplots (by a categorical variable)
When you want to plot two numeric variables but want to get some insight about a third categorical variable, you can color the points on the plot by the categorical variable. # color points by who the passenger is sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False, height=6, aspect=2);
Statistic
a quantity computed from a sample
When you have one categorical and one quantitative variable, you'll use ......
barplot
distribution of speed limits in the US
bimodal
Bar Chart
categorical data.
mode is most helpful in describing the central tendency for ________
categorical variables. The mode is the most frequently observed category, not the number of times that category appears
what does 'sns.set_style()' do?
change the style of your plots
uniformly distributed example
daily chance of winning the lottery
Shape:abell-shapeddistiribution
most values are near the central value. data are symmetric around middle
skewed distribution
most values fall to one extreme within the range
Distribution of heights of a random sample of adults in the US. like heights of females
normal distribution
different types of variability:
range(highest score-lowest score) interquartile range(75%-25%)/(upper part of quare-lower part of square)
types of variability:
range, interquartile range, variance, SD
What does a rugplot do? What is the function?
rug plots make a short vertical line at the bottom. each line represents a different passenger's age# add rugplot. sns.distplot(ti['age'], rug=True);
Boxplots
single quantitative variable broken down by categorical variable.display where most of the data lie for a given set of numbers.look at a single numeric variable. Boxplots really shine when you want to look at the range of typical values for a quantitative variable, broken down by a separate categorical variable. By default, the box delineates the 25th and 75th percentile. The line down the middle represents the median. "Whiskers" extend to show the range for the rest of the data, excluding outliers. Outliers are marked as individual points outside of the whiskers
Histograms:
single quantitative variable. are helpful for visualizing information about a single quantitative variable. show the whole distribution at once
distribution of number of siblings you all have:
skewed to the right
# categorical alive/not alive passengers # compute and plot the average age within `alive`.
sns.barplot(x='alive', y='age', data=ti);
# Count # of passengers survived & # who didn't # draws bars with corresponding values plotted
sns.countplot(x='alive', data=ti);
how to generate histogram
sns.distplot(ti['age']);
how do you adjust the bin size of a histogram?
sns.distplot(ti['age'], kde = False, bines = 30)
# generate scatterplot
sns.lmplot(x='age', y='fare', data=ti, height=6, aspect=2); or sns.lmplot(x='age', y='fare', data=ti, fit_reg=False, height=6, aspect=2);
# generate scatterplot
sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False, height=6, aspect=2) # customize plt.title('Fare not Determined by Age') plt.xlabel('Age of Passenger') plt.ylabel('Fare in USD')
plt.rcParams:
specified the default size for plots
SD:
square root of the variance: Calculating SD: S^2=3.5 S=sqrt(s^2) s=1.87
Central tendency doesn't _________
tell the whole story
Statistics
the science that deals with he collection, classification, analysis, and interpretation of numerical facts or data, and that, by use of mathematical theories of probability, imposes order and regularity on aggregates of more or less disparate elements. the science that deals with the collection, classification, analysis, and interpretation of numerical facts or data, and that, by use of mathematical theories of probability, imposes order and regularity on aggregates of more or less disparate elements
# Load the dataset and drop N/A values # to make plot function calls simpler
ti =sns.load_dataset('titanic').dropna().reset_index(drop=True) # We'll take a look at the first few rows ti.head()
## describe quantitative variables
ti.describe()
Scatter Plots
two quantitative variable. So far we've looked at a single quantitative variable (histogram), a quantitative and a categorical variable (boxplot)
Distribution of daily chance of winning the lottery
uniform distribution
What are the best practices for sampling from a population?
●Always think about what your population is ●Collect data from a sample that is representative of your population. -If you have no choice but to work with a dataset that is not collected randomly and is biased, be careful not to generalize your results to the entire population