Descriptive Analysis

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

GIGO

Garbage in. Garbage out. It's always worth spending time at the beginning of a project to determine whether or not the data you have are garbage. be certain they are actually able to help you answer the question you're interested in

random distribution

distribution has no apparent pattern

Shape:different types of distribution:

distribution of values is constant over the range of the variable, the possibility of getting any value is the same distribution if you rolled a die 1M times

bimodal distribution

distribution with two distinct peaks

bimodal example:

heights of a family with two parents and 10 children

distribution of heights of a family with two parents and 10 children

heights tend to be similar. heights of 1 -bimodal distribution

variability tells you:

how spread out the values are. the central tendency tells you part of the story. the variability in the values in your observation helps fill in the rest. measure how close the values in the distribution are to the middle of the distribution. ex: average squared difference of the scores from the mean

sns.set():

increasing the font size on plots so that it's large enough to view on slides when projected. Multiple settings can be specified within that function.

central tendency:

knowing the mean, median, and/or mode can help you get an idea of what a typical value is for your variable of interest

seaborn is great for.....

making lots of different plots

what is the Better measure of central tendency when you have outliers?

when you have outliers, use median

Can you generate a boxplot for fare broken down by what class they were on the Titanic?

# Generate fare boxplot broken down by group sns.boxplot(x='fare', y='class', data=ti);

Can you generate a barplot for embark_town by deck? (Hint: be sure to consider if this is a countplot or a barplot.)

# Specify second variable with `hue` sns.countplot(x='embark_town', hue='deck', data=ti); # if your legend was covering up your plot # plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);

Outlier determination

# calculate upper cutoff # values above this are outliers upper_cutoff = upper + 1.5 * iqr # calculate lower cutoff # values below this are outliers lower_cutoff = lower - 1.5 * iqr upper_cutoff, lower_cutoff

Can you generate a histogram for the fares paid by passengers on the titanic?

# histogram for fares sns.distplot(ti['fare'], kde=False, bins=30, color = 'dimgrey');

A good data visualization can help you:

-identify anomalies in your data -better understand your own data -communicate your findings

Calculate the sample variance:

1. calculate sample mean: x=3.5 2. numerator:((1-3.5)^2+(2-3.5)^2+(3-3.5)^2....=17.5 3. denominator:(n-1)=6-1=5 s^2=17.5/5=3.5

In seaborn there are two types of bar charts:

1. countplot - counts the number of times each category appears in a column 2. barplot - groups dataframe by a categorical column and plots the height bars according to the average of a numerical column within each group

outlier equation:

1.5*IQR

In matplotlib the process involves the following: and give an example

1.Create the figure 2.Add axes to the figure # Create a figure f = plt.figure(figsize=(10, 5)) # Add an axes to the figure. # The second and third arguments create a 1x1 table # The first argument places the axes in the first # cell of the table. ax = f.add_subplot(1, 1, 1) # Create a line plot on the axes ax.plot([0, 1, 2, 3], [1, 3, 4, 3]) # Show the plot. # In Jupyter notebook you don't need this # It will show up plt.show()

mean and median are used to summarize the central tendency for___________________

Quantitative variables

How do we describe a dataset? What measurements do we use? (size, missingness, shape, central tendency, variability)

Size(number of rows and columns), missingness(where is data missing), shape(It's critical to know the distribution of the variables in your dataset b/c certain statistical approaches can only be used with certain distributions), central tendency, variation

What is a descriptive analysis?

The goal of descriptive analysis is to understand the components of a data set, describe what they are, and explain that description to others who might want to understand the data

Outliers can occur due to....

data entry errors, poor sampling procedures, technical or mechanical error, unexpected changes in weather. indicate that you removed some outliers

# remove regression lineScatterplots (by a categorical variable)

When you want to plot two numeric variables but want to get some insight about a third categorical variable, you can color the points on the plot by the categorical variable. # color points by who the passenger is sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False, height=6, aspect=2);

Statistic

a quantity computed from a sample

When you have one categorical and one quantitative variable, you'll use ......

barplot

distribution of speed limits in the US

bimodal

Bar Chart

categorical data.

mode is most helpful in describing the central tendency for ________

categorical variables. The mode is the most frequently observed category, not the number of times that category appears

what does 'sns.set_style()' do?

change the style of your plots

uniformly distributed example

daily chance of winning the lottery

Shape:abell-shapeddistiribution

most values are near the central value. data are symmetric around middle

skewed distribution

most values fall to one extreme within the range

Distribution of heights of a random sample of adults in the US. like heights of females

normal distribution

different types of variability:

range(highest score-lowest score) interquartile range(75%-25%)/(upper part of quare-lower part of square)

types of variability:

range, interquartile range, variance, SD

What does a rugplot do? What is the function?

rug plots make a short vertical line at the bottom. each line represents a different passenger's age# add rugplot. sns.distplot(ti['age'], rug=True);

Boxplots

single quantitative variable broken down by categorical variable.display where most of the data lie for a given set of numbers.look at a single numeric variable. Boxplots really shine when you want to look at the range of typical values for a quantitative variable, broken down by a separate categorical variable. By default, the box delineates the 25th and 75th percentile. The line down the middle represents the median. "Whiskers" extend to show the range for the rest of the data, excluding outliers. Outliers are marked as individual points outside of the whiskers

Histograms:

single quantitative variable. are helpful for visualizing information about a single quantitative variable. show the whole distribution at once

distribution of number of siblings you all have:

skewed to the right

# categorical alive/not alive passengers # compute and plot the average age within `alive`.

sns.barplot(x='alive', y='age', data=ti);

# Count # of passengers survived & # who didn't # draws bars with corresponding values plotted

sns.countplot(x='alive', data=ti);

how to generate histogram

sns.distplot(ti['age']);

how do you adjust the bin size of a histogram?

sns.distplot(ti['age'], kde = False, bines = 30)

# generate scatterplot

sns.lmplot(x='age', y='fare', data=ti, height=6, aspect=2); or sns.lmplot(x='age', y='fare', data=ti, fit_reg=False, height=6, aspect=2);

# generate scatterplot

sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False, height=6, aspect=2) # customize plt.title('Fare not Determined by Age') plt.xlabel('Age of Passenger') plt.ylabel('Fare in USD')

plt.rcParams:

specified the default size for plots

SD:

square root of the variance: Calculating SD: S^2=3.5 S=sqrt(s^2) s=1.87

Central tendency doesn't _________

tell the whole story

Statistics

the science that deals with he collection, classification, analysis, and interpretation of numerical facts or data, and that, by use of mathematical theories of probability, imposes order and regularity on aggregates of more or less disparate elements. the science that deals with the collection, classification, analysis, and interpretation of numerical facts or data, and that, by use of mathematical theories of probability, imposes order and regularity on aggregates of more or less disparate elements

# Load the dataset and drop N/A values # to make plot function calls simpler

ti =sns.load_dataset('titanic').dropna().reset_index(drop=True) # We'll take a look at the first few rows ti.head()

## describe quantitative variables

ti.describe()

Scatter Plots

two quantitative variable. So far we've looked at a single quantitative variable (histogram), a quantitative and a categorical variable (boxplot)

Distribution of daily chance of winning the lottery

uniform distribution

What are the best practices for sampling from a population?

●Always think about what your population is ●Collect data from a sample that is representative of your population. -If you have no choice but to work with a dataset that is not collected randomly and is biased, be careful not to generalize your results to the entire population


Set pelajaran terkait

Midterm Chapters 1-6 Social Psych

View Set

Europe Thinking Spatially and Data Analysis - Europe - Physical Geography

View Set

Convection in the Atmosphere and Wind

View Set

Community Development and Planning Quiz 2

View Set

Chapter 41 Pathophysiology NCLEX-Style Review Questions

View Set