DATA 110 Final Exam

Ace your homework & exams now with Quizwiz!

Creating and reading boxplots

A boxplot visualizes the distribution of a dataset by summarizing six key statistics: Minimum: The smallest data point not considered an outlier. Q1 (25th Percentile): The median of the lower half of the data. Median (Q2, 50th Percentile): The central value. Q3 (75th Percentile): The median of the upper half of the data. Maximum: The largest data point not considered an outlier. Outliers: Points beyond 1.5×IQR1.5 \times \text{IQR}1.5×IQR from Q1 or Q3.

Monty Hall Problem

A probability puzzle demonstrating counterintuitive results in conditional probability scenarios.

Tables, Columns, Rows

Data is often stored in tabular format with rows representing individual records and columns representing attributes or features.

Difference between Data Science and Statistics

Data science incorporates computing, data analysis, and communication, often focusing on practical applications and insights from data, while statistics emphasizes theoretical aspects of data interpretation.

Data Modality, Tabular Data

Different data modalities include structured (e.g., tables), semi-structured (e.g., JSON), and unstructured data (e.g., images, text). Tabular data is structured in rows and columns, making it suitable for statistical and machine learning models.

Sampling and Representative Samples

Drawing a subset from a population that accurately reflects its characteristics.

Uniform distribution: Coin flip, dice roll, picking cards

Uniform Distribution applies when all outcomes are equally likely. Coin flips and dice rolls are examples of discrete uniform distributions, while random selection from continuous ranges (e.g., 0 to 1) demonstrates continuous uniform distribution. Sampling without replacement (e.g., card drawing) introduces conditional probabilities and affects uniformity after each draw.

Variables, Attributes, Features

Variables (features) are attributes in the data that describe each record

Motivation for Plots and Charts

Visualizations help reveal patterns, distributions, and relationships in data.

Association vs. Correlation vs. Causality:

Association: General relationship between two variables. Correlation: Quantifies strength of association (e.g., correlation coefficient). Causality: One variable directly affects another.Spurious Correlations: False correlations due to coincidence or third variables.

Empirical distribution vs. probability distribution

Empirical: The distribution of observed data in a sample which is derived from real data and reflects the actual frequencies/proportions of observed outcomes. Ex. rolling dice; the proportion of each number rolled Probability: A theoretical distribution that describes the probabilities of all possible outcomes in a population. Describes idealized behavior of a random variable and defined by mathematical formulas or models (e.g., normal distribution, binomial distribution). Ex. the uniform distribution for a fair six-sided die gives each number a probability of 1/6

When to use pandas, seaborn, and matplotlib functions

Exploratory Data Analysis (EDA) Pandas: df.describe(), df.info(), df.isnull().sum() Seaborn: sns.histplot(), sns.boxplot(), sns.pairplot() Matplotlib: plt.scatter(), plt.hist() Cleaning and Transforming Data Pandas: df.dropna(), df.fillna(), df.apply() Visualizing Relationships Seaborn: sns.scatterplot(), sns.regplot() Matplotlib: plt.plot(), plt.scatter() Identifying Trends Seaborn: sns.lineplot() Matplotlib: plt.plot() Highlighting Distributions Seaborn: sns.histplot(), sns.violinplot() Matplotlib: plt.hist()

Interpreting Histograms and Bar Charts and when to use each

Histograms: Use the area principle and density to show frequency. Bar Charts: Display categorical data counts. Choose based on data type and intended message.

Definition of Distributions

How values are spread in data.

Hypothesis testing - Design, execution and interpreting results - Null and alternative hypotheses - Choosing a test statistic - Simulation - p-values and statistical significance - A/B testing (don't memorize the t-statistic formula)

Hypothesis testing: allows us to systematically evaluate claims about the population.

Core ideas and importance of the Law of Large Numbers, Glivenko-Cantelli Theorem, and Central Limit Theorem

If you sample enough data independently and under identical conditions... Law of Large Numbers: the sample mean (mean of the observed data) approaches the expected value of the population. Glivenko-Cantelli Theorem: the distribution of the observed data approaches the distribution of the population. Central Limit Theorem: the sample means will form a normal distribution regardless of the shape of the population distribution.

Creating and Reading Lineplots and Scatterplots

Line plots show trends over time, while scatterplots visualize relationships between two numerical variables.

Matplotlib Functions

Matplotlib provides more control for custom visualizations. import matplotlib as plt plt.plot(x, y): Create a simple line plot. plt.scatter(x, y): Create a scatter plot. plt.hist(data, bins=n): Create a histogram. plt.bar(x, height): Create a bar chart. plt.title('Title'): Add a title to the plot. plt.xlabel('X-axis Label'): Add a label to the X-axis. plt.ylabel('Y-axis Label'): Add a label to the Y-axis. plt.legend(loc='upper right'): Add a legend to the plot. plt.grid(True): Display gridlines. plt.xlim(min, max): Set the range for the X-axis. plt.ylim(min, max): Set the range for the Y-axis.

Measures of Centrality

Mean, Median, Mode Mean: The sum of all values divided by the number of values. Represents the center of the data, easy to compute, and ideal for symmetric distributions. Strongly affected by outliers, unreliable for skewed data, and doesn't represent non-normal data distribution well. Best for symmetric data (ex. test scores). Median: The middle value when data is sorted in ascending order. Represents the typical value of a skewed dataset. Ignores magnitude and less sensitive to changes in data. Best for skewed data (ex. property price). Mode: The most frequently occurring value(s) in a dataset. Applicable to numerical and categorical data and useful for identifying the most common value. Less useful for data with many unique variables. Best for categorical data (ex. surveys).

Connections to model, sampling, and simulation

Model: A statistical model is a simplified representation of the data-generation process in the population. Models use probability distributions to describe theoretical relationships. Sampling: Sampling bridges the gap between the population and the sample. Random sampling ensures that the sample is representative, reducing bias. Simulation: Simulation uses probability distributions and random sampling to model real-world processes or assess statistical properties.

Concept of Model and Modeling

Models represent data patterns or relationships, used for predictions or insights.

Numerical vs. Categorical Distributions

Numerical distributions show range and frequency of values, while categorical distributions show counts per category.

Numerical vs. Categorical Variables

Numerical variables represent quantities, while categorical variables represent groups or categories.

Basic Pandas functions

Pandas is primarily used for data manipulation and analysis. import pandas as pd pd.read_csv('file.csv'): Read a CSV file into a Pandas DataFrame. pd.to_csv('file.csv'): Export a DataFrame to a CSV file. Data Inspection df.head(n): Display the first n rows of the DataFrame (default is 5). df.tail(n): Display the last n rows of the DataFrame (default is 5). df.info(): Get a summary of the DataFrame, including data types and non-null values. df.describe(): Get summary statistics for numerical columns. df.shape: Return the number of rows and columns as a tuple. df['column_name']: Access a single column as a Series. df[['col1', 'col2']]: Access multiple columns as a DataFrame. df.loc[row_labels, column_labels]: Select rows and columns by labels. df.iloc[row_indices, column_indices]: Select rows and columns by index. df[df['column'] > value]: Filter rows where a column value satisfies a condition. df[(df['col1'] > val1) & (df['col2'] < val2)]: Filter using multiple conditions with logical operators. df.drop(columns=['col1', 'col2']): Drop specified columns. df.dropna(): Remove rows with missing values. df.fillna(value): Fill missing values with a specified value. df.drop_duplicates(): Drop duplicate rows. df['new_col'] = df['col1'] + df['col2']: Create a new column using operations. df.apply(function): Apply a function element-wise or row/column-wise. df.groupby('column'): Group data by a column for aggregation. df.sort_values(by='column', ascending=True): Sort rows by column values. df['column'].mean(): Calculate the mean of a column. df['column'].median(): Calculate the median. df['column'].value_counts(): Count unique values in a column. df['column'].sum(): Calculate the sum of a column. pd.merge(df1, df2, on='key_column', how='inner'): Merge two DataFrames based on a key.

Parameter vs. estimate

Parameter: A fixed, unknown quantity describing a population (e.g., population mean, population variance). Ex. the true average income of all city residents Estimate: A statistic calculated from a sample that approximates a population parameter. Ex. the average income of the 500 residents surveyed, used to estimate the population mean.

Population vs. sample, statistic

Population: The entire set of individuals or observations of interest. Ex. all residents in a city Sample: A subset of the population used to make inferences about the population. Ex. a random selection of 500 residents from the city. Statistic: A numerical summary calculated from a sample. Ex. the mean income of the 500 residents surveyed.

Probability Basics: Randomness and chance Complementary probability Conditional probability Independence Mutually exclusive

Randomness and Chance: Events occur unpredictably. Complementary Probability: Probability of an event not occurring (1 - probability of occurrence). Conditional Probability: Probability of an event given another event. Independence: Events don't affect each other's outcomes. Mutually Exclusive: Events that cannot happen simultaneously. Uniform Distribution: Equal probability for each outcome (e.g., fair dice, coin flips). Gaussian (Normal) Distribution: Symmetrical, bell-shaped distribution, described by mean and standard deviation, where 68%, 95%, and 99.7% of values lie within 1, 2, and 3 standard deviations from the mean.

Measures of Dispersion: Range, Standard Deviation, and Interquartile Range

Range, Standard Deviation, and Interquartile Range Range: The difference between the largest and smallest values. Simple to compute but highly sensitive to outliers. Useful to get an idea of variability in exploratory analysis. Standard Deviation: A measure of average distance of data points from mean. Incorporates all datapoints but sensitive to outliers. Ideal for symmetric distribution. Interquartile Range: Difference between the 75th and 25th percentile. Robust to outliers, highlights variability in median. Ignores extremes and harder to compute. Excellent for skewed data when outliers should be ignored.

Seaborn Functions

Seaborn is used for statistical data visualization. import seaborn as sns sns.lineplot(x='x_col', y='y_col', data=df): Create a line plot. sns.scatterplot(x='x_col', y='y_col', data=df): Create a scatter plot. sns.histplot(data=df['column'], bins=n): Create a histogram. sns.barplot(x='x_col', y='y_col', data=df): Create a bar chart with aggregated values. sns.boxplot(x='x_col', y='y_col', data=df): Create a boxplot. sns.heatmap(data=df.corr(), annot=True): Create a heatmap with annotations. sns.violinplot(x='x_col', y='y_col', data=df): Create a violin plot for distributions. sns.pairplot(df): Create pairwise scatterplots for all numerical columns. sns.regplot(x='x_col', y='y_col', data=df): Create a scatter plot with a regression line. sns.set_theme(style='darkgrid'): Set an overall theme for plots. sns.set_context(context='talk'): Adjust the size and scaling of plots.

Data Science Lifecycle

Steps include defining the problem, data collection, preparation, exploration, model building, model evaluation, and model deployment.

Summary Statistics

Summary statistics provide a quick overview of a dataset by focusing on key characteristics, such as its central tendency and variability. They are essential for understanding data distributions and are categorized into measures of centrality and measures of dispersion.

Gambler's Fallacy

The mistaken belief that past random events affect future probabilities (e.g., expecting a coin flip to result in heads after multiple tails).

Inference

The process of drawing conclusions about a population based on data from a sample. It connects theoretical concepts like probability distributions with practical tools like sampling, simulation, and modeling.


Related study sets

Module 8: Pure Competition in the Short Run

View Set

MKT-378-001 Test #3 JSU (Dr. Coco)

View Set

chapter 22 the short-run trade-off between inflation and unemployment

View Set

Chapter 7: Political Participation: Activating the Popular Will

View Set

Retirement and other insurance concepts

View Set

The World Wars, Episode 3- Never Surrender (World History).

View Set

Chapter 8: Microbiol Metabolism HW

View Set