STATS Midterm
NoSQL
"Not Only SQL" ; a non-relational database that supports the storage of a wide range of data types including structured, semi-structured, and unstructured data Offers the flexibility, performance, and scalability needed to handle extremely high volumes of data
Approximation formula for the width of each interval
(maximum - minimum) divided by number intervals
Knowledge
A blend of data, contextual information, experiences, and intuition that can be applied and put into action in specific situations
Data warehouse
A central repository of data from multiple departments within an organization One of its primary purposes is to support managerial decision making Usually organized around subjects such as sales, customers, or products that are relevant to business decision making
Database
A collection of data logically organized to enable easy retrieval, management, and distribution of data
HyperText Markup Language (HTML)
A mark-up language that uses tags to define its data in web pages
JavaScript Object Notation
A popular alternative to XML in recent years; a standard for transmitting human-readable data in compact files
Population vs Sample
A population consists of all observations or items of interest in an analysis. A sample is a subset of the population. We examine sample data to make inferences about the population.
Composite primary key
A primary key that consists of more than one attribute We use a composite primary key when none of the individual attributes alone can uniquely identify each instance of the entity
Data management
A process that an organization uses to acquire, organize, store, manipulate, and distribute data
Frequency distribution for numerical variables
A series of intervals with the following guidelines 1) mutually exclusive 2) total number of intervals in a frequency distribution usually ranges from 5 to 20 3) exhaustive (covers the entire sample) 4) easy to recognize and interpret
Instance
A single occurrence of an entity
Data mart
A small-scale data warehouse or a subset of the enterprise data warehouse that focuses on one particular subject or decision area
Database Management System
A software application for defining, manipulating, and managing data in databases (oracle, sql, access)
Ordinal scale
Able to both categorize and rank the data with respect to some characteristic or trait; ranked, arbitrary values Typically expressed in words and then coded into numbers (example: hotel reviews classified 1-5 stars)
Interval scale
Able to categorize and rank the data as well as find meaningful differences between observations (example: temperature) Value of zero is arbitrarily chosen
Population
All elements of interest
Primary key
An attribute that uniquely identifies each instance of the entity; special type of attribute (ex Customer_ID is the primary key for CUSTOMER because each customer would have a unique ID number) Often used to create a data structure called an index for fast data retrieval and searches
Discrete variable
Assumes a countable number of values
Change in analytic professionals
Become more self-reliant and possess the necessary skills for data wrangling and data analysis; no longer relying on IT department; requires a broader skill set than just statistical and data mining techniques
Business Analytics
Broad topic, encompassing statistics, computer science, and information systems with a wide variety of applications in marketing, human resource management, economics, finance, health, sports, politics, etc.
Relative Frequency
Calculated by dividing the frequency by the sample size Proportion of observations in each category
Big Data
Catch-phrase term meaning a massive volume of both structured and unstructured data that are extremely difficult to manage, process, and analyze using traditional data-processing tools
Three common approaches for transforming categorical data prior to analysis
Category reduction, dummy variables, and category scores
Continuous variable
Characterized by uncountable values within an interval (weight, height, time, investment return)
Dummy variables
Commonly used to describe two categories of a variable; used when converting categorical variables into numerical variables; takes on values of 1 or 0 to describe two categories of a categorical variable
Data
Compilations of facts, figures, or other contents, both numerical and nonnumerical
Data, Information, and Knowledge
Data are compilations of facts, figures, or other contents, both numerical and nonnumerical. Information is a set of data that are organized and processed in a meaningful and purposeful way. Knowledge is derived from a blend of data, contextual information, experience, and intuition.
Information
Data that have been organized, analyzed, and processed in a meaningful and purposeful way
Foreign key
Defined as a primary key of a related entity (ex. Because Customer_ID is the primary key of the CUSTOMER entity, which shares a relationship with the ORDER entity, it is considered a foreign key in the ORDER entity
Three types of analytics
Descriptive: what happened? Predictive: what could happen in the future? Prescriptive: what should we do?
Range
Difference between the maximum and the minimum observations of a variable
Line chart
Displays a numerical variable as a series of data points connected by a line; especially useful for tracking changes or trends over time
Percentile
Divides a variable into two parts; less than or greater
Unstructured data
Does not conform to a predefined, row-column format; usually textual or have multimedia components
Fixed-width format
Each column starts and ends at the same place in every row; actual data are stored as plain text characters in a digital file
Delimited format
Each piece of data is separated by a comma
Skewness
Extremely high or low values of skewed variables significantly inflate or deflate the average of the entire data set, making it difficult to detect meaningful relationships with skewed variables. A popular mathematical transformation that reduces skewness in data is the natural logarithm transformation. Another transformation to reduce data skewness is the square root transformation.
True or False: raw data offers a lot of value and insights
False - in order to extract value from data, we need to be able to understand the business context, ask the right questions from the data, identify appropriate analysis models, and communicate information into verbal and written language.
Entity
Generalized category to represent persons, places, things, or events about which we want to store data in a database table
Bar chart for categorical variable
Graphical representation of a frequency distribution; with the height of each bar is equal to the frequency or the relative frequency of the corresponding category
Histogram
Graphical representation of frequency distribution for numerical variables
Entity-relationship diagram
Graphical representation used to model the structure of data
Scatterplot
Graphical tool to examine the relationship between two numerical variables; each point represents a paired observation for the two variables
Stacked column chart
Graphically show information from a contingency table Allows for the comparison of composition within each category
Two important data preparation techniques
Handling missing values and subsetting data
Distinction between JSON and XML
JSON format is not as verbose as the XML format, making data files smaller in size JSON format supports a wide range of data types not readily available in XML format Parsing JSON data files is faster and less resource intensive
Nominal scale
Least sophisticated level of measurement Categorizes or groups the data Data set differs merely by name or lable
Frequency Distribution for categorical variable
Make categorical variables more manageable and easier to access Groups the data into categories and records the number of observations that fall into each category The relative frequency for each category equals the proportion of observations in each category
Machine-generated data
Manufacturing sensors, speed cameras, web server logs
Measures of Central Location
Mean, median, mode
Types of numerical descriptive measures
Measures of central location: find a typical value for the data Measures of dispersion: gauge the underlying variability of the data Measures of shape: reveal symmetry and tails Measures of association: whether there is a linear relationship
Median
Middle value of a data set Mean can give misleading description due to outliers
Category Scores
Most appropriate if the data are ordinal and have natural, ordered categories Recode the categories numerically using numbers Assume equal increments between the category scores Example: customer satisfaction surveys, ranking 1-5 with each number representing a satisfaction level
Relational database
Most common type of database used in organizations today Consists of one or more logically related data files, often called tables or relations Where each data file is a two-dimensional grid that consists of rows and columns
Mode
Most frequently occurring observation of a variable. A variable may have no mode or more than one mode. The mode is the only meaningful measure of central location for a categorical variable.
Structured query language (SQL)
Most popular query language A language for manipulating data in a relational database using relatively simple and intuitive commands Basic structure: Select, From, Where
Relationship in a scatterplot
Negative linear relationship: points clustered together along a line with a negative slope Vice versa for positive Nonlinear relationship: x increases and y increases at a faster rate (positive) No relationship: no apparent pattern
Measurement Scales
Nominal and Ordinal (categorical) Interval and Ratio (numerical) Techniques for summarizing and analyzing variables
Why do we rely on sampling data?
Obtaining information on the entire population is expensive. It is impossible to examine every member of the population.
Two common strategies for dealing with missing values
Omission and imputation
Relationship
One to one; one to many; or many to many; relationship with each other that represents certain business facts or rules
RFM Analysis
Popular marketing technique used to ID high value customers Recency, frequency, and monetary (days since last, number of orders, and monetary variables)
Types of mean
Population mean = parameter Sample mean = Statistic
Notation of mean
Population mean is referred to with Greek letter u (mu) Sample mean is referred to with x (x-bar)
Structured data
Predefined, row-column format; spreadsheet or database applications to enter, store, query, and analyze structured data
Human-generated data
Price, income, retail sales, age, gender
Binning
Process of transforming numerical variables into categorical variables by grouping the numerical values into a small number of groups or bins Must be consecutive and nonoverlapping Each value falls into one and only one bin Effective way to reduce noise in the data if we believe that all observations in the same bin tend to behave the same way
Business Intelligence
Provides organizations and their users with the ability to access and manipulate data interactively through reports, dashboards, applications, and visualization tools
Categorical variable
Qualitative
Numerical variable
Quantitative
Measures of Dispersion
Range, interquartile range, mean absolute deviation, variance, standard deviation
Omission strategy
Recommends that observations with missing values be excluded from the analysis Also called complete-case analysis Appropriate when the amount of missing values is small or concentrated in a small number of observations
Cross-Sectional Data
Refer to data collected by recording a characteristic of many subjects at the same point in time, or without regard to differences in time. examples: NBA wins/losses over a season; recorded grades of students in a class; sale prices of home)
Time Series Data
Refer to data collected over several time periods focusing on certain groups of people, specific events, or objects Examples: hourly body temperature, daily price of stock in the first quarter)
Imputation strategy
Replaces missing values with some reasonable imputed values like mean or median For categorical variables, it is common to impute the most predominant category In the presence of outliers, it is preferred to use the median instead of the mean to impute missing values
First tasks completed by data analysis to gain a better understanding and insight into data
Review and inspect for quality and relevance Counting and sorting to verify the data set is complete or has missing values
Bubble plot
Shows the relationship between three numerical variables; the third numerical variable is represented by the size of the bubble
eXtensible Markup Language (XML)
Simple language for representing structured data; widely used for sharing structured information between computer programs, between people, and between computers and people Each piece of data is enclosed in a pair of 'tags' that follow specific XML syntax
Structure of a data mart
Star schema: a multidimensional data model - made up of dimension and fact tables Dimension table: describes the business dimensions of interest, such as customer, product, location, and time Fact table: contains facts about the business operation, often in a quantitative format
Ratio scale
Strongest level of measurement All the characteristics of interval scale with a true zero point, which allows us to interpret the ratios between observations (example: sales, profits, inventory levels, weight, time, distance)
Sample
Subset of a population
Histogram shape of distribution
Symmetric distribution is one that is a mirror image of itself on both sides of its center. If the distribution is not symmetric, then it is skewed. Positive skewed: long tail that extends to the right reflects the presence of a small number of relative large values Negative skewed: long tail that extends to the left; small number of relatively small values
Data transformation
The data conversion process from one format or structure to another; performed to meet the requirements of statistical and data mining techniques used for the analysis Examples: date of birth to age; BMI calculation; percentages
Primary barrier preventing organizations from taking full advantage of business analytics
The inability to clean and organize big data
Data modeling
The process of defining the structure of a database
Subsetting
The process of extracting portions of a data set that are relevant to the analysis; commonly used to pre-process the data prior to analysis May remove variables that are irrelevant to the problem, variables that contain redundant information, or variables with excessive amounts of missing values
Data wrangling
The process of retrieving, cleansing, integrating, transforming, and enriching data to support subsequent data analysis Transforming raw data into a format that is more appropriate and easier to analyze
Extraction, Transformation, and Load process
To integrate data from different databases generated by various business departments To retrieve, reconcile, and transform data into a consistent format, and then load the final data into a data warehouse
Heat map
Uses color or color intensity to display relationships between variables; useful to identify combinations of the categorical variables that have economic significance
Issues where it makes sense to use category reduction
Variables with too many categories pull down model performance If a variable as some categories that rarely occur If one category clearly dominates in terms of occurence
Other characteristics of big data
Veracity: credibility and quality of data Value: value derived from big data is perhaps the most important aspect of any analytics initiative
Three characteristics of big data
Volume: immense amount of data Velocity: data from a variety of sources generated at a rapid speed Variety: come in all types, forms, and granularity, both structured and unstructured
Variable
When a characteristic of interest differs in kind or degree among various observations (records)
Rescaling
When the variables in a data set are measured using different scales, the variability can place undue influence on larger-scale variables, resulting in inaccurate outcomes Commonplace to rescale the data using either standardization or normalization, especially in data mining techniques
Category reduction
Where we collapse some of the categories to create fewer nonoverlapping categories Guideline 1: categories with very few observations may be combined to create the "other" categories Guideline 2: categories with a similar impact may be combined
Key distinction between XML and HTML
XML tells us or computer applications what the data are HTML tells the web browser how to display the data
General rule for creating dummy variables
k - 1 , using the last category as reference
Scatterplot with a categorical variable
scatterplot that incorporates a categorical variable with different colors or symbols
Contingency table
to examine the relationship between two categorical variables shows the frequencies for two variables, where each cell represents a mutually exclusive combination of the values