Acct Stats - Exam 1
measures of shape
(skewness coefficient, kurtosis coefficient)
data warehouse
a central repository of data from multiple departments within an organization to support managerial decision making
database
a collection of data logically organized to enable easy retrieval, management, and distribution of data
binning
a common data transformation technique that converts numerical variables into categorical variables by grouping the numerical values into a small number of bins
boxplot
a convenient way to graphically display the five-number summary of a variable
variable
a general characteristic being observed on a set of people, objects, or events, where each observation varies in kind or degree
entity
a generalized category to represent persons, places, things, or events about which we want to store data in a database table
entity-relationship diagram (ERD)
a graphical representation used to model the structure of the data
scatterplot
a graphical tool that helps in determining whether or not two numerical variables are related in some systematic way
structured query language (SQL)
a language for manipulating data in a relational database using relatively simple and intuitive commands, the basic structure consists of the select, from, where keywords
big data
a massive volume of both structured and unstructured data that are extremely difficult to manage, process, and analyze using traditional data-processing tools
composite primary key
a primary key that contains more than one attribute
histogram
a series of rectangles where the width and height of each rectangle represent the interval width and frequency of the respective interval
information
a set of data that are organized and processed in a meaningful and purposeful way
HyperText Markup Language (HTML)
a simple text-based markup language for displaying content in web browsers
eXtensible Markup Language (XML)
a simple text-based markup language for representing structured data, uses user-defined markup tags to specify the structure of data
instance
a single occurrence of an entity
database management system (DBMS)
a software application for defining, manipulating, and managing data in databases
JavaScript Object Notation (JSON)
a standard for transmitting human-readable data in compact files
sample
a subset of data used for the analysis and to make inferences about the population
primary key
an attribute that uniquely identifies each instance of an entity
mean absolute deviation (MAD)
an average of the absolute differences between the observations and the mean
variance
an average of the squared differences between the observation and the mean
volume
an immense amount of data is compiled from a single source or a wide range of sources
heat map
an important visualization tool that uses color or color intensity to display relationships between variables
dummy variable
an indicator or a binary variable, takes on values of 1 or 0 to describe two categories of a categorical variable
empirical rule
approx 68% of all observations fall in one standard deviation, approx 95% of all observations fall within two standard deviations, and approx 100% fall within three standard deviations from the mean
discrete variable
assumes a countable number of values
dimension table
business dimensions of interest such as customer, product, location, and time
continuous variable
characterized by uncountable values within an interval
business analytics
combines qualitative reasoning with quantitative tools to identify key business problems and translate data analysis into decisions that improve business performance
data
compilations of facts, figures, or other contents, both numerical and nonnumerical
population
consists of all observations or items of interest in an analysis
relational database
consists of one or more logically related data files, where each file is a two-dimensional grid that consists of rows and columns
fact table
contains facts about the business operation, often in a quantitative format
standardizing
converting observations into z-scores
cross-sectional data
data collected by recoding a characteristic of many subjects at the same point in time, or without regard to differences in time
time series data
data collected over several time periods focusing on certain groups of people, specific events, or objects
variety
data come in all types, forms, and granularity, both structured and unstructured
velocity
data from a variety of sources get generated at a rapid speed
bar chart
depicts the frequency or the relative frequency for each category of the categorical variable as a series of horizontal or vertical bars
knowledge
derived from a blend of data, contextual information, experience, and intuition
stacked column chart
designed to visualize more than one categorical variable, allows for the comparison of composition within each category
unstructured data
do not conform to a predefined, row-column format
delimited format
each column is separated by a delimiter such as a comma
relative frequency
equals the proportion of observations in each category or interval compared to the whole
measures of dispersion
gauge the variability of a data set (range, interquartile range, mean absolute deviation, variance, and standard deviation)
frequency distribution
groups the data into categories and records the number of observations that fall into each category for a categorical variable, for a numerical variable it groups data into intervals and records the number of observations that fall into each interval
fixed-width format
in a data file where each column starts and ends at the same place in every row
correlation coefficient
indicates the direction and the strength of the linear relationship between x and y (-1 = perfect negative lin rel, 0 = not linearly related, 1 = perfect positive lin rel)
covariance
indicates whether x and y have a negative linear relationship, positive linear relationship, or no linear relationship (negative # = negative lin rel, positive # = positive lin rel, 0 = no lin rel)
skewness coefficient
measures the degree to which a distribution is not symmetric about its mean (symmetric = 0, positively skewed = positive, negatively skewed = negative)
z-score
measures the relative position of an observation within a distribution
kurtosis coefficient
measures whether the tails of a distribution are more or less extreme than the normal distribution (normal = 3, excess = coefficient - 3) more extreme = leptokurtic, less extreme = platykurtic
interval scale
observations can be categorized and ranked, and differences between observations are meaningful, the value of zero is arbitrarily chosen
ordinal scale
observations can be categorized and ranked, however the differences between the ranked observations are meaningless
nominal scale
observations differ merely by name or label, the least sophisticated level of measurement
ratio scale
observations have all the characteristics of interval-scaled data as well as a true zero point, meaningful ratios can be calculated
business intelligence
provides organizations and their users with the ability to access and manipulate data interactively
categorical
qualitative, observations represent categories
measures of association
quantify the direction and strength of the linear relationship between two variables (covariance, correlation coefficient)
numerical
quantitative, observations represent meaningful numbers
omission
recommends that observations with missing values be excluded from subsequent analysis
imputation
recommends that the missing values be replaced with some reasonable imputed values
descriptive analytics
refers to gathering, organizing, tabulating, and visualizing data to summarize 'what has happened?'
predictive analytics
refers to using historical data to predict 'what could happen in the future?'
prescriptive analytics
refers to using optimization and simulation algorithms to provide advice on 'what should we do?'
measures of central location
relates to the way numerical data tend to cluster around some middle or central value (mean, median, mode, percentile)
relationship between entities
represents certain business facts or rules, one-to-one, one-to-many, or many-to-many
structured data
reside in a predefined, row-column format
line chart
shows a numerical variable as a series of data points connected by a line
contingency table
shows the frequencies for two categorical variables, x and y, where each cell represents a mutually exclusive combination of the pair of x and y values
bubble plot
shows the relationship between three numerical variables in a two-dimensional graph
scatterplot with a categorical variable
shows the relationship between two numerical variables and a categorical variable in a two-dimensional graph
data mart
small-scale data warehouses that only contain data that are relevant to certain subjects or decision areas
star schema
structure of a data mart conforms to this multidimensional data model
veracity
the credibility and quality of the data
data transformation
the data conversion process from one format or structure to another
range
the difference between the maximum and the minimum observations
interquartile range (IQR)
the difference between the third quartile and the first quartile, does not rely on extreme observations
negatively skewed distribution
the long tail extending off to the left, with a small number of relatively small variables
positively skewed distribution
the long tail that extends to the right reflects the presence of a small number of relatively large variables
median
the middle observation of a variable
mode
the most frequently occurring observation of a variable
value
the most important aspect of any analytics initiative
parameter
the population mean
standard deviation
the positive square root of the variance
outliers
the presence of extremely small or large observations
foreign key
the primary key of a related entity
arithmetic mean
the primary measure of central location
data modeling
the process of defining the structure of a database
subsetting
the process of extracting parts of a data set that is of interest to the analytics professional
data wrangling
the process of retrieving, cleansing, integrating, transforming, and enriching data to support subsequent data analysis
data management
the process that an organization uses to acquire, organize, store, manipulate, and distribute data
statistic
the sample mean