Analytics CH 3
diagnostic analytics
Procedures that explore the current data to determine why something has happened the way it has, typically comparing the data to a benchmark. As an example, these allow users to drill-down in the data and see how it compares to a budget, a competitor, or trend.
pruning
removes branches from a decision tree to avoid overfitting the model
decision support systems
rule-based systems that gather data and recommend actions based on the input.
summary statistics data reduction or filtering
types of descriptive analytics
profiling clustering similarity matching co-occurrence grouping
types of diagnostic analytics
regression classification link prediction
types of predictive analytics
decision trees
used to divide data into smaller groups.
data reduction or filtering
used to reduce the amount of observations to focus on relevant items (i.e., highest cost, highest risk, largest impact, etc.). It does this by taking a large set of data (perhaps the population) and reducing it to a smaller set that has the vast majority of the critical information of the larger set.
linear classifiers
useful for ranking items rather than simply predicting class probability. useful for determining the really important values, such as valuable customers, or which transactions are most likely fraudulent.
similarity matching
a grouping technique used to identify similar individuals based on data known about them.
goal of classification
to predict whether an individual will belong to one class or another
causal modeling
A data approach similar to regression, but used when the relationship between independent and dependent variables where it is hypothesized that the independent variables cause or are associated with the dependent variable.
support vector machine
A discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe).
XBRL (eXtensible Business Reporting Language)
A global standard for exchanging financial reporting information that uses XML.
decision support system
An information system that supports decision-making activity within a business by combining data and expertise to solve problems and perform calculations
benford's law
An observation about the frequency of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the significant lending digit is likely to be small.
unsupervised approach
Approach used for data exploration looking for potential patterns of interest
supervised approach/method
Approach used to learn more about the basic relationships between independent and dependent variables that are hypothesized to exist.
prescriptive analytics
Procedures that model data to enable recommendations for what should be done in the future. These typically include developing more advanced machine learning and artificial intelligence models to recommend a course of action based on a current problem.
descriptive analytics
Procedures that summarize existing data to determine what has happened in the past. Some examples include summary statistics (e.g. Count, Min, Max, Average, Median), distributions, and proportions.
predictive analytics
Procedures used to generate a model that can be used to determine what is likely to happen in the future. Examples include regression analysis, forecasting, classification, and other predictive modeling.
clustering algorithms
calculate the minimum distance of all observations and groups those elements
structured data
data that are stored in a database or spreadsheet and are readily searchable.
f(independent variable)
dependent variable =
summary statistics
describe a set of data in terms of their location (mean, median), range (standard deviation, minimum, maximum), shape (quartile), and size (count).
co-occurrence grouping
discovers associations between individuals based on common events, such as transactions they are involved in.
regression
estimates or predicts the numerical value of a dependent variable based on the slope and intersect of a line and the value of an independent variable.
training data
existing data that have been manually evaluated and assigned a class.
test data
existing data used to evaluate the model
clustering
helps identify groups of individuals (such as customers) that share common underlying characteristics—in other words, identifying groups of similar data elements and the underlying drivers of those groups.
profiling
identifies the "typical" behavior of an individual, group, or population by compiling summary statistics about the data (including mean, standard deviations, etc.) and comparing individuals to the population
machine learning and artificial intelligence
learning models or intelligent agents that adapt to new external data to recommend a course of action.
fuzzy matching
locates approximate matches. useful for identifying relationships in imperfect data
decision boundaries
mark the split between one class and another.
overfitting
models that are too accurate. they are actually pretty bad at predicting a future observation
classification
predicts a class or category for a new observation based on the manual identification of classes from previous observations.
link prediction
predicts a relationship between two data items, such as members of a social media platform.
1. identify the classes you wish to predict 2. manually classify an existing set of records 3. select a set of classification models 4. divide your data into training and testing sets 5. generate your model 6. interpret the results and select the "best" model
steps for classification
1. identify the attribute to reduce/focus on 2. filter the results 3. interpret the results 4. follow up on the results
steps for data reduction
1. identify objects or activity to profile 2. determine types of profiling to perform 3. set boundaries/thresholds for the activity 4. interpret the results and monitor activity and/or generate a list of exceptions 5. follow up on exceptions
steps for profiling
1. identify the variables that might predict the outcome 2. determine the functional form of the relationship 3. identify the parameters of the model
steps for regression