SEAS 8414 LECTURE 1: What is Data Analysis? Why is it Important?
DISCRETE Data Example
# of children you want to have, SAT score, whole number value
BOX plots
1 categorical and 1 quantitative variable
CONTINGENCY Tables
2 categorial variables, frequency of occurrence theme
VARIABLE
A quantity or condition that can change
CONTROLLED Variable
A variable that is kept the same for all conditions. Y=f(x) where Y is the dependent variable or the result output. X is the independent variable, input or the controllable variable
DEPENDENT Variable
A variable that you can observe and measure
ENDPOINT Protection
Algorithms that learn behaviors typical to malware, better than Antivirus
ARTIFICIAL Intelligence
Any technique that enables computers to mimic human intelligence, using logic, if-then rules, decision trees, and ML and DL
CLUSTERING
Can automatically ID classes which the samples belong when no information about classes is available in advance. (Unsupervised) Important to malware and forensic analysis.
DATA TYPES
Categorical and Numerical
AI & ML for Cybersecurity
Classification, Clustering, Network Protection, EndPoint Protection, App Security, Suspect Behavior (SUB)
UNSUPERVISED Learning APPLICATION
Clustering, Association Rule Learning, Dimensionality reduction (entity class, Anomaly detect, data exploration) Network protection, Endpoint Protection, User Behavior, Application Security.
QUALITATIVE DATA
Consists of descriptive statements, text-based, Stat analysis is harder, collected using interviews, documents, observations
QUANTITATIVE DATA
Data can be measured and expressed numerically, Number-based, Stat Analysis easier, collected using surveys, observations, experiments, interviews
4 TYPES of DATA ANALYTICS
Descriptive, Diagnostic, Predictive, Prescriptive
STATISTICAL INFERENCE
Estimation (point estimates, Confidence Levels) and Hypothesis Testing
GRAPHS and TABLES for Categorical Variables
Frequency Distribution Tables, Bart, Pie Charts, Pareto Diagrams
UNSUPERVISED LEARNING
Group and interpret data based only on input data (Clustering)
SUPERVISED LEARNING
Has a training data set and known result/output! Develop predictive model based on both input and output data (include classification and regression)
SUSPECT USER BEHAVIOR
ID attempts at fraud at the very moment they occur is emerging area for Deep learning
HISTOGRAMS
If you want to visualize the distribution of a SINGLE continuous variable Most common ways to represent numerical data. Each bar has width equal to the width of the interval. The bars are touching as there is continuation between intervals: where one ends -> the other begins.
SUPERVISED Learning APPLICATION
Large labeled & missing labeled data (Malware & Spam detect, Anomaly detect, risk scoring Network protection, Endpoint Protection, User Behavior, Application Security.
PREDICTIVE
Likely to happen? Decisions automated using algorithms, Technology. Historical patterns to predict specific outcomes.
MEASURES of CENTRAL TENDENCY
Mean, Median, Mode, Quartiles
POPULATION
Measurable quality is a parameter. Complete set. Reports are true representation of opinion. Contains all group members
SAMPLE
Measurable quality is called a statistic. Sample is subset of population. Reports have margin of error and confidence interval. Subset that represents the entire population
KUETOSIS
Measure for the degree of peakedness/flatness in variable distribution
SKEWNESS
Measure of asymmetry that indicates whether the observations in a dataset are concentrated on one side
CORRELATION
Measures the strength of linear relationship between X&Y
INTERVALS
No true zero, (degrees Celsius, Fahrenheit)
LEVELS of Measurement
Qualitative and Quantitative
GRAPHICAL REPRESENTATION OF MULTIPLE VARIABLE
Scatter Plots, Box, Contingency Tables, Cross Tables
REINFOCEMENT LEARNING
Science of decision making. Learning optimal behavior in an environment to obtain maximum reward. Optimal behavior learned through interactions with environ & observations of responds
PARETO diagram
Special type of bar chart where the categories are shown in descending order of frequency, and a separate curve shows the cumulative frequency.
MEASURES of DISPERSION/VARIATION
Standard deviation, Variance, Range
MACHINE Learning
Subset of AI. Includes abstruse statistical techniques that enable machines to improve at tasks with experience. The category includes deep learning.
DEEP Learning
Subset of ML. Composed of Algorithms that permit software to train itself to perform tasks, like speech or image recognition, by exposing multilayered neural networks to vast amounts of data.
MACHINE Learning METHODS
Supervised Learning, Unsupervised Learning, Reinforcement Learning
INFERENTIAL STATISTICS
Testing of statement about the population on the basis of sample characteristics
MEAN
The mean is the most widely spread measure of central tendency. It is the simple average of the dataset. Easily affected by outliers
QUANTITATIVE
Two Types. Interval and Ratio
QUALITATIVE
Two qualitative levels 1) nominal and 2) ordinal.
NETWORK Protection
Use ML to implement sophisticated IDS
CLASSIFICATION
Used to properly ID types of similar attacks, like different pieces of malware belonging to the same family, having common characteristics, behavior, signatures, etc.
PIE Charts
Used when we want to see the share of an item as a part of the total. Market shares always represented with a pie chart.
DESCRIPTIVE STATISTICS
Uses Graphical Representation of Data. Best option.
MODE
Value that occurs most often. A dataset can have 0 modes, 1 mode or multiple modes. Calculated simply by finding value with highest frequency.
PEARSON Correlation COEFFICIENT
Varies between -1 and +1. Perfect positive relationship = 1, No relationship = 0, Perfect negative relationship = -1
BAR charts
Very common. Each bar represents a category. On the y-axis we have the absolute frequency
LECTURE 1
What is Data Analysis? Why is it Important?
DISCRETE VARIABLE
Whole number values
INDEPENDENT Variable
a variable that you can control
DISCRETE Data
can be usually counted in a finite matter.
CONTINOUS VARIABLE
can take any value within a range Categorial Variable EXAMPLE is Sex = female or male
APP SECURITY
counter DDos, SQL Inject, etc. by using AI and ML tools
CONTINOUS Data
infinite and impossible to count. Ex - weight, height,
STATISTICS
is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data.
COVARIANCE
is a measure of the joint variability of two variables.
CORRELATION
is a measure of the joint variability of two variables. thought of as a standardized measure. values between -1 and 1, thus easy for us to interpret the result.
Ø A Correlation of -1
known as perfect negative correlation, means that one variable is explaining the other one perfectly, but they move in opposite directions.
Ø A Correlation of 1
known as perfect positive correlation, means that one variable is perfectly explained by the other.
LEFT (Negative) Skewness
means that the outliers are to the left.
Right (Positive) Skewness
means that the outliers are to the right (long tail to the right).
COVARIANCE of 0
means that the two variables are independent.
NEGATIVE COVARIANCE
means that the two variables move in opposite directions.
Ø A Correlation of 0
means that the variables are independent.
POSITIVE COVARIANCE
means two variables move together.
VARIANCE and Standard Deviation
measure the dispersion of a set of data points around its mean value.
MEDIAN
midpoint of the ordered dataset. Not affected by outliers.
RATIOS
ratios have a true zero (degrees Kelvin, length)
CROSS TABLES
represent categorical variables. One set of Categaory. is labeling the rows and another is labeling the columns.
ORDINAL
represents categories that can be ordered
CATAGORICAL Data
represents groups or categories. Examples: Car Brands: Audi, BMW and Mercedes. 2. Answers "yes and no"
NUMERICAL Data
represents numbers. It is divided into two groups 1) discrete and 2)continuous.
FREQUENCY Distribution Tables
show the category and its corresponding absolute frequency.
SCATTER PLOTS
two quantitative variables
COVARIANCE VALUES
values from -∞ to +∞ . A problem as it is very hard to put such numbers into perspective.
PRESCRIPTIVE
what do I need to do? Recommended actions/strategies testing outcomes, apply advanced analytical techniques, to make specific recommendations.
DESCRIPTIVE
what is happening to my business? Comprehensive, accurate, live data, effective visualization
DIAGNOSTIC
why is this happening? Drill down to root cause, isolate all confounding information