SEAS 8414 LECTURE 1: What is Data Analysis? Why is it Important?

Ace your homework & exams now with Quizwiz!

DISCRETE Data Example

# of children you want to have, SAT score, whole number value

BOX plots

1 categorical and 1 quantitative variable

CONTINGENCY Tables

2 categorial variables, frequency of occurrence theme

VARIABLE

A quantity or condition that can change

CONTROLLED Variable

A variable that is kept the same for all conditions. Y=f(x) where Y is the dependent variable or the result output. X is the independent variable, input or the controllable variable

DEPENDENT Variable

A variable that you can observe and measure

ENDPOINT Protection

Algorithms that learn behaviors typical to malware, better than Antivirus

ARTIFICIAL Intelligence

Any technique that enables computers to mimic human intelligence, using logic, if-then rules, decision trees, and ML and DL

CLUSTERING

Can automatically ID classes which the samples belong when no information about classes is available in advance. (Unsupervised) Important to malware and forensic analysis.

DATA TYPES

Categorical and Numerical

AI & ML for Cybersecurity

Classification, Clustering, Network Protection, EndPoint Protection, App Security, Suspect Behavior (SUB)

UNSUPERVISED Learning APPLICATION

Clustering, Association Rule Learning, Dimensionality reduction (entity class, Anomaly detect, data exploration) Network protection, Endpoint Protection, User Behavior, Application Security.

QUALITATIVE DATA

Consists of descriptive statements, text-based, Stat analysis is harder, collected using interviews, documents, observations

QUANTITATIVE DATA

Data can be measured and expressed numerically, Number-based, Stat Analysis easier, collected using surveys, observations, experiments, interviews

4 TYPES of DATA ANALYTICS

Descriptive, Diagnostic, Predictive, Prescriptive

STATISTICAL INFERENCE

Estimation (point estimates, Confidence Levels) and Hypothesis Testing

GRAPHS and TABLES for Categorical Variables

Frequency Distribution Tables, Bart, Pie Charts, Pareto Diagrams

UNSUPERVISED LEARNING

Group and interpret data based only on input data (Clustering)

SUPERVISED LEARNING

Has a training data set and known result/output! Develop predictive model based on both input and output data (include classification and regression)

SUSPECT USER BEHAVIOR

ID attempts at fraud at the very moment they occur is emerging area for Deep learning

HISTOGRAMS

If you want to visualize the distribution of a SINGLE continuous variable Most common ways to represent numerical data. Each bar has width equal to the width of the interval. The bars are touching as there is continuation between intervals: where one ends -> the other begins.

SUPERVISED Learning APPLICATION

Large labeled & missing labeled data (Malware & Spam detect, Anomaly detect, risk scoring Network protection, Endpoint Protection, User Behavior, Application Security.

PREDICTIVE

Likely to happen? Decisions automated using algorithms, Technology. Historical patterns to predict specific outcomes.

MEASURES of CENTRAL TENDENCY

Mean, Median, Mode, Quartiles

POPULATION

Measurable quality is a parameter. Complete set. Reports are true representation of opinion. Contains all group members

SAMPLE

Measurable quality is called a statistic. Sample is subset of population. Reports have margin of error and confidence interval. Subset that represents the entire population

KUETOSIS

Measure for the degree of peakedness/flatness in variable distribution

SKEWNESS

Measure of asymmetry that indicates whether the observations in a dataset are concentrated on one side

CORRELATION

Measures the strength of linear relationship between X&Y

INTERVALS

No true zero, (degrees Celsius, Fahrenheit)

LEVELS of Measurement

Qualitative and Quantitative

GRAPHICAL REPRESENTATION OF MULTIPLE VARIABLE

Scatter Plots, Box, Contingency Tables, Cross Tables

REINFOCEMENT LEARNING

Science of decision making. Learning optimal behavior in an environment to obtain maximum reward. Optimal behavior learned through interactions with environ & observations of responds

PARETO diagram

Special type of bar chart where the categories are shown in descending order of frequency, and a separate curve shows the cumulative frequency.

MEASURES of DISPERSION/VARIATION

Standard deviation, Variance, Range

MACHINE Learning

Subset of AI. Includes abstruse statistical techniques that enable machines to improve at tasks with experience. The category includes deep learning.

DEEP Learning

Subset of ML. Composed of Algorithms that permit software to train itself to perform tasks, like speech or image recognition, by exposing multilayered neural networks to vast amounts of data.

MACHINE Learning METHODS

Supervised Learning, Unsupervised Learning, Reinforcement Learning

INFERENTIAL STATISTICS

Testing of statement about the population on the basis of sample characteristics

MEAN

The mean is the most widely spread measure of central tendency. It is the simple average of the dataset. Easily affected by outliers

QUANTITATIVE

Two Types. Interval and Ratio

QUALITATIVE

Two qualitative levels 1) nominal and 2) ordinal.

NETWORK Protection

Use ML to implement sophisticated IDS

CLASSIFICATION

Used to properly ID types of similar attacks, like different pieces of malware belonging to the same family, having common characteristics, behavior, signatures, etc.

PIE Charts

Used when we want to see the share of an item as a part of the total. Market shares always represented with a pie chart.

DESCRIPTIVE STATISTICS

Uses Graphical Representation of Data. Best option.

MODE

Value that occurs most often. A dataset can have 0 modes, 1 mode or multiple modes. Calculated simply by finding value with highest frequency.

PEARSON Correlation COEFFICIENT

Varies between -1 and +1. Perfect positive relationship = 1, No relationship = 0, Perfect negative relationship = -1

BAR charts

Very common. Each bar represents a category. On the y-axis we have the absolute frequency

LECTURE 1

What is Data Analysis? Why is it Important?

DISCRETE VARIABLE

Whole number values

INDEPENDENT Variable

a variable that you can control

DISCRETE Data

can be usually counted in a finite matter.

CONTINOUS VARIABLE

can take any value within a range Categorial Variable EXAMPLE is Sex = female or male

APP SECURITY

counter DDos, SQL Inject, etc. by using AI and ML tools

CONTINOUS Data

infinite and impossible to count. Ex - weight, height,

STATISTICS

is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data.

COVARIANCE

is a measure of the joint variability of two variables.

CORRELATION

is a measure of the joint variability of two variables. thought of as a standardized measure. values between -1 and 1, thus easy for us to interpret the result.

Ø A Correlation of -1

known as perfect negative correlation, means that one variable is explaining the other one perfectly, but they move in opposite directions.

Ø A Correlation of 1

known as perfect positive correlation, means that one variable is perfectly explained by the other.

LEFT (Negative) Skewness

means that the outliers are to the left.

Right (Positive) Skewness

means that the outliers are to the right (long tail to the right).

COVARIANCE of 0

means that the two variables are independent.

NEGATIVE COVARIANCE

means that the two variables move in opposite directions.

Ø A Correlation of 0

means that the variables are independent.

POSITIVE COVARIANCE

means two variables move together.

VARIANCE and Standard Deviation

measure the dispersion of a set of data points around its mean value.

MEDIAN

midpoint of the ordered dataset. Not affected by outliers.

RATIOS

ratios have a true zero (degrees Kelvin, length)

CROSS TABLES

represent categorical variables. One set of Categaory. is labeling the rows and another is labeling the columns.

ORDINAL

represents categories that can be ordered

CATAGORICAL Data

represents groups or categories. Examples: Car Brands: Audi, BMW and Mercedes. 2. Answers "yes and no"

NUMERICAL Data

represents numbers. It is divided into two groups 1) discrete and 2)continuous.

FREQUENCY Distribution Tables

show the category and its corresponding absolute frequency.

SCATTER PLOTS

two quantitative variables

COVARIANCE VALUES

values from -∞ to +∞ . A problem as it is very hard to put such numbers into perspective.

PRESCRIPTIVE

what do I need to do? Recommended actions/strategies testing outcomes, apply advanced analytical techniques, to make specific recommendations.

DESCRIPTIVE

what is happening to my business? Comprehensive, accurate, live data, effective visualization

DIAGNOSTIC

why is this happening? Drill down to root cause, isolate all confounding information


Related study sets

Legal Aspects of Real Estate Final Exam

View Set

Mejo 341 copyright and trademark

View Set

Astronomy Ch. 6-11 Exam 2 (Fullerton College, Liliana Barabas)

View Set

MastBio - Chapter 5 AppliedContent

View Set

Risk Assessment & The Four Steps

View Set

Chapter 17: Preoperative Nursing Management

View Set