Data Science

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Considering Anomaly distributions

This is not necessarily true but certain algorithms like KDE and OC SVM might make that assumption ahead of time. This ca be mitigated by a Semi Supervised or supervised anomaly detection approaches, or modeling anomalies before hand.

Latent Dirichlet Allocation

Topic Modeling where we categorizes probabilities of a variable or document ends up in an area or topic of n dimensional topics. The # of Topics is a hyper parameter that can be set

Factorization Machines

Supervised Learning Model High Dimensional and Sparse data to predict missing missing entries in matrix. matrix is var1 by var2 for classification, Recommenders and regression

Top2Vec

Takes the concept of Semantic Embedding = Documents and Words sharing a Semantic space allowing them to be compared) and Finds the centroid of densely located Document vectors using the closet word representing the centroid aka Topic Vector The Space produces Topic , Document and Word Vectors Query Word V -> Topics V -> Document V

Association Rule Mining, Support

The Fraction of support count / all existing sets even repeats aka transactions s({X,Y})/|transactions|

The Concentration/Cluster Assumption

The Region where normal data lives can be bounded basically there is a threshold of normality

Association Rule Mining, Closure

The closure of a set, cl(S), can be defined as the intersection of all closed sets containing S cl(S)=⋂ { C∣C closed , S⊆C} so all closures are closed sets because they are subsets of other closed sets

anomaly Scale

The features may define one concept or represent multiple contexts

objective function

The function being maximized or minimized in Linear Programming

Density Level Set Estimation

The goal of the density level set estimation is to generate an estimate Gb of the level set based on the n observations {X1...Xn}, such that the error between the estimator Gb and the target set G , as assessed by some performance measure which gauges the closeness of the two sets, is small.

Kernel Density Estimation (KDE)

Where Histograms are bricks , KDE are piles of sand Model of intensity of a spatial process, particularly useful at unobserved locations. It is a non-parametric way to estimate the probability density function of a random variable. based on either being distribution-free or having a specified distribution but with the distribution's parameters unspecified. We often assume our data comes from a normal distribution, or at the very least from a distribution with mean μ and variance σ^2. Nonparametric methods are not based on such parameters. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. is a common technique used to estimate probability density functions (PDFs). For each data point a kernel is assigned and the sum of the kernels forms the density l under the total kernels https://towardsdatascience.com/histograms-vs-kdes-explained-ed62e7753f12

Triangle Inequality Theorem

The sum of the lengths of any two sides of a triangle is greater than the length of the third side.

Prototype Assumption

There are a finite set of prototypical elements in data space that characterize the data

Density Estimation and Probabilistic Anomaly Detection Models

These Models predict anomalies by estimating normal data probability distribution a) Classic Density Estimation i) Energy Based Models b) Deep Statistical Models i) Neural Generative Models (VAE & GANS) Seem generally based on negative Log Likelihood anomaly scores ~ continuous ~ Ranking

One Class Classification Detection Anomaly Models

This discriminative approach avoids a full estimation of density by focusing on detecting the boundaries od the dense areas that has low errors when evaluating new data. This generally provides a binary density Level Set detector. better at detecting what isn't normal, and efficient sample wise A) Support Vector Data Discriminator B) Kernel based OCC C) Deep SVDD D) Deep OC-SVM

Pointwise Mutual Information (PMI)

is a measure of association of two variables. It can take positive or negative values, but is zero if X and Y are independent. Pointwise mutual information measure is not confined to the [0,1] range. Positive means that the two events co-occurring in a frequency higher than what we would expect if they would be independent event. if p(x|y)/p(y) smaller than 1, the log is negative p(x|y) < p(y) means we see more of Y=y than we see x given that Y=y. The pointwise mutual information represents a quantified measure for how much more- or less likely we are to see the two events co-occur, given their individual probabilities, and relative to the case where the two are completely independent.

Minkowski Distance

is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. ( |y-y|^p)^1/p

Skip Grams

is actually the opposite of CBOW: instead of predicting one word each time, we use 1 word to predict all surrounding words ("context"). Skip gram is much slower than CBOW, but considered more accurate with infrequent words.

Manifold learning (Nonlinear)

is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.

Association Rule Mining, Minimum confidence

is an arbitrary Threshold used to find an association rule since all rules have a confidence they must satisfy confidence ≥ minconf threshold

Association Rule Mining, Minimum Support

is an arbitrary Threshold used to find an association rule since all rules have a support they must satisfy support ≥ minsup threshold

Generative Adversarial Networks (GANs)

is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset. The generative model frames the problem as a supervised learning problem with two sub-models: A) The generator model that we train to generate new examples. takes a fixed-length random vector as input and generates a sample in the domain. The vector is drawn from randomly from a Gaussian distribution, and the vector is used to seed the generative process. B) The discriminator model that tries to classify examples as either real (from the domain) or fake (generated). takes an example from the domain as input (real or generated) and predicts a binary class label of real or fake (generated). The real example comes from the training dataset. The generated examples are output by the generator model. The two models are trained together in a zero-sum game, adversarial, until the discriminator model is fooled about half the time, meaning the generator model is generating plausible examples. https://towardsdatascience.com/introduction-to-normalizing-flows-d002af262a4b

Information Gain

is the Difference between the Entropy of Set and Entropy of a Subset. This is a heuristic metric used for Optimization of determining maximum Information Gain. The Greater the Gain the more opportunities to split data

Association Rule Mining, Confidence

is the the fraction of a itemset {X,Y} / itemset predicate X in X -> Y c(X -> Y) = s( {X,Y})/s( {X}) it is an indication of how often the rule has been found to be true. c(ABC-> D) > c(AB-> CD) >= c(A -> BCD)

doc2Vec Embeddings

is to create a numeric representation of a document, regardless of its length. it is a small extension to the CBOW model. But instead of using just words to predict the next word, we also added another feature vector, which is document-unique. the document vector intends to represent the concept of a document. It can learn document vectors and word vectors. Where words can be compared to documents and represent the doc = Semantic Embedding Paragraph Vec w Dist. Mem. (DM) Dist. Bag of Words (DBOW)

Jaccard Distance

measures dissimilarity between sample sets, commonly used to calculate an n × n matrix for clustering and multidimensional scaling of n sample sets

Gini Index

measures income inequality. A higher number means more inequality. "measure how often a randomly chosen element from the set would be incorrectly labeled" 0 = never miss labeled .5 = %50

Jaccard Similarity

measures set similarity between sets

Divisive Clustering

methods separate n objects successively into finer groupings. Starts with one all-inclusive and splits until each cluster has one or K clusters achieved

Clustering

minimize Intra-Cluster. Max Inter-Cluster Dist.summarize/utility, understanding, compression by grouping. Image/pattern Recognition Cluster Analysis: divides into clusters Clustering: are a collection of clusters Cluster: similar group

discriminative models

models capture the conditional probability p(Y | X). the model ignores the question of whether a given instance is likely, and just tells you how likely a label is to apply to the instance. it must choose or make a decision as to what class a given example belongs.

KNN

simple and efficient application of distance based classification examines each pixel to be classified, then identifies the k nearest training samples as measured in multispectral data space Expensive Testing Phase susceptible to skewed data, data can't be to similar Relationships between features and the target class are numerous, complicated and difficult to understand. Another way to describe it is when the concept is difficult to define, but you know it when you see it, then nearest neighbor might be appropriate. On the other hand, if there is no clear distinction among the groups, then the algorithm may not well suited for identifying the boundaries.

Alternative Hypothesis (H1)

states that there is a change, a difference, or a relationship for the general population. In the context of an experiment, H1 predicts that the independent variable (treatment) does have an effect on the dependent variable. the opposing statement to a null Hypothesis

Mode

the most frequently occurring score(s) in a distribution

curse of dimensionality

the phenomena that occur when High # of attributes are being classified, organized, and analyzed and the high dimensional data that does not occur in low dimensional spaces, specifically the issue of data sparsity and "closeness" of data. number of values you need to fill space is exponentially large. n ≈ (10-30) dim. Small bad quality

Data Mining

the process of analyzing data to extract information not offered by the raw data alone (Bottom Up Data is start) Techniques Clustering, Classification, Anomaly det., Association Mining used for Statistics, Math, Advanced Computing, Visualization Tasks: 1. Prediction Methods-find future values 2. Description Methods-find patterns that describe data

Anomaly Detection

the process of identifying rare or unexpected items or events in a data set that do not conform to other items in the data set. It hinges on an ability to accurately analyze time series data in real time. Time series data anomaly detection must 1) create a baseline for normal behavior in primary key performance indicators (KPIs)

Standarization

the process of transforming a variable to one with a mean of 0 and a standard deviation of 1.

Resolution

the scale/magnification needed to identify something

Null Hypothesis (H0)

the statistical hypothesis tested by the statistical procedure; usually a hypothesis of no difference or no relationship. This is an example of that statement: You are trying to determine if you believe a statement to be true or false. The null hypothesis is, in a way, the default statement, as it is presumably true and it is the test's job to challenge this

Anomaly Threshold

thresholds to detect anomalies in performance measures on devices, links, and interfaces, and display these anomalies in the Device Dashboard.

Time Series - Walk Forward Validation

training of statistical models are not time consuming, walk-forward validation is the most preferred solution to get most accurate results

Dot Product of Vectors

u1v1 + u2v2 a · b = ax × bx + ay × by + az × bz a · b = |a| × |b| × cos(90°) =0 |v| = √(v1^2 + v2^2 + v3^2) The Dot Product gives a scalar (ordinary number) answer, and is sometimes called the scalar product.

Random Cut Forests (RCF)

updates are better served with simpler choices of partition. More advanced algorithms in the inference phase can compensate for simpler updates. Unsupervised numeric high Dimensional data

Word2Vec

word2vec (CBOW +Skip Grams) is an algorithm and tool to learn word embeddings by trying to predict the context of words in a document. The resulting word vectors have some interesting properties, for example vector('queen') ~= vector('king') - vector('man') + vector('woman'). Two different objectives can be used to learn these embeddings: The Skip-Gram objective tries to predict a context from on a word, and the CBOW objective tries to predict a word from its context. Word2vec relies only on local information of language. That is, the semantics learnt for a given word, is only affected by the surrounding words. does very well in analogy tasks Such representations, encapsulate different relations between words, like synonyms, antonyms, or analogies, build some model using words, simply labeling/one-hot encoding them is a plausible way to go

Distributional Hypothesis

words that occur in the same contexts tend to have similar meanings (Harris, 1954)

Histogram Estimators

y= Bin membership size based clusters https://towardsdatascience.com/histograms-vs-kdes-explained-ed62e7753f12

Data Quality Issues

•Missing Values Ex Survey questions skipped, insufficient choices •Noise & Outliers identification •Wrong Data •Duplicate Data

Data Types and their operations

•Nominal ex ID#, Sex, color Mode, Frequency, Correlation, =, != •Ordinal ex grades Sequence (monatomic preserved)no diff or sum Rank Correlation, Percentiles, Median g=f(x) , <, > •Interval temps., Dates no meaning of zero, Median, Std. Dev. T-test , F-test, Pearson's Correlation g=a*f(x)+i -, + •Ratio Kelvin, age,Time, Counts Location of zero is defined and fixedgeometric Mean, harmonic mean, % variation g=a*f(x) * , /

Data Mining Challenges

•Scalability •Dimensionality - Computational complexity required to measure or analyze huge •Computational complexity required to measure or analyze huge •Heterogeneous & Complex data of varying types Unstructured. clusters vary in size, height, shape •Data Ownership & Distribution distributed nature too hard. need to gather or work the data wholly •Nontraditional Analysis - difficult to determine all the potential hypotheses that can be present in the data

Manifold

a collection of points forming a certain kind of set, such as those of a topologically closed surface or an analog of this in three or more dimensions.

standard deviation

a computed measure of how much scores vary around the mean score. a quantity calculated to indicate the extent of deviation for a group as a whole. 1 std = + or - 34.1% => 68.2% around mean 2 std = + or - 47.7% => 95.4% around mean 3 std = + or - 49.9% => 99.8% around mean 4 std = + or - 50% => 100% around mean

Principal Component Analysis (PCA)

a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set. Classically useful in Anomaly detection

latent semantic analysis (LSA)

a mathematical procedure for automatically extracting and representing the meanings of propositions expressed in a text. a technique of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns.

Ranking

a position on a scale that shows how good someone or something is when compared with others. similarity varies based on distance measure used

distributed representation

a representation in which information is coded as a pattern of activation distributed across many different nodes. A collection of distributed neurons where each contributes to the representation of a concept

Bag-of-words (BOW)

a simplifying representation as the bag (multiset) of its words, commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature that disregards grammar and even word order but keeping multiplicity. loses many subtleties of a possible good representation, e.g consideration of word ordering.

Anomaly Based Objectives

a) Loss based Learning b) Distance based learning (i.e.Nearest Neighbor, Isolation Forest, Local Outlier Factor)

Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)

an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. It provides a very general framework for approaching manifold learning and dimension reduction, but can also provide specific concrete realizations. Uses Topological Analysis

Noise

an inherent sense of randomness, assumed to be unbiased and spherically symmetric

Novelty

an instance from an entirely different space or a change in current space that is suddenly occurring more often and considered the new norm

Outlier

an instance inside the probable space of known items that occurs with a very rare or low probability. generally considered noise/ edge cases

Variational Autoencoders (VAE)

are extensions of autoencoders to generate content. Variational Autoencoders map inputs to multidimensional Gaussian distributions instead of points in the latent space. Then, the decoder randomly samples a vector from this distribution to produce an output. They build general rules shaped by probability distributions to interpret inputs and to produce outputs. https://towardsdatascience.com/introduction-to-normalizing-flows-d002af262a4b

Independent probabilities

are multiplied

Attention

attention is a technique that mimics cognitive attention The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus to the small, but important, parts of the data. Learning which part of the data is more important than another depends on the context, and this is trained by gradient descent.

Association Rule Mining, Eclat

based on lattice traversal

Association Rule Mining, FP-Growth

based on without candidate generation

Symmetric attributes

binary attributes that are equally valuably or carry the same amount of weight. negative and positive values are equally relevant

Eager learners

construct a classification model based on the given training data before receiving data for classification. It must be able to commit to a single hypothesis that covers the entire instance space. Due to the model construction, eager learners take a long time for train and less time to predict. Abstracts away from the data during training and uses this abstraction to make predictions. (decision tree induction induction and rule-based systems). Majority of computation occurs at training time.

multivariate time series forecasting problems

contain multiple variables keeping one variable as time and others will be multiple in parameters

Structured data

data conforms to a database model, which is largely characterized by the various fields that data. (retail, financial, bioinformatics, geodata)

Deep One-Class Classification (DOC)

deep-learning-based approach for one-class transfer learning in which labelled data from an unrelated task is used for feature learning in one-class classification. One-class classification trains the classifier to be able to identify out-of-class objects when given a single class sample

Classification and Regression Tree Model (CART)

defines its rule as identifying the feature in the data that best separates records into distinct classes of interest.

lazy learners

does not built models explicitly. These learners refer to any machine learning process that defers the majority of computation to consultation time. Two typical examples of lazy learning are instance-based learning and Lazy Bayesian Rules.

stemming

eliminate the suffixes. strip from back

Support Vector Data Description

finds the smallest hypersphere that contains all samples, except for some outliers .is a machine learning technique that is used for single-class classification and outlier detection. The idea of SVDD is to find a set of support vectors that defines a boundary around data. It obtains a spherically shaped boundary around a dataset and analogous to the Support Vector Classifier it can be made flexible by using other kernel functions. a data description method that can give the target data set a spherically shaped description and be used to outlier detection or classification. Multiple kinds of kernel functions (linear, gaussian, polynomial, sigmoid, laplacian)

Continuous Attributes

floating point, real, temp, height ,

GPT

generative pre-training (GPT) of a language model (2018) Generative Pre-trained Transformer 2, commonly known by its abbreviated form GPT-2, is an unsupervised transformer language model and the successor to GPT. It avoids certain issues encoding vocabulary with word tokens by using byte pair encoding. This allows to represent any string of characters by encoding both individual characters and multiple-character tokens. (2019) Generative Pre-trained[a] Transformer 3, commonly known by its abbreviated form GPT-3, is an unsupervised Transformer language model and the successor to GPT-2. It can generalize the purpose of a single input-output pair. (2020)

Lemmatization

grouping words together based on their basic dictionary definition. The base dictionary form of the word is called a lemma

Association Rule Mining, H-Confidence or all-confidence

h − confidence(X) = s(X)/max (s(i1), s(i2),..., s(ik))

Agglomerative clustering

hierarchical clustering procedure where each object starts out in a separate cluster; clusters are formed by grouping objects into bigger and bigger clusters

cosine similarity

if documents can be represented as vectors, the cosine between those vectors represents how similar they are. ignores magnitude focuses on orientation. Not proper distance cause it fails Triangle Inequality property

Asymmetric attributes

if outcomes of the states are not equally important Ex Positive for aids is more important than negative test

Graph Based Clusters (Contiguity — Based Clusters)

in cluster NN Intra-distance <<< any point outside cluster · Two objects are connected only if they are within a specified distance of each other. · Each point in a cluster is closer to at least one point in the same cluster than to any point in a different cluster. · Useful when clusters are irregular and intertwined. · This does not work efficiently when there is noise in the data, as shown in the above picture, a small bridge of points can merge two distinct clusters into one. · Clique is another type of Graph Based Cluster (Detailly explained in my future articles). · Agglomerative hierarchical clustering has close relation with Graph based clustering technique.

unstructured data

information that either does not have a predefined data model or is not organized in a predefined manner. (images, video, sensor data, web pages)

Discrete Attributes

integers, counts, binary

Data Science

involves applications of statistics, computer science, and software engineering, along with some other relevant fields (such as sociology or finance). (Top down Start with question) Statistical Analysis, Math, Scientific Method, Advanced Computing, Visualization, Domain Exp., Hacking extracting knowledge from Datasets 1. Design a Sampling scheme (Question) 2. Determine measurable quantitative attributes (Query Data) 3. Improve data Quality(Explore) 4. (Model)and Validate data 5. Communicate Result (Present)

Transformer

is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).

Reconstruction Models for Anomaly Detection

is a generative approach that reconstructs what it considers as normal for comparison to data that doesn't fit that definition. The difference provides a Ranking of how anomalous the new data is. suited for determining data point sets or structures A) PCA B) AutoEncoders C) prototypical clustering

Type 2 error

(False Negative) Believing that something is not real when it is. when you fail to reject the null hypothesis when you should have.

Type 1 Error

(False positive) Rejecting null hypothesis when it is true. you reject the null hypothesis when you indeed should not have.

Cosine distance

1 - cosine similarity

Cluster Types

1) Well-Separated 2) Center/prototype Based 3) Contiguous Cluster (Near-Nei) 4) Density Based 5) Conceptual

Time Series Anomalies

1. Global outliers - known as point anomalies, these outliers exist far outside the entirety of a data set 2. Contextual outliers - called conditional outliers, these anomalies have values that significantly deviate from the other data points that exist in the same context. 3. Collective outliers - When a subset of data points within a set is anomalous to the entire dataset. Individual behavior may not deviate from the normal range in a specific time series dataset. But when combined with another time series dataset, more significant anomalies become clear.

K-nearest neighbor (K-NN)

A classification method that classifies an observation based on the class of the k observations most similar or nearest to it.

Association Rule Mining, itemset

A collection of one or more items. k-itemset contains k items

Semi-structured

A database of _____ data model is a collection of nodes, each node is either a leaf or interior. structured data, but lacks a strict structure imposed by an underlying data model

Gini Coefficient

A measure of Purity in Decision Trees. Purity can be thought of as how homogenized the groupings are. 1) If we have 4 red gumballs and 0 blue gumballs, that group of 4 is 100% pure, based on color as the target. 2) If we have 2 red and 2 blue, that group is 100% impure. 3) If we have 3 red and 1 blue, that group is either 75% or 81% pure, if we use Gini or Entropy respectively.

Covariance

A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship. lie between -∞ and +∞ is affected by the change in scale of the variables. So use the when the variable are on similar scales

Correlation

A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other. It is dimensionless. It is a unit-free measure of the relationship between variables. between -1 and 1 it is tool for feature selection. So use the when the variable scales differ

Manhattan Distance

A method of distance measurement . = Minkowski Distance with p=1 no diagnols

Euclidean distance

A method of distance measurement using the straight line mileage between two places. = Minkowski Distance with p=2

Association Rule Mining, Generator

A set x is called a generator of y if the closure of x is y both {A} and {A, C, W} are generators of {A, C, W}.

time series data

A time series is a sequence of numbers that are ordered by a time index. This can be thought of as a list or column of ordered values. The predictions over time become less and less accurate and hence it is a more realistic approach to re-train the model with actual data as it gets available for further predictions.

Association Rule

An implication expression of the form X -> Y, where X and Y are item sets derived from transactions - Example: {Milk, Diaper} -> {Beer} is a possible rule of frequent item { Milk, Diaper, Beer} all rules have a Confidence and a Support To find one you need to: 1) Generate a Frequent Itemset 2)Generate the Rules of High confidence that binarily partition the Frequent Itemset. Note every rule of a frequent Itemset will have a high enough confidence

Anomaly

An instance outside probable space of known items

Association Rule Mining, Closed Frequent itemset

An itemset X is closed if none of its immediate supersets has the same support count as the itemset X.

Association Rule Mining, Maximal Frequent itemset

An itemset is maximal frequent if it is frequent and none of its immediate supersets is frequent. meaning all supersets are infrequent Unlike closed itemsets, maximal itemsets do not imply anything about transactions.

Association Rule Mining, Frequent itemset

An itemset whose support value is greater than or equal to a predetermined minimum support value threshold . To find every permutation is computationally crazy. In contrast Infrequent are below minimum support

Conditional/Contextual Anomaly

Anomalies specific to a context ; Space, Time, connection to a graph

One Class Support Vector Machine

One-class SVM is an unsupervised algorithm that learns a decision function for novelty detection: classifying new data as similar or different to the training set. You only have data of one class and the goal is to test new data and found out whether it is alike or not like the training data. it's a variation of the SVM that can be used in an unsupervised setting for anomaly detection. The one-class SVM finds a hyper-plane that separates the given dataset from the origin such that the hyperplane is as close to the datapoints as possible. this is different from SVM Margins

Sparsity

Only presence counts. Lack of diversity, Asymmetric

Gibbs sampling

Organize Objects into topics one object at a time wit respect to each other in the topic. a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximately from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution (e.g., to generate a histogram of the distribution); to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variables);

Classic Anomaly Detection Algorithms

PCA Analysis One-Class SVM Support Vector Data Description K-nearest neighbor (K-NN) Kernel Density Estimation (KDE)

doc2Vec (DM)

Paragraph Vec w Dist. Mem. (DM) uses Context Words + Doc vector to predict a word in a context window

Types of anomalies

Point Anomalies Conditional/Contextual Anomalies Group/Collective Anomalies

Entropy

Probability to choosing a diverse item amongst a set. H(all c)= SUM[ - abs(c)/abs(all c) Log abs(c)/abs(all c) ] Entropy = 0 when all are the same

Hierarchical Clustering

Process of agglomerating observations into a series of nested groups based on a measure of similarity. Produces a set of nested clusters Organized as a hierarchical tree. Can be visualized as a dendrogram

Random Forest For Time Series Model

Random Forest is a popular and effective ensemble machine learning algorithm. It is widely used for classification and regression predictive modeling problems with structured (tabular) data sets Random Forest can also be used for time series forecasting, although it requires that the time series dataset be transformed into a supervised learning problem first. It also requires the use of a specialized technique for evaluating the model called walk-forward validation vs k-fold cross validation would result in optimistically biased results

Low-level features vs High Level in anomaly analysis

Refers to place in a feature hierarchy example Low - Words , Letters , sentences High - Topics and Documents

Central Limit Theorem (CLT)

Says that when n is large, the sampling distribution of the sample mean is approximately Normal

Group/Collective Anomalies

Set of collective points are abnormal

Autoencoders

Autoencoders are trained to recreate the input; in other words, the y label is the x input. One application of vanilla autoencoders is with anomaly detection. If the autoencoder can reconstruct the sequence properly, then its fundamental structure is very similar to previously seen data. On the other hand, if the network cannot recreate the input well, it does not abide by known patterns. When creating autoencoders, there a few components to take note of: The loss function is very important — it quantifies the 'reconstruction loss'. Since this is a regression problem, the loss function is typically binary cross entropy (for binary input values) or mean squared error. Determine the code size — this is the number of neurons in the first hidden layer (the layer that immediately follows the input layer). This is arguably the most important layer, because it determines immediately how much information will be passed through the rest of the layer. The performance of an autoencoder is highly dependent on the architecture. Balance representation size, the amount of information that can be passed through the hidden layers; and feature importance, ensuring that the hidden layers are compact enough such that the network needs to work to determine important features. Use different layers for different types of data. For instance, one could use one-dimensional convolutional layers to process sequences. Autoencoders are the same as neural networks, just architecturally with bottlenecks.

Reconstruction Anomaly Score

Base line error value to consider something fully matching or reconstructing a "perfect" model

K-means

Clustering is an unsupervised learning algorithm that is used for clustering

Conceptual Clusters

Clusters based on shared property or concept

Continuous Bag Of Words

Continuous bag of words creates a sliding window around current word, to predict it from "context" — the surrounding words. Each word is represented as a feature vector. After training, these vectors become the word vectors.

Density Based Clusters:

Dense regions separated by less dense regions · Cluster is a dense region of objects that is surrounded by a region of low density. · Density based clusters are employed when the clusters are irregular, intertwined and when noise and outliers are present. · Points in low density region are classified as noise and omitted. The above picture can be compared with the picture under "Graph Based clustering" for better understanding. The bridge between two circles and another small curve are eliminated. · DBSCAN is an example of Density based clustering algorithm. Density and Probabilistic models - provides a uncertainty measure or probability that tells us how much a data point is associated with a specific cluster, instead of modeling each dimension of data independently. Learn the probability distribution of large datasets. Such techniques include Generative Adversarial Networks (GANs), Variational Auto Encoders (VAEs), and Normalizing Flows. https://scikit-learn.org/stable/modules/density.html A) Histogram Estimators - aka bricks B) Kernel Density Estimation (KDE), Generative Adversarial Networks (GANs) - doesn't assume distribution has mean C) Gaussian Mixture Models (GMM), Variational Autoencoders (VAE) - assumes mean D) Marchov Chain Monte Carlo (MCMC) -estimating PDF based on probability chain. we need to determine an appropriate function for modeling posterior probability refers to the conditional probability P(A|B) of an event A given B E) Normalizing Flows

Probabilistic Latent Semantic Analysis (PLSA)

Derived from LSA there are 3 variables Documents, Words, (Topics = Hidden/Latent) P(w|d)= sum( t, P(w|t)P(t|d) ) LSA which uses only global statistics

doc2Vec (DBOW)

Dist. Bag of Words (DBOW) similar to word2vec skip grams it uses document vector to predict surrounding words as oppose to Context words to predict surrounding words in a context window

Gaussian Mixture Model (GMM)

Each Gaussian Mixture is a Cluster defined by a Gaussian function curve and each cluster has a A) A mean μ that defines its centre. B) A covariance Σ that defines its width. This woulterm-146d be equivalent to the dimensions of an ellipsoid in a multivariate scenario. C) A mixing probability π that defines how big or small the Gaussian function will be. https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95

Fuzzy Clustering

Every member belongs to every member with a weight assigned 0 = not a member and 1 = absolute membership and the sum of the weights for the object must = 1

XGBoost

Extreme Gradient Boosting. decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems where it comes to small-to-medium structured/tabular data,using ensemble tree methods and gradient descent to apply the principle of boosting weak learners are considered best-in-class right now. Parallel processing+ Tree Pruning + handle Missing values + regularization

Amazon Blazing text

FastText with GPU enhancements 1) CBOW and Skip Grams generate Embeddings (Unsupervised) 2) Given Vocabulary for text Classification (Supervised)

Association Rule Mining, Support Count

Frequency of an itemset occurrence including subsets

generative models

Generative models can generate new data instances. Generative models capture the joint probability p(X, Y), or just p(X) if there are no labels. They contrast with discriminative models. A generative model could generate new photos of animals that look like real animals, while a discriminative model could tell a dog from a cat. unsupervised models that summarize the distribution of input variables may be able to be used to create or generate new examples in the input distribution generative models include Latent Dirichlet Allocation, or LDA, and the Gaussian Mixture Model, or GMM.Restricted Boltzmann Machine, or RBM, and the Deep Belief Network, or DBN. Variational Autoencoder, or VAE, and the Generative Adversarial Network, or GAN.

Latent Dirichlet Allocation (LDA)

Generative statistical Model that will explain why some parts are similar using unobserved group = Topic to explain a (set of observations) = words in a document. In This Topic every word gravitates to a Topic and the collection of words gravitate to specific topics Needs: a) Set of Topics b) Stop-Words (Normalize) c) Stemming (Normalize) d) Lemmatization (Normalize) but it's very hard to tune, and results are hard to evaluate.

Glove Embeddings

Global Vectors -The advantage of GloVe is that, unlike Word2vec, GloVe does not rely just on local statistics (local fixed/surrounding context information of words), but incorporates global statistics (word co-occurrence over given corpus ) to obtain word vectors. "You can derive semantic relationships between words from the co-occurrence matrix." GloVe captures both global statistics and local statistics of a corpus, in order to come up with word vectors aka h a principled loss function which uses both

Connectivity-based Clustering

Graph Based or also known as Hierarchical Clustering when clusters are nested can be generated using two types of methods 1) Top-Down Divisive Approach 2) Bottom-Up Agglomerative approach

Association Rules

IF <antecedent> THEN <consequent> Detect unexpected Relations, Medical Informatics, Market Analysis There are three popular algorithms of Association Rule Mining, Apriori (based on candidate generation), FP-Growth (based on without candidate generation) and Eclat (based on lattice traversal).

Association Rule Mining, Aprior

If an itemset is frequent, then all of its subsets must also be frequent. (Anti-Monotone property of Support) = ∀W,V : (V ⊆W)⇒s(V) ≥ s(W) based on candidate generation. Can be inefficient, especially for dense datasets. Apriori is a program to find association rules and frequent item sets (also closed and maximal as well as generators) with the Apriori algorithm [Agrawal and Srikant 1994], which carries out a breadth first search on the subset lattice and determines the support of item sets by subset tests. This implementation is pretty fast as it uses a prefix tree to organize the counters for the item sets. However, Apriori is outperformed on basically all data sets by depth-first algorithms like Eclat or FP-growth.

Well-Separated Clusters

Intra distance <<< inter distance · Distance between any two points in different groups is larger than the distance between any two points in the same group. · These clusters need not be globular but, can have any shape. · Sometimes a threshold is used to specify that all the objects in a cluster must sufficiently close to one another. Definition of a cluster is satisfied only when the data contains natural clusters. Could be considered Distribution based

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map.

Association Rule Mining, Frequent Itemset Generator

Is the Graph formed out of all item sets possibilities minsup =0 means every subset is frequent 2^d-1 where d = max length of largest set. if minsup huge the number of rules are small

semi-supervised learning

Machine learning using a combination of supervised and unsupervised learning techniques. A learning problem that involves a small number of labeled examples and a large number of unlabeled examples.

ARIMA Model

Model used in Time Series Forecasting. It is Statistical Based. It stands for AutoRegressive Integrated Moving Average.

Deep Anomaly Detection Algorithms

Deep Autoencoder Variants Deep One-Class Classification (DOC) Generative Adversarial Networks (GANs)

Anomaly Score

score is created using an anomaly/id and the new instance (input_data) for which you wish to create an anomaly score. When you create a new anomaly score, BigML.io will automatically compute a score between 0 and 1. The closer the score is to 1, the more anomalous the instance being scored is.

automated anomaly detection systems

should include detection, ranking, and grouping of data, eliminating the need for large teams of analysts.

BERT

Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google.

Center/Prototype Based clusters

Center 1-Xc1 <<< Centerc-Xc1 distance · If the data is numerical, the prototype of the cluster is often a centroid i.e., the average of all the points in the cluster. · If the data has categorical attributes, the prototype of the cluster is often a medoid i.e., the most representative point of the cluster. · Objects in the cluster are closer to the prototype of the cluster than to the prototype of any other cluster. · These clusters tend to be globular. · K-Means and K-Medoids are the examples of Prototype Based Clustering algorithms

Manifold assumption

Data lives on a low dimensional level that may be a non linear non convex Manifold. So you could match similarity to the shape of the Manifold

Normalizing Flows

normalizing flows is a series of simple functions which are invertible, or the analytical inverse of the function can be calculated. For example, f(x) = x + 2 is a reversible function because for each input, a unique output exists and vice-versa whereas f(x) = x² is not a reversible function. Such functions are also known as bijective functions. The normalizing flows transform a complex data point such as MNIST Image to a simple Gaussian Distribution or vice-versa. Normalizing flows offers various advantages over GANs and VAEs. Some of them are listed as follows:- A) The normalizing flow models do not need to put noise on the output and thus can have much more powerful local variance models. B) The training process of a flow-based model is very stable compared to GAN training of GANs, which requires careful tuning of hyperparameters of both generators and discriminators. C) Normalizing flows are much easier to converge when compared to GANs and VAEs Disadvantages of Normalizing Flows:- While flow-based models come with their advantages, they also have some shortcomings as follows:- A) Due to the lackluster performance of flow models on tasks such as density estimation, it is regarded that they are not as expressive as other approaches. B) One of the two things required for flow models to be bijective is the volume preservation over transformations, which often leads to very high dimensional latent space, which is usually harder to interpret. C) The samples that are generated through flow-based models are not as good when compared to GANs and VAEs..

Point Anomaly

one data Point is not normal

univariate time series forecasting problems

only two variables in which one is time and the other is the field to forecast

population proportion

p = Ratio of members of a population with a particular characteristic to the total members = n of the population To accurately assess 1) size of sample has to be less than 5% of population 2) satisfying np(1-p) >=10 approximates that a normal distribution exists so we can compare it to Z Score dist

Data Science

Ensembles d'études connexes

FIN 3403 Chapter 1

PAD4084 Final

ATI Pharmacology -- The Cardiovascular System

Unit 8 Learning Activities

Exam #3

MKTG 461 Ch3

Health Promo + Pharm # 3

Greatest Common Factor (GCF):, GCF and LCM of Monomials

Recap: Network Protocol and Routing

Religion Diagnostic Quiz

PTCB Practice Questions

Earth's Surface: Water and Wind Erosion

Section 16 Unit 2 Exam

Chapter 3/4 MANAGEMENT Exercise

history/geography 800- unit 2: pt6

ACCY 131 Ch.14

CompTIA Security+ Certification Exam - Port Numbers

Espanol Dos Semester 2 Final

Healthcare Communication Test

Nutrition - Chapter 2