AI Fundamentals

Ace your homework & exams now with Quizwiz!

model = LogisticRegression(C=0.000001)

model = LogisticRegression(C=________________________) model.fit(training_inputs, training_labels) check_performance(model, testing_inputs, testing_labels) Set the value of hyper-parameter C to 0.000001.

cluster entities events similar attributes algorithms automated unsupervised categorizations data

A clust__________________ is a group of entit__________________ or even__________________ sharing simil__________________ attribu__________________. So, when we talk about clust__________________within the context of AI, we talk about the process of applying Machine Learning algori__________________for automa__________________ discov__________________of cluste__________________. This is an unsupe__________________ learning problem, because we are not using any pre-existing categori__________________ during training -- we let the algorithm assign dat__________________ into groups entirely on its own.

linear separation number specific

DBSCAN allows both nonlin__________________ separation AND figures out the num__________________ of clusters on its own, BUT it can leave too many points without any spe__________________ cluster, as indicated with black dots. Obviously, each algorithm has its advantages and the final choice should be made based on the requirements of each specific use case.

model fitting numerical optimization minimum cost parameter testing combinations values performing

Fit the mod___________ In scary mathematical jargon, mod___________ fitting refers to performing numeri___________ optimiz___________, looking for the minim___________of our cos___________ function in the parameter space. Scary, huh? In HUMAN terms, fitting is the process of moving the knobs under the hood of our model, testi___________ as many combina___________ as necessary, until we find those valu__________ that yield the best perfor__________ model. Like us humans when we're searching for that perfect water temperature in the shower, the fitting algorithm runs up and down, until it finds the settings that are just right.

Backpropagation algorithm. press2

For a long time, neural networks were more of a theoretical concept than a practical tool. The reason for this was the lack of an efficient training algorithm. This all changed in 1986 when a group of authors published a famous paper. Which revolutionary algorithm did this paper further improve and popularize? Moonwalk algorithm. press1 Backpropagation algorithm. press2 Back-to-the-future algorithm. press3

model = RandomForestClassifier(n_estimators=5, max_depth=20) model.fit(X_train, y_train) test_and_show_accuracy(model, X_test=X_train, y_test=y_train)

Hold-out You already know about the danger of overfitting, which occurs when your model learns the training data too well, but then performs poorly when faced with new data. Because of that, you've been urged to always test your model using data that wasn't previously used for training. But don't take our word for it, see for yourself! You will use a dataset consisting of two classes. 60% of the data has been selected for training and stored in X_train and y_train. The remaining 40% is stored in variables X_test and y_test. You will train a RandomForestClassifier() model and see the difference in performance: when it's applied on the very same data used to train it when it's applied on data just slightly different from the training set Test the model on the same data it used for training. model = RandomForestClassifier(n_estimators=5, max_depth=20) model.fit(X_train, y_train) test_and_show_accuracy(model, X_test=____, y_test=____)

unusual behavior anomalies or outliers different supervised normal anomaly fraud security monitoring

Many times we want to detect and react upon unu______________________behav______________________ in our systems and processes. We call these events anom______________________ or outli______________________. Unfortunately, anom______________________ are sometimes so rare and so diffe______________________ from each other, that applying super______________________ learning is practically impossible. Luckily, in many cases we can still define what is the nor______________________ mode of operation and then flag every event that significantly devi______________________ from it as an anom______________________. Most common use cases for anom______________________ detection are credit card fra______________________ detection, network sec______________________ monitoring, heart-rate moni______________________ and others.

reason plan solve abstractly complex quickly experience

One of the widely accepted formulations defines intelligence as: the ability to rea___________, pla___________, sol___________ problems, think abstr___________, comprehend comp___________ ideas, learn quic___________ and learn from exper___________

groups entities or events "clustering" supervision classification pre-existing categorizations exploring discovering K-means mean-shift

One typical problem we can solve in this way is finding grou__________ of sim__________ entit__________ or even__________ -- for example, grou__________of similar consumers of a certain product, or similar articles on a news website. We call this problem "clust__________" and it is crucial to differentiate it from its super__________sibling Classif__________. With classific__________, we are teaching the model some **pre-exi__________** categori__________, while with cluste__________ we are expl__________ and **discove__________** categ__________, with minimum assumptions. When it comes to algor__________, the most famous Cluste__________ algori__________ is K-mea__________ clust__________, but a variety of them exists, like me__________-shift cluster__________, DBSCAN and others. For Dimensionality reduction, the first choice is usually Principal Component Analysis or PCA, followed by an array of non-linear algorithms, also called "Manifold learning". Finally, for Anomaly detection, an excellent first choice is the Isolation Forest algorithm.

Artificial General Intelligence (General AI) press2

Some computer algorithms are capable of mimicking human intelligence, to reason and solve problems on their own, and to apply previously acquired knowledge on completely new types of problems. These algorithms fall into the domain of: Artificial Narrow Intelligence (Narrow AI) press1 Artificial General Intelligence (General AI) press2

objective absolute error metric cancel relative quantifies goodness squared

The choice of metrics always depends on our specific objec_________________, but most of the time we want some kind of abso_________________ err_________________ met_________________. Otherwise positi_________________ and negati_________________ errors can canc_________________ each other out and give us a false sense of confidence in our model. Or, if we just want a relati_________________ measure that quant_________________ the goodn_________________-of-fit of our regression model, the R-squa_________________ score is the way to go.

elephant_image = load_elephant() test_digit_predictor(elephant_image, 'elephant')

The elephant in the room To further illustrate the concept of Narrow AI, let's see what happens when an algorithm trained for one problem, is given a completely unrelated input and asked for a prediction. Specifically, what happens when you feed a picture of an elephant into a model that is trained to recognize handwritten digits? To make things simple, a digit recognition model has been pre-trained. Your task is to feed the elephant image into it and see the result. Feed the elephant_image into the digit_predictor. elephant_image = load_elephant() test_digit_predictor(____, 'elephant')

Tensorflow SEQUENTIAL linear DENSE INNER OUTPUT dimension parameters fitting

The industry standard library for Deep Learning today is Tens_____________flo_____________ by Google and since version 2.0 it integrates the "keras" library, which makes network develo_____________ a breeze. We will initialize a SEQUE_____________ model, which is a linea_____________ stack of layers and indeed the most common Neural Network design pattern. We then add one fully connected or DEN_____________ INN_____________ layer with 32 units and another dense OUTP_____________ layer with 3 units, because we are dealing with a 3-class classifi_____________ problem. Notice that for the 1st layer we had to specify the input dimen_____________, but not for the second one, because its inputs are all outpu_____________ of the previous layer. Finally, we have to define the param_____________ guiding the fitti_____________ procedure: The optimizer, the loss function and the performance metric used. As you can see, there are many decisions to make. Optimal network design and optimization approaches are beyond the scope of this course, but covered in detail in DataCamp's series on Deep Learning and Machine Learning in general.

algorithms Means Spectral linear separation boundaries specify fast. slower

There are DOZENS of clustering algor__________________ . The simplest and by far the most commonly used method is KMe__________________ clustering. Then there is Spect__________________ clustering, DBSCAN and many more. There are clear differences in the classification results of each of these algorithms. KMe__________________, seen in the left column, supports only lin__________________ separ__________________ bound__________________AND you have to specify the number of clusters to be found, but it's very fast. Spe__________________ Clustering (in the middle column) also requires the number of clusters to be spec__________________ by the user, BUT it allows for very non-lin__________________ separ__________________ between clusters -- at the expense of slo__________________ execution time.

Supervised, Unsupervised, and Reinforcement

To better understand Machine Learning, let's investigate its three most common flavors: Superv__________, Unsuper__________, and Reinforc__________ learning.

Supervised learning press1

When evaluating the performance of Anomaly Detection models, you most often use metrics from the domain of: Supervised learning press1 Unsupervised learning press2

fit predict

As with all other models in the scikit-learn package, the model is trained using the fi___________() method and applied using the pred_____________() method

Linear regression Logistic regression neural networks

Most common models for tackling regression problems are Line__________ regressi__________, Lasso and Ridge regression, as well as ARIMA models which are used for time-series forecasting. For classification, most common models are Logi__________ regression, Bayesian classifiers and Tree-based models (such as Decision Trees, Random Forests and Gradient Boosted Trees). As for neur__________ netwo__________, they are so versatile that, in the right configuration, they can be used to tackle both problems.

type output variables predict categories quantities numbers train/test splitting model fitting scoring

As we said in the first lesson of this chapter, the main difference between classification and regression models is the ty__________________ of outp__________________ varia__________________ they are built to predi__________________: For classification it's categ__________________ and for regression - quan__________________, which we express with numbers. That fundamental difference further reflects the difference in model structure and metrics used to evaluate them, but many aspects are practically identical: We still need to perform the trai_________________/tes_________________ splitting. We still use the same functions for mod_________________ fittin_________________and scori_________________. We still need to collect quality data for the model to work.

stacking layers neurons input output hidden middle nodes shallow networks 3 Deep Learning

Deep Neural Networks: what are they? So what are Deep Neural Networks? As you have seen, we build networks by stac_____________ laye_____________ of neur_____________ on top of each other. At a minimum, we have one layer for the inpu_____________ , one for the outpu_____________ and one hidd_____________ layer in the mid_____________ Whatever the number of nod_____________ in each layer, we call these "shall_____________ networ_____________ ". But the moment we pass the thresh_____________ of _____________ (#) layers (including the input and the output), we are already building a Dee_____________ Learn_____________ network.

problem collect model main automate manual repetitive value

Define the prob___________ We start by defining the specific pro___________ we need to solve and the associated measure of success. Then we coll___________ the data from our process inputs and outputs. Based on that, we select the mod___________ to address the problem at hand and fit it using the available data. First, what is the mai___________point? For example, we want to autom___________ manu___________, repetit___________ tasks with an AI solution. Second, how do we create valu___________ by solving this problem? In our automation example, we would create value by freeing up resources for more complex and productive work. Finally, how do we define and measure success and failure?

Isolation Forest is commonly used for anomaly detection. press2

Despite being a bit more computationally intensive than other methods, an algorithm is commonly used for anomaly detection. Which algorithm is it? One-Class SVM is commonly used for anomaly detection. press1 Isolation Forest is commonly used for anomaly detection. press2 Robust covariance is commonly used for anomaly detection. press3

Unsupervised learning is the most appropriate approach for this task. press2

Guess the flavor II You want to apply machine learning on a large collection of news articles and cluster them into groups, so that you could identify and count recurring topics, while ignoring the existing news categorization. Which flavor of Machine Learning algorithms is the most appropriate for this task? Reinforcement learning is the most appropriate approach for this task. press1 Unsupervised learning is the most appropriate approach for this task. press2 Supervised learning is the most appropriate approach for this task. press3

Reinforcement Learning natural organisms learn actions adjusting outcome positive or negative criteria

Last but not least, we have the very interesting domain of Reinfor__________ Learni__________, which is not covered in this course, but absolutely necessary to mention. Reinforcement learning is most similar to the natu__________ way in which living organ__________ lea__________: an entity or an "agent" is taking certain actio__________ in its environ__________ and then adju__________ its beha__________depending on whether the outc__________ of the action was positi__________ or neg__________ compared to its success criter__________. Although a very powerful idea and easy to intuitively understand, this domain of AI is still in its infancy, but significant efforts are being invested in research within this domain.

interpreting global and local decision-making rules visualizing decision trees plotting feature importance classified explanation criteria LIME

Lastly, once our model is built and ready, we will often want to pop-open the hood and take a peek at what's happening underneath it. There are two most common ways of interp_____________ the model: glob_____________ and loc_____________ interpret_____________ With glob_____________ interpr_____________, we try to figure out "What are the general deci_____________-making rul_____________ of this model?". Common approaches in this case include visual_____________ dec_____________ trees, or plott_____________ feat_____________ importance. With loc_____________ interpr_____________, we are investigating: "Why was this specific example class_____________ in this way?" For example, in the EU, if your bank declines to give you a loan based on a Machine Learning model, you can ask it to provide you the expla_____________ on what the key crite_____________were when making such a decision. One of the most popular algorithms for local model interpretation is the so-called LIM_____________ algorithm.

distance center others unsupervised clustering metrics EVALUATING

Of course, there are many other methods and metrics based on which you can tune your clustering algorithm. The most common one is the Variance Ratio Criterion, which considers the dista_________________ of each point to the cent_________________ of its cluster together with the distan_________________ between cluster cente_________________ themselves. And the most common alternative to it is the Silhouette Score, which evaluates how clo_________________ each point is to its own cluster VS how clo_________________ it is to the oth_________________? These two belong to the "unsuper_________________ clustering metrics", because we have not defined any expecta_________________ on how our samples should be clustered. When we DO have expect_________________ defined in a validation set, we can use Superv_________________ metrics, such as Mutual Information and Homogeneity, which essentially compare the match between the expec_________________ and the clustering results. But don't be confused: here we are still using unsupervised learning, we are just EVALU_________________ the results in a supervised manner.

interpretability performance simpler interpret complex performance Deep Neural first

Once we have a pool of fitting candidates, we need to start prioritizing. One of the common dilemmas is the choice between model inter_____________and perfo_____________. The general rule is that simp_____________ models are easier to inter_____________, while com_____________ ones dominate the perform_____________ arena, but there can be exceptions at both extremes. A simple Decision Tree might give you both the required perfor_____________ AND interpre_____________, but it can also happen that you train a monster of a De_____________ Neural network, and achieve neither of the two. It really depends on your probl_____________ and your da_____________, but a good mantra is "Simpli_____________ fir_____________!". Try to always start with the simpl_____________, fast_____________ and interpre_____________ model, and move on up the complexity ladder only if necessary.

Parameters and hyperparameters parameters coefficients values algorithm data before training number tuning

Parame__________ and hyp__________parame__________ Model parame__________ (or coeffi__________) are those valu__________ that the algor__________ learns from the da__________ by itself, but there is another set of valu__________crucial to the performance of our model: the "hyp__________-parame__________". Hyp__________-parame__________ are settings defined befo__________ the train__________. For example, when performing clustering, bef__________ your model even sees the data, we first need to tell him the num__________ of clusters he will be looking for. Searching for the optimal hyper-parameters is sometimes referred to as "model tuni__________".

Performance evaluation examples didn't phase holdout 60 labeled 40 Comparing predictions performance

Perform__________ evalua__________ We mainly want to see how the model performs on examp__________ it did__________ see during the training pha__________. The simplest way to do this is to apply the hol__________ approach, where we usually use around __________0 percent of all available labeled data for training and the remaining __________0% for testing. Comp__________ the model predi__________on the test inputs with the known test outputs gives us a picture of our model's perfo__________.

overfitting memorize details patterns features dimensionality reduction linear regression visualize information estimate

Sometimes we are afraid of overf__________________, that is that our model will mem__________________ trivial det__________________ instead of most crucial pat__________________ in the data. Having too many fea__________________ increases the risk of that happening and dimen__________________ redu__________________ can be a good antidote. Sometimes there is too much corre__________________ between our feat__________________, which creates problems with lin__________________ regre__________________. Sometimes the dimen__________________ of our raw dataset puts enormous stress on our limited computa__________________ resou__________________. And sometimes we just need a way to visua__________________ hyper-dimensional data on a tw__________________-dimensional plot, ready for printing. On the other side, the benefits of this procedure usually come at the cost of losing a certain amount of inform__________________, which can then have a negative impact on our perfor__________________. You should therefore always estimate the perform__________________ of your model before AND after applying dimensionality reduction in order to determine if that's the sacrifice you are really willing to make.

model = RandomForestClassifier(n_estimators=5, max_depth=20) model.fit(X_train, y_train) test_and_show_accuracy(model, X_test=X_test, y_test=y_test)

Test the model on the hold-out dataset, that is, the data the model hasn't seen during training. model = RandomForestClassifier(n_estimators=5, max_depth=20) model.fit(X_train, y_train) test_and_show_accuracy(model, X_test=____, y_test=____)

GENERAL STRONG AI NARROW WEAK one translated trick Machine Learning

The AI that mimics "human-like intelligence" -- is what is commonly called "Artificial GENE___________ Intelligence" or "STRO___________ AI". What IS being developed and implemented by 99 percent of AI practitioners in the industry and academia today is a subset of AI called Artificial NARR___________ Intelligence, sometimes also called WEA___________ AI. Why "narrow"? Well, because these solutions are designed to solve only on___________ specific problem, without any capacity to be transl___________ to another one without rework. Alpha Go, Google's algorithm that beat the best human players in Go, a game more complex than chess, couldn't even start a game of tic-tac-toe. You can call it a one-tri___________ horse, but it's a very powerful trick nonetheless. Finally, when we talk about Narrow AI, 99% of time we talk about the good old Mac___________ Lear___________

feedforward multidimensional recurrent patterns space images texts

Types of DNNs: Convolutional Finally, we have the currently extremely popular convolutional networks. A CNN is technically also a fee_____________ orw_____________ network, but a pretty complex one, able to handle multidimen_____________ data. And while a recurr_____________ network recognizes patte_____________ in time, a CNN learns to recognize patte_____________ across spa_____________ . They are primarily used for imag_____________ , but also for text_____________ . This list is not exhaustive and for a more detailed overview of different topologies, please refer to the resources listed at the end of the course.

"Feedforward network signal unidirectionally order of appearance text, sound or time-series Recurrent neural networks

Types of DNNs: Feedforward The most basic neural network architecture is the so called "Fee_____________ forwa_____________ network", an example of which we have seen on the previous slide. The key property of this network is that the sign_____________ travels strict_____________ unidirecti_____________ , from the input to the output. Fe_____________ orw_____________ networks can be very powerful, but they are not suited well for data in which the "ord_____________ of appeara_____________ " is important, like tex_____________ , soun_____________ or tim_____________ -seri_____________ in general. For that purpose we would rather use the so-called "Recurr_____________ neur_____________ netwo_____________ ".

current previously time speech recognition

Types of DNNs: Recurrent These networks take as their input not just the curre_____________ inpu_____________ example they see, but also what they have perceived previo_____________ in tim_____________. RNNs have completely revolutionized spee_____________ recogn_____________, for example.

predictions NUMBERS errors positive negative wrong

Unlike classification, where the predi_________________ is either correct or incorrect, regression predic_________________ are NUMB_________________, so regression err_________________ are ALSO num_________________ and can be positive, negative, big or small. In other words, regression predi_________________ is almost certainly always wro_________________, but the question is: how wro_________________ is it in general?

Unsupervised output labels relationship patterns

Unsuperv__________ learning owes its name to the fact that at training time it makes no use of the outp__________ labe__________-- it is only busy with capturing the relation__________ and patt__________ in process inputs.

poly = PolynomialFeatures(degree= 2) x2_train = poly.fit_transform(x_train) linear_model.fit(X=x2_train, y=y_train) check_model_fit(model=linear_model, x=x2_train, y=y_train)

Using the original dataset x_train, create a new dataset x2_train with 2nd degree polynomial features. Fit the linear_model using the newly created dataset. Check the goodness of fit of your fitted linear model. poly = PolynomialFeatures(degree=_______________) x2_train = poly.fit_transform(x_train) linear_model.fit(X=_____________, y=y_train) check_model_fit(model=linear_model, x=______________, y=y_train)

individual metrics different satisfying and optimizing cut-off criteria meet accuracy execution priority one ranked one changes ranking

We already talked about indiv_____________met_____________, but we rarely select our model based on only on_____________. So how do we select one single best model if different ones exc_____________ at diffe_____________ metrics? One good practice is to differen_____________ between satisf_____________and optim_____________ metrics. Satisfy_____________ metrics are metrics that define a cut-of_____________ crite_____________ that every candidate model needs to me_____________. Multiple metrics of this kind can be used for the same evaluation, such as minimum accura_____________, maximum execu_____________ time and so forth. After bad apples have been filtered out, we apply the optimizing metrics. This should be a metric that illustrates the ultimate business prio_____________ and there can be only on_____________, like "minimize percentage of undetected diseases" or "minimize the percentage of false alarms" in a fraud detection system. So, the final model is the one that passes the bar on all satisfying metrics and is the highe_____________ rank_____________ on_____________ on the optimi_____________ metric. In a real life scenario we would probably repeat this "model competition" from time to time, because real-li_____________ data chan_____________ in time, and therefore so does the ranki_____________ of the models.

linear polynomial multiplying features interaction

We mentioned earlier in this chapter that with regression, 99% of the time we use line_________________ models, even when we model non-line_________________ relationships. So how do we do that? The trick is simple: we engineer polyn_________________ features in our dataset and feed it as such to any linear model that we wish. Polyn_________________ features are features constructed by multiplying raw features with themselves a number of times or multip_________________ featu_________________ between themselves or both. When we multip_________________ two featu_________________, we are creating so called "interac_________________ featu_________________". The scikit-learn algorithm that does this job for us is, very conveniently, called Polyno_________________Feat_________________and we only need to provide it with our desired degree of polyn_________________it needs to generate. Here, we chose 2. Calling the preprocessors method .fit_transform() creates the desired higher-order features, which we then feed into our linear learning algorithm as we do with any other data.

algorithms behavioral patterns systems processes input and output data

What is Machine Learning? Simply put, it is the process of applying computer algor___________ to capture the beha___________ and beha___________ patte___________ of syste___________ and proce___________ , based on the inpu___________ and outp___________ dat___________ collected from these systems.

supervised learning

What is the model? A model is trained based on a collection of pictures to recognize cats and dogs in pictures, annotated with appropriate labels. Into which domain does this model fall?

unsupervised learning

What is the model? You want to apply machine learning on a large collection of news articles and cluster them into groups, so that you could identify and count recurring topics, while ignoring the existing news categorization. Which flavor of Machine Learning algorithms is the most appropriate for this task?

errors distributed mean absolute MEDIAN learn

When err_________________ are normally distrib_________________ , a good choice is the me_________________ absol_________________ error. If we however expect occasional error spikes of higher magnitude, it would be good to "filter them out" by calculating the MEDI_________________ absol_________________ err_________________ .

Regression Linear logistic classification

When it comes to Regression models, the most common ones in use today are varieties of Lin__________________ Regression, meaning that these models assume a linear link between model inputs and its outputs. Of course, nothing is perfectly linear in real life, but many relationships can be quite decently approximated as having a linear nature within a certain range.. Remember: we're not looking for perfection, but for a model that is good enough! And again, do not be confused: Linear regression is an actual regression model, while logi__________________ regression is a classif__________________ model.

algorithms linear projections Principal Component Analysis Text Mining is Latent Dirichlet Allocatio Non-linear longer execution inconsistency

When it comes to algor__________________ involving line__________________ project__________________ , the by far most common one is the Princi__________________ Compo__________________ Ana__________________ or just P.C.A. Another very useful one, especially in the domain of Te__________________ Minin__________________ is Latent Dirichlet Allocation, or LDA. From the domain Non-linear algorithms, we very often use the Isomap algorithm and t-SNE. Keep in mind that the power of non-lin__________________ algorithms comes at a cost of lon__________________ execut__________________ times and sometimes inconsis__________________ of results.

model = LogisticRegression() model.fit(X_train, y_train)

When regression means classification OK, so you have a classification problem at your hands. Neat. Now you talk to your client, which tells you that the algorithm needs to be implemented on their legacy systems, with very poor computing resources. It turns out you can only run the simplest of models there and you have only two options: LinearRegression() and LogisticRegression(). But you need a classifier and these models have "regression" in their name, can you really pull this off? If you paid attention in the previous video lesson, you know there's no reason to worry so just go ahead and select the right algorithm for your problem. Select the right algorithm for your classification problem. model = ____ model.fit(X_train, y_train)

multitude interconnect organized artificial neural network output input hidden entry image pixel transformed weights and functions

When we then take a multi_____________ of artificial neurons and interconnect them in an organized way -- we get an artific_____________ neur_____________ netwo_____________. Each network must have an inpu_____________ and an outpu_____________ layer and they usually have at least one hidd_____________ layer next to that. The basic network structure - input layer The inpu_____________ layer is the ent_____________ point for our data. If we were building an image classifier, each input would be connected to one ima_____________ pix_____________. The basic network structure - hidden layer As the image signal travels across the netwo_____________, it is getting transfor_____________ by the network weig_____________ and functio_____________

All of the above. press6

Why Python? Python is the best choice for developing machine learning solutions. Why is that the case? Python has a simple and beautiful syntax. press1 Python is very versatile. press2 Python is very flexible. press3 Python offers rich AI-related libraries. press4 The Python community is big and growing fast. press5 All of the above. press6

dimensionality reduction linear varies outliers informative distributed discard

You could bet that PCA is the first algorithm that pops into the mind of a data scientist when someone mentions dimens__________________ reducti__________________ . It is a line__________________ method, and as such very simple, fast and cons__________________. You can think of it as an algorithm that tries to find the directions across which your data vari__________________ the most and set these directions as the new coordinate system. The reduc__________________ happens when we decide to keep only "n" most inform__________________ components and disca__________________ the rest. However, it relies on the assumption that our data is normally distribu__________________, which is often very idealistic and makes the algorithm very sensitive to out__________________. Nonetheless, it should be the first tool in your toolbox and you should proceed to more complex methods only if data shows that PCA doesn't cut it.

linear_model.PolynomialFeatures(X=x_train, y=y_train) check_model_fit(model=linear_model, x=x_train, y=y_train)

You're about to see how it's possible to model non-linear relationships using linear regression models. To make this "transition", you don't have to change anything in the model, you just have to generate: higher order features: (a)→(a,a2,a3,...), and interaction features: (a,b)→(a∗b,a2∗b,a∗b2,...) You will first try to fit a purely linear model to a quadratic process and check the R^2 score. After that you'll use the function PolynomialFeatures() to generate, well, polynomial features and see how much better your fit is -- both visually and according to the R^2 score. Finally, you've been provided with the custom function check_model_fit() that plots the model predictions against actual data and prints the R^2 score of your model. Select the appropriate method to call the model training procedure. linear_model.______________(X=x_train, y=y_train) check_model_fit(model=linear_model, x=x_train, y=y_train)

Classification should be used to predict risky suppliers. press2

Your company has a problem with too many suppliers failing to meet their delivery deadlines, which creates a lot of downstream problems in your production and many angry customers. Your boss tells you that you have a very rich database of supplier data and asks you to build an algorithm for early prediction of suppliers which are too risky to work with. What kind of model should you use? Regression should be used to predict risky suppliers. press1 Classification should be used to predict risky suppliers. press 2

This model falls into the domain of supervised learning. press3

A model is trained based on a collection of pictures to recognize cats and dogs in pictures, annotated with appropriate labels. Into which domain does this model fall? This model falls into the domain of reinforcement learning. press1 This model falls into the domain of unsupervised learning. press2 This model falls into the domain of supervised learning. press3

quantity peaking or dropping derivatives succession time waveform

Approaches: Thresholding If we are dealing with a qua______________________ that is fairly stable over time, sometimes a simple threshold is enough. Approaches: Rate of change However, sometimes the anomaly is not in the magnitude of the value, but in the rate at which it changes, like suddenly pea______________________ or drop______________________ . In that case we need to include the deriv______________________ of the target value in our modeling. Approaches: Shape monitoring In very difficult cases, even the rate of change is not a sufficient anomaly detection, but we must find a way to model normal behavior in terms of the expec______________________ succes______________________ of values over ti______________________ -- or the shape of the wave______________________ .

anomalies contaminated overestimate false

Although you clearly see that Isolation Forests, even without any tuning, do an amazing job of detecting anom_____________ they can not do miracles if your data is severely contam_____________. It's good to remember that in that case they will tend to overes_____________ the number of anomalies, raising a lot of fal_____________ alarms

actual reference outputs predictions learn

And of course, we have ready-made functions for that purpose already implemented in scikit-lea_________________ . Notice the consistency in the code: whether it's classification or regression, scikit-learn scoring functions take the act_________________ refere_________________ outp_________________ as the first argument and the predic_________________ as the second one.

Anomaly detection abnormal entities and events Dimensionality Reduction complex, high-dimensional dataset overfitting computational intensity

Another important problem solved by unsupervised learning is Anom__________ detect__________ -- used to detect abnor__________ entiti__________ and eve__________, like the ones in the ECG signal shown on the picture. And lastly, there is Dimensio__________Redu__________ -- used to reduce comp__________, high-dimensi__________ datasets to a simp__________ represen__________. We might do this to minimize overfit__________, or to reduce the comput__________inte__________ or just to be able to visualize complex data in 2D.

Activation function. press2

Artificial neural networks are built from artificial neurons. Just like human neurons, artificial neurons have input pathways, bringing the signals to the neuron, and output pathways, leading the resulting signal further. The values of the inputs signals have been aggregated (usually just summed up), then the final output value of the neuron is defined by a function. Transfer function. press1 Activation function. press2 Input function. press3

simple and fast normally normality outliers slower standard

As far as the standard algorithms that we use for anomaly detection, some common ones are Robust covariance, Isolation Forests and One-Class Support Vector machines. They all have their pros and cons: - Robust covariance is sim______________________ and fa______________________, but works only for norm______________________ distributed data. - One-class SVM doesn't require norm______________________, but is very sensitive to outli______________________. - IsolationForests the most powe______________________ among the three, but 3-10 times slow______________________. Still if your computational power allows it, Isolation Forests are indeed the stand______________________ go-to algorithm for this purpose today.

SUPERVISED false labelled confusion matrix Precision and Recall scores percentage correct successfully

As we have seen with clustering, a model constr_____________ using unsupervised learning can still be evaluated using SUPER_____________ methods and metrics -- and that is exactly the case with anom_____________detec_____________. Ultimately, we are interested in how good it is at detecting anomalies, as opposed to the number of fal_____________ alarm_____________ it raises -- so we have to have SOME amount of label_____________ anomalies, at least for the evaluation phase. The perfect tool for this task is the so-called confu_____________ mat_____________, from which we can then derive Prec_____________ and Rec_____________ scores. Let's imagine we've built an Anomaly Detection model for heart rate monitoring. Precision tells us the perc_____________ of times that our algorithm is corr_____________ when it signals that arrhythmia is detected. Recall, on the other side, tells us what percen_____________of arrhythmias we have successfully detec_____________? These are extremely important metrics in Machine Learning in general, so it's critical to clearly understand the difference between them.

model technical theoretical algorithm

Configure the mod___________ Now that our BUSINESS problem is clearly defined, we can start working on the techni___________ aspect of our AI solution. We do so by answering the following questions. First, what is the techni___________ nature of my problem in the broadest sense? Second, what is the appropriate theore___________ model for my problem? Finally, what specific algorit___________ would have the best performance on my dataset? These are big decisions which we will address in more detail in the coming lessons and exercises -- but once these questions are answered, we can proceed to model fitting.

WHERE SHAPE SIZE decision boundary.

Creating Classification Models So what does it mean to create a classification model? Let's build some intuition. To stick with our analogy -- it means creating those boxes for each category, deciding WHE__________________ they will be built and in WHAT SHA__________________ AND SI__________________, all based on the data provided in the training set. We also call this line or surface "the decis__________________ boun__________________."

SELECTION EXTRACTION reduction subset predictive COMBINED transformation features linear and non-linear

Dimensionality reduction can be divided into feature SELEC__________________ and feature EXTRA__________________. Feature SELEC__________________ is the simplest form of reduction and it consists of selecting the sub__________________ of most predi__________________ feat__________________, without applying any transfo__________________. As simple as it sounds, it is not a trivial problem, because we need to select the fea__________________ that carry the most infor__________________ when COMB__________________ . It's like making a basketball team: Five best players in the NBA rarely make a winning team when put together. On the other side, we have feature EXTRA__________________, which ALWAYS involves some kind of transfo__________________ and often combining multiple featu__________________ into one by applying mathematical operations called lin__________________ and non-lin__________________ project__________________

variables principal processing existing information

Dimensionality reduction is the process of reducing the number of vari__________________ under consideration by obtaining a set of princi__________________ vari__________________ We usually apply it as a pre-proc__________________ step, when there is a strong need to represent exist__________________ data using a smaller number of vari__________________, while keeping as much infor__________________ as possible.

classification categories classes variables

In human terms, classi______________________ is the process of putting things in boxes -- actual or imaginary. In technical terms we call these "boxes" "categ______________________" or "clas______________________" and we call such data "categ______________________ varia______________________". Such variables are all around us: colors, types of planets, flavors of ice cream... The category assigned usually determines the course of action for the given object so it's very important to get it right.

Dense fully connected dimensional image position classical Dropout regularization overfitting sub-sampling

Layers and layers In the previous lesson we already saw the most fundamental layer, called the "Dense layer" or "fully connected layer", but there are many more types with different functions and properties. For example, Convolutional layers can operate with multi-dimensional data and help us extract important image features irrespective of their position, which is not possible with classical Dense layers. Dropout layers, on the other hand, have a regularization function, preventing overfitting by randomly turning off a fraction of the nodes from the preceding layer. Another way to fight overfitting is sub-sampling, also called "pooling", where we basically reduce the dimensionality of the data by aggregation - very similar to reducing the resolution of a picture. Finally, we sometimes just want to operate simple signal wrangling operations, like flattening it from a higher-dimensional to single-dimensional space, which we usually do right before calculating the final output of the network.

classifier SEPARATING Regression ALONG data

Now, if a classi__________________ was trying to find the line SEPARA__________________ data of different classes, a Regres__________________ model's objective is different: it aims to construct a line that goes ALO__________________ the data points, as much as it's physically possible. That's why we sometimes also call it "line" or "curve fitting".

Principal Component Analysis is an example of linear dimensionality reduction algorithm. press1

Principal Component Analysis To which family of dimensionality reduction algorithms does Principal Component Analysis belong? Principal Component Analysis is an example of linear dimensionality reduction algorithm. press1 Principal Component Analysis is an example of non-linear dimensionality reduction algorithm. press

classifiers CATEGORIES QUANTITIES numerical categorical

Regression models have a quite unintuitive name, but their essence is very simple: while classif____________________ help us predict CATE____________________ of things, regression models help us predict QUANT____________________ . Like, what will be the temperatures in the next week? What are the chances of Denver Nuggets winning the NBA playoffs? How many customers will I gain and lose in the next 6 months? Similar to classification, the inputs to these models can be both numer____________________ and catego____________________ - but if the prediction output is a number, it's a regression model.

Supervised common machine predict employee performance product buy repay models predict categories or quantities measurements pictures training outputs picture output labels

Supervi__________ learning is the most com__________ flavor of mach__________ learning in use today. Companies use it to pred__________ emplo__________ perfor__________, what prod__________you're likely to bu__________ next, are you likely to repa__________ the loan you are applying for and much more. We use it to build mod__________ that predi__________catego__________ or quanti__________ based on some input measure__________. So, if we are making a Fruit and Vegetable recognizer, the training inputs will be pictur__________and trai__________ outpu__________the lab__________ stating which fruit or veggie is in the pict__________. The usage of outp__________ labe__________ during training is where the name "supervised" comes from.

Regression output quantity length Classification categories

There are two major problem types in supervised learning: Regre__________ problems, when the outp__________ of interest is a quant__________ -- such as leng__________, weight or oil prices; and Classific__________ problems, where we want to predict categ__________, such as "metal or plastic", "positive or negative review".

building blocks short dendrites signals neuron long axon aggregated mathematical weight input transfer activation linear sum value output

To understand neural nets as a whole, we need to first understand its buil_____________ bloc_____________. Human neurons usually have a bunch of sho_____________ "branches" called dend_____________, which bring the signals to the neu_____________, where they get aggreg_____________ and led off by the lo_____________ branch called the axo_____________. An "artificial neuron" is the exact same thing, expressed in mathe_____________terms: We have input branches with their weig_____________ factors, which define the impact of each inp_____________. Then we have a trans_____________ function which aggre_____________ all the weighted inputs and is usually just a regular su_____________. And finally there is the so-called "activa_____________ function", which is usually some non-line_____________ function that conv_____________ the su_____________ of weigh_____________ inputs into the neuron's final valu_____________ at the outp_____________.

fit_polynomial(degree=1)

Too much of a good thing Let's inspect the issue of overfitting in a more visual way. Say you have measured the temperature in your office for several days and you want to create a model to describe these oscillations. You have (wisely) decided to tackle this challenge using a polynomial model. Your task is now to find the right value of the hyper-parameter representing the degree of the polynomial. But be careful! As your measurements are quite noisy (because you used a cheap thermometer), you're in danger of overfitting if you exaggerate with model complexity. A fit_polynomial() function has been defined for you. fit_polynomial(degree=____)

LIME press2

When you want to know why your model has made a certain decision for a specific single record, you are engaging in so-called "local model interpretation". Currently, the most popular algorithm for this purpose is? LEMON press1 LIME press2 ORANGE press3 PAPAYA press4

specify clusters elbow idea range sum of squared distances center

Yes, many clustering algorithms, including KMeans still require YOU to spec_________________ the num_________________ of clusters you are looking for. On the top left, we specify only one cluster, on the top right 3 and on the bottom right 6. But you often have no ide_________________ how many clusters you should have and it can easily appear to be a chicken and egg problem. Luckily there are solutions, one of them being the elbo_________________ method. The idea behind it is that we sca_________________ through a rang_________________ of possible number of clusters and measure the so-called "su_________________ of squar_________________ distan_________________", which indicates how far each point is from the cent_________________ of the cluster it is assigned to.

Precision. press2

You have built an e-mail SPAM filtering algorithm. Which metric is best suited to quantify the percentage of emails your algorithm flags as spam that are actually spam? Possible Answers Accuracy. press1 Precision. press2 Recall. press3

The R^2 score

You need to estimate the goodness-of-fit of your regression model to your data in a relative, unit-less manner. Which metric should be your first choice? The mean absolute error. press1 The median absolute error. press2 The R^2 score. press3

Decision Trees Logistic Regression Support Vector Machines Random Forest classifiers

the most common classification algorithms today are Deci____________________ Tre_____________, Logis____________________ Regres____________________, Supp____________________ Vect____________________ Machines and Rand____________________ Fore____________________ classif____________________. Don't be confused - LogisticRegression is indeed a classification and not a regression model.


Related study sets

Chapter 18: Evolution and the Fossil Record

View Set

Chapter 21: Respiratory Care Modalities

View Set