AI Exam 1
what is reinforcement learning?
- finding a strategy for taking a series of decisions in an environment that is usually changing unpredictably. - reward and cost system to max rewards EX: The learning system, called an agent, observes the environment, selects and performs actions, and gets rewards/penalties in return. Learns by itself, called a policy. A policy defines the action taken in a given situation.
what is the goal of unsupervised learning?
- goal of unsupervised ML is to learn and generate distinctive groups or clusters of data points a dataset. -Data given is only x; no y.
Determine the difference between regression and classification
- if taregt = numerical, is regressor; draw through data -if target = categorical, is classifier; draw between clusters
what does feature engineering do?
- improves the performance of the model by selecting the right features and preparing the features in a way that is suitable for the machine learning model. - heavily dependent on the experience and expertise of the data scientists conducting the analysis.
what is bias?
- is the difference between our actual and predicted values. - Bias is the simple assumptions that our model makes about our data to be able to predict on new data.
how do you get a good fit in ML?
- look at performance of ML model over time with the training data - if model trained too long, can learn the unnecessary details and noise in the training set and lead to overfitting - to get a good fit, need to stop training where error on the test set starts to increase.
why does model size and packaging affect model deployment?
- model size plays a role in how we plan to package the model. - smaller can be wrapped faster and contained in a docker container
what are the three common approaches for converting ordinal and nomincal variables to numeric?
- ordinal encoding - one hot encoding -dummy variable encoding
why does data and concept drift affect model deployment?
- over a span of time real-world data keeps changing and may not be reflected in the model
what are the two most common approaches for dealing with missing values?
- removal: simply remove any observations(rows) where one or missing values are present -imputation: input or impute replacement values where they were originally missing
what are the stages of a simple ML model lifecycle?
- scoping - data collection - data engineering - model training - model validation - deployment - monitoring
what is model selecting?
- the challenge of choosing a model among many that relate to your specific problem - the process of selecting the "best," from among a collection of machine learning models.
what is feature extraction?
- the process of extracting new features from the existing attributes - primarily concerned with reducing the number of features in the model.
what is data cleaning?
- to correct what is incorrect -error may be caused by human input (spelling, formatting, data missing)
what is rolling updates deployment?
- updating all instances of your model one by one - useful when you want to make quick update of entire model line with new version
what is shadow deployment?
- used to test new version of model with production data -A copy of the user request is made and sent to your updated model, but the existing system gives the response
how does encoding contribute to data preparation?
- variables that are numeric, but are unstructured/categorical variables, must be coded numerically
Machine Learning
A subset of AI techniques that use statistical methods to enable machines to improve with experiences; study of algorithms that improve their performance (p) at a task (t) with experience (e); solve a prediction problem given an input X, predict an appropriate output Y
Learning Example
Spam Detection Input: incoming mail Output: spam or not spam this is a binary classification problem because only 2 possible outcomes
Bernoulli Distribution
The probability distribution of a random variable with two possible outcomes, each with a constant probability of occurrence.
what is an outlier generally?
any data point that is very different to the majority can be fixed by removing the row containing the outlier or simply replacing its value
what is irrelevant data?
anything that isn't related to the problem you're looking to solve
computer scientist
applies concepts from computer science to create efficient solutions
what is the curse of dimensionality?
as the dimensionality of the feature's space increases, the number of configurations can grow exponentially and thus the number of configurations covered by an observation decreases
what does it mean to estimate accuracy?
When you are building a predictive model, you need to evaluate the capability of the model on unseen data. This is typically done by estimating accuracy using data not used to train the model
what is a model parameter?
a configuration variable that is internal to the model and whose value can be estimated from data.
what is a data dictionary?
a glossary of terminology relevant to the project; what each data entry is, its format, etc
what is continuous probability distribution?
a probability distribution showing all the possible outcomes and associated probabilities for a given event
what are AI technologies driven by?
data and analytics
what are the three major components of a machine learning system?
data, models and learning
what is a cumulative distribution function?
describes the cumulative probability of any given function below, above or between two points
data engineer
design and build pipelines that transform and transport data into a format so that by the time it reaches the data scientists or other end users in a usable state
what is computer vision?
enables computers and systems to derive meaningful info from digital images, videos and other visual inputs and take actions or make a recommendations based on that info
what does the CLT establish
establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.
computer vision
extracts and understands info from images and videos
what is bias-variance trade off?
find the perfect balance between Bias and Variance. ensures that we capture the essential patterns in our model while ignoring the noise present it in.
what is meant by good models?
good meaning performs well on unseen data requires us to define some performance metrics, such as accuracy or distance from ground truth, as well as figuring out ways to perform well under these performance metrics.
rules based programming approach
handcrafted knowledge - where programmers craft sets of rules to represent knowledge in well-defined domains issues: -labor-intensive -cannot generalize to unanticipated input combos (prediction problematic) -doesn't naturally handle uncertainty
what is unstructured data?
has some implicit structure, it doesn't follow a specified format.
what is the main purpose of EDA?
help ID obvious errors, better understand patterns within data, detect outliers or anomalous events, and find interesting relations among variables
AI hardware
includes physical computer component requirements to achieve increased processing efficiency and/or speed
data collection from different sources
internal and/or external to make sure we have the right data in correspondence with the business requirements/ problems
what does data splitting involve?
involves partitioning the data into: 1.An explicit training dataset used to prepare the model: Train: the algorithm learns from the data pattern to develop the model. 2.An (unseen) test dataset used to evaluate the model's performance. Test: use the developed model from the last step to predict the target variable on the test dataset. Then, evaluate the model's performance by comparing the predicted value with the actual value for our target variable.
what can we use to determine unsupervised learning?
k-means clustering
data scientist
leads research projects to extract valuable info from big data.
what is continuous data?
numeric, but exists in fractional form; represents info that can be divided to a more granular level EX: Lebron James is 2.06m tall v. 2.064759m tall
machine learning approach
statistical learning - where programmers create statistical models for specific problem domains and train them on data issues: -machine learns on its own
Stats v. ML?
statistics is data plus analytical theory and machine learning is data plus computable structures
what does data science do?
structures big data, finding the best patterns, and then advising businesspeople to make the changes that would work best for their needs. It includes the transformation, ingestion, collection, and retrieval of large quantities of data, which is referred to as Big Data.
Deep Learning
subset of ML which make the computation of multi-layer neural networks feasible
what is SVM?
supervised learning models assign new examples to one category or the other, making it a non probabilistic binary linear classifier.
what does MLOps workflow involve?
supporting data collection and processing, experimentation, evaluation and deployment, and monitoring and response.
Again, what is machine learning?
teaching computers how to perform a task without having to program them to do it
what is big bang-recreate deployment?
tear down the existing deployment for the new one to be deployed
what is NLP?
the part of computer science and AI that can help in communicating between computers and humans by natural language; enables a computer to read and understand data by mimicking human natural language; EX: GPS, Siri
what is joint probability?
the probability of the intersection of two events
what is Conditional Probability
the probability that one event will occur given that some other event has occurred
what is standardization?
the process of converting the data into a uniform format.
what is variance?
the variability in the model prediction—how much the ML function can adjust depending on the given data set.
what is the objective of a data dictionary and schema?
to create a product demand fucntion that suggests optimal timing (when) and depth (percent) of markdowns to realize the highest product margin.
what is the goal of Feature engineering?
to create new features by combining several features that we expect to be important based on our human knowledge of the problem.
what is the goal of learning?
to find a model and its corresponding parameters such that the resulting predictor will perform well on unseen data.
Why is data cleaning necessary?
to preprocess data and to correct incorrect, improperly formatted, duplicated, irrelevant, missing or outliers in the data.
what is a learner?
trained algorithm that successfully moves from individual examples to broader generalizations
what is supervised learning?
try to predict either a categorical target variable or a numerical target variable. EX: given an object with a set of known, observed measurements, predict the value of an unknown or target variable
what is structured data?
typically stored in traditional relational databases and refers to data that has a defined length and format.
natural language processing
understanding and using data encoded in written language
what is data extraction?
unstructured or semi-structured data to be converted into structured data.
What is a Poisson distribution?
used to model the number of events occurring within a given time interval
what is a/b testing deployment?
used to understand what users prefer and which model might work better for them
what is binomial distribution?
used when there are exactly 2 mutually exclusive outcomes of a trial labeled success or failure
what is a predictor as a function?
when given a particular input example produces an output. And we have presented this as: Y = F(X). Y - target outcome. F - function (algorithm) that relates X to Y as trained. X - new instances. And we represented the linear function F as: Y ≈ β0 + β1X. Β0-intercept Β1- slope
what is low variation data?
where column in your data set contains only one or few unique values
what is underfitting?
where the model cannot find patterns in our training set and hence fails for both seen and unseen data
how are models deployed in practice?
with REST API or (representational state transfer): an API that conforms to the design principles of the REST, or representational state transfer architectural style.
How do you pick the best model?
·A model that meets the requirements and constraints of project stakeholders. ·A model that is good given the time and resources available. ·A model that is skillful as compared to naive approaches, i.e., excel models. ·A model that performs well relative to other tested models. ·A model that is skillful relative to the state-of-the-art.
develop future state requirements
•Define what the organization's data and analytics structure would ideally look like in the short term and in the long term
create an enterprise data model (EDM)
•EDM is an integrated view of the data produced and consumed across an entire organization •EDM determines the structure by which data is governed and how it relates to the various aspects of the organization
embrace continuous process improvement
•Encourage continuous process improvement (CPI) using incremental enhancements and breakthroughs •Implement a feedback mechanism and put processes in place for its rapid implementation
emphasize rapid prototyping
•Encourage rapid prototyping of solutions and an iterative approach to process improvement •Encourage incremental enhancements to existing processes in order to reach mature processes and technologies
what is continuous uniform distribution?
•Forms the basis for sampling from more complex distributions
conduct a gap analysis
•Identify people, processes and technologies that are required to move from a current state to the desired state
what are some ways to tackle underfitting?
•Increase the number of features in the dataset. •Increase model complexity. •Reduce noise in the data. •Increase the duration of training the data.
obtain leadership and stakeholder commitment
•Leadership commitment is essential for this momentous task •Multiple stakeholders need to be brought on board because of the interdisciplinary nature of this task
what is the prediction function?
- A prediction function takes input x and produces an output y. - Machine learning is about finding the best prediction function.
what happens when you have an underfit model?
- An underfit model has poor performance on the training data and will result in unreliable predictions. - Underfitting occurs due to high bias and low variance.
Role/Responsibility of a Chief analytics or Data officer
- Create a vision for AI in the company. - Identify business-driven use-cases. - Determine the appropriate level of ambition. - Create a target data architecture. - Manage external innovation. - Develop and maintain a network of AI champions.
how do we train a model?
- Distill training data into model parameters - Parameters: * beta coefficients for linear model * tree structure (split for decision tree) - Hyperparameters: num trees, K clusters, learning rate -A model = learned algorithm + parameters
what are the sources of error in assessing model accuracy?
- Model underfitting, too weak or simple, does not capture X Y relationship. - Model overfitting, model too specific to training data, does not generalize well.
Why are model parameters important for ML?
- Parameters are key to machine learning algorithms. They are the part of the model that is learned from historical training data. - Given an input for the feature vector X, the values of the model parameters (which are learnt from the training data) allow the output variable y to be computed. - They are required by the model when making predictions. - The practitioner does not manually set them. - They are often saved as part of the learned model. - Often model parameters are estimated using an optimization algorithm, which is a type of efficient search through possible parameter values. - Think of the model as the hypothesis and the parameters as the tailoring of the hypothesis by of data.
what are the different types of machine learning?
- Supervised learning - Unsupervised learning -Reinforcement learning
data cleaning and feature engineering
- Understand the dataset and clean up the given dataset. - Understand the features and the relationships between them. - Extracting essential variables and leaving behind/removing non-essential variables. - SELECTING, TRANSFORMING, EXTRACTING, COMBINING AND MANIPULATING RAW DATA
what is overfitting?
- When a model performs very well for training data but has poor performance with test data (new data); - the machine learning model learns the details and noise in the training data such that it negatively affects the performance of the model on test data. - can happen due to low bias and high variance.
what is blue/ green deployment
- a server swap; there are 2 identical systems available -when user requests are routed to the newer system, swapping out for the older one - used mostly in application/web scenarios
What is discrete data?
- also numeric, also exists in whole form; represents info that is countable and cannot be divided into smaller forms EX: Cristiano Ronaldo's total scored points for the season; they cannot be broken down any further
what is an API?
- application programming - set of rules that define how applications or devices can connect to and communicate with each other
what is one-hot encoding
- assigns numeric vector to the value of a nominal variable - there is no exact order to this
what is ordinal encoding?
- assings a numeric value to the value of an ordinal variable - this value preserves an order amongst the values EX: poor, good, excellent -> 1,2,3
what is independent events?
- can occur at the same time - can have intersection - and events - multiplication rule
what is a mutually exclusive event
- cannot occur at the same time -no intersection - or events - addition rule
robot characteristics
- consist of some sort of mechanical construction; this helps it complete tasks in the environment for which it's designed -need electrical components that control and power the machinery - contain some level of computer programming
why does traffic and requesting routing affect model deployment?
- depending on traffic and type of model have to decide on either real-time inferencing or batch model deployment
what is a data dictionary, again?
compiles all of the data about the data elements in the model
what is classical statistics?
concerned with developing models that characterize, explain, and describe phenomena, machine learning is primarily concerned with prediction.
planning/control
contain processes to identify, create and execute activities to achieve specified goals
ensure that data.....
1. Intended for specific use cases and algorithms. 2. Helps make the model more intelligent. 3. Speeds up decision making.
machine learning
contains a broad class of computational models that learn from data
what is canary deployment?
- deploy the update to existing system ad expose the users partially to the new version - smaller % of users will use the updated model and rest will use old version
what is EDA?
- exploratory data analysis: used to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. - Used to discover patterns, spot anomalies, test a hypothesis, or check assumptions
what is ordinal data?
-another type of categorical data; does contain an underlaying order or ranking; they generally only show the sequence, not the scale EX: tshirt sizes; small < medium < large; average
What is Gaussian distribution?
-aries naturally in many processes in our everyday life -> central limit theorem
what is nominal data?
-categorical data; the data points do not have an order EX: social media names like instagram, twitter, FB
why does model retraining and versioning affect model deployment?
-how often a model is retrained impacts development strategy because you need to compare model performance, update, and possibly maintain different versions
computer vision tasks
-image classification -object detection -object tracking -content-based image retrieval
what are the major areas of AI?
-knowledge processing -speech -AI hardware -Evolutionary computation -natural language processing -machine learning -vision -planning/control
what are the built in data structures of python?
-list -dictionary -tuple -set
what should you consider when choosing a model?
-performance -how long model takes to train -how easy it is to explain to project stakeholders
What are some NLP use cases?
-virtual agents and chatbots -machine translation -social media sentiment analysis -text summarization
what are the steps to end-to-end process flow?
1) collection of data from various source 2) data cleaning and feature engineering 3) model building for selecting correct ML algorithm 4) evaluate model 5) model deployment
what are the Key Steps in Creating a Center of Excellence in Data Science and Driving Organizational Adoption
1) define a vision 2) obtain leadership/stakeholder commitment 3) evaluate current state 4) develop future state requirements 5) conduct a gap analysis 6) create an implementation roadmap 7) establish a data governance structure 8) create an enterprise data model 9) emphasize rapid prototyping 10) embrace continuous process improvement
what are the fundamental rules data sets must follow before their use in models?
1.All data must be numeric. 2.There can't be any missing values. 3. Must delete or derive numeric features from nonnumeric features, such as strings, dates, and categorical variables. 4.Even with purely numeric data, there is potential cleanup work, such as deleting or replacing erroneous/missing entries or even deleting entire records that are outside our business rules.
what are some examples of probability distributions that are discrete?
1.Bernoulli Distribution 2.Poisson Distribution 3.Binomial Distribution
what are some examples of probability distributions that are continuous?
1.Gaussian Distribution 2.Exponential Distribution 3.Continuous Uniform Distribution
You have gathered data, cleaned the data, performed EDA, explored various algorithms and created the final model. Now What?
1.Model needs to be integrated into an existing production environment so that it can be used for making predictions which will aid in decision making. 2.After a model is deployed in the production environment, when it is given an input, in a production environment, the model provides a prediction for the value of output variable for the given input.
How does Kmeans find find similar baskets?
1.Randomly chooses initial centroids. 2.Measures the distances between data points in our case similar baskets. 3.Sums the distances. 4.Finds a new points by moving the the average of the points in the cluster and repeat. Goal - find the centroids with the smallest distances (Within Cluster Sum of Squares)
python is a .....
an interpreter that translates based on syntax into machine code
what is python?
A multi-purpose language: - Data Analysis. - AI/ML. - Automation. - Web development (server-side). - Software development.
evolutionary computation
contains a set of computational routines using aspects of nature and evolution
what is feature scaling?
Features with very different scales can affect the regularization of ML models, and can also make the learning procedure itself slow. The goal of normalization is to transform the feature values into a similar (or identical) range.
who operationalizes ML?
Data/ML engineers operationalize ML, i.e., deploy and maintain ML pipelines in production.
data science v. ML
Data science can be viewed as an incorporation of several different parent disciplines, including data engineering, software engineering, data analytics, machine learning, business intelligence, predictive analytics, and more.
what is dummy encoding?
If not x or y, it must be z
how do you set the business objective?
START WITH A QUESTION 1) determine business objectives * background * business objectives * business success criteria 2) assess situation * inventory resources * requirements, assumptions, and constraints * risks and contingencies * terminology (data dictionary) * costs and benefits 3) project plan * project plan *assessment of tools and techniques
What is the definition of AI?
activity devoted to making machines intelligent, and intelligence is that quality that enables an entity to function appropriately and with foresight in its environment; any technique which enables computers to mimic human behavior
what is feature selection?
adding or removing features from model to ensure that features are only added or removed if it results in an improvement in the model performance
what are some of the most common ML problems?
classification regression
what do data scientists really spend most of their time doing?
cleaning and organizing data and collecting data sets
what are the two predictive approaches in models as functions?
predictor or probabilistic
what is important to remember in terms of data collection?
privacy, diversity/ neutrality, credibility, and quality of the data.
what is marginal probability?
probability of a single event; If A is an event, the marginal probability is the probability of that event occurring P(a)
what is encoding?
process of converting categorical variables to numerical variables
what is learning?
process of converting experience into expertise or knowledge; learning system that is enabled to use that expertise or knowledge gained when it is confronted with new info
what is operationalizing ML or MLOps?
process of the continual loop of (i) data collection and labeling, (ii) experimentation to improve ML performance, (iii) evaluation throughout a multi-staged deployment process, and (iv) monitoring of performance drops in production.
what is data transformation?
process of transforming the data from one layout to another; doesn't change original meaning of the data
what does good data preparation do?
produces clean and well-curated data which leads to more practical, accurate model outcomes.
what are robots?
programmable machines that assist humans or mimic human actions
what does probability do?
provides a language for quantifying uncertainty
what does the probability theory provide?
provides us the tools and techniques to work with uncertain phenomenon
what are the two problems of supervised ML?
regression and classification
knowledge processing
representing and deriving facts about the world and using this info in automated systems
what is duplicate data?
rows of data that are exactly the same across all columns; add to storage and processing
what is incorrect data?
self explanatory; can be hard to spot
what is unsupervised learning?
solves a complementary set of problems that do not require labeled data.
what is kmeans clustering?
specify the number of clusters (K) that we wish to cluster the data into. - Often, we wish to find groupings or patterns in our data. -Datapoints in the same cluster are deemed to be similar under some measure.
speech
speech recognition includes techniques to understand a sequence of words given an acoustic signal
evaluate the current state
•Assess current technical infrastructure •Review existing business functions, activities, roles of existing stakeholders and technology implementations
establish a data governance structure
•Clearly assign data-related responsibilities, oversight and ownership of tasks •Encourage adoption of new technology and processes •Standardize data processes across the organization
what are the reasons for underfitting?
•Data used for training is not cleaned and contains noise (garbage values) in it. •The model has a high bias. •The size of the training dataset used is not enough. •The model is too simple.
what are the reasons for overfitting?
•Data used for training is not cleaned and contains noise (garbage values) in it. •The model has a high variance. •The size of the training dataset used is not enough. •The model is too complex.
why do we need to construct multiple version of the model?
•Testing •User experience research •Change in market environment •Model update
what are some ways to tackle overfitting?
•Using K-fold cross-validation. •Using regularization techniques such as Lasso and Ridge. •Training model with sufficient data. •Adopting ensembling techniques.
what does the deployment of the model in production on a technical level usually involve?
•an API endpoint gateway •a load balancer •a cluster of virtual machines •a service layer •persistent data storage (Database) •the model itself.
define a vision
•for the goals in data analytics domain that your organization should achieve in the short and long term •This vision statement acts as a guideline for the next steps
skillset for a data scientist
●A deep knowledge of machine learning algorithms. ●Proficiency with statistics and probabilistic reasoning. ●Proficiency with python, R and other computer languages used for machine learning. ●Proficiency with various machine learning frameworks such as scikit-learn. ●Pros and cons of existing machine learning techniques. ●A deep knowledge of AI literature, algorithms and how existing machine learning techniques can be adapted to the problem at hand. ●Ability to work at the interface of computer science, mathematics and machine learning.
skillset for computer scientist
●A detailed understanding of computer architecture. ●In-depth knowledge of operating systems, what functionality it provides and the hardware-software interface. ●Ability to program computer systems using various programming languages and create new software products. ●Develop underlying computer concepts on which data engineer can build: for example, development of new concepts for efficient data storage. ●Development of underlying concepts and tools on which data scientist can build and deploy.
skillset for data engineer
●Extensive knowledge of database concepts. ●Extensive knowledge of various types of fault tolerant architectures used in database design. ●Detailed knowledge of data layouts on various types of data storage systems and how they store information. ●An understanding of various performance metrics to make efficient use of these layouts. ●Deep understanding of how databases are accessed on a computer network. ●In contrast to a data scientist who requires detailed knowledge of numeric programming and machine learning algorithms, a data engineer requires proficiency with database languages such as SQL.
what are the applications of probability to ML?
●Just like calculus and matrix theory, probability theory is one of the main pillars on which machine learning and AI rest 1.Many classification algorithms such as Naive Bayes are based entirely on probability 2.Many machine learning algorithms such as logistic regression incorporate probability ideas as part of their inference 3.Machine learning algorithms such as Decision Trees use probabilistic ideas under the hood 4.Many AI techniques for Natural Language Processing(NLP) and Speech Recognition are based on probability - examples include parts of speech (POS) tagging using Hidden Markov Models (HMMs) 5.Bayesian Networks, which are based entirely on probability ideas, are a well know AI technique used for decision making in numerous business applications