Python
Covariance
A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship
one-hot encoding
All the categories are converted to columns while in a row, just one of the these columns is in the 'on/hot' state.
What does the get_dummies() function in pandas do?
Convert categorical variables into dummy/indicator variables.
Which of the following function(s) would be useful in removing the whitespace characters (spaces) from the string below? Country= " England "
Country.replace(' ', '') Country.strip()
Input: Country = 'United-States-Of-America'print(Country) >>> 'United-States-Of-America' Desired Output: 'United'
Country.split('-')[0]
ML
Information and Noise split data in 2 different sets training set and testing set
Which of the statement(s) is/are correct regarding Data Preprocessing?
It is a term used to describe the collection of approaches to prepare the data for analysis It involves different techniques like missing value treatment, outlier detection and treatment, variable transformations, etc
What difference will it make if dropna = False parameter is added in the value_counts() function like data.column.value_counts(dropna = False)
It will return the count of missing values along with the count of unique categories in a column
Data Science
Learning from data we observe from the real world
Which of the following would be the best transformation when you want to smoothen a skewed distribution?
Log Transformation
What will the following code do? data.isnull().sum().sort_values(ascending=False)
Return the count of missing values column-wise and sort them in descending order
Which of the following is true about standard scaling?
Standard scaling assumes features are normally distributed and will scale them to have a mean 0 and standard deviation of 1. Standard scaling doesn't have a predetermined range to scale to.
When a linear regression was trained, it was found that R-squared value was 0.85
The model explains 85% of the variance
A model is giving a very low error on the training set but a very high error on the test set.
The model is suffering from overfitting
While building a multiple linear regression model, it was found that the addition of a variable decreased the value of the adjusted R-squared. Which of the following statements is correct?
The new variable should not be added in the final model
Boolean
an object can take a value of TRUE or FALSE
The concat() function will
combine series to create a dataframe
plotly
commercial $$The premier low-code platform for ML & data science apps
Machine learning refers to
computer systems trying to learn about a process using data that represents that process.
def x2p7(x): y = x*x z =y+7 return z print(x2p7(5)) answer would be 32 z is x squared =7
example of user defined function
We can have elements of different data types in a given array"
false
when you add a method(function) to a class, you must have
self as the first argument of that function
if you want to use SELF internally in a method you would reference is as
self.attribute_name
each column of a dataframe is a
series
each list of list is a series
series is a column
array
similar to a list, and have same way of storing data. stores single data type elements. more easier and robust than a list
pyplot
std for making Matplotlib graphs
A program stores data in variables that represent
storage locations in the computer's memory
A very complex model will perform better on the
test data set than on the train data set.
A model performance is evaluated on the
testing data
Mean Absolute Error (MAE)
the average of the absolute values of the forecast errors
Root Mean Square error
the square root of the average squared deviations of a set of values from a target value; typically used as a measure of overall tracking proficiency
We can create matrices by converting lists of lists.
true
matrix
two dimensional array
type function will provide the
type of the object that's provided as its argument
Float
values are specified with a decimal point and can take both negative and positive values (-5.4, -2, 1.1)
To check the data type of a value use the function
"type()" (Ex. type("Hell World") Answer = str
Add comments to code by using
# sign (Ctrl / in windows)
example of a tuple
(12,42,11,99,2351)
Set
(Ex X={1,2,3,4} can be edited does not support indexing
List
(Ex. X=['a',2,True,'b'] a collection of items of any data type ad it can be edited and supports indexing
Dictionary
(Ex. X={1:'Jan', 2:'Feb',3:'Mar'}
Multiple regression
1 DV (Y), more than 1 IV (X)
vector
1 dimensional array
Model building is
1 take data as input 2 find patterns in the data 3 summarize the pattern in a mathematical precise way
Which of the following best describes the quantile function of numpy?
Compute the q-th quantile of the data along the specified axis.
Which of the following describes the tmean function of scipy.stats?
Computes the trimmed mean. This function finds the arithmetic mean of given values, ignoring values outside the given limits.
T/F Imputation is always the best way to deal with missing data.
False
T/F Imputation is really just making up data to artificially inflate results. It's better to just drop cases with missing data than to impute.
False
T/F Log transformation scales the data to a predetermined scale (i.e., always [0-1]).
False
T/F Missing data isn't really a problem if I'm just doing simple statistics like chi-squares and t-tests.
False
T/F We can just impute by the mean for any missing data. It won't affect the results.
False
True or False We have 100 observations for the income of people. A person with an income of $5 million per year will always be considered an outlier.
False
supervised learning
IV = DV (input/output) gives desired output. I can compare models (Match a desired output)
What is an outlier?
In statistics, an outlier is a data point that differs significantly from other observations.
What effect(s) does log transformation have on data?
It can change the shape of the data It reduces the scale of data
Which of the following statement is correct regarding OneHotEncoder?
It encodes categorical features as a one-hot numeric array.
Which of the following is true about the log transformation?
It is most useful on skewed data It decreases the scale of the distribution
Select the correct statement(s) with respect to the following function: isinstance(param1, str)
It will check whether the first parameter (param1) is a string (str) type object It is a built-in function of python
What will the following code do? pd.set_option('display.max_columns', None)
It will remove the limit on the number of columns displayed in the Jupyter Notebook
Which of the following is minimized in a linear regression model?
Mean Squared Error
Which of the following is true about min-max scaling?
Min-max scaling rescales the data to a predefined range, typically 0-1.
If there are outliers in the data, what would be the best strategy?
Outlier treatment is subject to the data, the business problem at hand and the business domain that we are working in. Outliers should be analysed carefully before jumping into a decision. Some machine learning algorithms are robust to outliers and we might not need outlier treatment in those cases
A model that captures noise too is called an
Overfit model
assumption of Machine Learning?
Past is a good representation of the future
Artificial Intelligence refers to
Predicting the turnover for a restaurant based on the previous years data around turnover. Predicting if an employee is going to leave a company based on the historical employee attrition data.
What will the following code do? data.sample(10)
Return a random sample of 10 rows from the dataframe 'data'
What would the following code do? data.describe().T
Return the transposed statistical summary of data (columns will become index and index will become columns)
An overfit model does poorly on
Testing data
An overfit model does well on
Training data
T/F We should be careful when applying log transformation to negative data.
True
A measure of the spread of data values
Variance
If two columns (Col A and Col B) have a high correlation (correlation >= 0.8) what inference(s) can we make?
With decrease in Col A, Col B will also decrease With increase in Col A, Col B will also increase
How can we declare a null value X in Python? (Assume numpy is imported as np, pandas as pd, seaborn as sns, and matpltlib.pyplot as plt)
X = np.nan
Tuple
X=('a',2,True,'b') a collection of items of any data type (cannot be edited) Ex. and supports indexing
example of a list
['Dan',2,3,4,'python',2.71]
wont run anything to the right when commented
a = 3 #adding a comment as an example
OOP in python
a way language talking about classes and objects (attributes, methods)
seaborn
adds on to Matplotlib and makes them look nicer. is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
list definition
an ordered and indexed collection of values that are changeable and allows duplicates
tensor
any dimension higher than 2
how to create an array
arr = np.array([1, 2, 3, 4]) print(arr)
variable inside of a class is a
attribute
text and numerical values
cant have in numpy arrays but can have in panda dataframes
correlation does not imply
causation
you can have functions inside of a
class
what is unitless
correlation
math functions for numpy
cosine, exponential, sqrt, log
reduce dimentionality (reducing columns) helps pick up on meaningful signals
create dictionary average columns and name it
If the drop_first attribute of the get_dummies() function is set to True, then get_dummies() will
create k-1 dummies out of k categorical levels by removing the first level.
Which of the following options is commonly used for reducing overfitting in a model?
cross validation
Choose the correct code(s) that will drop the column (temporarily) 'Col2' from the data.
data.drop(["Col2"],axis=1)
how to replace a word
data["Col4"].replace("Nature","Beauty", inplace=False)
Choose the correct code that will change the case of strings from lower case to upper case.
data["Col4"].str.upper()
how to add a new column and take sum of col 1,2,3
data["Col5"] = data[["Col1","Col2","Col3"]].sum(axis=1)
how to add a new column 'Col6' to the dataframe that will take the values of the difference between 'Col3' and 'Col1'
data["Col6"] = data["Col3"] - data["Col1"]
Choose the correct line of code to extract the year from a 'datetime' type column
data["column"] = data["column"].dt.year
Choose the correct code(s) that will multiply 'Col 1' by 5.
data['Col1'] * 5 data['Col1'].apply(lambda x : x*5 )
Choose the correct code(s) that gives the number of elements that end with 's' in 'Col4'.
data['Col4'].str.endswith('s').sum()
Which of the following code will convert a column of the 'object' type to a column of the 'datetime' type? (Assume numpy is imported as np, pandas as pd, seaborn as sns, and matpltlib.pyplot as plt)
data['column'] = pd.to_datetime(data['column'])
initializer
declared inside a class
use keyword def to define a
define a function
Supervised learning regression
desired output is a continuous number; classify" desired output is a category
underfit
didnt capture all the info that was available to us
we can address the scope issue by declaring a variable as
global
Unsupervised learning clustering
grouping data with dimensionality reduction; compressing data; association rule learning; If X then Y
In Python, all data is stored in the form of an object. An object has three things
id, type, and value.
np arange np linspace
inclusive and exclusive inclusive and inclusive
The R-squared value ______ with the addition of features, but the adjusted R-squared value might _______ with the addition of features to generate the best fit line.
increases, decrease/increase
Data is
info + noise
overfit
info and noise
A good Machine Learning model captures
information in the data leaving out the noise.
immutable objects -An immutable object is an object that is not changeable and its state cannot be modified after it is created.
integer, float, string, tuple, bool, frozenset
dictionary definition
is a collection of values that are unordered (but indexed) and changeable
Matplotlib
is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Linear Regression
is a guided learning process, i.e., the data needs to come with labels/targets, and then the regression model is trained to minimize the error.
lamda function
is a keyword in python that means an inline function
numpy
is a library for doing numerical calculations (Ex...python package for doing math that is more advanced) Ex...cosine, exponential, sqrt
Integer (int)
is a non-decimal point numeric number and can be both negative and positive values (Ex -4, 6, 10)
vector
is a single dimensional array of a list
class (I can define a data type called a class)
is a user defined data type ( we give the data type a name) (Ex int, float, list, My_data_type)
object
is an instance of a particular class
tuple definition
is an ordered collection of values that are unchangeable and allows duplicates
self
is pythons internal reference identifier for classes
for loop defined
is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string)
user defined function (dont understand)
it is often used to define your own function to do something then have your python code call that function
How to append a list?
l2.append(xx)
how to remove list
l2.remove(xx)
mutable objects - The values of mutable objects can be changed at any time and place, whether you expect it or not.
list, dictionary,set
lenx is a
local variable
Pandas dataframe
made up of several series..can be thought of like a excel spreadsheet that is storing some data
Supervised learning refers to
mathematical model using data that contains both input and desired outputs
String
may contain alphabets or numbers or combo of both with single or double quotes (Ex. "Hello World", "My area pin-code is 121121")
numpy arrays (np.array)
more robust version of a list
find the row-wise mean of the dataset.
np.mean(data, axis = 1)
vectors and matrices apply to
numpy
valid function(s) for NumPy
numpy.cos(numpy.pi)
Which of the following snippets of code are valid function(s) for NumPy?
numpy.log(6); numpy.sqrt(1.44) ; numpy.exp(4)
when you declare an object of that class you access that attribute from the outside as
object_name.attribute_name
lambda function are all put
on 1 line
Simple linear regression has _______ independent feature/features while multiple linear regression has _______ independent features/features
one, many
how to call os module
os.name
lenx is not defined
outside of the function
Pandas
package used for managing data (create 2 data types)
how to merge join combine DataFrames
pd.concat([x,x,x,axis,sort=False) df9.join(df10,how='outer') pd.merge(df7,df8,........
Which of the following function can be used to create bins in a continuous variable
pd.cut()
methods
perform calculate on attributes or feed new data
Correlation
popular technique to measure the degree of association between variables.
Data is the
precise observation from real world
Residuals are a way to figure out the deviation of the actual values from the
predicted
NumPy
python package for doing math that is more advanced than +-*/, special functions like cosine, exponents, sqrt
Using axis = 0 in the drop() function of pandas would drop ______ from the dataframe.
rows
me = Chris_data_type() me.init_some_vals(2.2) print(me.first_var) print(me.multiply_vals())
run this function me and declare object of the type class
Pandas are similar
similar to excel spreadsheet
Regression pro's
simple elegant model computationally very efficient easy to interpret the output's coefficient
StandardScaler and MinMaxScaler are contained in which Python library?
sklearn.preprocessing
Which of the following libraries has the OneHotEncoder function?
sklearn.preprocessing
A good fit model will have a
smaller standard deviation of residuals.
Which function helps in plotting the pair-wise relation between each numerical variable of the data?
sns.pairplot(data)
Regressions Cons
sometimes too simple to capture real-world complexities assumes a linear relationship between IV and DV outliers can have a large effect on the output assumes independence between attributes
In simple linear regression, the R-squared value is equal to which of the following?
square of correlation
class Chris_data_type def init_some_vals(self,val2): self.first_var = 1.7 self.second_var = val2 def multiply_vals(self): return self.first_var*self.second_var
this is a definition of a new data type called a class
ML uses what type of approach
train and test
A model is built on the
training data
documentation string """
triple double quote
An Identity matrix is a square matrix in which all the elements of the principal diagonal are ones and all other elements are zeros.
true
unsupervised learning
we have inputs, but no desried outputs; IV only, no output expected