Python

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Covariance

A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship

one-hot encoding

All the categories are converted to columns while in a row, just one of the these columns is in the 'on/hot' state.

What does the get_dummies() function in pandas do?

Convert categorical variables into dummy/indicator variables.

Which of the following function(s) would be useful in removing the whitespace characters (spaces) from the string below? Country= " England "

Country.replace(' ', '') Country.strip()

Input: Country = 'United-States-Of-America'print(Country) >>> 'United-States-Of-America' Desired Output: 'United'

Country.split('-')[0]

ML

Information and Noise split data in 2 different sets training set and testing set

Which of the statement(s) is/are correct regarding Data Preprocessing?

It is a term used to describe the collection of approaches to prepare the data for analysis It involves different techniques like missing value treatment, outlier detection and treatment, variable transformations, etc

What difference will it make if dropna = False parameter is added in the value_counts() function like data.column.value_counts(dropna = False)

It will return the count of missing values along with the count of unique categories in a column

Data Science

Learning from data we observe from the real world

Which of the following would be the best transformation when you want to smoothen a skewed distribution?

Log Transformation

What will the following code do? data.isnull().sum().sort_values(ascending=False)

Return the count of missing values column-wise and sort them in descending order

Which of the following is true about standard scaling?

Standard scaling assumes features are normally distributed and will scale them to have a mean 0 and standard deviation of 1. Standard scaling doesn't have a predetermined range to scale to.

When a linear regression was trained, it was found that R-squared value was 0.85

The model explains 85% of the variance

A model is giving a very low error on the training set but a very high error on the test set.

The model is suffering from overfitting

While building a multiple linear regression model, it was found that the addition of a variable decreased the value of the adjusted R-squared. Which of the following statements is correct?

The new variable should not be added in the final model

Boolean

an object can take a value of TRUE or FALSE

The concat() function will

combine series to create a dataframe

plotly

commercial $$The premier low-code platform for ML & data science apps

Machine learning refers to

computer systems trying to learn about a process using data that represents that process.

def x2p7(x): y = x*x z =y+7 return z print(x2p7(5)) answer would be 32 z is x squared =7

example of user defined function

We can have elements of different data types in a given array"

false

when you add a method(function) to a class, you must have

self as the first argument of that function

if you want to use SELF internally in a method you would reference is as

self.attribute_name

each column of a dataframe is a

series

each list of list is a series

series is a column

array

similar to a list, and have same way of storing data. stores single data type elements. more easier and robust than a list

pyplot

std for making Matplotlib graphs

A program stores data in variables that represent

storage locations in the computer's memory

A very complex model will perform better on the

test data set than on the train data set.

A model performance is evaluated on the

testing data

Mean Absolute Error (MAE)

the average of the absolute values of the forecast errors

Root Mean Square error

the square root of the average squared deviations of a set of values from a target value; typically used as a measure of overall tracking proficiency

We can create matrices by converting lists of lists.

true

matrix

two dimensional array

type function will provide the

type of the object that's provided as its argument

Float

values are specified with a decimal point and can take both negative and positive values (-5.4, -2, 1.1)

To check the data type of a value use the function

"type()" (Ex. type("Hell World") Answer = str

Add comments to code by using

# sign (Ctrl / in windows)

example of a tuple

(12,42,11,99,2351)

Set

(Ex X={1,2,3,4} can be edited does not support indexing

List

(Ex. X=['a',2,True,'b'] a collection of items of any data type ad it can be edited and supports indexing

Dictionary

(Ex. X={1:'Jan', 2:'Feb',3:'Mar'}

Multiple regression

1 DV (Y), more than 1 IV (X)

vector

1 dimensional array

Model building is

1 take data as input 2 find patterns in the data 3 summarize the pattern in a mathematical precise way

Which of the following best describes the quantile function of numpy?

Compute the q-th quantile of the data along the specified axis.

Which of the following describes the tmean function of scipy.stats?

Computes the trimmed mean. This function finds the arithmetic mean of given values, ignoring values outside the given limits.

T/F Imputation is always the best way to deal with missing data.

False

T/F Imputation is really just making up data to artificially inflate results. It's better to just drop cases with missing data than to impute.

False

T/F Log transformation scales the data to a predetermined scale (i.e., always [0-1]).

False

T/F Missing data isn't really a problem if I'm just doing simple statistics like chi-squares and t-tests.

False

T/F We can just impute by the mean for any missing data. It won't affect the results.

False

True or False We have 100 observations for the income of people. A person with an income of $5 million per year will always be considered an outlier.

False

supervised learning

IV = DV (input/output) gives desired output. I can compare models (Match a desired output)

What is an outlier?

In statistics, an outlier is a data point that differs significantly from other observations.

What effect(s) does log transformation have on data?

It can change the shape of the data It reduces the scale of data

Which of the following statement is correct regarding OneHotEncoder?

It encodes categorical features as a one-hot numeric array.

Which of the following is true about the log transformation?

It is most useful on skewed data It decreases the scale of the distribution

Select the correct statement(s) with respect to the following function: isinstance(param1, str)

It will check whether the first parameter (param1) is a string (str) type object It is a built-in function of python

What will the following code do? pd.set_option('display.max_columns', None)

It will remove the limit on the number of columns displayed in the Jupyter Notebook

Which of the following is minimized in a linear regression model?

Mean Squared Error

Which of the following is true about min-max scaling?

Min-max scaling rescales the data to a predefined range, typically 0-1.

If there are outliers in the data, what would be the best strategy?

Outlier treatment is subject to the data, the business problem at hand and the business domain that we are working in. Outliers should be analysed carefully before jumping into a decision. Some machine learning algorithms are robust to outliers and we might not need outlier treatment in those cases

A model that captures noise too is called an

Overfit model

assumption of Machine Learning?

Past is a good representation of the future

Artificial Intelligence refers to

Predicting the turnover for a restaurant based on the previous years data around turnover. Predicting if an employee is going to leave a company based on the historical employee attrition data.

What will the following code do? data.sample(10)

Return a random sample of 10 rows from the dataframe 'data'

What would the following code do? data.describe().T

Return the transposed statistical summary of data (columns will become index and index will become columns)

An overfit model does poorly on

Testing data

An overfit model does well on

Training data

T/F We should be careful when applying log transformation to negative data.

True

A measure of the spread of data values

Variance

If two columns (Col A and Col B) have a high correlation (correlation >= 0.8) what inference(s) can we make?

With decrease in Col A, Col B will also decrease With increase in Col A, Col B will also increase

How can we declare a null value X in Python? (Assume numpy is imported as np, pandas as pd, seaborn as sns, and matpltlib.pyplot as plt)

X = np.nan

Tuple

X=('a',2,True,'b') a collection of items of any data type (cannot be edited) Ex. and supports indexing

example of a list

['Dan',2,3,4,'python',2.71]

wont run anything to the right when commented

a = 3 #adding a comment as an example

OOP in python

a way language talking about classes and objects (attributes, methods)

seaborn

adds on to Matplotlib and makes them look nicer. is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

list definition

an ordered and indexed collection of values that are changeable and allows duplicates

tensor

any dimension higher than 2

how to create an array

arr = np.array([1, 2, 3, 4]) print(arr)

variable inside of a class is a

attribute

text and numerical values

cant have in numpy arrays but can have in panda dataframes

correlation does not imply

causation

you can have functions inside of a

class

what is unitless

correlation

math functions for numpy

cosine, exponential, sqrt, log

reduce dimentionality (reducing columns) helps pick up on meaningful signals

create dictionary average columns and name it

If the drop_first attribute of the get_dummies() function is set to True, then get_dummies() will

create k-1 dummies out of k categorical levels by removing the first level.

Which of the following options is commonly used for reducing overfitting in a model?

cross validation

Choose the correct code(s) that will drop the column (temporarily) 'Col2' from the data.

data.drop(["Col2"],axis=1)

how to replace a word

data["Col4"].replace("Nature","Beauty", inplace=False)

Choose the correct code that will change the case of strings from lower case to upper case.

data["Col4"].str.upper()

how to add a new column and take sum of col 1,2,3

data["Col5"] = data[["Col1","Col2","Col3"]].sum(axis=1)

how to add a new column 'Col6' to the dataframe that will take the values of the difference between 'Col3' and 'Col1'

data["Col6"] = data["Col3"] - data["Col1"]

Choose the correct line of code to extract the year from a 'datetime' type column

data["column"] = data["column"].dt.year

Choose the correct code(s) that will multiply 'Col 1' by 5.

data['Col1'] * 5 data['Col1'].apply(lambda x : x*5 )

Choose the correct code(s) that gives the number of elements that end with 's' in 'Col4'.

data['Col4'].str.endswith('s').sum()

Which of the following code will convert a column of the 'object' type to a column of the 'datetime' type? (Assume numpy is imported as np, pandas as pd, seaborn as sns, and matpltlib.pyplot as plt)

data['column'] = pd.to_datetime(data['column'])

initializer

declared inside a class

use keyword def to define a

define a function

Supervised learning regression

desired output is a continuous number; classify" desired output is a category

underfit

didnt capture all the info that was available to us

we can address the scope issue by declaring a variable as

global

Unsupervised learning clustering

grouping data with dimensionality reduction; compressing data; association rule learning; If X then Y

In Python, all data is stored in the form of an object. An object has three things

id, type, and value.

np arange np linspace

inclusive and exclusive inclusive and inclusive

The R-squared value ______ with the addition of features, but the adjusted R-squared value might _______ with the addition of features to generate the best fit line.

increases, decrease/increase

Data is

info + noise

overfit

info and noise

A good Machine Learning model captures

information in the data leaving out the noise.

immutable objects -An immutable object is an object that is not changeable and its state cannot be modified after it is created.

integer, float, string, tuple, bool, frozenset

dictionary definition

is a collection of values that are unordered (but indexed) and changeable

Matplotlib

is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Linear Regression

is a guided learning process, i.e., the data needs to come with labels/targets, and then the regression model is trained to minimize the error.

lamda function

is a keyword in python that means an inline function

numpy

is a library for doing numerical calculations (Ex...python package for doing math that is more advanced) Ex...cosine, exponential, sqrt

Integer (int)

is a non-decimal point numeric number and can be both negative and positive values (Ex -4, 6, 10)

vector

is a single dimensional array of a list

class (I can define a data type called a class)

is a user defined data type ( we give the data type a name) (Ex int, float, list, My_data_type)

object

is an instance of a particular class

tuple definition

is an ordered collection of values that are unchangeable and allows duplicates

self

is pythons internal reference identifier for classes

for loop defined

is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string)

user defined function (dont understand)

it is often used to define your own function to do something then have your python code call that function

How to append a list?

l2.append(xx)

how to remove list

l2.remove(xx)

mutable objects - The values of mutable objects can be changed at any time and place, whether you expect it or not.

list, dictionary,set

lenx is a

local variable

Pandas dataframe

made up of several series..can be thought of like a excel spreadsheet that is storing some data

Supervised learning refers to

mathematical model using data that contains both input and desired outputs

String

may contain alphabets or numbers or combo of both with single or double quotes (Ex. "Hello World", "My area pin-code is 121121")

numpy arrays (np.array)

more robust version of a list

find the row-wise mean of the dataset.

np.mean(data, axis = 1)

vectors and matrices apply to

numpy

valid function(s) for NumPy

numpy.cos(numpy.pi)

Which of the following snippets of code are valid function(s) for NumPy?

numpy.log(6); numpy.sqrt(1.44) ; numpy.exp(4)

when you declare an object of that class you access that attribute from the outside as

object_name.attribute_name

lambda function are all put

on 1 line

Simple linear regression has _______ independent feature/features while multiple linear regression has _______ independent features/features

one, many

how to call os module

os.name

lenx is not defined

outside of the function

Pandas

package used for managing data (create 2 data types)

how to merge join combine DataFrames

pd.concat([x,x,x,axis,sort=False) df9.join(df10,how='outer') pd.merge(df7,df8,........

Which of the following function can be used to create bins in a continuous variable

pd.cut()

methods

perform calculate on attributes or feed new data

Correlation

popular technique to measure the degree of association between variables.

Data is the

precise observation from real world

Residuals are a way to figure out the deviation of the actual values from the

predicted

NumPy

python package for doing math that is more advanced than +-*/, special functions like cosine, exponents, sqrt

Using axis = 0 in the drop() function of pandas would drop ______ from the dataframe.

rows

me = Chris_data_type() me.init_some_vals(2.2) print(me.first_var) print(me.multiply_vals())

run this function me and declare object of the type class

Pandas are similar

similar to excel spreadsheet

Regression pro's

simple elegant model computationally very efficient easy to interpret the output's coefficient

StandardScaler and MinMaxScaler are contained in which Python library?

sklearn.preprocessing

Which of the following libraries has the OneHotEncoder function?

sklearn.preprocessing

A good fit model will have a

smaller standard deviation of residuals.

Which function helps in plotting the pair-wise relation between each numerical variable of the data?

sns.pairplot(data)

Regressions Cons

sometimes too simple to capture real-world complexities assumes a linear relationship between IV and DV outliers can have a large effect on the output assumes independence between attributes

In simple linear regression, the R-squared value is equal to which of the following?

square of correlation

class Chris_data_type def init_some_vals(self,val2): self.first_var = 1.7 self.second_var = val2 def multiply_vals(self): return self.first_var*self.second_var

this is a definition of a new data type called a class

ML uses what type of approach

train and test

A model is built on the

training data

documentation string """

triple double quote

An Identity matrix is a square matrix in which all the elements of the principal diagonal are ones and all other elements are zeros.

true

unsupervised learning

we have inputs, but no desried outputs; IV only, no output expected


Set pelajaran terkait

national government chapter civil liberties

View Set

Unit: 2. FRACTIONS Assignment: 1. Fractions and Mixed Numbers

View Set

Ch 8 Commercial Property Insurance

View Set

Chapter 4- The Market Forces of Supply and Demand

View Set

Lab: Population Growth and Resource Consumption

View Set