IST 5520 Exam 1

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Methods used in simple analytics projects

Find business, clean and transform data, analyze data, create visualization and models, suggest areas for improvement.

What problem would you have if you use the test set to tune hyperparameters of predictive models? Explain why.

If you were to use your test set to tune the hyperparameters of a predictive model, then you would essentially be making your model useless in terms of its ability to make accurate predictions and generalize to unseen data. By using the test dataset, you're ensuring the model performs exceptionally well on the test data. However, this is useless since its overfitting the data and won't be able to perform well on any new data. If you're going to tune your hyperparameters, it's best to do it with the training / validation sets so you still have a test dataset to actually see how your model performs on new, unseen data.

Numeric Types

Int, float, complex, boolean

What is the most suitable scale of measurement for Degrees Fahrenheit (°F)

Interval

Can the lambda expression in Python contain a block of multiple statements? Why?

Lambda expressions in Python cannot contain a block of multiple statements since it is only able to accept one and only one expression.

Container

List (costless insertion and append), string, dictionary (key-values for fast lookup), set, tuple

NumPy Package contains

Multidimensional arrays

What is the difference between n-fold cross validation and leave-one-out? When would you like to choose leave-one-out?

N-fold cross validation splits the data into n equal parts, and one of those parts will be the validation set, leaving the rest as training sets. Leave-one-out uses one data point as the validation set. You would choose leave-one-out whenever you want to validate data very accurately since you are taking every single data point and testing it.

Leave-one-out

N-fold cross validation where n = the number of instances in the dataset. Each instance in turn is left out and the model is trained on all remaining instances. Advantage: Greatest possible amount of data is used for training, lower bias than k-fold. Disadvantage: Computationally expensive, higher variance than k-fold.

Suppose we have a function defined in the following code. (6 points) def abc(x=0): if x % 3 == 0 : print('YES') else: print('NO') What is the result of abc(10)?

NO

What is the most suitable scale of measurement for a Social Security Number?

Nominal

Two way partition

Only one training set and test set

What is the most suitable scale of measurement for a rank of a baseball team?

Ordinal

Training and test regimen for model evaluation

Use validation to set to tune hyper parameters or choose competing models - usually through cross-validation

Regression

Used to predict a continuous variable (i.e. "How many items will be sold in the next month?")

flat file

Using a plain text file to store tabular data. Has a delimiter.

Machine Learning

Using data to choose a hypothesis g that approximates the target function f

The 4 V's

Volume - size of the data Variety - unstructured, structured, semi-structured Velocity - speed of generation or rate of analysis Veracity (accuracy) - untrusted, uncleansed

Underfitting

When the model performs poorly on the training data and is unable to capture the relationship between predictors and the response.

Overfitting

When the training data does really well but the test data does poorly. Unable to generalize unseen cases.

Unsupervised Machine Learning

When there is no outcome to predict or classify, data is unlabeled, association rules and clustering.

Structural Patterns

White boxes, patterns are represented in terms of a decision structure, can be examined, reasoned about, and used to inform future decision.

Detect all syntax errors from the following code. (5 points) The code is supposed to evaluate if a variable is positive or negative. if x <= 0: print("A positive number.") else: print("A non-positive number.")

X = 0 #Need to create x if x <= 0: print("A non-positive number.") #Flip the print statement to make sense Else: #fix indentations print("A positive number.")

Suppose we have a function defined in the following code. (6 points) def abc(x=0): if x % 3 == 0 : print('YES') else: print('NO') What is the result of abc()?

YES

Suppose we have a function defined in the following code. (6 points) def abc(x=0): if x % 3 == 0 : print('YES') else: print('NO') What is the result of abc(6)?

YES

Big Data

data with characteristics beyond the ability of commonly used software tools to capture, curate, and process data withing a tolerable elapsed time.

Database

good for structured data, does not do well with unstructured data

Black Boxes

internal logic is incomprehensible, usually gives greater predictive accuracy, but very complicated.

\w

matches any alphanumeric character, equivalent to the class [a-zA-Z0-9_]

\d

matches any decimal digit, equivalent to the class [0-9]

\W

matches any non-alphanumeric char, equivalent to the class [^a-zA-Z0-9_]

\D

matches any non-digit character, equivalent to the class [^0-9]

*

metacharacter for repeating things, doesn't match literal character, instead specifies that the previous character can be matched zero or more times

Levels of Measurement

nominal, ordinal, interval, ratio

?

previous character can be matched 0 to 1 times

+

previous character can be matched 1 or more times

{m,n}

previous character can be matched m to n times

Unit of Analysis

the major entity that is being analyzed in a BA project. Example: in a customer churn project, the unit of analysis is at customer level. Example: in a product price prediction project, the unit of analysis is at product level.

Classification

to classify units into categories (i.e. "Which brand will the user purchase?")

Business Analysis Process

-Have a business problem -Understand the problem -Collect data -Clean and transform the data -Data analysis -Solve business problem -Repeat

What is the most suitable scale of measurement for zip code? _____________. A) Nominal B) Ordinal C) Interval D) Ratio

A

What is the result of the following Python code? ____________. (2.5 < 3) and (3-2==0) or (1 > -4) A) True B) False C) 1 D) 0

A

What is the type of the following knowledge representation pattern? _____________. IF a customer buys a PC mouse, THEN the customer is likely to buy a keyboard also. A) Structural pattern B) Black box C) Both D) None

A

Which of the following is NOT a pandas data structure? _____________. A) Dictionary B) Panel C) Series D) DataFrame

A

Which of the following is the standard missing data marker used in Pandas? A) NaN B) NA C) N/A D) NULL

A

Which of the following methods is appropriate to visualize a qualitative variable? A) Bar chart B) Density plot C) Scatter plot D) Box plot

A

Explain why we may need to rescale dataset for predictive modeling.

Algorithms that rely on the similarity/distance between instances (KNN, SVM, Clustering methods, etc.) are sensitive to the scale of data. By having input variables with different units and scales, this can lead to an increase the difficulty of the problem being modeled. Essentially, the goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values, and it's crucial that we are very careful by how we scale our data because results / conclusions of the predictive model can change.

Data Science

An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to knowledge discovery databases (KDD).

Business Analytics (BA)

Analyzes historical data, identifies patterns from samples for reporting trends.

Descriptive Analytics

Analyzes historical data, identifies patterns from samples for reporting trends.

Interpret the meaning of the following Python code. y = 0 i = 0 while i < 20: i = i + 1 if i == 10 or i == 9: continue y = y + i print(y)

At a high level, this code sums the integers 1-20 while bypassing over the numbers 9 and 10 and then prints the result.

According to INFORMS, analytics does NOT include: ______________. A) Descriptive analytics B) Exploratory analytics C) Predictive analytics D) Prescriptive analytics

B

Assume "dat" object is a pandas data frame which stores the data set of used corolla. What is the meaning of the following Python code? dat[['Fuel_Type','Price']] A) To calculate average price of cars for each fuel type B) To select Fuel_Type and Price columns in the data frame C) To draw a box plot of price for each fuel type D) None of the above.

B

If your analytics model performs very well on the training dataset but poorly on the testing dataset, what is happening? A) Your model underfits the training data B) Your model overfits the training data C) Your model can generalize very well to new data D) None of the above is correct

B

What is the result of the following Python code block? _________. import numpy as np x = np.array([1,2,3,4,5,6,'a']) x[3] A) '3' B) '4' C) 4 D) 3

B

What is the x object in the following Python code block? _________. import pandas as pd df = pd.DataFrame({ "Course": ['IST5001','IST5001','IST5001','IST5001'], "Student": ['Jack','Kim','Tony','David'], "Score": [80, 90, 85, 70]}) x = df['Student'] A) A pandas DataFrame that contains only the Student column B) A pandas Series that contains student names C) A list that contains student names D) None of the above

B

Which Python library would be more useful for data management or transformation? A) numpy B) pandas C) NLTK D) Scikit-learn

B

Which Python library would you prefer to use for data management and transformation? A) numpy B) pandas C) NLTK D) Scikit-learn

B

Which of the following statements about Python is true? A) Python 3 is backward compatible with Python 2. B) Values in a NumPy array must be of the same data type. C) The Python command '?object' is to auto-complete the name of 'object'. D) The code "course ={5001:'Data Methods in Python',3420:'Introduction to Data Science and Management'}" defines a list called "course".

B

Which of the following statements about Python is true? _____________. A) Python 3 is backward compatible with Python 2. B) Values in a NumPy array must be of the same data type. C) The Python command '?object' is to auto-complete the name of 'object'. D) The code "course ={5001:'Data Methods in Python',3420:'Introduction to Data Science and Management'}" defines a list called "course".

B

What is the result of the following Python code block? _________. import numpy as np np.linspace(0,1,10).size A) 0 B) 1 C) 10 D) 20

C

What is the result of the following Python code? employee = ['Jack','Kim','Tony','David'] employee[0:2] A) ['Jack',Tony'] B) ['Kim','Tony'] C) ['Jack','Kim'] D) ['Jack','Kim','Tony']

C

What is the result of the following Python code? __________. str(1+3) A) 4 B) '1' C) '4' D) '13'

C

What is the result of the following Python statement? str(12+11) + 'abc' A) 1211abc B) 12+11+abc C) 23abc D) integerabc E) The statement is not correct in syntax

C

Which of the following is NOT a benefit of using Jupyter Notebook for data science? A) We can write and execute interactively. B) It uses a JSON format to store code, result, and explanation in a single file. C) The Jupyter Notebook can be directly compiled to a binary file that can be executed in an operation system with good performance. D) The data analysis can be easily shared among data scientists.

C

Basic control structures in Python language do NOT include: ______________. A) if...else statement B) for loop C) while loop D) case statement

D

Suppose we have a DataFrame called 'df' that contains three columns 'Course', 'Student', and 'Score'. What is the most appropriate way to count the unique number of students in the dataset? _____________. A) len(df['Student']) B) df['Student'].value_counts().sum() C) df['Student'].count() D) len(df['Student'].unique())

D

What is the difference between interval scale and ratio scale? A) A ratio scale puts scores into categories, while an interval scale measures on a continuous scale. B) An interval scale has a true zero point, so zero on the scale corresponds to zero of the concept being measured. C) A ratio scale has equal intervals between the points on the scale, whereas an interval scale does not. D) A ratio scale has a true zero point, so zero on the scale corresponds to zero of the concept being measured.

D

What is the difference between interval scale and ratio scale? ______________. A) A ratio scale puts scores into categories, while an interval scale measures on a continuous scale. B) An interval scale has a true zero point, so zero on the scale corresponds to zero of the concept being measured. C) A ratio scale has equal intervals between the points on the scale, whereas an interval scale does not. D) A ratio scale has a true zero point, so zero on the scale corresponds to zero of the concept being measured.

D

What is the y object in the following Python code block? _________. import pandas as pd df = pd.DataFrame({ "Course": ['IST5001','IST5001','IST5001','IST5001'], "Student": ['Jack','Kim','Tony','David'], "Score": [80, 90, 85, 70]}) y = df['Score'] >= 85 A) A pandas DataFrame that contains two observations with score >= 85 B) A pandas Series that contains two scores that are larger or equal to 85 C) A list that contains four student scores D)A pandas Series of four Boolean values

D

Which of the following data structures is NOT a container? A container is simply an object that holds a collection of other objects. A) Pandas DataFrame B) Set C) Dictionary D) Integer

D

Which of the following is NOT a basic control structure in structure theorem? A) Sequence B) Loop C) Selection D) Vector

D

Which of the following is NOT a pandas data structure? A) Series B) Panel C) DataFrame D) Dictionary

D

What's the meaning of the regular expression pattern "\d{3,5}"?

Decimal digits with 3 to 5 occurrences.

What are the benefits of saving a dataset into a local file after you have cleansed the dataset?

Do not need to run the time-consuming data cleansing and transformation process again. For descriptive and predictive analytics, you can simply load the cleansed dataset and then start to do data analysis.

Which of the following data structure is NOT a container? _____________. A) Numpy Array B) Pandas DataFrame C) Set D) Dictionary E) Integer

E

Bias

Error from improper assumptions in the learning algorithm.

Variance

Error from sensitivity to small fluctuations in the training set.

Prescriptive Analytics

Evaluates and determines new ways to operate. Targets business objectives. Balances all constraints.

Compare the two methods to_csv() and to_pickle() of a Pandas data frame object.What is the guide to choose between these two methods when you need to save your cleansed dataset into a local file?

Pickle is a serialized way of storing a Pandas dataframe. You are basically writing down the exact representation of your dataframe to the disc. This means the types of the columns are the same and the index is the same. If you simply save a file as a csv you are just storing it as a comma separated list. Depending on your data set, some information will be lost when you load it back up.

Regression example

Predicting how much a store will need to stock for inventory in the coming month.

Classification Example

Predicting what customers will buy online based on certain factors (targeted advertising)

Supervised Machine Learning

Prediction and Classification, data is labeled and there is an outcome trying to be predicted.

Predictive Analytics

Predicts future probabilities and trends. Finds relationships in data that may not be readily apparent with descriptive analysis.

Ordinal

Qualitative, attributes can be ranked/ordered (football team rank, customer ranking)

Nominal

Qualitative, just labels with no ordering or mathematical calculation (Gender, StudentID, zipcode, department #)

Ratio

Quantitative, like interval but has true zero values and the ratio is meaningful (weight, height, distance, number of visits)

Interval

Quantitative, similar to ordinal but distance between attributes has meaning, but ratios are not meaningful (temperature, SAT score)

K-Fold cross validation

Randomly divide the sample into folds of approximately equal size. Each fold serves once as a test fold.

What is the most suitable scale of measurement for age?

Ratio

Pandas Package contains

Series, DataFrame, Panel

What is PEP8?

The style guide for Python that makes code more readable by making it easy to format.

Three way partition

Training set, validation set, and test set


Kaugnay na mga set ng pag-aaral

CompTIA A+ Exam 220-1001 - Network Protocols

View Set

Chapter 21: The Evolution of Populations

View Set

Chapter 5: Consciousness --> Expanding the Boundaries of Psychological inquiry (Quiz Questions)

View Set