IST 5520 Exam 1
Methods used in simple analytics projects
Find business, clean and transform data, analyze data, create visualization and models, suggest areas for improvement.
What problem would you have if you use the test set to tune hyperparameters of predictive models? Explain why.
If you were to use your test set to tune the hyperparameters of a predictive model, then you would essentially be making your model useless in terms of its ability to make accurate predictions and generalize to unseen data. By using the test dataset, you're ensuring the model performs exceptionally well on the test data. However, this is useless since its overfitting the data and won't be able to perform well on any new data. If you're going to tune your hyperparameters, it's best to do it with the training / validation sets so you still have a test dataset to actually see how your model performs on new, unseen data.
Numeric Types
Int, float, complex, boolean
What is the most suitable scale of measurement for Degrees Fahrenheit (°F)
Interval
Can the lambda expression in Python contain a block of multiple statements? Why?
Lambda expressions in Python cannot contain a block of multiple statements since it is only able to accept one and only one expression.
Container
List (costless insertion and append), string, dictionary (key-values for fast lookup), set, tuple
NumPy Package contains
Multidimensional arrays
What is the difference between n-fold cross validation and leave-one-out? When would you like to choose leave-one-out?
N-fold cross validation splits the data into n equal parts, and one of those parts will be the validation set, leaving the rest as training sets. Leave-one-out uses one data point as the validation set. You would choose leave-one-out whenever you want to validate data very accurately since you are taking every single data point and testing it.
Leave-one-out
N-fold cross validation where n = the number of instances in the dataset. Each instance in turn is left out and the model is trained on all remaining instances. Advantage: Greatest possible amount of data is used for training, lower bias than k-fold. Disadvantage: Computationally expensive, higher variance than k-fold.
Suppose we have a function defined in the following code. (6 points) def abc(x=0): if x % 3 == 0 : print('YES') else: print('NO') What is the result of abc(10)?
NO
What is the most suitable scale of measurement for a Social Security Number?
Nominal
Two way partition
Only one training set and test set
What is the most suitable scale of measurement for a rank of a baseball team?
Ordinal
Training and test regimen for model evaluation
Use validation to set to tune hyper parameters or choose competing models - usually through cross-validation
Regression
Used to predict a continuous variable (i.e. "How many items will be sold in the next month?")
flat file
Using a plain text file to store tabular data. Has a delimiter.
Machine Learning
Using data to choose a hypothesis g that approximates the target function f
The 4 V's
Volume - size of the data Variety - unstructured, structured, semi-structured Velocity - speed of generation or rate of analysis Veracity (accuracy) - untrusted, uncleansed
Underfitting
When the model performs poorly on the training data and is unable to capture the relationship between predictors and the response.
Overfitting
When the training data does really well but the test data does poorly. Unable to generalize unseen cases.
Unsupervised Machine Learning
When there is no outcome to predict or classify, data is unlabeled, association rules and clustering.
Structural Patterns
White boxes, patterns are represented in terms of a decision structure, can be examined, reasoned about, and used to inform future decision.
Detect all syntax errors from the following code. (5 points) The code is supposed to evaluate if a variable is positive or negative. if x <= 0: print("A positive number.") else: print("A non-positive number.")
X = 0 #Need to create x if x <= 0: print("A non-positive number.") #Flip the print statement to make sense Else: #fix indentations print("A positive number.")
Suppose we have a function defined in the following code. (6 points) def abc(x=0): if x % 3 == 0 : print('YES') else: print('NO') What is the result of abc()?
YES
Suppose we have a function defined in the following code. (6 points) def abc(x=0): if x % 3 == 0 : print('YES') else: print('NO') What is the result of abc(6)?
YES
Big Data
data with characteristics beyond the ability of commonly used software tools to capture, curate, and process data withing a tolerable elapsed time.
Database
good for structured data, does not do well with unstructured data
Black Boxes
internal logic is incomprehensible, usually gives greater predictive accuracy, but very complicated.
\w
matches any alphanumeric character, equivalent to the class [a-zA-Z0-9_]
\d
matches any decimal digit, equivalent to the class [0-9]
\W
matches any non-alphanumeric char, equivalent to the class [^a-zA-Z0-9_]
\D
matches any non-digit character, equivalent to the class [^0-9]
*
metacharacter for repeating things, doesn't match literal character, instead specifies that the previous character can be matched zero or more times
Levels of Measurement
nominal, ordinal, interval, ratio
?
previous character can be matched 0 to 1 times
+
previous character can be matched 1 or more times
{m,n}
previous character can be matched m to n times
Unit of Analysis
the major entity that is being analyzed in a BA project. Example: in a customer churn project, the unit of analysis is at customer level. Example: in a product price prediction project, the unit of analysis is at product level.
Classification
to classify units into categories (i.e. "Which brand will the user purchase?")
Business Analysis Process
-Have a business problem -Understand the problem -Collect data -Clean and transform the data -Data analysis -Solve business problem -Repeat
What is the most suitable scale of measurement for zip code? _____________. A) Nominal B) Ordinal C) Interval D) Ratio
A
What is the result of the following Python code? ____________. (2.5 < 3) and (3-2==0) or (1 > -4) A) True B) False C) 1 D) 0
A
What is the type of the following knowledge representation pattern? _____________. IF a customer buys a PC mouse, THEN the customer is likely to buy a keyboard also. A) Structural pattern B) Black box C) Both D) None
A
Which of the following is NOT a pandas data structure? _____________. A) Dictionary B) Panel C) Series D) DataFrame
A
Which of the following is the standard missing data marker used in Pandas? A) NaN B) NA C) N/A D) NULL
A
Which of the following methods is appropriate to visualize a qualitative variable? A) Bar chart B) Density plot C) Scatter plot D) Box plot
A
Explain why we may need to rescale dataset for predictive modeling.
Algorithms that rely on the similarity/distance between instances (KNN, SVM, Clustering methods, etc.) are sensitive to the scale of data. By having input variables with different units and scales, this can lead to an increase the difficulty of the problem being modeled. Essentially, the goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values, and it's crucial that we are very careful by how we scale our data because results / conclusions of the predictive model can change.
Data Science
An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to knowledge discovery databases (KDD).
Business Analytics (BA)
Analyzes historical data, identifies patterns from samples for reporting trends.
Descriptive Analytics
Analyzes historical data, identifies patterns from samples for reporting trends.
Interpret the meaning of the following Python code. y = 0 i = 0 while i < 20: i = i + 1 if i == 10 or i == 9: continue y = y + i print(y)
At a high level, this code sums the integers 1-20 while bypassing over the numbers 9 and 10 and then prints the result.
According to INFORMS, analytics does NOT include: ______________. A) Descriptive analytics B) Exploratory analytics C) Predictive analytics D) Prescriptive analytics
B
Assume "dat" object is a pandas data frame which stores the data set of used corolla. What is the meaning of the following Python code? dat[['Fuel_Type','Price']] A) To calculate average price of cars for each fuel type B) To select Fuel_Type and Price columns in the data frame C) To draw a box plot of price for each fuel type D) None of the above.
B
If your analytics model performs very well on the training dataset but poorly on the testing dataset, what is happening? A) Your model underfits the training data B) Your model overfits the training data C) Your model can generalize very well to new data D) None of the above is correct
B
What is the result of the following Python code block? _________. import numpy as np x = np.array([1,2,3,4,5,6,'a']) x[3] A) '3' B) '4' C) 4 D) 3
B
What is the x object in the following Python code block? _________. import pandas as pd df = pd.DataFrame({ "Course": ['IST5001','IST5001','IST5001','IST5001'], "Student": ['Jack','Kim','Tony','David'], "Score": [80, 90, 85, 70]}) x = df['Student'] A) A pandas DataFrame that contains only the Student column B) A pandas Series that contains student names C) A list that contains student names D) None of the above
B
Which Python library would be more useful for data management or transformation? A) numpy B) pandas C) NLTK D) Scikit-learn
B
Which Python library would you prefer to use for data management and transformation? A) numpy B) pandas C) NLTK D) Scikit-learn
B
Which of the following statements about Python is true? A) Python 3 is backward compatible with Python 2. B) Values in a NumPy array must be of the same data type. C) The Python command '?object' is to auto-complete the name of 'object'. D) The code "course ={5001:'Data Methods in Python',3420:'Introduction to Data Science and Management'}" defines a list called "course".
B
Which of the following statements about Python is true? _____________. A) Python 3 is backward compatible with Python 2. B) Values in a NumPy array must be of the same data type. C) The Python command '?object' is to auto-complete the name of 'object'. D) The code "course ={5001:'Data Methods in Python',3420:'Introduction to Data Science and Management'}" defines a list called "course".
B
What is the result of the following Python code block? _________. import numpy as np np.linspace(0,1,10).size A) 0 B) 1 C) 10 D) 20
C
What is the result of the following Python code? employee = ['Jack','Kim','Tony','David'] employee[0:2] A) ['Jack',Tony'] B) ['Kim','Tony'] C) ['Jack','Kim'] D) ['Jack','Kim','Tony']
C
What is the result of the following Python code? __________. str(1+3) A) 4 B) '1' C) '4' D) '13'
C
What is the result of the following Python statement? str(12+11) + 'abc' A) 1211abc B) 12+11+abc C) 23abc D) integerabc E) The statement is not correct in syntax
C
Which of the following is NOT a benefit of using Jupyter Notebook for data science? A) We can write and execute interactively. B) It uses a JSON format to store code, result, and explanation in a single file. C) The Jupyter Notebook can be directly compiled to a binary file that can be executed in an operation system with good performance. D) The data analysis can be easily shared among data scientists.
C
Basic control structures in Python language do NOT include: ______________. A) if...else statement B) for loop C) while loop D) case statement
D
Suppose we have a DataFrame called 'df' that contains three columns 'Course', 'Student', and 'Score'. What is the most appropriate way to count the unique number of students in the dataset? _____________. A) len(df['Student']) B) df['Student'].value_counts().sum() C) df['Student'].count() D) len(df['Student'].unique())
D
What is the difference between interval scale and ratio scale? A) A ratio scale puts scores into categories, while an interval scale measures on a continuous scale. B) An interval scale has a true zero point, so zero on the scale corresponds to zero of the concept being measured. C) A ratio scale has equal intervals between the points on the scale, whereas an interval scale does not. D) A ratio scale has a true zero point, so zero on the scale corresponds to zero of the concept being measured.
D
What is the difference between interval scale and ratio scale? ______________. A) A ratio scale puts scores into categories, while an interval scale measures on a continuous scale. B) An interval scale has a true zero point, so zero on the scale corresponds to zero of the concept being measured. C) A ratio scale has equal intervals between the points on the scale, whereas an interval scale does not. D) A ratio scale has a true zero point, so zero on the scale corresponds to zero of the concept being measured.
D
What is the y object in the following Python code block? _________. import pandas as pd df = pd.DataFrame({ "Course": ['IST5001','IST5001','IST5001','IST5001'], "Student": ['Jack','Kim','Tony','David'], "Score": [80, 90, 85, 70]}) y = df['Score'] >= 85 A) A pandas DataFrame that contains two observations with score >= 85 B) A pandas Series that contains two scores that are larger or equal to 85 C) A list that contains four student scores D)A pandas Series of four Boolean values
D
Which of the following data structures is NOT a container? A container is simply an object that holds a collection of other objects. A) Pandas DataFrame B) Set C) Dictionary D) Integer
D
Which of the following is NOT a basic control structure in structure theorem? A) Sequence B) Loop C) Selection D) Vector
D
Which of the following is NOT a pandas data structure? A) Series B) Panel C) DataFrame D) Dictionary
D
What's the meaning of the regular expression pattern "\d{3,5}"?
Decimal digits with 3 to 5 occurrences.
What are the benefits of saving a dataset into a local file after you have cleansed the dataset?
Do not need to run the time-consuming data cleansing and transformation process again. For descriptive and predictive analytics, you can simply load the cleansed dataset and then start to do data analysis.
Which of the following data structure is NOT a container? _____________. A) Numpy Array B) Pandas DataFrame C) Set D) Dictionary E) Integer
E
Bias
Error from improper assumptions in the learning algorithm.
Variance
Error from sensitivity to small fluctuations in the training set.
Prescriptive Analytics
Evaluates and determines new ways to operate. Targets business objectives. Balances all constraints.
Compare the two methods to_csv() and to_pickle() of a Pandas data frame object.What is the guide to choose between these two methods when you need to save your cleansed dataset into a local file?
Pickle is a serialized way of storing a Pandas dataframe. You are basically writing down the exact representation of your dataframe to the disc. This means the types of the columns are the same and the index is the same. If you simply save a file as a csv you are just storing it as a comma separated list. Depending on your data set, some information will be lost when you load it back up.
Regression example
Predicting how much a store will need to stock for inventory in the coming month.
Classification Example
Predicting what customers will buy online based on certain factors (targeted advertising)
Supervised Machine Learning
Prediction and Classification, data is labeled and there is an outcome trying to be predicted.
Predictive Analytics
Predicts future probabilities and trends. Finds relationships in data that may not be readily apparent with descriptive analysis.
Ordinal
Qualitative, attributes can be ranked/ordered (football team rank, customer ranking)
Nominal
Qualitative, just labels with no ordering or mathematical calculation (Gender, StudentID, zipcode, department #)
Ratio
Quantitative, like interval but has true zero values and the ratio is meaningful (weight, height, distance, number of visits)
Interval
Quantitative, similar to ordinal but distance between attributes has meaning, but ratios are not meaningful (temperature, SAT score)
K-Fold cross validation
Randomly divide the sample into folds of approximately equal size. Each fold serves once as a test fold.
What is the most suitable scale of measurement for age?
Ratio
Pandas Package contains
Series, DataFrame, Panel
What is PEP8?
The style guide for Python that makes code more readable by making it easy to format.
Three way partition
Training set, validation set, and test set