ITSS 4353 Exam 2
Patsy is _____
a Python library for describing linear statistical models
In the Data Visualization in Python by Examples video, this code resulted in producing what kind of visualization? (Choose the best answer).
a bar plot
IPython is:
a better interactive python interpreter
When there are no specific outcomes, such as in a discovery process, which method can best help you understand patterns of customers or entities in terms of their characteristics?
a clustering model
SciPy is:
a collection of scientific computing packages
What is a data aggregation?
a data transformation that produces scalar values from arrays
The first/best way of discerning patterns with known outcomes is done with which kind of a model?
a decision tree model
What is the DOE methodology?
a design of experiments (DOE) to understand the relationship between factors affecting a process and the output of that process
What is a ndarray
a fast, flexible container for large datasets
What is a tuple?
a fixed-length, immutable sequence of Python objects
What is KNIME?
a free & open-source data analytics, reporting and integration platform
scikit-learn is:
a machine learning toolkit
What does a regex describe?
a pattern to locate in text
BI visualization is:
a recommended final input into driving business decision-making
Anaconda is:
a recommended installer for python
What is a tuple?
a sequence or list
Choose all that apply. Examples of structured data include:
multidimensional matrices unevenly spaced time series multiple tables of data interrelated by key columns spreadsheet or CSV
Syntactic sugar is:
programming syntax that simplifies
Exploratory Data Analysis (EDA) is:
required and a pre-requisite for building more effective analytics models
Data Normalization is to:
rescale & arrive at values relative to some size variable
Data security, such as what we saw in the Target and JP Morgan Chase case studies, needs to cover ___________________________________________.
the entire value chain of customer data
In [13]: time_zones = [rec['tz'] for rec in records if 'tz' in rec] In [14]: time_zones[:10] Out[14]: ['America/New_York', 'America/Denver', 'America/New_York', 'America/Sao_Paulo', 'America/New_York', 'America/New_York', 'Europe/Warsaw', '', '', '']
the first 10 timezones we want to look at
What is the Medici effect?
the integration of concepts from varying cultures & industries/ideas
Which indicates the best grouping method?
the lowest degree of impurity
The 3 V's of Big Data are:
Variety Velocity Volume
When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.
True
The analytics maturity of any company can be measured along five dimensions, as the delta model. Apply the correct label to each of the five dimensions in the Analytics Maturity Model.
1. C. Data Access 2. D. Enterprise Focus 3. B. Analytical Leadership 4. A. Strategic Targets 5. E. Analytical Talents
The power of real insights come when _______1______ from diverse sources can be ________2_________.
1. Multiple Data 2. Integrated Together
Put the steps of the Analytics Business Value Chain in the right sequence:
1. New Business Challenges & Questions 2. Data Audit, Preparation & Exploration 3. Analytics Knowledge Discovery 4. Test & Learn Knowledge Management 5. Execution & Continuous Optimization
Apply the correct label to each of the five levels in the Analytics Maturity Model.
1.E. Analytically Impaired 2.D. Analytically Aware 3.A. Analytically Inspired 4.C. Analytically Proficient 5.B. Analytical Leaders
Match up the components of this privacy hierarchy and level of intrusiveness:
A. Level of Privacy Intrusiveness B. Access C. Identity D. Pooling
Classification and regression are examples of _________A_________, while hidden Markov and principal component analysis and clustering are examples of __________B________.
A. Supervised Learning B. Unsupervised Learning
Choose the correct statement.
Business analytics is focused on the process of discovery of actionable knowledge and the creation of new business opportunities from such knowledge.
Predicted Lifetime Value (LTV) of a customer can be predicted using:
Buyer Differential Characteristics (BDC) except substituting transactions with net present value (NPV) of all revenue/profit streams through the customer life cycle with the business
All of the descriptive statistics on pandas objects ___________________ missing data by default.
Exclude
Which of these is a sentinel value in pandas, that can be easily detected?
For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data.
What is JSON?
JavaScript Object Notation
Common causes for senior management's lack of attention to how analytics is used are:
Lack of understanding and trust Lost in translation Over reliance on what they know Lack of strategic vision of analytics' uses
The point of contact between pandas and other analysis libraries is usually ____
NumPy Arrays
NumPy stands for:
Numerical Python
Choose the correct statement about leverage in analytics testing.
Once the possible levers are identified, the incremental effects of different levers can then be isolated and validated with tests (with properly designed multicell control groups) design of experiments (DOE).
What is PII?
Personally Identifiable Information
What does this formula represent? P(B|A)= P(A&B)/P(A) where A is a price point, and B is product purchase
Predicted pricing sensitivity
When we think of workable data quality, we understand that data quality is _________________, especially for business data.
Relative
Consider a scenario where a customer marketing campaign failed, attributed to analytics. Choose the correct analysis:
This is likely due to a mismatch between marketing messages, offers, and target customers. A better way is to create an intersection between the marketing team responsible for the contact collaterals and the analytics team.
A common workflow for model development is to use pandas for data loading and cleaning before switching over to a modeling library to build the model itself.
True
Adopt consistent data strategy and invest in acquiring, cleansing and searching, and integrating with new data sources. Establish a data council to oversee data governance and formulate progressive privacy and security policies.
True
In Python, mixing functions with arrays, dicts, or Series is not a problem as everything gets converted to arrays internally.
True
Leverage past histories to predict future outcomes and trends. Identify and optimize levers (promotions, sales, new products) in scenario simulations and predict levels of investments and ROIs.
True
NumPy is designed for efficiency on large arrays of data.
True
Pandas provides many built-in time series tools and data algorithms. You can efficiently work with very large time series and easily slice and dice, aggregate, and resample irregular- and fixed-frequency time series.
True
Python has both built-in and third-party libraries for converting a JSON string into a Python dictionary object
True
Python is an interpreted language.
True
In python, bisect means to:
binary search and insert into a sorted list
In the Data Visualization in Python by Examples video, you learned what syntax/command to print off the first ten rows?
data.head()
Select the six steps in the BAP (business analytics process):
deployment business objectives test and learn modeling data preparation data audit
Select the four generally agreed on types of activities / solutions for analytics:
descriptive predictive diagnostic prescriptive
In the Data Visualization in Python by Examples video, you learned what syntax/command to read in the data?
df = pd.read_csv("data-disasters.csv")
Select the most effective digital marketing tactics for customer retention.
email marketing
Business needs to give value to consumers to be permitted access to their data. Research indicates the three greatest incentives are:
exclusive deals loyalty points cash rewards
In python, pandas are:
high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive
A lift chart shows what?
how a predictive model performs
Fill in the blank. The simplest and most widely used kind of time series are those _______________.
indexed by timestamp
Given the choice between using an INSERT or an APPEND in python, which is more "computationally expensive"?
insert
What does the pip or conda install command do?
installs packages into your python environment
What kind of language is python?
interpreted
How are object references in python typed?
no type associated (strongly)
Given this code, what does the set of three single quotes ' ' convey, in the output? In [13]: time_zones = [rec['tz'] for rec in records if 'tz' in rec] In [14]: time_zones[:10] Out[14]: ['America/New_York', 'America/Denver', 'America/New_York', 'America/Sao_Paulo', 'America/New_York', 'America/New_York', 'Europe/Warsaw', '', '', '']
unknown/empty strings
How is code structured in python?
using whitespace (tabs/spaces)
Select the channels that are used the most for customer acquisition:
website and email
Select typical/possible components of an ERP system:
Service Management Supply Chain Management Financial and Managerial Accounting Human Resources
Select the correct statement.
Significantly more companies have a greater focus on customer acquisition over those that focus on retention.
To optimize ROI, management may test and find ways to minimize costs and increase conversions among various tiers of customers.
Some customers may need to be fired for having small wallets and are not worth pursuing, whereas others with large wallets and low shares may not be worth the high costs to pursue them, also indicating that the fundamental value proposition and focus may not appeal to this group of customers.
It is best to always focus on the simple descriptive and diagnostic analytics over needlessly advanced predictive and prescriptive analytics.
False
NumPy internally stores data in non-contiguous blocks of memory, dependent on other built-in Python objects.
False
Python runs a lot faster than Java or C++ compiled code.
False
What the customer purchased before, or the customer's web browsing history are examples of ________A__________. The likelihood of a customer being the target is an example of a _________B__________.
A. independent variable B. dependent variable
Select the correct statement.
Anything that is observed or measured at many points in time forms a time series.
With a robust analytics strategy and the right tools and processes, it is not necessary to focus on hiring or training analytics talent to recognize patterns in customers and product purchases.
False
Choose all valid scenarios related to predicting outcomes and testing potential levers.
Given everything we know, could we have predicted this year's results from last year's data. How can we baseline predictions and test effective levers to determine what we could do differently. Taking validated predictions, can we confidently predict next year's results with this year's data. What confidence do we have in business outcomes, before major investments are made.
Choose the correct statement.
While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.
Select the correct statement.
With a DataFrame, you can sort by index on either axis.
Given these two lines of code, select the 3rd line of code necessary to produce the bar plot shown above. In [36]: import seaborn as sns In [37]: subset = tz_counts[:10]
In [38]: sns.barplot(y=subset.index, x=subset.values)
What does the read_csv function do?
load a comma separated value file into a dataframe load a comma separated value file or PDF into a dataframe
______ is a desktop plotting package designed for creating (mostly two-dimensional) publication-quality plots.
matplotlib
Select all examples of optimized groupby methods (data aggregation).
min sum max mean count
When data contains significant missing data, there are a few ways to alleviate the impacts or assess the model effectiveness under such imperfect conditions. Choose all the correct options.
use data imputation eliminate the columns of independent variables with questionable quality additional data cleansing and augmentation eliminate the rows (of customers) with questionable reliability
A big benefit of using NumPy is the ability to write math functions for fast operations on entire arrays of data _______________________.
without having to write for-loops
What are the three stages of a simple group aggregation?
split-apply-combine
Select all examples of time series that could be modeled in business:
stock prices inventory levels profit & loss
Given this code, what is the expected output? In [10]: from datetime import datetime In [11]: now = datetime.now() In [12]: now Out[12]: datetime.datetime(2017, 10, 17, 13, 34, 33, 597499) In [13]: now.year, now.month, now.day
(2017, 10, 17)
To predict a customer's propensity to buy certain product, the workflow would be:
(Spend + Demographics Data) + (P(Buy) model) + (DoE)
Given this code, what is the expected graphical output? Look at the graphs, and then select A, B, C or D. In [238]: close_px.AAPL.plot( ) In [239]: close_px.AAPL.rolling(250).mean( ).plot( )
A
Select the correct statement.
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.)
Select the correct statement(s).
A data analyst/business analytics professional works with structured/unstructured data and utilizes programming and visualization to recommend business strategies. A business analyst focuses on processes and requirements such as the software development lifecycle and utilizes Excel over programming as a key tool.
In the Business Analytics Workflow reviewed via in-class lecture/video, select the correct combination to replace the red and purple ? (question) marks.
Data Mining and Predictive Analytics
Choose the two work horse data structures used in pandas
Dataframe Series
In the different types of Analytics reviewed via in-class lecture/video, select what does NOT belong.
Disruptive Analytics
If you execute the following code, what will be the result? In [12]: import numpy as np In [13]: data = np.arange(10) In [14]: data Out[14]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [15]: plt.plot(data)
a simple 45-degree line plot starting from (0,0)
What is Seaborn
a statistical graphics library
What is a ufunc
a universal function
In python, dict is:
a very important built-in data structure, also known as hash map or associative array
What is a tag cloud?
a visual depiction of user-generated tags attached to online content, typically using color and font size to represent the prominence or frequency of the tags depicted.
What is a lambda function?
an anonymous function
What is a dot used for in python?
an array method & function for matrix multiplication
ANOVA is ______
analysis of variance methods
Generic time series in pandas are ____
assumed to be irregular; that is, they have no fixed frequency
GroupBy operations can be significantly faster with __________________ because the underlying algorithms use the integer-based codes array instead of an array of strings.
categorical
If you do a lot of analytics on a particular dataset, converting to ______________ can yield substantial overall performance gains. A ________________ version of a DataFrame column will often use significantly less memory.
categorical
Pandas has a special ___________ type for holding data that uses the integer-based _________ representation or encoding. You can achieve better performance and memory use in some pandas operations.
categorical
What are the four key types of analytics reviewed in Ch.9 (Applied Business Analytics)?
descriptive, diagnostic, predictive, prescriptive
Built-in aggregate functions like 'mean' or 'sum' are typically ____________________ than/as a general applyfunction.
faster
What is the pandas method to use to fill in holes (missing) in data?
fillna
What is NaN?
not a number
Scikit-learn is ______
one of the most widely used and trusted general-purpose Python machine learning toolkits
The cost of changing or adapting operational systems to embed advanced analytics is __________________.
prohibitively high
What does the plt.savefig method do?
save a plot figure to file
Select two popular python modeling toolkits:
scikit-learn statsmodels