Data
SQL Statements
- Identify the Databases - Identify the Data - Identify the Tables - Identify the Conditions (Selectivity) - Select ... From ... Where ...
Types of Business Data:
- Master Data - Transactional Data - Hierarchical Data - Unstructured Data - Metadata
Frequency Distribution
How a Variable Is Distributed
Regression
A statistical method for predicting a numeric outcome (y) using a numeric predictor (x).
r2
Coefficient of determination and the square of the correlation coefficient. A measure of how well x predicts y in a regression. 1 = a perfect relationship, 0 = no relationship.
An overall description of a database, including the names of data elements, their characteristics, and their relationship to one another, would be defined by using a Data definition language. Data control language. Data manipulation language. Data command interpreter language
Data definition language.
Imagine a variable has five values: 7, 10000, 12000, 13000, 14500. What will be the shape of the distribution?
Nonnormal, left skewed, negative skewness
A ___________ p value indicates good model fit.
Small, typically < 0.05
Define "application software."
The diverse group of end-user programs that accomplish specific user objectives. Can be general purpose (word processors, spreadsheets, databases) or custom-developed for a specific application (e.g., a marketing information system for a clothing designer). May be purchased "off the shelf" or developed internally.
Histogram
A graph of the distribution of a variable, grouped into bins (groups).
Define "byte."
A group of (usually) eight bits represent alphabetic, numeric, and other characters (3, g, X, ?, etc.). Several coding systems assign bytes to characters (e.g., ASCII and EBCIDIC). These systems define the sequence of zeros and 1s that represent each character.
Define "field."
A group of characters (bytes) that identify a characteristic of an entity. A data value is a specific value found in a field. Fields can consist of a single character (Y, N) but usually consist of a group of characters. Each field is defined as a specific data type. Date, Text and Number are common data types.
Define "record."
A group of related fields (or attributes) that describe an individual instance of an entity (a specific invoice, a particular customer, an individual product).
What is the difference between a histogram and a box plot?
A histogram shows the distribution of a variable grouped into bins. A box plot shows the median and quartiles of the distribution of a variable.
p
A measure of the statistical significance of a regression. p < .05 indicates a significant relationship, p > .1 indicates little or no relationship.
The r2 for a regression of production volume (x) on maintenance costs (y) is .38. This value indicates:
A moderately strong relationship between x and y.
Box plot
A plot of the distribution of a variable that indicates the median and quartiles of the distribution.
Describe the concept of knowledge management (KM).
Attempts to ensure that the right information is available at the right time to the right user. A variety of practices attempt to electronically capture and disseminate information throughout the organization. Knowledge management practices seek specific outcomes, including shared intelligence, improved performance, competitive advantage, and more innovation.
Why is an automated ETL process desirable?
Automated ETL is the only feasible approach for big data sets. Automated ETL processes are more efficient and accurate than manual ETL processes.
Define "data mart".
A specialized version of a data warehouse that contains data that is pre-configured to meet the needs of specific departments. Companies often support multiple data marts within their organization.
What is the purpose of a stacked bar chart?
A stacked bar chart can show both changes (often over time) and the relation of parts to a whole (e.g., sales by divisions in a company).
Normal distribution
A symmetrical, bell shaped distribution in which the mean and median are usually close to one another.
Master Data Management
Administration of Master Data within an Enterprise Master Entity Data, such as Customers, Items, Vendors, Bills of Materials, Employees Master Reference Data in contrast with Transactional Data, such as Sales Orders, Purchase Orders, Timecards, etc.
Which of the following is not a category of computer software? System software. Programming languages. Application software. All of the above are categories of computer software.
All of the above are categories of computer software.
Define "programming languages."
All software is created using programming languages. They consist of sets of instructions and a syntax that determine how the instructions can be put together.
Database management software is considered: Outerwear. Software. Middleware B and C.
B and C
In a regression predicting monthly rental revenue (y) from the number of rental properties (x), the B coefficient is $1,872. What does this value indicate?
For a 1-unit increase in rental property, the expected monthly rental revenue is $1,872.
Median
For odd set of numbers, the median equals the middle value. For even set of numbers, the median equals the average of the two numbers in the middle. Compared to the mean, the median is a more useful measure of central tendency when the distribution is skewed.
One of the benefits of a single integrated database information system is Closer program-data linkage. Increased data redundancy. Reduced security. Increased data accessibility.
Increased data accessibility.
The machine language for a specific computer May be changed by the programmer. Is the same as all the other computer languages. Is determined by the engineers who designed the computer. Is always alphabetic
Is determined by the engineers who designed the computer.
Why Know SQL Syntax?
Is the Access Mechanism • The basis of - Queries - Reports - Dashboards - Business Intelligence - Business & Predictive Analytics
Right-skewed distribution
More and/or bigger values on the left (lower) side of the distribution. The median is less than the mean. Positive skewness
Left-skewed distribution
More and/or bigger values on the right (higher) side of the distribution. The median is greater than the mean. Negative skewness
Describe the "analytics mindset."
The ability to think critically and use information to exercise informed professional judgments
Of what value is the color wheel in creating visualizations?
The color wheel can be used to choose complementary and contrasting colors for visualizations.
Define "operating system."
The interface between the user and the computer hardware.
Bar charts are useful in showing the relationship between a category and a variable. True False
True
Data mining techniques include statistical analysis. True False
True
Heat maps are useful in describing the riskiness of a process or activity. True False
True
In business analytics, the ability to use legacy tools—for example, Excel—is helpful to transitioning data to contemporary tools. True False
True
Knowledge management systems endeavor to create a competitive advantage through use of shared information. True False
True
Most DBMSs include backup and recovery procedures. True False
True
Prescriptive business analytics is concerned with the use of advanced methods to address business problems. True False
True
The ETL process is for preparing and cleaning data. True False
True
The human perception system is strongly attuned to visual information. True False
True
The mean is a good indicator of the central tendency (or mass) of a normal distribution. True False
True
A byte is a group of 17 bits. True False
False
A cross-sectional regression predicts a variable over time. True False
False
A large p values indicates a significant regression. True False
False
A quantile divides a distribution into fifths. True False
False
A simple linear regression has two or more predictor variables. True False
False
A strong regression relationship indicates that x causes y. True False
False
Business analytics is the reporting of big data to users. True False
False
Computerized data files are composed of fields. True False
False
Data are useful in business analytics. True False
False
Data redundancy is always unnecessary. True False
False
Databases consist of logically related fields. True False
False
Expert systems are passive information repositories. True False
False
In creating data visualizations, the use of gradients from the same color is generally a bad idea. True False
False
Scatterplots are useful in showing the relationship between a category and a variable. True False
False
Scripting languages are low-level. True False
False
Software consists of instructions for databases. True False
False
The mode is always greater than the median. True False
False
A dashboard is designed to tell a story about data. True False
False (collection of views of data that allows a user to understand, and often monitor, a process)
Which of the following structures refers to the collection of data for all vendors in a relational data base? Record. Field. File. Byte.
File
Data cubes
Contain numeric facts called measures, which are categorized by dimensions, such as time and geography Can be built to summarize unit sales of a specific item on a specific day for a specific store
Cubes
- Multidimensional structures of dimensions and measures
Uriah Heep is the CEO of Blundeston Financial Services, an investment banking company based in Boston. The company has grown from a small number of clients that the Heep‐ster (Mr. Heep's nickname) invested many years ago to a large, multi‐product financial services company. To date, however, all of the company's files are maintained as flat files. • The Heep‐ster asks you whether remaking the AIS away from flat files to a relational database system is worth the very high costs?
• Yes, this is worth the cost. • It is difficult to imagine a financial services company keeping clients if it could not answer the questions and queries that are central to database systems. • The cost is likely to be small compared to business failure, which is likely if the business doesn't convert. • Flat files include data redundancies, inabilities to answer basic queries, and extremely limited analytic capabilities.
Online Analytical Processing (OLAP)
- A method to analyze multidimensional data from many different perspectives - enables users to identify issues and opportunities and perform trend analysis
Types of data mining algorithms
- Association discovery - Clustering - Classification - Text and Web content mining
ETL = Extract, Transform & Load
- Extract • Copy out of Source - Transform • Change the Data • Often Cleansing Data • Edits Data such that it fits into Target - Load • Load into Target - Reject Processing • Handling "Erroneous" Data - Caveats: • Transformations may happen in between Source and Target or Prior to Extract or After Upload • Why - Whichever is easiest • Still called ETL even if TEL or ELT - Products: Microsoft SSIS and Talend
Facts/Measures
- Facts/Events - Data that can be aggregated - Often Numerical
Typical Interface Mechanisms:
- Flat File • comma separated values (csv) Traditional; Used most often outside IT - JSON • JavaScript Object Notation • Less verbose • Currently favored by programmers - XML • eXtensible Markup Language (similar to HTML) • Verbose
Data Movement Methodologies
- Internal (within applications) • Integration - External (between applications) • APIs to download and upload data (inbound and outbound) • ETL to move data from "source" to "target" - Moving Data from One Place to Another • such as from Transactional DB to Data Warehouse
Data Migration Challenges
- Moving Data into Staging Tables • Keep copy of Unprocessed Source Data • With all of its Issues - Processing Data • Cleansing • Reformatting • Completing - Can be Challenging - Uploading into Data Warehouse • Mapping Staging into Data Warehouse - Missing Contents • Especially missing data elements from the perspective of the Target • Often "De-normalized" (flattened)
Dimensions
- Perspectives on which to view the Facts - Hierarchically arranged - Drill-Down & Roll-up
Why Data Warehouses?
- Querying - Reporting - Business Intelligence - Business Analytics - Data maybe Reformatted - Data maybe De-Normalized - Data maybe Combined with Other Data - Dimension Tables maybe added for "slicing" data
Types of Data:
- Raw Data - Canonical Data = Normalized Data • Canonical Data Model (CDM) - Logical Data Model - Metadata
API = Application Programming Interface
- Replicates Manual Data Entry Logic and Validation - Designed for Mass Data Upload • Into Applications - such as Prospective Customer Lists • Often used to transfer data between Applications - Essential Idea: • Data goes through the same processes and controls on upload as it would on manual entry
Big Data Challenges
- Volume - Variety - Velocity - Value - Veracity
The benefits of the increased use of ADA
1. Improved understanding of an entity's operations and associated risks, including fraud risk. 2. Increased potential for detecting material misstatements. 3. Improved communications with audit stakeholders.
Describe three influences on the reliability of data.
1. Nature (type of data) 2. Source (e.g., general ledger vs a website) 3. Process (e.g., accounting system vs. web page-generated)
In visualization, it is useful to consider negative space and to realize that 1 + 1 = ?
3. The placement of two objects creates a "negative space" between the objects.
Visualizations should generally use ____ or fewer colors.
4. Using more than four colors is confusing and unhelpful.
Define "file."
A collection of records for one specific entity (an Invoice File, a Customer File, a Product File). In a relational database environment, files are also known as tables.
Define "knowledge base" (or "knowledgebase").
A component of a knowledge management system. A special type of database designed for retrieval of knowledge. It provides the means to collect and organize the information and develop relationships among information components.
Define "expert system" ("knowledge-based system").
A computer program that contains subject-specific knowledge derived from experts. The system consists of a set of rules that are used to analyze information provided by the user of the system. Based on the information provided, the system recommends a course of action.
A data warehouse differs from a data mart because A data warehouse is more specialized than a data mart. Data mining is possible in a data mart but not a data warehouse. A data mart supports specific needs. External data is not included in a data mart.
A data mart supports specific needs. A data mart is more specialized than a data warehouse. The data mart is often constructed to support specific needs of subunits of an organization.
A specialized version of a data warehouse that contains data that is pre-configured to meet the needs of specific departments is known as A functional warehouse. A data mart. A data store. An object-oriented database.
A data mart. A data mart is focused on a particular market or purpose and contains only information specific to that objective.
Lonesome Dove Cattle Ranch stores its accounting system data in multiple tables (i.e., matrices) that are linked by common key fields. This data structure is called Flat files. Matrix organization. A database system. An OLTP.
A database system.
Cross-sectional regression
A regression that predicts a variable at one point in time, e.g., sales (y) for a company for one year based on store square footage (x).
Time-series regression
A regression that predicts a variable over time (e.g., monthly, quarterly, or yearly).
What is the purpose of a scatterplot?
A scatterplot shows the relationship between two variables. Scatterplots can also be used (e.g., with color or shapes) to show differences in categories of observations (e.g., of divisions in a company).
• Monroe Starr is the CFO of Last Tycoon Enterprise. Last Tycoon is primarily a publisher of classic books (fiction and nonfiction). Mr. Starr is seeking your advice as he ponders whether to move some of the company's archives to cloud computing. The specific question that he asks you, however, is to explain the common elements of control found in any database management system written in a language that he (a non‐IT person) can understand.
Answer: cloud computing is a complex issue that is worthy of separate consideration. • The common elements of control found in any database management system. 1. No collisions—Concurrent access management • Multiple users can use the system and maintain record and file integrity 2. No hackers or creepers—Access controls • Restrict access to all tables and record to those who are authorized 3. Data definition standards, data element standards • E.g., Determine what must entered for each data element• Cloud computing is a complex issue that is worthy of separate consideration. • The common elements of control found in any database management system. 4. Backup and recovery procedures • How, when and by whom will the system be backed up? 5. Update privileges • A change control—who can change records and how can they change them 6. Data elements and relationship controls • Structure and relationships among the database elements
Which of the following is true in regard to data warehouses? I. The bulk of the data found in a data warehouse comprises historical operational data. II. Pattern recognition is one of the principal functionalities offered by data mining software. I only. II only. Both I and II. Neither I nor II
Both I and II. A data warehouse is a database archive of an organization's operational transactions (sales, purchases, production, payroll, etc.) over a period of years; external data that might be correlated with these transactions, such as economic indicators, stock prices, and exchange rates, are also included. Pattern recognition is a major component of data mining software: data mining is the process of performing statistical analysis and automatically searching for patterns in large volumes of data.
Which of the following is a primary function of a database management system? Report customization. Capability to create and modify the database. Financial transactions input. Database access authorizations
Capability to create and modify the database.
What is the correct ascending hierarchy of data in a system? Character, record, file, field. Field, character, file, record. Character, field, record, file. Field, record, file, character.
Character, field, record, file.
Relational Database Technology
Components - Database Management System • MS SQL or Oracle or MySQL or ... - DDL (Data Definition Language) - DML (Data Management Language) - Administrative Suite • SQL Server Management Studio (SSMS)
Data Life Cycle - CRUD
Create/Insert • Read/Browse • Update/Edit • Delete/Destroy • Search
The following customer data is stored in the sales processing system to a regional produce distributor: CustomerNumber, CustomerName, CustomerPhone, CustomerContact, CustomerCreditLimit Which of the following is true? CustomerNumber is an example of a field. CustomerNumber is an example of a data value. CustomerNumber is an example of a record All of the above are true.
CustomerNumber is an example of a field.
Analysis of data in a database using tools which look for trends or anomalies without knowledge in advance of the meaning of the data is referred to as Artificial intelligence. Data mining. Virtual reality. Transitory analysis.
Data mining
A data mart is a specialized type of ____________ that is tailored to the needs of a(n) _______. Database; user Data warehouse; organization OLTP; ERP AI; ERP
Data warehouse; organization
Which of the following components of a database is responsible for maintaining the referential integrity of the data in the system? Database management system (DBMS) Data query language (DQL). Data manipulation language (DML). Data definition language (DDL).
Database management system (DBMS) The database management system (DBMS) controls the storage and retrieval of the information maintained in a database and is responsible for maintaining the referential integrity of the data.
Quantiles
Divide a distribution into sections (quartiles are quarters, quintiles are fifths, deciles are tenths.)
Robert the Grievous is reading an online summary production cost report and wants to know why the cost of sprockets, used in constructing orbital sanders, is so high. Robert most likely needs to: Data mine. Drill down. Slice and dice. Use the OLAP system.
Drill down. He needs to move from summary to detailed information to determine its cause.
Scarlett O'Hara "Give a Darn" tours of Atlanta has an automated system that uses information obtained from travel agents to help customers find the best tours for their interests. This is best described as a(n) _____________. Database Data warehouse Expert system Data mart
Expert system
What is ETL?
Extract (getting data), Transform (cleaning the data), Load (moving cleaned data to target system)
Which of the following statements is correct? I. An important advantage of flat file systems is that they are program independent. II. Flat file systems contain little data redundancy. Both I and II. I only. II only. Neither I or II.
Neither I or II. Statement one is incorrect because, while flat file systems do contain program independence, this is seen as a disadvantage not an advantage. This is because the program independence of flat file systems means that multiple programs must be used to read, access and process the data. Statement II is incorrect because flat file systems contain a high degree of data redundancy.
Imagine a variable has five values: 7, 10, 12, 13, 14500. What will be the shape of the distribution?
Nonnormal, right-skewed, positive skewness
The machine-language program that results when a symbolic-language program is translated is called a(n) Processor program. Object program. Source program. Wired program.
Object program.
Define "bit" (binary digit)
On or off. 0 or 1; the smallest piece of information that can be represented in a computer system
Software that performs a variety of general technical computer-controlling operations is a(n) Integrated suite. Shareware. Database. Operating system.
Operating system.
Which of the following is a critical success factor in data mining a large data store? Pattern recognition. Effective search engines. Image processing systems. Accurate universal resource locator (URL).
Pattern recognition
SQL is most directly related to String question language processing. The "grandfather, father, son" method of record retention. Electronic commerce. Relational databases.
Relational databases
r2 is the ______ of the correlation coefficient.
Square. That is, r2 = the correlation coefficient (r)2.
Standard deviation
Standardized measure of dispersion (variation) in a variable. Abbreviated as SD or as a lowercase Greek letter sigma σ (for the population standard deviation). In securities, a measure of risk. In investing, higher equals more dispersion and risk, lower equals lower dispersion and risk For normally distributed data, about 66% of observations are within 1 standard deviation of the mean and about 95% are within 2 standard deviations of the mean. For this distribution (given that it is normal), the standard deviation indicates that about 66% of the inventory values are +/-$2,806 (the standard deviation) from $7,837 (the mean or average).
Interquartile range (IRQ)
The middle 50% of the distribution. The 3rd minus the 1st quartile.
Audit data analytics (ADA)
The process of "discovering and analyzing patterns, identifying anomalies, and extracting useful information in data ... for the purpose of planning or performing the audit."
Define "systems software."
The programs that run the computer and support system management operations.
In a regression predicting monthly rental revenue (y) from number of rental properties (x), the intercept is $693. What does this value indicate?
The value of y when x is 0 is $693.
Mode
The value that occurs most frequently in a given data set.
"Drill down" is the ability to move from aggregated data to the detailed transactions underlying it. True False
True
A bit is a binary digit. True False
True
Which of the following is not true? Relational databases Are flexible and useful for unplanned, ad hoc queries. Store data in table form. Use trees to store data in a hierarchical structure. Are maintained on direct access devices.
Use trees to store data in a hierarchical structure
Four Categories of Business Analytics
What is happening, why is it happening, what is likely to happen, and how should we act in response?
What is the central question of predictive analytics?
What is likely to happen?
When is the mean a good indicator of the central tendency (or mass) of a variable?
When the variable is normally distributed
Define "slicing and dicing" in data warehouses
• Ability to view data in multiple dimensions • E.g., sale of printers by product, by region, by time period, by company, etc.
Define "drill down."
• Associated with data warehouses • Ability to move from summary to granular information
Data Repository Concepts
• Data Warehouses - Traditional Query/Reporting • Data Marts - Focused on Specific Areas/Needs/Depts • Data Lakes - Vast amounts of Data
Analytics
• Descriptive (what happened) • Diagnostic (why did it happen) • Predictive (what will happen) • Prescriptive (what should we do) • Discovery/Investigative (what can be learned from the data)
Business Intelligence Trends
• From Internal to Mix of Internal & External Data • From Structured to Semi-Structured to Unstructured Data • From Reports to Visualizations • From Measurement to Prediction • Emergence of Metadata as Data
Business Intelligence and Business Analytics
• Solid Data Management/Data Governance • Assembling the Data • Organizing the Data • Cleaning the Data • Migrating the Data to a Suitable Platform • Applying a Suitable BI/BA tool • Techniques - Graphics Techniques - Linear Regression • Dashboards - Key Performance Indicators (KPIs) • Tools - Query/Reporting Tools - Visualization Tools
Describe a flat file system
• Used in early (pre‐ERP) systems • Characterized by independent programs and data sets • High data redundancy, poor analytic capabilities • Difficult to achieve cross‐functional reporting