ISM 4212
Dimensional modeling
- A data design methodology used for designing subject-oriented analytical databases, such as data warehouses or data marts - Commonly, dimensional modeling is employed as a relational data modeling technique
Data Warehouse
- A typical organization maintains and utilizes a number of operational data sources. - The operational data sources include the databases and other data repositories which are used to support organization day-to-day operations - A data warehouse is created within an organization as a separate data store whose primary purpose is data analysis.
Subject-oriented
- An operational database system is developed in order to support a specific business operation - A data warehouse is developed to analyze specific business subject areas
ANOVA Test
- If we have k-independent-samples tests (i.e., three or more samples) - Uses the variance of observations
Star Schema
- In the star schema, the chosen subject of analysis is represented by a fact table - Designing the star schema involves considering which dimensions to use with the fact table representing the chosen subject
What does an Outer Join do?
- List the Customer ID number, registered date, and order number for all customers. Include all customers even if the customer does not have an order. - SELECT customer.customerid, registerdate, orderid FROM customer LEFT OUTER JOIN b2c_order ON customer.customerid = b2c_order.customerid
Structured repository
- The data warehouse is a database containing analytically useful information - Any database is a structured repository with its structure represented i in its metadata
Assumption for t-test
- The observations must be independent—that is, the selection of any one case should not affect the chances for any other case to be included in the sample. - The observations should be drawn from normally distributed populations. - These populations should have equal variances.
SELECT statement
- WHERE - GROUP BY - HAVING - ORDER BY
Foreign keys
- are identifiers that enable one table to refer to another table - cannot have values that the primary key in other table does not have
Primary keys
- are unique identifiers of the table - Simple (single attribute) or Composite (more than one attribute) - No PK can be null - Guarantee's that all instances are unique
Analytical Information
- the information collected and used in support of analytical tasks - based on operation (transnational) information
Data definition Types
-CREATE - ALTER - DROP
Data control Types
-GRANT - REVOKE - COMMIT - ROLLBACK
Data manipulation Types
-SELECT - INSERT - UPDATE - DELETE
Data warehouse components
-Source systems -Extraction-transformation-load (ETL) infrastructure -Data warehouse -Front-end applications
Alternative Hypothesis (H1)
logical opposite of the null hypothesis
Outer join
- Rows that do not have matching values in common columns are still included in the result - Can use left outer join or right outer join
Integrated
- The data warehouse integrates the analytically useful data from the various operational databases (and possibly other sources) - Integration refers to this process of bringing the data from multiple data sources into a singular data warehouse.
Retrieval of analytical information
A data warehouse is developed for the retrieval of analytical information, and it is not meant for direct data entry by the users
Null Hypothesis (H0)
A statement that no difference exists between the parameter and the statistic being compared to it.
Data Control Language (DCL)
Commands that control a database, including administering privileges and committing data
Data Definition Language (DDL)
Commands that define a database, including creating, altering, and dropping tables and establishing constraints
Data Manipulation Language (DML)
Commands that maintain and query a database
ALTER TABLE
Data Definition types
One-tailed test
Directional test which considers only one possibility (either greater than or less than)
Heteroscedasticity
The variances of population are equal
Left outer Join
left table and then things that correspond in the right table
Join
An operation that causes two or more tables with a common field to be combined into a single table or view
Independence of errors
Each sample group is independent from each other
Referential Integrity Rule
Foreign keys cannot have values that the primary key in other table does not have
Data definition
Generate data structure or delete
Data control
Grant or revoke permissions to database users and recover transactions
Entity Integrity Rule
No primary key attribute may be null. All primary key fields must have data
Two-tailed test
Non-directional test which considers two possibilities: the purchase probability of treatment group could be more than or less than that of control group
T-score < T- critical
Null hypothesis is not rejected
T-score > T- critical
Null hypothesis is rejected
Data manipulation
Retrieve data, Edit, Update, Add, delete
Explicit JOIN notation
SELECT * FROM employee JOIN department ON employee.DepartmentID = department.DepartmentID SELECT LastName, CourseTitle FROM student JOIN course ON student.studentID = course.studentID
Implicit JOIN statment
SELECT * FROM employee, department WHERE employee.DepartmentID = department.DepartmentID SELECT LastName, CourseTitle FROM student, course WHERE student.studentID = course.studentID
Selection Bias
Samples must be randomly selected from normal population
Data Warehouse Definition
The data warehouse is a structured repository of integrated, subject-oriented, enterprise-wide, historical, and time-variant data. The purpose of the data warehouse is the retrieval of analytical information. A data warehouse can store detailed and/or summarized data.
Hypothesis Testing
The purpose of hypothesis testing is to determine the accuracy of your hypotheses by using statistics.
Inner/Natural join
The query compares each row of table A with each row of table B to find all pairs of rows which satisfy the join condition. Rows must have matching values in order to appear in the result.
Time variant
The term time variant refers to the fact that a data warehouse contains slices or snapshots of data from different periods of time across its time horizon
Classical Statistics
represents an objective view of probability in which the decision making rests totally on an analysis of available sampling data. (the hypothesis is rejected or accepted based on the sample data collected)
Bayesian Statistics
represents an subjective probability estimates stated in terms of degrees of belief
Operational information (transactional information)
the information collected and used in support of day to day operational needs in businesses and other organizations