CHOI Final Exam
For every dimension under consideration, two questions must be answered:
Question 1: Can the dimension table be useful for the analysis of the chosen subject? Question 2: Can the dimension table be created based on the existing data sources?
Data Manipulation Language: SELECT
Retrieve data based on conditions
Two tailed test:
Non-directional test which considers two possibilities: the purchase probability of treatment group could be more than or less than that of control group
Data manipulation FUNCTION
Retrieve data, Edit, Update, Add, delete
Data Manipulation Language terms
SELECT INSERT UPDATE DELETE
SELECT Statement:
SELECT * FROM Order_Table WHERE order_quantity >3;
IN(): example
SELECT * FROM coupon_usage WHERE discount_rate IN (10, 25);
Explicit JOIN notation
SELECT * FROM employee JOIN department ON employee.DepartmentID=department.DepartmentID SELECT LastName, CourseTitleFROM student JOIN course ON student.studentID= course.studentID
Implicit JOIN
SELECT * FROM employee, department WHERE employee.DepartmentID=department.DepartmentID
HAVING Statement example
SELECT discount_rate , COUNT( member_ID ) FROM coupon_usage GROUP BY discount_rate HAVING discount_rate >20;
SELECT Statement:
SELECT field_name_1, [field_name_2, ...,] FROM table_name_1, [table_name_2, ...,] WHERE Condition GROUP BY field_name HAVING Condition ORDER BY field_name [ASC or DESC];
Data manipulation TYPES
SELECT, INSERT, UPDATE, DELETE
HAVING Statements:
Similar to WHERE Statements, it confines the query into specific conditions Used after GROUP BY function USE HAVING (NOT WHERE) after GROUP BY function!!! • You CANNOT use WHERE after GROUP BY function!!!
DISTINCT Statement:
Syntax to use to select one record from duplicate records
Dimensional modeling
-A data design methodology used for designing subject-oriented analytical databases, such as data warehouses or data marts
Subject-oriented
-An operational database system is developed in order to support a specific business operation -A data warehouse is developed to analyze specific business subject areas
LIKE "%~%":
-Applies to string values -You can select the records which contain specific keywords
Outer Join
List the Customer ID number, registered date, and order number for all customers. Include all customers even if the customer does not have an order.
ORDER BY Statement:
-Returns the results sort by order -Syntax to control the order of output -Default: Ascending order (use DESC to sort them in decreasing order)
Outer join
-Rows that do not have matching values in common columns are still included in the result -Can use left outer join or right outer join
Data warehouse components
-Source systems -Extraction-transformation-load (ETL) infrastructure -Data warehouse -Front-end applications
Integrated
-The data warehouse integrates the analytically useful data from the various operational databases (and possibly other sources) -Integration refers to this process of bringing the data from multiple data sources into a singular data warehouse.
Structured repository
-The data warehouse is a database containing analytically useful information -Any database is a structured repository with its structure represented in its metadata
Inner/Natural join
-The query compares each row of table A with each row of table B to find all pairs of rows which satisfy the join condition. -Rows must have matching values in order to appear in the result.
Data Manipulation Language: INSERT
Insert values into the table
Data Manipulation Language: UPDATE
Modify stored data
Operational information (transactional information)
-the information collected and used in support of day to day operational needs in businesses and other organizations
data warehouse.
A ______________ is created within an organization as a separate data store whose primary purpose is data analysis.
retrieval of analytical information
A data warehouse is developed for the ________________, and it is not meant for direct data entry by the users
Null hypothesis:
A statement that no difference exists between the parameter and the statistic being compared to it.
ALTER TABLE statement:
ALTER TABLE table_name ADD field_name datatype <field conditions>; ALTER TABLE table_name ADD CONSTRAINT condition_name; ALTER TABLE table_name DROP field_name; ALTER TABLE table_name DROP CONSTRAINT condition_name;
Join
An operation that causes two or more tables with a common field to be combined into a single table or view
IN():
Applies to the discrete values
CREATE DATABASE example
CREATE DATABASE dbname
Data definition TYPES
CREATE, ALTER, DROP
Alter table
Change structure (i.e., adding new fields) or change constraints
Two approaches to hypothesis testing:
Classical statistics Bayesian statistics
Data Control Language (DCL)
Commands that control a database
Data Definition Language (DDL)
Commands that define a database
Data Manipulation Language (DML)
Commands that maintain and query a database
dimensional modeling
Commonly, ____________ is employed as a relational data modeling technique
Declaring a Primary Key
Create Table Broker (BrokerID Int not null primary key, LastName varchar(255) not null, FirstName varchar(25), GPA varchar(5));
Data Manipulation Language: DELETE
Delete the records in the data
star schema
Designing the _____________ involves considering which dimensions to use with the fact table representing the chosen subject
One-tailed test:
Directional test which considers only one possibility (either greater than or less than)
Data control TYPES
GRANT, REVOKE, COMMIT, ROLLBACK
Data definition FUNCTION
Generate data structure or delete
Data control FUNCTION
Grant or revoke permissions to database users and recover transactions
INSERT Statement
INSERT INTO Table_Name VALUES (value1, value2....);
INSERT Statement example
INSERT INTO Table_Name_1(Fields_Name_1) SELECT Fields_Name_2FROM Table_Name_2WHERE Conditions;
INSERT INTO
The __________ statement is used to insert new records in a table.
GROUP BY
The ____________ statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result
INSERT INTO
The ____________ statement is used to insert new records in a table.
changes
The data in the data warehouse is not subject to _______.
non-volatile, static, read-only
The data in the data warehouse is referred to as____________ , ________, or __________
retrieval
The only functionality available to the users of the data warehouse is _________
Hypothesis Testing
The purpose of ___________ is determine the accuracy of your hypotheses by using statistics.
time variant
The term _____________ refers to the fact that a data warehouse contains slices or snapshots of data from different periods of time across its time horizo
Update Statement
UPDATE Book SET Stock = 10 WHERE Book_num = 5;
GROUP BY
When you use ___________ function, please make sure you select the fields which have the records unique to the ___________ level
the data warehouse =
a structured repository of integrated, subject-oriented, enterprise-wide, historical, and time-variant data. The purpose of the data warehouse is the retrieval of analytical information. A data warehouse can store detailed and/or summarized data.
Foreign keys
are identifiers that enable one table to refer to another table - Primary key from one table becomes the foreign key of the other - Foreign keys cannot have values that the primary key in other table does not have → Referential Integrity Rule
Primary keys
are unique identifiers of the table -This is how we can guarantee that all instances are unique - Use EntityNameID if not a "natural" primary key -Keys can be simple (a single attribute) or composite (more than one attribute). -No primary key attribute may be null. All primary key fields must have data → Entity Integrity Rule
Data Definition Language (DDL) examples
creating, altering, and dropping tables and establishing constraints
CREATE DATABASE
defines a database
CREATE TABLE
defines a new table and its columns
Dimensions Facts
dimensional modeling distinguishes two types of tables:
star schema
in the __________, the chosen subject of analysis is represented by a fact table
Data Control Language (DCL) examples
including administering privileges and committing data
GROUP BY Statement:
is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result
Alternative hypothesis:
logical opposite of the null hypothesis.
ERD to SQL
look at slide 17 and 20 will be exactly like the exam... look through slide 18 - 20
Classical statistics:
represents an objective view of probability in which the decision making rests totally on an analysis of available sampling data. (the hypothesis is rejected or accepted based on the sample data collected)
Bayesian statistics:
represents an subjective probability estimates stated in terms of degrees of belief.
DISTINCT Statement example
select distinct(discount_rate) from coupon_usage;
CREATE TABLE example
tablename(column_name1 data_type ,column_name2 data_type, column_name3 data_type , ....)
Analytical information -
the information collected and used in support of analytical tasks -Analytical information is based on operational (transactional) information
Assumption for t-test:
•The observations must be independent—that is, the selection of any one case should not affect the chances for any other case to be included in the sample. •The observations should be drawn from normally distributed populations. •These populations should have equal variances.