Info 340 Midterm Study Guide
Explain what is meant by 'ACID principles'.
Acronym for the principles of transaction management. Hint: relational databases are effectively transaction management machines. Any INSERT, UPDATE or DELETE statement is considered a transaction (and therefore is afforded the guarantees offered by ACID principles).There are two types of transactions: 1) implicit (any DELETE, INSERT or UPDATE statement that is executed by itself and 2) explicit transaction (any bundle of statements wrapped within a BEGIN TRAN and COMMIT TRAN wrapper. ACID stands for the following: A: Atomicity ('All effects of a transaction---which could be one or even thousands of statements---complete successfully or NONE of them are completed') C: Consistency ('each transaction ends in good/consistent state...or the whole thing rolls back'). A consistent state is when no violations occur in terms of PK/FK requirements, data type requirements or business rules. I: Isolation ('a transaction occurs in the database as if the session were the only connection'). Effectively single-user mode. We are not influenced or affected by other transactions in-flight. D: Durability ('once a transaction has been committed...it lives forever; never under any conditions can it be changed, undone or modified'
Explain what a multi-valued dependency is and provide an example.
Although BCNF removes anomalies, there is also Multi-Valued Dependency which can cause data redundancy. Represents a dependency between attributes (for example A, B and C) in a relation, such that for each value of A there is a set of values for B and a set of values for C. However, the set of values for B and C are independent of each other. A multi-valued dependency when the presence of one or more rows in a table implies the presence of one or other rows in the same table. Put in another way, two or more attributes (columns) are independent of each other, but both depend on a third attribute. A multi-valued dependency prevents 4NF. Consider: ----------------------------------- branchno | sName | oName ____________________________________________ b003 ann b carol f b003 david f carol f b003 ann b tina m b003 david f tina m Displays the names of members of staff (sName) and property owners (oName) at each branch office (branchNo). sName uniquely identifies each member of the staff and oName uniquely identifies each owner. In this example, members of the staff Ann and David work at branch B003, and property owners Carol and Tina are registered at branch B003. However, as there is no direct relationship between members of staff and property owners at a given branch office, we must create a tuple for every combination of member of staff and owner to ensure that the relation is consistent. This constraint represents a multi-valued dependency in the BranchStaffOwner relation. MVD exists because two independent 1:* relationships are represented in the BranchStaffOwner relation.
Explain the difference between data and information and provide an example of each.
Clifford Stoll - set of context → Progression of data management/cognitive understanding. People become more aware of data. Data → Information → Knowledge → Understanding → Wisdom People who see trends and communities will thrive more Data is raw values- cannot understand them without context. Ex. Randomly saying blue/12. We need more perspective - why are you saying the words? We cannot learn with just data Information - Data in context, we can finally do something with these raw values, use them as information. Ex. Blue is my favorite color, 12 is my favorite color. Knowledge - you can process information. Ex. Lane on highway is closed, this might cause a traffic jam. This helps us use the information that we have collected. From this you go into understanding and wisdom. Better decisions are made as you go into wisdom
Explain the function of each of the clauses in the SELECT statement. What restrictions are imposed on these clauses?
Every clause in the SELECT statement - SELECT (list of columns) <- the only required statement for a query What we want to return FROM (tables as many as needed) JOIN (ON PK and FK) WHERE (Restrictions) GROUP BY (bundles of rows that are grouped... usually to be against an aggregate function - (MIN(), MAX(), SUM(), COUNT(), STDEV()) HAVING - Restricts the resultset out of GROUP BY. Count all bricks on campus vs count all bricks in room. ORDER BY WORKS: SELECT 5 * 5 SELECT GETDATE() SELECT 'THIS CLASS SUCKS' SELECT - list of columns to be returned, which tables we are going to pull columns from SELECT Top 1 StudentFName, StudentLName FROM - the list of tables that SELECT takes columns from FROM tblSTUDENT JOIN - join tables and retrieve columns from more than 1 table JOIN tblSTUDENT_ADDRESS AD ON S.StudentID = AD.StudentID WHERE - filters for columns AD.CityName = 'Seattle' LIKE - search '3_7%' ← % anything after that ORDER BY - orders result set ORDER BY S.Age DESC INSERT INTO tblSTUDENT (ID, Fname, Lname) VALUES (1234567, 'Greg', 'Hay')
Explain how effective management of data provides a competitive advantage for organizations in today's economy.
Like organize or die If we have clean data, trust data, can make decision correctly. Can learn from data and not just guess things. Data warehousing and analytical question if we have clean/trustworthy data we can determine things very quickly such as location, pricing, when to discontinue items. "If we didn't have decisions to make and transactions to manage we wouldn't need databases."
Explain how a PK that is narrow, sequential and never-updated increases system performance.
Narrow (such as a integer data type which is 4 bytes) is better than fat (such as the combination of Fname, Lname, DateOfBirth, CityOfOrigin, and favorite football team which may be 100 bytes). In this example, we can fit 25 times as many rows in memory with the INT as opposed to character columns. Memory is about 1 thousand times faster to retrieve rows than from disk. · Sequential: we want sequential so we do not lock any portion of the table we are INSERTING new rows into ( except the first row that holds the data we are inserting). We can enable auto)increment function to conduct this. · Never Updating: We will destroy any gains received from being narrow and sequential if we ever modify the PK. o Lose history o Never modify PK o Example: Changing out StudentID means we lose all history of any activity recorded under our previous ID in addition to forcing a recalculation of the sorting of all PKs. Narrow (such as an INTEGER data type which is 4 bytes) is better that fat (such as the combination of Fname, lname, DateOfBirth, CityOfOrigin, and favorite football team which may be 100 bytes). In this example, we can fit 25 times as many rows in memory with the INT as opposed to character columns. Memory is about 1 thousand times faster to retrieve rows than from disk. Sequential: we want sequential so we do not lock any portion of the table we are INSERTing new rows into (except the first row that holds the data we are inserting). We can enable auto-increment function to conduct this. Never-updated: We will destroy any gains received from being narrow and sequential if we ever modify the PK. Example: changing out studentID means we lose all history of any activity recorded under our previous ID in addition to forcing a re-calculation of the sorting of all PKs.
4. Discuss the purpose and benefits of following the normalization process during database design.
Purpose: You ensure that attributes are in the right tables. Also reduce redundancy and make everything appear on the same level. Normalization is a formal technique for analyzing a relation based on its primary key and the functional dependencies between attributes of that relation Normalization 'irons out the wrinkles' in attribute to entity placement. Establishes determinant and dependency relationships. Collecting the right characteristic attributes of an object, an entity, and making sure we are not overloading too much data, too many columns in the wrong entities/tables. From a high level: What is normalization? It's the process of taking a concept, an idea, brainstorming and going through the conceptual diagramming phase once we get into the realm of the logical design phase we want to iron out the attributes. We want to make sure we have the attributes appropriately located. We may have the right exact attributes but if we don't have them in the right objects we're going to end up with bad design, poor performance, and poor data integrity. If we're ever in the position of being responsible for participating or contributing to a database design we want to be able to recognize whether we have a productive design. Normalization is going to help us make sure we have the right attribute and the right objects, and recognize all relationships that are going on between the objects. Relations are categorized as normal form based on which modification anomalies or other problems they are subjected to. If we don't meet the requirements of a relation the data is JUNK. Function dependencies: what determines or what defines a particular column. Example: For keeping data about students, do we also want to have data on where they live like dorm name? Dorm room? Year that the dorm was built? We want to make sure that every attribute in a particular entity has characteristics of that object. We want to make sure that all objects are functionally defined by the primary key. 3NF is when we achieve fully normalized structures, but we can easily end up in 4NF just by default. What is a relation? A relation looks very similar to a spreadsheet. We have column headers and rows of data. All columns relate to each other like a record. Un-Normalized data: Stuffing too much data in a single cell. Violates rules for what is a relation. We are not able to build a structure like this in a relational database because it is not going to allow us to put three values in one column. What makes a proper relation? Each cell has to have one and only one value. Normal Forms UNF - A table that has repeating groups of column values. 1NF - A table that qualifies as relation is in 1NF. 2NF - A relation is in 2NF if all of its non-key attributes are dependent on ALL of the primary keys. 3NF - A relation is 3NF if is in 2NF and has no determinants except the primary key. Boyce-Codd Normal Form (BCNF) - A relation is in BCNF if every determinant is a candidate key.
Explain the conditions that are required for a relation.
Relation name distinct from all other relation names in schema Each cell of relation contains exactly one atomic (single) value Each attribute (column) has a distinct name Values of attributes are all from same domain Each tuple (row) is distinct; no duplicate tuples Order of attributes/tuples not significant NOT RELATIONS: Multiple entries per cell, table w/ required row order One and only one value in one cell, one row for 'people'. Unique column name. values in same domains. Everything in one column tracking same type of information. No order dependency of the data. First normal form would be we have met all the forms of a relation.
Explain the concept of type inheritance in the context of relational database design. When is it best implemented?
Super type/sub Consider athlete supertype with basketball player, baseball player, and soccer player subtypes. All subtypes would be linked to the supertype one to one
Explain what a M:M relationship is and why relational theory does not allow them.
What is it? Many to many relationship is when more than one rows in a table are associated with more than one rows in another table. Example? Two entities, student and class. A student can take many classes. And a class can be taken by many students. This is a many to many relationship. Why do we care? What is the concern? So What? Why a relational theory does not allow many to many relationship is because there would be a lot of NULL values. There will be NULL values because you would be adding a bunch of rows trying to modify the table to fit but you are just creating a restriction for yourself. Additionally, it is redundant. You have to store the name of the student (Morgan), then the class they are in (340). Then another row where you store (Morgan) and her other class (360). <- redundant. Better to store all the student names once in one table. Then all the class names once in another table. Between those two tables add an associative entity where you relate the student and class tables using IDs. How? Therefore a better way to implement a many to many relationship is creating a third table called an associative entity. An gm entity bridges the many to many relationship by taking the primary keys of each table and foreign keys.
Explain the concept of data independence in the context of relational theory.
What is it? Separation of application logic from platform/storage logical. In the hierarchical model, file locations were hard-coded directly in the application design. This meant that the actual location of data had to be known by the database developers and users. EXAMPLE: STUDENT file is queried from c:\STUDENT\2017\AUTUMN\INFO\340\QUIZ2 as opposed to SELECT * FROM tblSTUDENT So what? It is easier/less expensive to conduct maintenance (we do not have to turn off the application or kick off all the users to move file location or increase capacity). We can separate duties between application developers, hardware architect and database admins which allow for experts and innovation. How? We have a logical structure of design (such as Entity-Relationship Diagram) that represents physical storage as a logical name (like tblSTUDENT or tblPRODUCT). We do not need to be aware of the physical path to the data as that is hidden and managed by the database management system.
21. Describe what a view is and how it is most-often used.
a. A view is the dynamic result of one or more relational operations operating on the base relations to produce another relation.. A view is a virtual relation that does not necessarily exist in the database but can be produced upon request by a particular user, at the time of request. b. Base relation (to clarify above definition for those unfamiliar): A named relation corresponding to an entity in the conceptual schema, whose tuples are physically stored in the database) c. Used when you want to: hide parts of the database from certain users, customize data access to users needs, and to simplify complex operations on the base relations. Unsure if this qualifies as "most-often used" but this is what they are used for... A view is a saved SELECT statement; can span dozens of tables in a very complex many (Group By, many sub queries,). Behaves like a table (we select from the view without having to know the underlying tables involved). Best use: under-skilled executives who need reporting data. Layer of abstraction (security in the sense that we do not need to share schema with users). Also, we can filter columns by not selecting them in our statement.
10. When was the relational model introduced and by whom? What were the advantageous characteristics of the relational model?
a. E.F. Codd of the IBM Research Laboratory in 1970's Edgar F. Codd, IBM engineer - used relational algebra - introduced in 1970 - Implemented early 1980 - revolutionized the data industry, eliminated almost ALL previous problems Advantages: Relational Model . Eliminated almost all previous problems, records type has no explicit owner, have all possible relations Relatively easy to understand - similar to real world concepts - entities and relationships Improve and optimize business processes - can model complex relations Quick to obtain information - experts not required, anyone can access data, decisions became faster, business modeling and analyzing trends became more efficient Efficient data storage - 3NF ensures no redundancy (less footprint, physical storage is less expensive), data integrity guaranteed Efficient implementation - ANSI has rules - platform independence, don't need to hardcode the physical location for the files, can physically move them without impacting database, easy to customize - adding new columns, affordable Scalability - structure allows for large data sets, easy to archive data. Ex. Quick stream activities for Microsoft - editing database, doesn't slow down activity Efficiency in data storage and standards developed Data storage (collection), data retrieval (presentation), data security (protection) Computer language that all systems use - SQL, dialect might differ Normalization - design methodology Data independence - data location not hard coded, column order is not significant
11. What were the characteristics of the relational model?
a. Eliminated almost all previous problems associated with Database Management Systems. Links record together by all relationships, and each record type has no explicit owner. More efficient by storing data in exactly one place. SQL computer language was developed to manage the system, and normalization was adopted as a design methodology to ensure data independence. All data is logically structured within relations (tables). Each relation has a name and is made up of named attributes (columns). Each tuple (row) contains one value per attribute.
15. Describe how entity integrity and referential integrity are enforced.
a. Entity integrity specifies in a base relation, no attribute of a primary key can be null. Enforced by adding a primary key clause to a schema definition (NOT NULL). i. Each primary is uniquely defined and cannot be modified b. Referential integrity states that if a foreign key exists in a relation, either the foreign key value must match a candidate key value of some tuple in its home relation or the foreign key value must be wholly null. Enforcement: i. SQL rejects any insert or update operation that attempts to create a foreign key value in a child table without a matching candidate key value in the parent table. ii. When the user attempts to delete a row from a parent table, and there are one or more matching rows in the child table, there are 4 options (know what these are, but you probably should just namedrop for sake of time): 1. CASCADE- Delete the row from the parent table and automatically delete the matching row in the child table. Deleted rows may themselves have a candidate key that is used in another table, foreign key rules are triggered in a cascading manner. Triggers are used mainly to enforce referential integrity 2. SET NULL - Delete the row from the parent table and set the foreign key value in the child table to NULL (assuming NOT NULL isn't specified) 3. SET DEFAULT - Delete the row from the parent table and set each component of the foreign key in the child table to the specified default value 4. NO ACTION - Reject the delete operation from the parent table Primary key is going to guarantee uniqueness for the PK column. When we have a unique column value, we are guaranteed to have a unique row.Foreign Keys are going to guarantee that the originating value (the PK in a referenced table) is present. It exists!! How/when is this checked? At the time a transaction is presented to the database to be committed, the system will check all business rules, data types, PK values and FK values for any violations. We are restricted/prevented from dropping or deleting rows in a PK table if the values are being referenced in an FK table.
8. Explain how people tracked data before computers; what were the limitations of these systems?
a. File-based systems were an early attempt to computerize the manual filing system. An organization might have physical files set up to hold all their data that they tracked. They were labeled and stored in file cabinets. The manual filing system breaks down when we have to cross-reference or process the information in the files. With separate files many different pieces of information, it would take a lot of effort to use the files to answer questions related to the business. b. If only one copy of a data exists, it is vulnerable to damage, theft, and misplacement. c. Prone to mistakes Paper based systems for centuries 1700s, 1800s. Method of business Computers came in 1950s Accounting books - ledgers/binders ex. Sears catalog - took 3 weeks to get items File Cabinet Databases drawer - loosely represents database hanging file folder - table or entity/excel spreadsheet folding file - column or field receipts - row or record (keeps track of transactions) Advantages Quick to establish - shoe box/crate Easy to use - no skills/professionals required Cheap Disadvantages Vulnerable - single copies, can be lost in accidents. No backup, takes too long to make photocopies to prevent accidents. Even paper's ink will fade away. Even today, receipts ink will fade Not scalable - limited to people's physical ability Difficult to query - cannot analyse/compare ranges of data to find trends or even find information Difficult to transfer data - Making copies, you can lose control if you make notes. Have to write it on all copies, data won't be transferred easily. Still here today - cash, paper receipts, parking tickets, notes, wallet Inspired today's database systems - needed to improve efficiency Paper based systems are flawed for their scalability limitations with their security limitations.
17. Give 3 examples of the mechanisms RDBMS implement to enforce ACID properties.
a. Locking -- See Isolation on Question 8 b. Logging -- See Durability on Question 19 c. Entity Integrity / Referential Integrity -- See Question 16 (Consistency) i. Entity integrity enforces consistency by ensuring accuracy in the database (e.g. no duplicate values for a primary key, which would otherwise result in data anomalies) Atomicity: explicit transactions Isolation: Locks, and timestamps Consistency: data types, primary keys, foreign keys and business rules Durability: Write-ahead-logging (transaction log is also acceptable) Atomicity: explicit transaction (forced roll-back) Consistency: data types, business rules/check constraints, computed columns Isolation: locking, time-stamps Durability: Write-ahead logging
12. What is meant by cardinality, degree, multiplicity and participation?
a. Relationship cardinality (mandatory one, mandatory many, optional one, optional many). Cardinality:number of attribute in a relation (Cardinality is the number of tuples in a relation,it describes maximum number of possible relationship occurrences for an entity participating in a given relationship type) eg: 1-0, 1-many, etc Cardinality:Describes the number of occurrences of an entity in a relationship with another entity. Example:STUDENT and CLASS are entities; the relationship from left to right is "one STUDENT can take many CLASSES" and right to left is "the same CLASS can have MANY STUDENTs" therefore the relationship is MANY-to-MANY. The book says that cardinality describes the number of rows in an entity; this is a European interpretation. Here in US industry we use cardinality to describe relationships .1:MM:M1:1 Degree: describes the number of entities that participate in a relationship. Degree 2 is also known as 'binary'; degree 3 is also known as ternary. Example: A degree 2 relationship (aka 'binary') would be STUDENT_TYPE and STUDENT. A degree 3 relationship example is PERSON, ROLE, and CLASS coming to describe PERSON_CLASS_ROLE. Multiplicity: describes exactly the minimum and maximum participants in a relationship. Example: the iSchool has a rule that a class must have at least 11 students to be run; the maximum varies obviously but we often cap at 35. Therefore, multiplicity for an iSchool class is written (11, 35). Multiplicity : min / max / range of instances (attributes) in a relationship e.g. pizzas order min of five to be delivered, not more than 10 Participation: is either mandatory or optional. Participation determines whether all or only some entity occurrences participate in a relationship (mandatory vs. optional )
20. Describe what a stored procedure is and how it is most-often used.
a. Stored procedures can take a set of parameters given to them by calling programs and perform a set of actions. Can modify and return data passed to them as a parameter. Do not necessarily return values. They provide extensibility, reusability, maintainability, and aid in abstraction b. Most often used.... c. Most-often used for Highly-frequent tasks (ie. authentication to Facebook) or Highly-sensitive activity (ie. around tables that take money or track orders...Amazon checkout) Describe what a stored procedure is and how it is most-often used.Stored procedure is saved SQL code. A sproc is different from a function and a view in that we can modify data with a stored procedure (INSERT, UPDATE and DELETE). Best use is for any transaction. Even better use is for high-volume (millions?) of repeatable transaction (order-entry for Amazon for example). WHY? A sprocs lives in memory as a named object; it has been compiled (converted into machine language), assigned an execution plan and accepts parameters. Scales exceptionally well (tens of millions of executions per hour).
6. Discuss the advantages of introducing SuperType/SubTypes into an ERD and provide examples.
a. Superclasses and subclasses are used to avoid describing different types of the superclass with different attributes within a single entity. This saves the designer time and makes the ER diagram more readable. Additionally you can show relationships that are associated with one particular subclass, and not with the entire superclass. This adds more semantic information to the design. b. Something can be in the superclass and not in the subclass c. Example: staff (super) and the different types (e.g. manager, sales associate, accountant, secretary are sub) OR a cast member can be of different types (crew, actor, writer, director are sub) a superclass subclass relationship is one-to-one. This means that we can have inheritance, where we can reference the primary key of the superclass and borrow that and put that in each subclass. That way we can keep track of when we refer to a particular musician that is playing a trumpet or a tuba trombone, we can refer back to the parent organization, the superclass, and find out their first name their last name their date of birth all of that information does not have to be replicated or made redundant in the particular instrument entity. We can borrow and inherit. That is perhaps the most important rule to remember is it we are borrowing the primary key from the superclass it becomes the primary key in the subclass.
1. Describe the benefits of taking a systematic approach to database development.
a. Systematic approach is a repeatable process. i. Increases efficiency. ii. Able to improve process with each iteration. b. Organizational learning and documentation means you can teach it to somebody else, also contributes to time efficiency, and dynamic resource allocation c. Allows designers to see if process is heading in the wrong direction easily d. control of data redundancies by integrating the files so multiple copies of the same data are not filed we improve over time because we know where we are in the developer process. this goes back to organizational learning where we read documents and assess them. e. We have the ability to do repetitive steps more efficiently 'knowing where we are' in regards to the development process we can ask things like are we ahead of schedule? are we behind? f. we have an abundance of documentation and can use these to facilitate communication with all stakeholders Characteristics: Repeatable process Documented Phases (tasks, subtasks) Measurable Deliverables (whether successful or not) Clear boundaries between phases End Result of employing a systematic process: we know where we are at all times (whether behind schedule, under budget, struggling with a deliverable all in relation to previous processes). We can dynamically reallocate resources as needed (we can recover from mistakes). Ultimately we learn/get better over time.
22. What purpose does GROUP BY perform? Explain how the GROUP BY clause works. What is the difference between the WHERE and HAVING clauses?
a. The GROUP BY clause is used with the SELECT statement to make a group of rows based on the values of a specific column or expression. The AGGREGATE function can be used to get summary information for every group and these are applied to individual group. It groups the data from the SELECT table(s) and produces a single summary row for each group GROUP BY bundles rows of data against an aggregate function. Aggregate functions include MIN(), MAX(), AVG(), COUNT(), SUM(), STDEV(). We can also group by a column (such as CollegeName) if we are trying to aggregate rows from a different table.Difference between WHERE and HAVING are WHERE goes before GROUP BY and HAVING is applied after (specifically against the result set from the aggregate function) EXAMPLE: "What are the number of deliveries by dock in the last 6 days? 'SELECT DockName, COUNT(PD.ProductDeliveryID) AS NumberOfDeliveriesFROM DOCK D JOIN PRODUCT_DELIVERY PD ON D.DockID = PD.DockIDWHERE TimeIN > (SELECT GetDate() - 6)GROUP BY DockNameORDER BY NumberOfDeliveries DESC GROUP BY clause A query that includes the GROUP BY clause are called grouping columns. When GROUP BY is used, each item in the SELECT list must be single-valued per group. Further, the SELECT clause may only contain: Columns names Aggregate functions Constants An expression involving combinations of the above. All column names in the SELECT list must appear in the GROUP BY clause unless the name is used only in the aggregate function. Contrary untrue: there may be column names in the GROUP BY clause that do not appear in the SELECT list. When the WHERE clause is used by the GROUP BY, the WHERE clause is applied first, then groups are then formed from the remaining rows that satisfy the search condition. Conceptually SQL performs the query as follows: SQL divides the staff into groups according to their respective brand numbers. Within each group, all staff have the same branch number. From each group, SQL computes the number of staff members and calculates the sum of the values in the salary column to get the total of their salaries. SQL generates a single summary row in the query result for each group. Finally, the result is sorted in ascending order of branch number, branchNo. WHERE clause (Row Selection) Purpose?: Restricts rows that can be achieved from a SELECT statement. How?: WHERE key word followed by a search condition that specifies the row to be retrieved. The five basic predicates (search conditions) are: Comparison: Compare the value of one expression to the value of another expression. Range: Test whether the value of an expression falls within a specified range of values. Set membership: Test whether the value of an expression equals one of a set of values. Pattern match: Test whether a string matches a specific pattern. Null: Test whether a column has a NULL (unknown) value. HAVING clause (Restricting Groups) The HAVING clause is designed for use with the GROUP BY clause to restrict the groups that appear in the final result table. Although similar in syntax, HAVING and WHERE serve different purposes. The WHERE clause filters individual rows going into the final result table. HAVING filters groups going into the final result table. Column names used in the HAVING clause must also appear in the GROUP BY list or be contained within an aggregate function. In practice, the search condition in the HAVING clause always includes at least one aggregate function, otherwise the search condition could be moved to the WHERE clause and applied to individual rows. (Remember, aggregate functions cannot be used in the WHERE clause). The HAVING clause is not a necessary part of SQL, any query expressed using a HAVING clause can always be rewritten without the HAVING clause.
14. How does the normalization process make for a better-designed database? - maybe flushing out? (says what, not why or how)
a. The normalization process aims to group attributes into relations to minimize data redundancy. If that is achieved, updates to the data stored in the database are achieved with a minimal number of operations, thus reducing the number of opportunities for data redundancy. Additionally, there will be a reduction in the file storage space required by the base relations thus minimizing costs. Reducing redundancies will also help eliminate update anomalies. b. Think about the consequences of not normalizing a database.
3. Explain the benefits of the Relational database design when compared to Hierarchical design.
○ -down, modeled from file-based paper systems. ○ Process of managing and sorting paper data. ○ Very inefficient, prone to data duplication ○ Reports takes very long to create; must anticipate reports ○ Advantages - Slide set 1 # 68ish ➢ Relatively Easy to understand ● Conceptually similar to File Cabinet ● Easy to migrate to electronic system ○ Business processes did not have to change much ➢ Relatively easy to develop and start ● Many entities or objects were similar structures ● Easy to expand data storage folders ○ Disadvantage ➢ Lousy performance ➢ EXPENSIVE ➢ Specialized skills required to obtain info ➢ difficult to modify ➢ Difficult to account for complex relationships ➢ Prone to anomalies and bad data Relational: ○ Don't need files on disk. ○ Easy to scale and deal with sets of data, sort and manage LOTS of data really fast. ○ Uniquely identify ○ Relatively easy to understand ➢ similar conceptually to 'real world' (entities & relationships) ○ Able to improve and optimize business processes ○ Efficiency in implementation ○ reports created quickly, can store commonly accessed reports as views. ○ Advantages - Slide set 1 #85ish ➢ Easy to obtain information quickly ● Simple conceptually allows non-experts to access data, unlike hierarchical which required experts ➢ Efficiency in data storage ● 3rd normal form suggests no redundancy ● Guaranteed data integrity Advantages: Hierarchical Relatively easy to understand - similar to the file cabinet system, easy to transition to electronic system so businesses did not have to change anything or hire extra people. Ex. Managing inventories in a restaurant chain Relatively easy to develop and start - entities/objects were similar to the businesses originally, easy to expand data storage Disadvantages: Hierarchical Slow/lousy performance with large data collection - inefficient and bloated data Extremely expensive - mainframes, and large number of employees Specialized skills - experts needed, slowed businesses down, electrical and computer engineers required Difficult to modify - application itself has to be modified to make changes, replicating was easy Difficult for complex relationships - businesses couldn't be efficient, hard to replicate real world relations Anomalies + bad data - multiple location of data, parent child relationship (each child has one parents) Redundant data, inefficiencies, inconsistent data Advantages: Relational Model Eliminated almost all previous problems, records type has no explicit owner, have all possible relations Relatively easy to understand - similar to real world concepts - entities and relationships Improve and optimize business processes - can model complex relations Quick to obtain information - experts not required, anyone can access data, decisions became faster, business modeling and analyzing trends became more efficient Efficient data storage - 3NF ensures no redundancy (less footprint, physical storage is less expensive), data integrity guaranteed Efficient implementation - ANSI has rules - platform independence, don't need to hardcode the physical location for the files, can physically move them without impacting database, easy to customize - adding new columns, affordable Scalability - structure allows for large data sets, easy to archive data. Ex. Quick stream activities for Microsoft - editing database, doesn't slow down activity Efficiency in data storage and standards developed Data storage (collection), data retrieval (presentation), data security (protection) Computer language that all systems use - SQL, dialect might differ Normalization - design methodology Data independence - data location not hard coded, column order is not significant
1. Cite and describe the high-level objectives of the major phases of database development.
● Conceptual: Brainstorm. Exhaust potential objects, stakeholder analysis, NO WRONG IDEAS. We want to brainstorm business objectives, look at reports/user groups/focus groups, we want to understand from the user-perspective how the data will be captured/presented. THERE (boom) ● Logical: This is the blueprint. Normalize attributes and entities (3NF, 4NF is fine), model the database, validate transactions for future operations. ● Physical: Choose the platform, watch the syntax, decide what hardware, how much to buy, installation needs. ○ platform is the first thing we should consider (window, os), indexes, file, storage capacity (unique) Implementation, choose the platform, watch the syntax, and decide what hardware, how much to buy, and installation needs. Things that we should consider: ■ consider platform(Windows, Linux, MySQL) ■ Storage capacity ■ Field storage CONCEPTUAL: Understand the business from the perspective of database users / stakeholdersInterviewsObservations Evaluation of artifacts (reports, spreadsheets, external documents)Brainstorming objects of the business; Deliverable is conceptual diagram; Dataflow Diagram is often another deliverable LOGICAL: refine the conceptual diagram to make sure the attributes we discovered are indeed in the correct entity.IncludesNormalization processAssign PK/FKValidate transactions (make sure each critical process such as placing an order or registering a student can be completed).Deliverable is logical ERD PHYSICAL: We finally worry product/vendor specifics oFile structure oSecurity measures oCloud components oApplication configuration settings oThe schema code oIndexes oLogin processes
Explain what the instructor means by the phrase 'organize or die'.
● Organization is power. Systematic process and documentation allows you to specialize over time. ● Allows you to do your work better, which means better results and more advancement. ● Specialization leads to innovation and optimization (examples?) ○ He used several examples from his anthropology background during class related to the development of farming and animal husbandry. ○ More modern example could be the switch to assembly line manufacturing. Each worker only assembles one or two parts of each product and becomes faster and more at doing that one task. ○ Amazon is a great example for this concept. From collecting data on user purchases they are able to recommend their customers products, and are able to generate more revenue than brick-and-mortar stores, and will 'live' longer as a result. System - a group of interacting or interrelated components Organized process that has an input and produced a predictable output Helps us understand complex relations and see trends ex. ATM - input card (debit/credit) output cash requested. If coffee came out, that would be unpredictable, so not a system Goal of a system is to obtain a repeatable process Efficiency - iron out mistakes Optimize over time - teach people skills Organizational learning - system with a web component and we buy something, there will more likely be a database that is organized. Over time, there will be efficiency and optimization. This will lead to innovation, learning the best ways for things and expertise. Organize or Die - how did certain societies excel? The ones that build systems, organizing and analyzing human behavior. Ex. Farming - It's a system, learning when to plan seeds/put pesticides, harvest etc. Over time, you can become an expert. Advantages Competitive advantage - organizations can better compete - specialization of labor, emergence of experts, innovation and process optimizations, people become interchangeable resources. Ex. Starbucks vs. Tully's Need systems for these ^ Ex. Nordstrom - how do they find trends? They do computer modeling/surveys/research. Understand the price/fabric/patterns Ex. Risk assessment - actuarial science - how do they know insurance premiums/what prices to charge. How much are businesses are worth? Ex. Starbucks vs. Tully's - Starbucks spends a lot of time researching where to put stores. They leverage data. Amazon - builds profiles on people and put in suggestions. Recognize obscure patterns. Databases allow these competitive advantages - massive amounts of data, explore theoretical scenarios, analyze trends. Understand information = better decision making.
2. Explain the purpose of associative entities and provide an example.
● Special entity that is also a relationship, account for specific instances, maintain transactional integrity, reduces disk footprint. (Basically many-to-many relationship) ○ For example, allows for intersection between class and student. ○ Create greater efficiency and relational ability, and accuracy ○ Can account for specific instances, overlaps, and maintain referential integrity What is it? An associative entity bridges the many to many relationship by taking the primary keys of each table and foreign keys. Example? There are two entities, one being student another being class. One student can take many classes and one class can be taken by many students. That is a many to many relationship. You create an associative entity to bridge. Why do we care? What is the concern? So What? Relational data theory does not allow for many to many relationships as there will be many NULL values. How? There will be NULL values because you would be adding a bunch of rows trying to modify the table to fit but you are just creating a restriction for yourself. Therefore you create an associative entity that bridges these two entities.