IS Unit 4
What is Transform?
"Clean and reformat the data according to requirements of the target database." Data transformation is frequently the most complicated part of the ETL process in terms of processing time. Before data transformation can occur, the raw data must become accessible for the database. During the transformation step in ETL, the data is cleansed and encoded to ensure data quality. Data that is extracted from a source database and that does not need to be transformed is loaded directly or direct-moved to the data warehouse. Transportation is the process of moving data from one system to another. In a data warehouse, the most common requirements for transportation are moving data from a source to a data warehouse, from a staging database to a data warehouse, or from a data warehouse to a data mart. Most data warehouses use direct-path loading for performance reasons because basic data transformation can occur during the loading process.
What are 3 ways to maintain data integrity within a database?
1) Limiting the risks to destruction of the data (through permissions or 'hardening' the data) 2) Enforcing specific formatting rules. 3) Maintaining explicit access control policies.
What are the 5 rules for relational databases?
1) The order of tuples (list of elements) and attributes is not important. 2) Every tuple is unique: For every record, there is an attribute that differentiates it from any other tuple. 3) Cells cannot contain more than one value. 4) All attributes must be from the same domain, such as name, date, age, etc. (i.e., if the attribute is name, a date cannot be entered into that cell). 5) Table names and attribute names must be unique—no two tables can have the same name in a database.
What are 3 factors to consider when defining big data?
1) Volume 2) Variety 3) Velocity
What is a Clickstream database model?
A clickstream is an application that records the location on the screen in which a user clicks while browsing the internet. These clicks are recorded and saved in web server log files. These data, if dumped into a database, can be "mined" for information, which could lead to the discovery of user intent. This user intent could then be used for targeted marketing.
What is a Data Mart?
A data mart is a smaller subset database from a data warehouse typically focused on a single area or function. A data mart is a subset of the data warehouse that is accessible by a specific department or strategic business unit. The basic characteristics of data marts and data warehouses are that they are organized by subject or function, they use OLAP, and they are multidimensional, integrated, nonvolatile, and time variant.
What is Data Mining?
Data mining is the process of identifying hidden patterns and relationships in large databases and applying rules to that data in order to predict future behavior. Data mining provides insights into organizational data that cannot be obtained with OLAP. The information that can be obtained using data mining includes associations, classifications, clustering, forecasting, and sequences. Data mining can be thought of as a single business activity looking at its data in order to improve performance. When data is analyzed, or mined, patterns and relationships emerge, the analysis of which leads to information. This information can lead to the discovery of trends or insights, which can be leveraged by business. One data mining methodology is decision trees, which generate rules and classify data sets. Another data mining methodology more complex is artificial neural networks, which are nonlinear algorithm models used in data mining. This can be described as machine learning or advanced algorithms. A third type of data mining is rule induction, which is a technique based on statistical importance.
What is Variable/Dynamic Data?
Data that moves and changes all the time.
What is Static Data?
Data that stays the same, like in a data warehouse.
What is GIS?
Geographic information systems. Using connections to numerous databases containing data concerning locations, crimes, taxes, traffic, votes, and environmental and life science data, geospatial analysis can help government administrators to understand demographic trends, population density changes, attribute concentrations (e.g., poverty, crime, or disease outbreaks), and quite a number of other important and factual indicators. With these analyses, administrators are better equipped to act, redirect and redistribute funds, change ordinances, enact legislation, and predict future changes. Popular examples include ESRI ArcView and Autodesk's GIS analysis software.
How does 'Data Properly Recorded' maintain security?
How data is recorded may be critical to the operation of a business. Inputting consistently by using the same descriptions, standards, units of measurement, and formats can mean the difference between success and failure. In 1999, NASA mission managers lost a $125 million Mars Climate Orbiter spacecraft due to an error caused between data input when imperial English units were expected but instead a metric unit of measurement was used.
How does 'Data Properly Retained and Destroyed' maintain security?
How data is retrieved and who is able to retrieve them are questions central to the design of a database. Controlling access and ensuring that data is updated and/or combined using appropriate and accurate methods helps to maintain the integrity, quality, and security of the information provided.
What is 2NF?
In a table, all nonprimary key fields must be dependent on the primary key. In other words, any other fields in the video order table, must be related to the video order number, video number, and date of order.
What is SCM?
Supply chain data can be collected, mined, and analyzed for trends between periods or over time to help a business ascertain its corporate buying power, assess changes in supply chain, understand changes of price and quantities ordered, and recognize potential areas of process failure. Data may also be analyzed by other enterprise applications, such as ERP, CRM, HRM, MRP, or MRP2. Popular examples of SCM include IBM Cognos SCM and SAP's SCM.
What is the purpose of Normalization?
The process in which how the data will be categorized within a database is determined to avoid redundant or unnecessary duplication of data and anomalies or inconsistent data. Small, adaptive data structures are developed in order to organize data efficiently. It involves correctly defining keys, tables, columns, and relationships. Flowcharting during the design phase might avoid inconsistencies in the database. The two objectives of data normalization are eliminating redundant data and ensuring that data dependencies are related. Data dependency means information stored in a database table determines other information stored in the same table. The purpose of the normalization process is to design efficient and functional databases by storing data in unique, logical locations.
What is Referential Integrity?
The process of establishing relationships between tables using keys. Referential integrity stops database inconsistency by creating relationships between tables using primary and foreign keys. Data inconsistency is prevented by using referential integrity rules that ensure that the relationships of data between tables is consistent. Referential integrity rules prevent a user from adding a record to a table containing a foreign key unless there is a similar record in the linked table. Changes that are made to the linked table must also be made to the primary table through a referential integrity technique known as cascading update. A row value in a table only exists if an equal value exists in a referenced table. Tables in databases form relationships using keys, and referential integrity means that for an entity or object to exist in a table, an equal value must exist in the referenced table.
What is a Cloud-based database model?
The typical web-based relational database configuration includes a web server, application server, and a database server. In contrast with a machine-based relational database, a web-based database uses a flexible model in which data is distributed among several machines, often in a cloud-computing format. Web-based data systems require a connection to the internet and a web browser. The advantages of a web- or cloud-based database is that the model is more flexible, it is both easy and inexpensive to add internet access to the database, data can be stored in multiple locations, and the database can handle large volumes of structured and unstructured data. A key advantage in web-based database systems is that they allow users to access data from anywhere, making them useful for remote or mobile staff. A web-based database also has the ability to separate content from presentation. One disadvantage to cloud-based databases is that they are less functional than machine-based databases. Because web-based databases are connected to the internet, security is another significant issue. Companies often use the internet to make some internal databases available to customers or partners. Although companies such as Microsoft, Apple, and Amazon offer cloud-computing databases for a fee, some larger organizations maintain their own cloud-based database. MySQL is an example of a cloud-computing data management tool.
What is 3NF?
This is another step to reduce redundancy and ensure data integrity even more between tables. All fields in a table depend on the key and nothing but the key. If there is a field in the table that does not rely on the key, it must be moved to a different table.
What is Presentation Area?
Two key elements that make up the data warehouse are the presentation and the staging area. The data staging area consists of all of the elements that operate between the operating system and the data presentation area and include the following: the source layer, the extraction layer, the ETL layer, the data storage layer, the presentation layer, and the metadata layer.
What is Staging Area?
Two key elements that make up the data warehouse are the presentation and the staging area. The data staging area consists of all of the elements that operate between the operating system and the data presentation area and include the following: the source layer, the extraction layer, the ETL layer, the data storage layer, the presentation layer, and the metadata layer. Data is extracted from the source into the warehouse by the extraction layer and stored in the staging area until it is cleaned and transformed (two critical steps in the process).
What is One and Only One in Crow's Foot?
Two vertical lines indicate one and only one customer (the name inside the rectangle indicates the name of the table). One and only one (mandatory) - II Student one or more course
What is Velocity?
Velocity is the speed required to process or analyze big data. Big data analytics must be performed at high speeds with powerful computers. This is referred to as high velocity requirement.
How does 'Data Properly Stored' maintain security?
Where data is stored can have a lasting effect on an organization's success. Data stored on systems not controlled by the organization, or in a way that inappropriate users have access, may contribute to a heightened risk for loss or theft. If an engineer were to leave the company and take with them to their next employer, the data or figures stored on their home computer, it may have a very damaging impact to the organization. In addition, storing the same data in multiple locations, even within the same database, can be especially problematic.
What is a Database?
a set or collection of data that is centrally located in an electronic format, organized in tables, and easily searchable. Databases support a real-time environment and avoid data redundancies. A search is performed in a database through the use of a query.
What is a Data Dictionary?
defines and manages the data types and structures that dictate how the database is organized
What is Data Administration?
including maintaining data integrity: recovering corrupted information, and managing users and group access to the data
What is Data Retrieval?
using queries, data is retrieved and combined
What is Sequences in data mining?
Sequences are events that take place in a specific order, like a customer purchasing a GPS device after purchasing an automobile. One event or transaction takes place before the other.
What are Health Analytics Tools?
Smartphone use is very popular today. Also popular are apps that track the health of their users through smartphone sensors or wearable peripheral devices. Data from these devices is collected, analyzed, and used to help a user to understand, improve, and maintain, or automatically report, their health to their health professionals. Examples include Jawbone's Up Application and the Apple Watch.
What is Forecasting in data mining?
Forecasting analyzes a series of values in order to predict or forecast the value of other variables. An example would be a utility company analyzing weather patterns in order to predict energy consumption on a given day.
What is a Foreign Key?
Foreign keys are used to create relationships between two tables. A common field that exists in multiple tables that is used to form a relationship (for example a customer might be assigned a specific sales representative; the customer's record might contain a sales rep ID that would link to the Sales Rep table)
What is Extract?
"Get the data from various sources as needed." Extraction, in the ETL process, performs various tasks, but the key task is to grab the data from business operational and analytical data and prepare it for the transfer process. First, the business must identify the data source that has the critical data to be warehoused. Second, a consolidation algorithm must be created that correlates the data with the data warehouse format. Third, one has to create a scrub algorithm to ensure data quality and integrity. Finally, a transfer procedure application is run to update the data warehouse database. Extraction is the operation of removing data from a source for use in a data warehouse. Designing and developing the extraction process is a very difficult task in the ETL process. The source systems may be complex and improperly documented, making it difficult to determine what data should be extracted. The extraction process used depends on the business needs, the amount of data to be extracted, and the source system. Data can be extracted using full extraction or incremental extraction. Full extraction involves removing all data at one time, while incremental extraction involves removing data periodically, typically since the last extraction date. The extraction phase of the ETL process consists of several tasks, the most important of which is to extract the data from business operational and analytical data and prepare it for the transfer process. Extraction copies the defined objects of source databases but does not copy the entire database and, therefore, is not part of a data warehouse back up process.
What is Load?
"Load the reformatted data into the targeted data warehouse." The load phase in the ETL process is the actual transfer of the new data to update the data warehouse. The ETL process runs in parallel because the extraction process can be lengthy, so data that are ready to load are loaded in parallel or at the same time.
What is a Data Warehouse?
A data warehouse is a system in which there is a collection of business data from multiple systems used for data analysis and mining for reporting. They are central libraries of data that are combined from one or more sources and organized by subject to support organizational managers. Three main types of data warehouse applications include information processing, analytic processing, and data mining. The basic characteristics of data warehouses are that they are organized by subject or business function, use online analytical processing (OLAP), and are integrated, time-derived, consistent, and multidimensional. The typical data warehouse uses staging, data integration, and access layers to contain key functions. The staging layer stores raw data that has been extracted from various data systems. The integration layer consolidates the various data sets by transforming the staged data and storing it in an operational data store (ODS). The data is sorted into ordered groups where the access layer assists users with data retrieval. A data warehouse is a collection of historical business transaction data, coupled with other, external market data or information. A data warehouse can be thought of as the foundation for a business to derive business intelligence from its operational data. A key reason for implementing a data warehouse is to enable storage of data from multiple sources to be queried and analyzed in order to help with management decision-making. Data warehouses store transactional and analytical data for reporting and analysis purposes. A data warehouse can have various purposes, including historical data analysis, but it also can combine current, timely data for analysis. Data warehouses, when queried, can result in business intelligence, helping decision-making. In summary, data warehouses can be described as repositories of data from several sources that are primarily used for reporting and for analysis in order to help management decision-making. A data warehouse is an integrated collection of data that is used to assist management in decision making. The data warehouse extracts historical and real-time data from one or more data stores or information systems within an organization. The basic characteristics of data marts and data warehouses are that they are organized by subject or function, they use OLAP, and they are multidimensional, integrated, nonvolatile, and time variant. The primary components of data warehouses are the source systems, the data quality, the data governance, the metadata, and the users. Two key elements that make up the data warehouse are the presentation and the staging area. The data staging area consists of all of the elements that operate between the operating system and the data presentation area and include the following: the source layer, the extraction layer, the ETL layer, the data storage layer, the presentation layer, and the metadata layer. Data is extracted from the source into the warehouse by the extraction layer and stored in the staging area until it is cleaned and transformed (two critical steps in the process). Logic is applied to the data in the ETL layer to transform the data from transactional to analytical. The data is stored in the storage layer, and the logic rules are stored in the logic layer. The information about the stored data is stored in the metadata layer. The presentation layer contains the information that reaches end users and is in some type of report format. The requirements should be determined when developing a data warehouse. These include what business rules will be applied to the data, the logical model for storage in the warehouse, the query profiles, and the system sources that will provide the data. Regardless of the confidentiality or sensitivity of information, an information policy ensures the standardization and accuracy of the data in the data warehouse during and after data warehouse implementation.
What is a Table?
A database contains one or more lists, known as tables of data, that are organized by columns, rows, and cells. A collection of related records (for example, a set of customer records would all be contained in one table)
What is a Decision Tree?
A decision tree is a tool used in decision analysis and is another tool used to represent the entity-relationship model. Visio is a vector tool for drawing decision trees and flow charts in order to outline processes that could be used to develop the entity relationship model. In entity-relationship models pointers, or linked lines, are used to signal relationships. Cardinality refers to the numeric relationship between two entities. Crow's foot notion is often used to describe the relationship between entities. Circles are used in entity-relationship diagrams to represent chance nodes that are events with zero outcomes; one bar indicates a one-to-one relationship, two bars represent one and only one, and a crow's foot indicates many. Bold font can be used to indicate a field must contain data and cannot be left blank.
What is a Flat File database model?
A flat file is a type of database model that resembles file cabinet functionality.
What is an Entity?
A person, item, event, etc., for which an organization collects data (an example of an entity would be customer or student)
What is a Primary Key?
A primary key is the field, or set of fields, that uniquely identifies a record (such as part number or student ID). A field that uniquely identifies a record, such as customer ID or student ID.
What is a Schema?
A schema is a planned design structure of the database and involves creating the definitions behind the tables and data constraints. The complete definition of a database, including the tables, records, fields, and relationships.
What is a Query?
A search is performed in a database through the use of a query.
What is 1NF?
A table is in first normal form if there is no repeating group or data. This sometimes means combining fields to make a unique primary key. If the video table includes order number, video number, and date, these three fields may have to be combined to be the primary key in order to avoid having a repeating group. An example of a repeating group would be if each record had the order number and the video number. If a customer ordered two videos in one order, there would be two records with the same order number. Combining the two fields as one primary key would eliminate this redundancy.
What is the Principal of Least Privilege?
An organization may apply a policy similar to the principle of least privilege, which suggests that to avoid fraud, abuse, and theft, personnel are given only enough access to complete tasks within their job.
What is a Record?
Records are related information separated by columns or fields and aligned in rows (e.g., one name and address is one record, another name and address is a different record). A set of related fields, also called a tuple, that describe one occurrence of an entity, such as a customer. Tuples are ordered lists of values and are used for grouping and organizing data. Tuples can be used to represent related data that belong together, such as student records, and are used in database relational models.
What is a Relational database model?
Relational databases are one of the most common database models in use today and are based upon the relational model of data, which is a method of structuring data using relationships, in rows and columns. Object-oriented and relational databases can be both centralized or distributed.
What is Clustering in data mining?
Clustering classifies events or entities when no known group exists for the data.
What is the CIA triad?
Confidentiality, Integrity, Availability
What is CRM?
Customer data can be collected, mined, and analyzed for trends between periods, in the aggregate, or by using tools to drill down into specific departments or divisions within an organization. Analysis might produce examples of success, areas of concern or opportunity, profitable customers, successful product lines, or acceptable profit margins. Data may also be analyzed by other enterprise applications (e.g. SCM, HRM, MRP, or MRP2), thanks to connections to an enterprise relationship planning (ERP) system. Popular examples of CRM include Salesforce, Peoplesoft CRM, and Microsoft Dynamics CRM.
What is Associations in data mining?
Associations are data that are linked to a specific event, such as how frequently customers purchase jelly when they purchase peanut butter.
What is Classifications in data mining?
Classifications look for patterns that indicate customer or entity behavior in order to target marketing efforts.
What is EMS?
Energy management systems. Uses for these include reducing our carbon footprint and reliance on foreign fuel, and, in particular, reducing the cost of doing business in a changing economy. Data from energy systems has been collected, analyzed, and used to control and predict for decades. Today, with vastly improved data mining and analytics tools, applications for saving energy come in both large and small packages. They help both large businesses and individual homeowners to track and monitor their energy usage and facilitate both manual and automatic changes to indoor climate and usage schedules based on historic and projected operating patterns. Examples include Google's Nest Labs Nest automated thermostat and C3 Energy's Smart Grid Analytics.
How does 'Data Properly Retrieved' maintain security?
Even when data is no longer being used, there may be laws or policies that govern how long it should be kept. Though data can be onerous or unwieldy to store, especially in printed form, the potential for an audit by the company, by its regulators, or by another external group can be damaging should the data not be available. Financial, accounting, and medical information is often kept for seven years. Other information, such as procedures and other intellectual property, as well as legal documentation, may be kept for much longer.
What is ETL?
Extract, Transform, and Load Extraction, transformation, and loading (ETL) is a process in which data is extracted from a computer, transformed into a format appropriate for the data warehouse, and then transferred to the data warehouse. ETL refers to the data identification, consolidation and scrubbing, formatting, and transfer to the target data warehouse. Because data warehouses consist of data from several systems, with different formats, the ETL process must be carefully designed to ensure data integrity during the load process. Good backup strategies are required during the ETL process, but this is mostly to ensure continuity, not manageability. To improve database manageability and performance, a database partition strategy must be considered. A key challenge in maintaining a data warehouse database is loading homogenous data from various business systems during the ETL process. Large databases will require significant effort in data homogenization in order to ensure effective ETL. Data from the various sources that feed the data warehouse database will cause data integrity and quality issues if not checked before the load phase in the ETL process.
What are Source Systems?
Extraction is the operation of removing data from a source for use in a data warehouse. Designing and developing the extraction process is a very difficult task in the ETL process. The source systems may be complex and improperly documented, making it difficult to determine what data should be extracted. The extraction process used depends on the business needs, the amount of data to be extracted, and the source system. The primary components of data warehouses are the source systems, the data quality, the data governance, the metadata, and the users.
What is Data Quality?
Extraction, in the ETL process, performs various tasks, but the key task is to grab the data from business operational and analytical data and prepare it for the transfer process. First, the business must identify the data source that has the critical data to be warehoused. Second, a consolidation algorithm must be created that correlates the data with the data warehouse format. Third, one has to create a scrub algorithm to ensure data quality and integrity. During the transformation step in ETL, the data is cleansed and encoded to ensure data quality. The primary components of data warehouses are the source systems, the data quality, the data governance, the metadata, and the users.
What is Data Governance?
Information policies and information management practices includes principles, policies, and processes within the sphere of data governance. Data governance is concerned with the management of data, including maximizing quality through effective utilization, availability, and protection. Knowing that data is merely unorganized and unstructured information, practices and policies may be created and applied generally. Even before designing a database, administrators must agree and plan on how data will be collected, stored, maintained, and retrieved so that faulty data may be corrected before it is input. Challenges include whether data is properly recorded stored, retrieved, and retained and destroyed. The primary components of data warehouses are the source systems, the data quality, the data governance, the metadata, and the users. Policies and procedures are established that define the data governance program, such as who has access, who has update capabilities. Have a consistent data hygiene process. Having clean data starts when the database is created by including database field (column) controls called validity checks. Data mapping comes to the rescue in the data management process. They are extra notes about the definition of a field, its data, and its use. Segmenting data is also an important aspect of the data management process (who has access, who has update capabilities)
What are the 4 normals forms?
Normalization assures elimination of data redundancy and logical dependencies by using the four normal forms: 1NF, 2NF, 3NF, and BCNF. The normalization process consists of steps or forms, ranging from the first normal form (1NF) to the Boyce-Codd Normal Form (BCNF), which is the form or state in which all redundancy has been removed.
What is a OLAP database model?
OLAP stands for "online analytical processing" and includes analytical, fixed databases containing data that are used for making decisions. An online analytical processing (OLAP) database management system enables businesses to create new records, update records, and delete records providing real time information for decision-making. OLAP is a tool that enables users to analyze data in order to make management decisions. OLAP is a business intelligence tool that is useful for management reporting, budgeting, and forecasting. The term OLAP is a derivative of the term online transaction processing (OLTP). A disadvantage of OLAP is that processing the data during ETL can be complex, leading to lengthy loading and updating. The core of any OLAP system is a cube, giving the OLAP a multidimensional aspect. The cube consists of numerical facts called measures that are categorized by dimensions and are located at the intersections of the cube. Multidimensional databases are typically relational databases that store data in a cube, which contains aggregated data related to all dimensions.
What is a OLTP database model?
OLTP stands for "online transaction processing" and is a database design used for browsing and manipulating business transaction activity. An online transaction processing (OLTP) operational database can be used to gather real-time transactional or operational data, such as inventory on hand.
What is a Object-oriented database model?
Object databases are used when media or other types of files are required because multimedia files and diagrams or blueprints such as those generated by computer aided design (CAD) are more complex and not easily sorted by columns and rows. Both object-oriented and relational database models are similar, except that an object-oriented database contains mixed media records. Object-oriented databases are not hierarchical models. Object-oriented and relational databases can be both centralized or distributed.
What are Physical Data Models?
One or more physical data models may exist within the logical data model, containing greater detail, such as database tables and key indexes.
What is One to Many in Crow's Foot?
One vertical line One or more (mandatory) - I Student one or more courses
What is a Field?
The categories in a database are known as fields (e.g., name, address, dollar amount). An attribute of an entity, such as name, address, or student ID.
What is None to Many in Crow's Foot?
The circle and the crow's foot indicate none to many. In other words, one customer can order no videos, one video, or many videos. Zero to many (optional) - 0 Advisor and instructor
What are Conceptional Design Models?
The conceptual data model is the highest level model, containing the least amount of detail within the data. Conceptual data models map the scope of the enterprise data architecture model and help support system and documentation requirements. This model is typically used in developing master reference data, such as in an enterprise-wide database.
What is Availability?
The data needs to be available to all users that the data owners deem necessary to ensure the integrity of the data.
What is Confidentiality?
The data needs to be kept confidential from those who don't need access to the data to ensure the integrity of the data.
What is Integrity?
The data's integrity needs to be kept safe from unauthorized changes to ensure the integrity of the data.
What is ERD?
The entity-relationship diagram is an essential starting point in database design. The data components (known as entities) have various properties (known as attributes) and relate to each other in various ways (known as relationships). Master data is business objects or values possessing a unique value and are shared across the entire enterprise. Master data can be used in all types of databases, are not sorted into tables, and are key to accurate business reporting and analysis. The entities, attributes, and relationships are mapped in a diagram in order to determine how to properly design the database. Relationships in entity-relationship diagrams (ERDs) can be defined or described as actions that link entities or things together. Three hierarchical levels may exist in the development of entity-relationship diagrams: 1) conceptual data models 2) logical data models 3) physical data models An entity-relationship model is used by businesses to map process requirements and data in order to develop a supporting relational database.
What are Logical Data Models?
The logical data model contains more detail than a conceptual data model, often defining operational and transactional entities for the database. One or more physical data models may exist within the logical data model, containing greater detail, such as database tables and key indexes.
What is Metadata?
The primary components of data warehouses are the source systems, the data quality, the data governance, the metadata, and the users. The data staging area consists of all of the elements that operate between the operating system and the data presentation area and include the following: the source layer, the extraction layer, the ETL layer, the data storage layer, the presentation layer, and the metadata layer. The information about the stored data is stored in the metadata layer.
What is Data Management?
the creation, update, and deletion of the data within the database