Exam 1 Review

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Give examples to the application of analytics across industries

-companies analyze consumer behaviors and engagements with their businesses by mining historical data. For this reason, it is helpful in service improvement and targeted marketing. -it is used to identify and address the areas of strengths and weaknesses. This consequently helps in better planning. Many learning systems use descriptive analytics for analytical reporting. They measure learner performance to ensure targets and training goals are fulfilled. A few examples of how descriptive analytics can be used in the e-learning industry include: -Analyzing assessment grades and assignments -racking the use of learning resources -Comparing the test results of learners -Analyzing the time taken by the learner to complete the course Other descriptive analytics use cases include: Use of social media and engagement data (Facebook and Instagram likes) Summarising past events such as marketing campaigns, sales. Collating survey results Reporting general trends

Snow flake schema

-normalizing the dimension tables to create a multidimensional database (dimensional hierarchy is explicitly represented) provide a refinement of star schemas where the dimensional hierarchy is explicitly represented by normalizing the dimension tables. This leads to advantages in maintaining the dimension tables.

Aggregating Functions

-the fact table contains numbers that you need to aggregate (for example, in the finance table, you have the amount, which is a monetary value that needs to be aggregated. In a sales fact table, you may have a sales amount and item counts.)

What are the basic phases of Crisp-DM

1. Business understanding - What does the business need? 2. Data understanding - What data do we have / need? Is it clean? 3. Data preparation - How do we organize the data for modeling? 4. Modeling - What modeling techniques should we apply? 5. Evaluation - Which model best meets the business objectives? 6. Deployment - How do stakeholders access the results? https://www.datascience-pm.com/crisp-dm-2/

Pillars of data governance

1. Metadata Management 2. Master data management

Benefits of Metadata

1. consistent understanding of data definitions 2. traceability of data transformation 3. reduced data redundancy 4. save time and effort of tracking down data 5. Ability to identify possible consequences and impacts of any changes to processes

Measures vs Calculated Column

A calculated column is calculated when the model is loaded and is contained in a table. A measure is calculated on the fly as you change the filter context in the report or visual. -The measure is not contained in a table but is merely associated with it. -calculated columns are precalculated and stored in the model. Measures are calculated when filters are applied to them in the Report view and must be recalculated every time the data context changes. -the more calculated columns you have, the greater the size of your Power Pivot file. The more measures you have, and the greater their complexity increases, the more memory is necessary when you are working with the file

Time Period - Shifting the date context

A common analysis often employed in data analytics is looking at period-to-date values. For example, you may want to look at sales year-to-date or energy consumption month-to-date. (YTD is common) with time period-based calculations, you can use them to compare past performance with current performance. But first, you need to know how to shift the date context to calculate past performance.

Master Data management

Data that provides context for transaction data -about data that does not change very often (not transactional data, it is much more stable for longer term) Data that is tightly controlled, mostly used for look up and is generally multiple tables over just one table, refers to dimensional data (data that doesn't change) -Data in which almost all organizations rely on

Descriptive analytics

Describes the data you already have: historical data and current data. Answers the question, "what has happened"

What is the difference between data analytics and data science ?

Difference is in what they do with data. Data analysts: examine large data sets to identify trends, develop charts, and create visual presentations to help businesses make more strategic decisions IOW: data analysts utilize data to draw meaningful insights and solve problems Data Scientist: designs and constructs new processes for data modeling and production using prototypes, algorithms, predictive models, and custom analysis IOW: estimate the unknown by asking questions, writing algorithms, and building statistical models.

Hierarchies in a Data Model?

Hierarchies are groupings of two or more columns into levels that can be drilled up/ down through by interactive visuals and charts. Hierarchies define various levels of aggregation (Ex: it is common to have a calendar-based hierarchy based on year, quarter, and month levels. An aggregate like sales amount is then rolled up from month to quarter to year.) -Another common hierarchy might be from department to building to region. You could then roll cost up through the different levels.

how does data quality relate to data governance?

If you don't have data governance, data quality is not an ongoing process -data quality efforts become costly one-off exercises.

Shifting date context

If you want to compare performance from one period to the same period in the past, say sales for the current month to sales for the same month a year ago, you need to shift the date context. -returns a set of dates that correspond to the interval type you specify -PARALLELPERIOD, DATEADD

Row context

Includes values from other columns in the same row (Say Discount as [SalesPrice]-[Actual Cost] -comes into play when you are creating a calculated column. It includes the values from all the other columns of the current row as well as the values of any table related to the row

differences between Sum vs SumX for example

The X functions are used when you evaluating an expression for each row in the table and not just a single column SumX expression: SUMX(<table>, <expression>) -where the table is the table containing the rows to be evaluated and the expression is what will be evaluated for each row. Ex: To figure out the total net sales amount, you can take the gross amount minus the cost and sum the result for each row, as in the following formula: SumNet:=SUMX(Sales,[Gross]-[Cost] SUM: Another way to get the same result is to create a net calculated column first and then use the SUM function on the net column. The difference is that calculated columns are precalculated and stored in the model.

Row vs Columnar

Traditional row-based data storage stores all the data in the row together and is efficient at retrieving and updating data based on the row key, for example, updating or retrieving an order based on an order ID. This is great for the order-entry system but not so great when you want to perform analysis on historical orders (say you want to look at trends for the past year to determine how products are selling, for example). IOW: -Row is good cause it stores all the data in the row together and is efficient at retrieving and updating data based on the row key -But if you want to preform analysis on historical orders (like past trends) then columnar is best to utilize. -Row-based storage also takes up more space by repeating values for each row -A columnar database stores only the distinct values for each column and then stores the row as a set of pointers back to the column values.

What are the two ways for Data Compression

Two ways of compressing data: run length encoding and hash encoding

Scrub/Cleanse

Uses pattern recognition and AI techniques to upgrade data quality -goal is to correct errors in data values in the source data Fixing errors: misspellings, missing data, duplicate data, etc Also: decoding, time stamping, conversion, error detection, etc.

What is Business Intelligence (BI)?

Using data and information to achieve organization goals

Main issue in data governance

Want to ensure that the security is managed. To make sure the data is being moved correctly and that people are complying with the regulations for privacy -the goal is that if you are doing business analytics, that the analytics being reported is high quality

Data Context

When the user changes filters, drills down, and changes column and row headers in a matrix, the context changes, and the values are recalculated. -Knowing how the context changes and how it affects the results is very essential to being able to build and troubleshoot formulas.

In-memory analytics

With in-memory analytics, the data is loaded into the RAM memory of the computer and then queried. -results in much faster processing times -limits the need to store preaggregated values on disk helps improve the speed of a system used to dramtically improve preformance of multidemnsional queries -by organizing data by column the number of disks that will need to be visited will be reduced and the amount of extra data that has to be held in memory is minimized. This greatly increases the overall speed of the computation.

Drawbacks?

You cannot update the data because it is now split/spread across different places in your disk

What is a data model

a data model is made up of tables, columns, data types, and table relations -designed to hold data for a business entity (Ex: customer data-customer table, employee data-employee table)

Prescriptive analytics

advises on possible outcomes and results in actions that are likely to maximize key business metrics. -"what should a business do" -based on OPTIMIZATION to achieve the best outcomes

Date Tables

aids in comparing values over time. -To use the built-in time intelligence functions in DAX, you need to have a date table in your model for the functions to reference contains all dates included in the period to analyze -the table is that it needs a distinct row for each day in the date range at which you are interested in looking. Each of these rows needs to contain the full date of the day.

What are one of the major goals of data governance

alignment -also ensures that whatever architecture you have is maintained

Facts Table

contain measures related to the subject of analysis. (Ex: UnitsSold) -measures numeric and are intended for mathematical computation and quantitative analysis. -contains foreign keys association them with dimension tables

Data quality

degree to which data is 1. accurate 2. complete 3. timely 4. consistent -Most importantly, data quality answers: is it something that you can use, is it relevant, and does it give you all the information you need

Which type is the most basic form of analytics?

descriptive

multi-field transformtion

from many fields to one, or one field to many -when you don't have a primary key to match records...or to remove duplicates (EmpName and TelephoneNo = EmpID on another table)

Single-field transofrmation

from one field to one field -translates data from old form to new form (EX: temperature in Fahrenheit to temperature in Celsius - algorithmic) Table lookup uses separate table keyed by source record code (State code --> state name)

Data Lakes

holds vast quantities of unstructured data -for future/potential analytics A large storage location that can hold vast quantities of unstructured data (Data warehouse stores structured data) -a data lake typically stores all possible data that might be needed for an undefined amount of analysis and reporting, allowing analysts to explore new data relationships (Usually built on commodity hardware rather than specialized servers) Ex: -distributed on specific pcs rather than the centralized fsb serve (example) Native/raw format for future/potential analytics consumption. Schema or structure is applied at time of Analysis (schema on read)

Difference between measures and dimensions

numeric measures are the objects of the analysis (Ex: sales, revenue, inventory) and dimensions provide the context for the measure (EX: product, city, date (when the sale was made))

Capture/Extract -static and incremental

obtaining a snapshot of a chosen subset of the source data Static extract: capturing a snapshot of the source data at a point in time Incremental extract: capturing changes that have occurred since the last static extract

Complex Event Processing Engines

to support near real time BI tasks. -an organizational tool for aggregating, processing, and analyzing massive streams of data in order to gain real-time insights from events as they occur. Applications define declarative queries that can contain operations over streaming data such as filtering, windowing, aggregations, unions, and joins. -need to handle situations where the streaming data is delayed, missing, or out-of-order -one of the challenges for the CEP engine is to effectively share computation across queries when possible.

Differences in traditional DW and Big Data based BI architectures

traditional DW uses external data sources and utilizes ETL and relational DBMS (OLAP server and reporting serve), big data system uses CEP and MapReduce Engine (Data mining text analytic engines and enterprise search engine centralized data warehouse being too rigid as it is complex to build and implement change (rely on the IT department) so... -BI architecture augments your centralized data warehouse to promote agile data analysis (tools allow business users to tap and manipulate data directly, rather than waiting for IT to write customized reports for them) -Enables sharing and collaboration -Scheduling and automation of data refresh -Can audit changes through version management -Can secure users for read-only and updateable access

Diagnostic analytics

using internal data to understand the "why" behind what happened

Hash encoding

Replacing data types with dictionary and indexes -Hash encoding builds a dictionary of the distinct values of a column and then replaces the column values with indexes to the dictionary. Hash encoding reduces storage space

Row database vs Columnar

Row Say we want to get the sum of ages from the Facebook_Friends data. To do this we will need to load all nine of these pieces of data into memory to then pull out the relevant data to do the aggregation. -waste of computing time Row Oriented Databases 1. Writing to row oriented dbs fast (adding new rows) 2. Reading is however slower. (say average price of products) Column If we want to add a new record: We have to navigate around the data to plug each column in to where it should be. Column Oriented Databases 1. Easy if we need for all - we can go just across the price column. 2. Harder if we now have to get at average price for just red. 3. Writing is very much slower.

What are the three types of context you need to consider?

Row, query, and filter

Data Marts

Smaller and focuses on a particular subject or department. Subset of a data warehouse -data mart is dependent on the data warehouse -subject specific and usually serve a specific business component, such as finance, marketing, or sales - the end prod- ucts delivered by the data warehouse and thus contain information for business users Like a data warehouse, a data mart is a snapshot of operational data to help business users make decisions or make strategic analyses. The difference between a data mart and a data warehouse is that data marts are created based on the particular reporting needs of specific, well-defined user groups, and data marts provide easy business access to relevant information.

What are the Three Vs of Big Data

-Volume: quantities of data that reach incomprehensible proportions -Variety: structured, unstructured and semi structured data (images, videos, tweets, etc.) -Velocity: measure of how fast the data is coming in, SPEED

Predictive analytics

Forecasting - what might happen in the future

Load/Index

Place transformed data into the warehouse and create indexes (typically every night) Full load: bulk rewriting of target data at periodic intervals (refresh mode) Delta load: only changes in source data are written to data warehouse (update mode)

Metadata Managment

-Consistent understanding of data information (data about data), helps you crack down duplicated data and take it out -A technical data dictionary -Data lineage: Tells you where the data came form, what transformations were applied to this data

Difference between a column database and a regular database **

-Data is stored in row sequence in general (SQL oracle -Column database is stored in columns: Since they're all in sequence its quicker/faster reading Aggregation types of queries is what we run most often when making a decision so column databases are -Good to answer those types of queries because the data is stored in columns

Why is data governance so important in a data lake?

-Data lake is just a "swamp" without data governance -Data lake challenge is the inability to trust the data -->They need to add context for every piece of data by implementing policy driven processes

What are the different types of data analytics

-Descriptive -Predictive -Prescriptive -Diagnostic

Dimensions

-The dimension tables contain the attributes that you are using to categorize and roll up the measures. Contains descriptions of the business to which the subject of analysis belongs -Used by analysts to answer business questions -Help answer question about what, when, by and, for whom Dimensional modeling leads to star schema

What Does Data Governance Do?

About making sure that the data is managed properly 1. improve data quality and reduce redundancy 2. Protect Sensitive Information 3. Ensure data and IT compliance with gov. regulations 4. Encourage correct use of data 5. Platform for robust data analytics

Data Governance

About making sure that the data is managed properly: the organization and implementation of: -policies -procedures -structure -and roles which outline and enforce rules of engagement, decision rights, and accountabilities for the effective management or information asset

Filter Context

Added as part of a formula --> you can change by adding all for example. added to the measure using filter constraints as part of the formula. The filter context is applied in addition to the row and query contexts. You can alter the context by adding to it, replacing it, or selectively changing it using filter expressions.

Power Query vs DAX

Power query bring in and cleans data, used as BI ETL tool, DAX is the language used once data is in Power BI to create calculated columns and measures Data Analysis Expressions (DAX): language used to create calculated columns and measures in PowerBI -is both a query and functional language -DAX stores data in columns rather than rows -measures = calculations

Transform

Convert the format of data that of operational system to that of data warehouse Record-level: selection, joining, aggregation Field-level: single field, multi field

Why create Hierarchies?

Creating hierarchies is one way to increase the usability of your model and help users instinctively gain more value in their data analysis.

Data governance and BI

BI: using data and information to achieve organization goals. Data governance ensures: -alignment -quality -consistency -BI architecture is maintained

ETL process

Extract, transform, load: harmonizing data. Data reconciliation 1. Capture/Extract 2. Scrub/cleanse data 3. Transform 4. Load and Index goal is to provide a single, authoritative source for data that support decision making

Star Schema

Main way that data is stored in a data warehouse -central fact table surround by different: -dimensions (organized as a star) -based on questions asked by manager Ex: Which states and counties have the top sales? Which model types sold the most by year?

Why store in a sorted fashion? What advantage does it provide?

Makes it easier to read and allows for certain things to happen: allows for compression/to store the data in a compact manor -faster aggregation -better compression

What is a measure?

Measures: a measure is a DAX calculation that returns a single value that can be used in visuals in reports or as part of calculations in other measures

What are columnar databases

Organizes data by field, keeping all of the data associated with a field next to each other in memory. -provide performance advantages to querying data

near real time BI

reduced time between when the data is acquired and when it is possible to analyze that data EX: airline tracking most profitable customers - if airline staff is able to be alerted proactively of any delays a customer might face can help ensure that the customer is rerouted. Help to increase customer loyalty and revenue. -Want to reduce the the latency between when operational data is acquired and when analysis over that data is possible

RLE (run length encoding)

reduces rows

Benefits of Master data

single source of all Master data, managed centrally and disseminated

What does data governance ensure?

standards are define, maintained, and enforced -MDM efforts are aligned to business needs (MDM - Processes that ensure that reference data is kept up to date and coordinated across an enterprise.)

Data Warehouse

subject-oriented, integrated, time-variant collection of data in support of management's decision -Updated periodically (hourly, daily, monthly etc) -Optimized for querying and reporting -Expanded using data marts -pool of data produced to support decision making; also a repository of current (for real-time decision support) and historical data - key to any medium-to-large BI system Subject oriented - analyze a particular subject area. For example, "sales" can be a particular subject.Integrated - must be because data comes from several OLTP systems and you don't necessarily have the same data definitions.Time variant/dimension.-because trends, common analysis. Users want to see trends in data.Nonvolatile-cannot be changed. Data store designed for reporting, not transaction processing. schema on write

What is CRISP DM?

the CRoss Industry Standard Process for Data Mining -a process model with six phases that naturally describes the data science life cycle --> helps you plan, organize, and implement your data science (or machine learning) project.

Standing queries

the arrival of events in the input stream triggers processing of the query -computation may be continuously preformed as long as events continue to arrive in the input stream or the query explicitly stopped

Query context

the filtering applied to a cell in the matrix. -When you drop a measure into a matrix, the DAX query engine examines the row and column headers and any filters applied. (Each cell has a different query context applied to it and returns the value associated with the context.) -Because you can change the query context on the fly by changing row or column headers and filter values, the cell values are calculated dynamically, and the values are not held in the Power BI model.

Joining

the process combining data from various sources into a single table or view

Selection

the process of partitioning records according to predefined criteria (Ex: timestamp)

Aggregation

the process of transforming data from detailed to summary level

Mark as Date Table

to create a relationship between the date table and the table that contains the values you want to analyze. -Once you have the table in the model, you need to mark it as the official date table and indicate which column is the unique key. This tells the DAX query engine to use this table as a reference for constructing the set of dates needed for a calculation.


संबंधित स्टडी सेट्स

acc exam 3 concept questions ch.9

View Set

English IV: Lesson Six (Test Ten)

View Set

Regulation, Accreditation and Legislation

View Set

2-2 Histograms, Frequency Polygons, and Ogives

View Set