STATS Midterm

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

NoSQL

"Not Only SQL" ; a non-relational database that supports the storage of a wide range of data types including structured, semi-structured, and unstructured data Offers the flexibility, performance, and scalability needed to handle extremely high volumes of data

Approximation formula for the width of each interval

(maximum - minimum) divided by number intervals

Knowledge

A blend of data, contextual information, experiences, and intuition that can be applied and put into action in specific situations

Data warehouse

A central repository of data from multiple departments within an organization One of its primary purposes is to support managerial decision making Usually organized around subjects such as sales, customers, or products that are relevant to business decision making

Database

A collection of data logically organized to enable easy retrieval, management, and distribution of data

HyperText Markup Language (HTML)

A mark-up language that uses tags to define its data in web pages

JavaScript Object Notation

A popular alternative to XML in recent years; a standard for transmitting human-readable data in compact files

Population vs Sample

A population consists of all observations or items of interest in an analysis. A sample is a subset of the population. We examine sample data to make inferences about the population.

Composite primary key

A primary key that consists of more than one attribute We use a composite primary key when none of the individual attributes alone can uniquely identify each instance of the entity

Data management

A process that an organization uses to acquire, organize, store, manipulate, and distribute data

Frequency distribution for numerical variables

A series of intervals with the following guidelines 1) mutually exclusive 2) total number of intervals in a frequency distribution usually ranges from 5 to 20 3) exhaustive (covers the entire sample) 4) easy to recognize and interpret

Instance

A single occurrence of an entity

Data mart

A small-scale data warehouse or a subset of the enterprise data warehouse that focuses on one particular subject or decision area

Database Management System

A software application for defining, manipulating, and managing data in databases (oracle, sql, access)

Ordinal scale

Able to both categorize and rank the data with respect to some characteristic or trait; ranked, arbitrary values Typically expressed in words and then coded into numbers (example: hotel reviews classified 1-5 stars)

Interval scale

Able to categorize and rank the data as well as find meaningful differences between observations (example: temperature) Value of zero is arbitrarily chosen

Population

All elements of interest

Primary key

An attribute that uniquely identifies each instance of the entity; special type of attribute (ex Customer_ID is the primary key for CUSTOMER because each customer would have a unique ID number) Often used to create a data structure called an index for fast data retrieval and searches

Discrete variable

Assumes a countable number of values

Change in analytic professionals

Become more self-reliant and possess the necessary skills for data wrangling and data analysis; no longer relying on IT department; requires a broader skill set than just statistical and data mining techniques

Business Analytics

Broad topic, encompassing statistics, computer science, and information systems with a wide variety of applications in marketing, human resource management, economics, finance, health, sports, politics, etc.

Relative Frequency

Calculated by dividing the frequency by the sample size Proportion of observations in each category

Big Data

Catch-phrase term meaning a massive volume of both structured and unstructured data that are extremely difficult to manage, process, and analyze using traditional data-processing tools

Three common approaches for transforming categorical data prior to analysis

Category reduction, dummy variables, and category scores

Continuous variable

Characterized by uncountable values within an interval (weight, height, time, investment return)

Dummy variables

Commonly used to describe two categories of a variable; used when converting categorical variables into numerical variables; takes on values of 1 or 0 to describe two categories of a categorical variable

Data

Compilations of facts, figures, or other contents, both numerical and nonnumerical

Data, Information, and Knowledge

Data are compilations of facts, figures, or other contents, both numerical and nonnumerical. Information is a set of data that are organized and processed in a meaningful and purposeful way. Knowledge is derived from a blend of data, contextual information, experience, and intuition.

Information

Data that have been organized, analyzed, and processed in a meaningful and purposeful way

Foreign key

Defined as a primary key of a related entity (ex. Because Customer_ID is the primary key of the CUSTOMER entity, which shares a relationship with the ORDER entity, it is considered a foreign key in the ORDER entity

Three types of analytics

Descriptive: what happened? Predictive: what could happen in the future? Prescriptive: what should we do?

Range

Difference between the maximum and the minimum observations of a variable

Line chart

Displays a numerical variable as a series of data points connected by a line; especially useful for tracking changes or trends over time

Percentile

Divides a variable into two parts; less than or greater

Unstructured data

Does not conform to a predefined, row-column format; usually textual or have multimedia components

Fixed-width format

Each column starts and ends at the same place in every row; actual data are stored as plain text characters in a digital file

Delimited format

Each piece of data is separated by a comma

Skewness

Extremely high or low values of skewed variables significantly inflate or deflate the average of the entire data set, making it difficult to detect meaningful relationships with skewed variables. A popular mathematical transformation that reduces skewness in data is the natural logarithm transformation. Another transformation to reduce data skewness is the square root transformation.

True or False: raw data offers a lot of value and insights

False - in order to extract value from data, we need to be able to understand the business context, ask the right questions from the data, identify appropriate analysis models, and communicate information into verbal and written language.

Entity

Generalized category to represent persons, places, things, or events about which we want to store data in a database table

Bar chart for categorical variable

Graphical representation of a frequency distribution; with the height of each bar is equal to the frequency or the relative frequency of the corresponding category

Histogram

Graphical representation of frequency distribution for numerical variables

Entity-relationship diagram

Graphical representation used to model the structure of data

Scatterplot

Graphical tool to examine the relationship between two numerical variables; each point represents a paired observation for the two variables

Stacked column chart

Graphically show information from a contingency table Allows for the comparison of composition within each category

Two important data preparation techniques

Handling missing values and subsetting data

Distinction between JSON and XML

JSON format is not as verbose as the XML format, making data files smaller in size JSON format supports a wide range of data types not readily available in XML format Parsing JSON data files is faster and less resource intensive

Nominal scale

Least sophisticated level of measurement Categorizes or groups the data Data set differs merely by name or lable

Frequency Distribution for categorical variable

Make categorical variables more manageable and easier to access Groups the data into categories and records the number of observations that fall into each category The relative frequency for each category equals the proportion of observations in each category

Machine-generated data

Manufacturing sensors, speed cameras, web server logs

Measures of Central Location

Mean, median, mode

Types of numerical descriptive measures

Measures of central location: find a typical value for the data Measures of dispersion: gauge the underlying variability of the data Measures of shape: reveal symmetry and tails Measures of association: whether there is a linear relationship

Median

Middle value of a data set Mean can give misleading description due to outliers

Category Scores

Most appropriate if the data are ordinal and have natural, ordered categories Recode the categories numerically using numbers Assume equal increments between the category scores Example: customer satisfaction surveys, ranking 1-5 with each number representing a satisfaction level

Relational database

Most common type of database used in organizations today Consists of one or more logically related data files, often called tables or relations Where each data file is a two-dimensional grid that consists of rows and columns

Mode

Most frequently occurring observation of a variable. A variable may have no mode or more than one mode. The mode is the only meaningful measure of central location for a categorical variable.

Structured query language (SQL)

Most popular query language A language for manipulating data in a relational database using relatively simple and intuitive commands Basic structure: Select, From, Where

Relationship in a scatterplot

Negative linear relationship: points clustered together along a line with a negative slope Vice versa for positive Nonlinear relationship: x increases and y increases at a faster rate (positive) No relationship: no apparent pattern

Measurement Scales

Nominal and Ordinal (categorical) Interval and Ratio (numerical) Techniques for summarizing and analyzing variables

Why do we rely on sampling data?

Obtaining information on the entire population is expensive. It is impossible to examine every member of the population.

Two common strategies for dealing with missing values

Omission and imputation

Relationship

One to one; one to many; or many to many; relationship with each other that represents certain business facts or rules

RFM Analysis

Popular marketing technique used to ID high value customers Recency, frequency, and monetary (days since last, number of orders, and monetary variables)

Types of mean

Population mean = parameter Sample mean = Statistic

Notation of mean

Population mean is referred to with Greek letter u (mu) Sample mean is referred to with x (x-bar)

Structured data

Predefined, row-column format; spreadsheet or database applications to enter, store, query, and analyze structured data

Human-generated data

Price, income, retail sales, age, gender

Binning

Process of transforming numerical variables into categorical variables by grouping the numerical values into a small number of groups or bins Must be consecutive and nonoverlapping Each value falls into one and only one bin Effective way to reduce noise in the data if we believe that all observations in the same bin tend to behave the same way

Business Intelligence

Provides organizations and their users with the ability to access and manipulate data interactively through reports, dashboards, applications, and visualization tools

Categorical variable

Qualitative

Numerical variable

Quantitative

Measures of Dispersion

Range, interquartile range, mean absolute deviation, variance, standard deviation

Omission strategy

Recommends that observations with missing values be excluded from the analysis Also called complete-case analysis Appropriate when the amount of missing values is small or concentrated in a small number of observations

Cross-Sectional Data

Refer to data collected by recording a characteristic of many subjects at the same point in time, or without regard to differences in time. examples: NBA wins/losses over a season; recorded grades of students in a class; sale prices of home)

Time Series Data

Refer to data collected over several time periods focusing on certain groups of people, specific events, or objects Examples: hourly body temperature, daily price of stock in the first quarter)

Imputation strategy

Replaces missing values with some reasonable imputed values like mean or median For categorical variables, it is common to impute the most predominant category In the presence of outliers, it is preferred to use the median instead of the mean to impute missing values

First tasks completed by data analysis to gain a better understanding and insight into data

Review and inspect for quality and relevance Counting and sorting to verify the data set is complete or has missing values

Bubble plot

Shows the relationship between three numerical variables; the third numerical variable is represented by the size of the bubble

eXtensible Markup Language (XML)

Simple language for representing structured data; widely used for sharing structured information between computer programs, between people, and between computers and people Each piece of data is enclosed in a pair of 'tags' that follow specific XML syntax

Structure of a data mart

Star schema: a multidimensional data model - made up of dimension and fact tables Dimension table: describes the business dimensions of interest, such as customer, product, location, and time Fact table: contains facts about the business operation, often in a quantitative format

Ratio scale

Strongest level of measurement All the characteristics of interval scale with a true zero point, which allows us to interpret the ratios between observations (example: sales, profits, inventory levels, weight, time, distance)

Sample

Subset of a population

Histogram shape of distribution

Symmetric distribution is one that is a mirror image of itself on both sides of its center. If the distribution is not symmetric, then it is skewed. Positive skewed: long tail that extends to the right reflects the presence of a small number of relative large values Negative skewed: long tail that extends to the left; small number of relatively small values

Data transformation

The data conversion process from one format or structure to another; performed to meet the requirements of statistical and data mining techniques used for the analysis Examples: date of birth to age; BMI calculation; percentages

Primary barrier preventing organizations from taking full advantage of business analytics

The inability to clean and organize big data

Data modeling

The process of defining the structure of a database

Subsetting

The process of extracting portions of a data set that are relevant to the analysis; commonly used to pre-process the data prior to analysis May remove variables that are irrelevant to the problem, variables that contain redundant information, or variables with excessive amounts of missing values

Data wrangling

The process of retrieving, cleansing, integrating, transforming, and enriching data to support subsequent data analysis Transforming raw data into a format that is more appropriate and easier to analyze

Extraction, Transformation, and Load process

To integrate data from different databases generated by various business departments To retrieve, reconcile, and transform data into a consistent format, and then load the final data into a data warehouse

Heat map

Uses color or color intensity to display relationships between variables; useful to identify combinations of the categorical variables that have economic significance

Issues where it makes sense to use category reduction

Variables with too many categories pull down model performance If a variable as some categories that rarely occur If one category clearly dominates in terms of occurence

Other characteristics of big data

Veracity: credibility and quality of data Value: value derived from big data is perhaps the most important aspect of any analytics initiative

Three characteristics of big data

Volume: immense amount of data Velocity: data from a variety of sources generated at a rapid speed Variety: come in all types, forms, and granularity, both structured and unstructured

Variable

When a characteristic of interest differs in kind or degree among various observations (records)

Rescaling

When the variables in a data set are measured using different scales, the variability can place undue influence on larger-scale variables, resulting in inaccurate outcomes Commonplace to rescale the data using either standardization or normalization, especially in data mining techniques

Category reduction

Where we collapse some of the categories to create fewer nonoverlapping categories Guideline 1: categories with very few observations may be combined to create the "other" categories Guideline 2: categories with a similar impact may be combined

Key distinction between XML and HTML

XML tells us or computer applications what the data are HTML tells the web browser how to display the data

General rule for creating dummy variables

k - 1 , using the last category as reference

Scatterplot with a categorical variable

scatterplot that incorporates a categorical variable with different colors or symbols

Contingency table

to examine the relationship between two categorical variables shows the frequencies for two variables, where each cell represents a mutually exclusive combination of the values


Kaugnay na mga set ng pag-aaral

Chapter 13 Current Weather Studies

View Set