Statistics Chapter 1 (Bentley)
U.S. Census Bureau - What data do they have available?
Economic indicators, foreign trade, health insurance, housing, sector-specific data.
Elements
The entities on which data are collected.
Data
The facts and figures collected, analyzed, and summarized for presentation and interpretation.
Data Mining Applications
The major applications of data mining have been made by companies with a strong consumer focus such as retail, financial, and communication firms. Data mining is used to identify related products that customers who have already purchased a specific product are also likely to purchase (and then pop-ups are used to draw attention to those related products). As another example, data mining is used to identify customers who should receive special discount offers based on their past purchasing volumes.
Statistics (The Study of Statistics)
The methodology of extracting useful information from a data set.
Statistical Inference
The process of using data obtained from a sample to make estimates and test hypotheses about the characteristics of a population.
Sample Statistics
The sample result.
Observation
The set of measurements obtained for a particular element.
Qualitative Variable
Use labels or names to identify the distinguishing characteristic of each observation. Examples: race, profession, type of business, the manufacturer of a car, etc.
Two Main Ways to Collect Sample Data
1. Cross-Sectional Data 2. Time Series Data
Two Branches of Statistics
1. Descriptive Statistics 2. Inferential Statistics
Three Essential Steps for Good Statistics
1. Find the right data (complete and lacking misrepresentation) 2. Use the appropriate statistical tools (which depends on the data at hand) 3. Clearly communicate numerical information into written language.
4 Major Categories of Data Measurement
1. Nominal (Qualitative) 2. Ordinal (Qualitative) 3. Interval (Quantitative) 4. Ratio (Quantitative)
Two Main Reasons for Being Unable to Use Population Data
1. Obtaining information on the entire population is expensive. 2. It is impossible to examine every member of the population.
Process of Statistical Inference (Example)
1. Population consists of all tune-ups. Average cost of parts is unknown. 2. A sample of 50 engine tune-ups is examined. 3. The sample data provide a sample average parts cost of $79 per tune-up. 4. The sample average is used to estimate the population average. -and around again-
Variable
A characteristic of interest that differs in kind or degree among various observations The general characteristic being observed on a set of people, objects, or events, where each observation varies in kind or degree. Characterized as either qualitative or quantitative.
Discrete Variable
A quantitative variable. Assumes a countable number of values. Examples: number of children in a family, number of points scored in a basketball game, etc. Note: may assume an infinite number of values, but these values are countable (can be presented in a sequence x1,x2,x3,...)
Continuous Variable
A quantitative variable. Characterized by uncountable values that are within a certain interval. Examples: weight, height, time, and investment return, etc. Note: In practice, however continuous variables may be measure in discrete values.
Sample
A subset of a particular population. In most statistical applications, this type of data is what we must rely on to make inferences about various characteristics of the population.
Quantitative Variable
A variable that assumes meaningful numerical values. Categorized as either discrete or continuous.
Applications in Business and Economics (Accounting, Economics, Finance, Marketing, and Production)
Accounting-Public accounting firms use statistical sampling procedures when conducting audits for their clients. Economics-Economists use statistical information in making forecasts about the future of the economy or some aspect of it. Finance-Financial advisors use price-earning ratios and dividend yields to guide their investment advice. Marketing-Electronic point-of-sale scanners at retail checkout counters are used to collect data for a variety of marketing research applications. Production-A variety of statistical quality control charts are used to monitor the output of a production process.
Data Set
All the data collected in a particular study.
Data Mining
Analysis of the data in the warehouse might aid in decisions that will lead to new strategies and higher profits for the organization. Using a combination of procedures from statistics, mathematics, and computer science, analysts "mine the data" to convert it into useful information. The mot effective data mining systems used automated procedures to discover relationships in the data and predict future outcomes, ...promoted by only general, even vague, queries by the user.
Federal Reserve Economic Data (FRED) - What data do they have available?
Banking, business/fiscal data, exchange rates, reserves, monetary base.
The Interval Scale
Can categorize and rank the data and are assured that the differences between scale values are meaningful. Main drawback is that the zero point of an interval scale does not reflect a complete absence of what is being measured.
Statistics
Can refer to numerical facts such as averages, medians, percents, and index numbers that help us understand a variety of business and economic situations. Can also refer to the art and science of collecting, analyzing, presenting, and interpreting data.
Data Warehousing
Capturing, storing, and maintaining the data (a significant undertaking). Because organizations obtain large amounts of data on a daily basis by means of magnetic card readers, bar code scanners, point of sale terminals, and touch screen monitors.
Sample Survey
Collecting data for a sample.
Census
Collecting data for the entire population.
Data Mining Model Reliability
Finding a statistical model that works well for a particular sample of data does not necessarily mean that it can be reliably applied to other data. With the enormous amount of data available, the data set can be partitioned into a training set (for model development) and a test set (for validating the model). There is, however, a danger of over fitting the model to the point that misleading associations and conclusions appear to exist. Careful interpretation of results and extensive testing is important.
finance.yahoo.com - What data do they have available?
Historical stock prices, mutual fund performances, international market data
Commonly Published Statistics from the Government
Inflation Unemployment Gross Domestic Product (GDP)
Bureau of Labor Statistics (BLS) - What data do they have available?
Inflation rates, unemployment rates, employment, pay and benefits, spending and time use, productivity.
Bureau of Economic Analysis (BEA) - What data do they have available?
National and regional data on gross domestic product (GDP) and personal income, international data on trade in goods and services.
The New York Times, USA Today, The Wall Street Journal, The Economist, and Fortune - What data do they have available?
Poverty, crime, obesity, and plenty of business-related data
espn.com - What data do they have available?
Professional and college teams' scores, rankings, standings, individual player statistics
zillow.com - What data do they have available?
Recent home sales, home characteristics, monthly rent, mortgage rates
Time Series Data
Refers to data collected by recording a characteristic of a subject over several time periods. Contain values of a characteristic of a subject over time.
Cross-Sectional Data
Refers to data collected by recording a characteristic of many subjects at the same point in time, or without regard to differences in time. Contains values of a characteristic of many subjects at the same point or approximately the same point in time.
Inferential Statistics
Refers to drawing conclusions about a large set of data (called a population) based on a smaller set of sample data. Mainly where the phenomenal growth in statistics is.
Descriptive Statistics
Refers to the summary of important aspects of a data set. Includes collecting data, organizing data, and then presenting the data in the forms of charts and tables. Often calculate numerical measures that summarize, for instance, the data's typical value and the data's variability. Today, commonly accounts for the techniques that are the most visible application of statistics.
The Ordinal Scale
Reflects the third tier of sophistication in data measurement. Able to categorize and rank the data with respect to some characteristic or trait. Weakness is that we cannot interpret the difference between the ranked values because the actual numbers used are arbitrary. (Differences between categories are meaningless).
The Nominal Scale
Represents the least sophisticated level of measurement. Can just categorize or group the data. The values in the data set differ merely by name or label. Often substitute numbers for the particular qualitative characteristic or trait that we are grouping (for ease of exposition)
The Ratio Scale
Represents the strongest level of measurement. Have all the characteristics of interval data as well as a true zero point, which allows us to interpret the ratios of values. Used to measure many types of data in business analysis. Examples: sales, profits, inventory, etc.
Data Mining Requirements
Statistical methodology such as multiple regression, logistic regression, and correlation are heavily used. Also needed are computer science technologies involving artificial intelligence and machine learning. A significant investment in time in money is required as well.
Some Government Agencies that Publish Lots of Economic and Business Data
The Bureau of Economic Analysis (BEA) The Bureau of Labor Statistics (BLS) The Federal Reserve Economic Data (FRED) The U.S. Census Bereau
Popular Parameter
What is attempted to be estimated by sample statistics (because it is generally not feasible to obtain population data and calculate the relevant parameter directly due to prohibitive costs an/or practicality)
Population
A large set of data. All members of a specified group (not necessarily people). All items of interest in a statistical problem.