BA II Chp.1
variety
data also comes in all types, forms, and granularity, both structured and unstructured. May include numbers, text, and figures as well as audio, video, emails, and other multimedia elements.
Veracity
in addition to the 3 Vs, Veracity refers to the credibility and quality of data
Structured data that is machine-generated
includes information from manufacturing sensors (rotations per minute), speed cameras (miles per hour), web server logs (number of visitors, etc.
Structured data that is human-generated
includes information on price, income, retail sales, age, gender, etc.
Unstructured data that is machine-generated
includes satellite images, meteorological data, surveillance video data, traffic camera images, and others.
Unstructured data that is human-generated
includes texts of internal emails, social media data, presentations, mobile phone convos, and text message data and so on.
Big data
is a catchphrase, meaning a massive volume of both structured and unstructured data that are extremely difficult to manage, process, and analyze using traditional data-processing tools.
eXtensible Markup Language (XML)
is a simple language for representing structured data. Uses markup tags to define the structure of data. Is case-sensitive. This formate is designed to support readability
Sample
is a subset of population. We examine this data to make inferences about the population.
A Continuous Variable
is characterized by uncountable values within an interval. Weight, height, time, and investment return for example. In practice, however, continuous variables are often measured in discrete values (i.e., rounding)
knowledge
is derived from a blend of data, contextual information, experience, and intuition.
Data for any variable can be classified into one of four major measurement scales:
nominal, ordinal, interval, or ratio
cross-sectional data
refer to data collected by recording a characteristic of many subjects at the same point in time, or without regard to differences in time.
time series data
refer to data collected over several time periods focusing on certain groups of people, specific events, or objects.
The ordinal scale
reflects a stronger level of measurement compared to nominal scale. We are able to both categorize and rank the data with respect to some characteristic or trait
Nominal scale
represents the least sophisticated level of measurement. If presented with this data, all we can do is categorize or group the data. The values in the data set differ merely by label or name.
The ratio scale
represents the strongest level of measurement. Has all the characteristics of the interval scale as well as a true zero point, which allows us to interpret the ratios between observations.
Sample data is collected by
cross-sectional data or time series data
Predictive and prescriptive analytics example for apple music
"What are the key factors that influence a U.S.-based female listener's music choice?" this answer cannot be found in enterprise database
Business intelligence (BI) example for Apple Music or Spotify would be....
"during the first quarter of 2020, how many country songs recommended by the music service were skipped by U.S.-based female listener within five seconds of playing?"
Structured data examples include
-numbers, dates, and groups of words and numbers, typically stored in a tabular format. -Point-of-sale and financial data.
JSON (JavaScript Object Notation)
A human-readable text format for data interchange that defines attributes and values in a document.
Volume
An immense amount of data is compiled from a single source or a wide range of sources, including business transactions, household and personal devices, manufacturing equipment, social media, and other online portals.
Descriptive Analytics is often referred to as...
Business intelligence (BI). It uses past data integrated from multiple sources to inform decision making and identify problems and solutions
Information
Data that have been organized, analyzed, and processed in a meaningful and purposeful way.
Variable
For business analytics, we invariably focus on people, firms, or events with particular characteristics. When a characteristic of interest differs in kind or degree among carious observations (records) then the characteristic can be termed a variable.
Structured data
Generally reside in a predefined, row-column format. Spreadsheet or databased applications are used to enter, store, query, and analyze structured data. Often consisting of numerical information that is objective and it not open to interpretation.
Data
In general are compilations of facts, figures, or other contents, both numerical and nonnumerical. Data of all type and formats are generated from multiple sources.
tabular format
The presentation of information such as text and numbers in tables.
Ratio scale continued
The ratio scale is used in many business application. Variables such as sales, profits, and inventory level are expressed on the ratio scale. A meaningful zero point allows us to state, for example, that profits for firm A are double those of firm B. Variables such as weight, time, and distance are also measured on a ratio scale because zero is meaningful.
The three Vs of big data
Volume, Variety, Velocity
Predictive Analytics Answers
What could happen? example: identifying customers who are likely to to respond to specific marketing campaigns, admitted students who are likely to enroll.
Descriptive Analytics Answers
What has happened? example: financial reports, enrollment at universities, student report cards.
Prescriptive Analytics Answers
What should we do? It explores several possible actions and and suggests a course of action. example: choosing an investment portfolio to meet a financial goal, targeting marketing campaigns to specific customer groups on limited budget.
Nominal and ordinal scales
are used for categorical variables
Interval and ratio scales
are used for numerical variables.
A Discrete Variable
assumes a countable number of values. Example you cant have 1.3 children, score 90.25 points on a basketball game, a stock price can take on a value of $20.37 or $20.38 but cannot take on a value between these two points
Population Data
consists of all observations or items of interest in an analysis
We rely on sampling because we are unable to use population data for two main reason
cost and imposible
Velocity
data from a variety of sources get generated at a rapid speed
Value
derived from big data is perhaps the most important aspect of any analytics initiative.
Unstructured data (or unmodeled data)
does not conform to predefined, row-column format like (Struct Data). Tends to be textual (e.g., written reports, e-mail messages, DR's notes) or have multimedia contents (e.g., photographs, videos, and audio data).
In a data file with a fixed-width format (or fixed-length format) used to store tabular data
each column starts and ends at the same place in every row. store only raw data. limits the amount of characters
In a delimited file,
each piece of data can contain as many characters as applicable.
Another widely used file format to store tabular data is delimited format
each piece of data is separated by a comma. A comma in this formate is called a delimiter, and the file is called a comma-delimited or comma-separated value (CSV) file.
HTML (Hypertext Markup Language)
the predominant language used to create web pages
Interval Scale
we are able to categorize and rank the data as well as find meaningful differences between observations. Example: fahrenheit scale, not only is 60 degrees f hotter than 50 degrees f, the same difference of 10 degrees also exists between 90 and 80 degrees f. main draw back is that the value zero is arbitrarily chosen
Ordinal scale weakness
we cannot interpret the difference between the ranked values because the actual numbers used are arbitrary. Example: (category) excellent= 5 (rating)
For a categorical variable
we use labels or names to identify the distinguishing characteristic of each observation. It can be defined by more than two categories. Example: marital status, course grade
For numerical variable
we use numbers to identify the distinguishing characteristic of each observation. They are either discrete or continuous