Chapter 1
big data
A term used to describe a massive volume of both structured and unstructured data that are extremely difficult to manage, process, and analyze using traditional data-processing tools; does not imply complete (population) data
Variety
Data come in all types, forms, and granularity, both structured and unstructured. Can include text, numbers, figures and audio, video, emails, and other multimedia elements
Velocity
Data from a variety of sources get generated at a rapid speed
3 characteristics of big data
Volume Velocity Variety
information
a set of data that are organized and processed in a meaningful and purposeful way
HyperText Markup Language (HTML)
a simple text-based markup language for displaying content in web browsers
eXtensible Markup Language (XML)
a simple text-based markup language for representing structured data. Uses user-defined markup tags to specify the structure of data
JavaScript Object Notation (JSON)
a standard for transmitting human-readable data in compact files
population
all observations or items of interest in an analysis
Volume
an immense amount of data is complied from a single source or a wide range of sources, including business transactions, household and personal devices, manufacturing equipment, social media, and other online portals
numerical variable
assume meaningful numerical values
categorical variable
assume names or labels
discrete variable
assumes a countable number of values
continuous variable
characterized by uncountable values within an interval
data
compilations of facts, figures, or other contents, both numerical and nonnumerical
cross-sectional data
data collected by recording a characteristic of many subjects at the same point in time, or without regard to differences in time
time series data
data collected over several time periods focusing on certain groups of people, specific events, or objects
structured data
data that reside in a predefined, row-column format
knowledge
derived from a blend of data, contextual information, experience, and intuition
unstructured data
do not conform to a predefined, row-column format; textual or multimedia contents
delimited format
each column is separated by a delimited such as a comma. Each column can contain as many characters as applicable
fixed-width format
every column starts and ends at the same place in every row; data stored as plain text characters in a digital file
descriptive analytics
gathering, organizing, tabulating, and visualizing data to summarize "what has happened?"
variable
general characteristic being observed on a set of people, objects, or events, where each observation varies in kind or degree
nominal scale
least sophisticated level of measurement; observations differ merely by name or label
interval scale
observations can be categorized and ranked, and differences between observations are meaningful. Main drawback of this is that the value of zero is arbitrarily chosen
ordinal scale
observations can be categorized and ranked; however, differences between the ranked observations are meaningless
ratio scale
observations have all the characteristics of interval-scaled data as well as a true zero point; strongest level of measurement
Value
perhaps the most important aspect of any analysis initiative
advanced predictions
predictive & Prescriptive analytics; focus on building predictive and prescriptive models that help organizations understand what might happen in the future
business intelligence (BI)
provides historical, current, and predictive views of business operations and environments and gives organizations a competitive advantage in the marketplace; descriptive analytics
Veracity
refers to the credibility and quality of data
machine-generated
structured: information from manufacturing sensors, speed cameras, web server logs unstructured: satellite images, meteorological data, surveillance video data, traffic camera images
human-generated
structured: information on price, income, retail sales, gender, etc unstructured: texts of internal e-mails, social media data, presentations, mobile phone conversations, text message data, etc
sample
subset of the population
Business Analytics (BA)
uses data and statistical methods to gain insight into the data and provide decision makers with information they can act on
predictive analytics
using historical data to predict "what could happen in the future"?
prescriptive analytics
using optimization and simulation algorithms to provide advice on "what should we do"?