Big Data Test 1
Clickstream Analysis
- Analysis of data that occur in the web environment - everything that you do on the web is captured - how you came in, where you go, how long you stay, what you look at , whether you buy
Structured Data
- Has a defined length and format - helps us make sense of data - easy to store because we know what to expect ex: Social Security number, tiger card number
Business Analytics
- Needed to change the business - looking in front of you to see what is going to happen
Data Mining
- Supervised (predictive)- the algorithm is given specific guidelines for the purpose of testing a hypothesis -Unsupervised (descripitive)- the algorithm is not given guidelines and there is not preconceived notion of what will be found
Primary key
- a candidate key selected as the primary means of identifying rows in a relation - short, numeric and never changes
Dashboard
- a collection of resources assembled to create a single unified visual display - used to quickly demonstrate metrics - originally developed for busy executives
Text Mining
- a data mining process used on data that is largely unstructured - usually uses the text itself to determine indices for subsequent analysis
Data integrity
- a data quality measure - refers to accuracy being maintained during manipulation - your AU permanent record should not change just because you have moved from an undergraduate to a graduate
Relations
- a two-dimensional table that has the following characteristics: - rows contain data about an entity - columns contain data about attributes - all entities in a column are of the same kind
Commodity Clusters
- affordable parallel computers with an average number of computing nodes - the computing nodes are clustered in racks connected to each other via a fast network
Relational Database
- based on the concept of two dimensional tables - designed with a number of related tables - each of these tables contains record attributes- the records are listed in rows and the attributes are listed in columns
Variety
- complexity - increasing different forms that data can be generated such as text, images, voice
Blade server
- compute node blade - each blade is a node
Distributed computing
- computing in one or more of these clusters across a local area network or the internet - enables data parallelism - improves scalability - a foundational technology - technique that allows individual computers to be networked together across geographical areas as though they were a single environment
Knowledge
- consists of data and/or information that have been organized and processed to convey understanding, experience, accumulated learning and expertise as they apply to a current business problem - actionable information in context
Web mining
- content mining- text mining of web content - structure mining- mining of web links and their relationships - usage mining- mining of web navigation
Data quality
- critical attributes of data including: 1. accuracy 2. precision 3. completeness 4. relevance 5. temporality
Metadata
- data about data - generally regards form and structure of data - used by data repositories to read and recall data, also to support data quality
Unstructured data
- data that has attributes that do not fit well into the "relation" model - ex: images, videos, audio files, text
Data
- facts and statistics collected together for reference and analysis - without context data has no meaning - a list of name is an example
Business Intelligence
- needed to run the business - looking in the rearview mirror and using historical data from one minute ago to many years ago
Explicit knowledge
- objective, often technical material that is easily transferred
Goals of Data visualization
- record information - analyze data to support reasoning - communicate ideas to others - interact with the data (which supports all of the above)
Information
- refers to data that have been organized so that they have meaning and value to the recipient - a list of names with a heading of "ISMN students"
Latency
- refers to systematic delays derived from delays in task execution- the delay before a transfer of data begins following an instruction for transfer - impacts the bid data life cycle in different ways
Data Visualization
- representation of data in a non-text way - often graphically - generally considered to be the ideal method of information transfer - caution should be used when interpreting
Volume
- size of data
Entities
- some identifiable thing that users want to track - ex: customers, computers, sales, products
Velocity
- speed at which data is being generated and the pace at which data moves from one point to the next
The 4 pillars of data
- statistics and probability - computer science and software programming - business domain - written and verbal communication
Random Access Memory
- the physical hardware inside a computer that temporarily stores data - allows a computer to work with more information at the same time
Data Visualization
- the presentation of data in a graphical format - shows many factors in one picture
Foreign Key
- the primary key of one relation that is *NORMALLY* placed in another relation to form a link between the relations - can be a single column or a composite key
Efficiency
- the ratio of input to output in relation to goal attainment - in business efficiency is often substituted for quickness but this does not show the whole picture
Hard Disk Drive
-Data storage device to store and retrieve digital information - uses a mechanical arm with a read/write head to move around and read information from the right location
Semi-structured data
-data that is somewhere in between (some consistent attributes have been defied) - part can be stored in a relational database while the rest can't
Path and evolution of Data Management
1. Computing machines 2. invention of relational data model 3. added SQL 4. large amounts of data slows down processing 5. enter data warehouse/mart 6. enter web and content management 7. managing big data (cloud storage and network speed)
entities, attributes, relations, relationships
A relational database consists of:
Temperature on a map
An example of visualizing Big Data is _______?
Data engineer
Computer science, engineering
True
Data Warehouses provide online analytic processing: T/F
Machines
Data generated from real time sensors in industrial machinery or vehicles, user behavior online trackers, environmental sensors, personal health trackers and many other sense data resources
1024
How many petabytes make up an exabyte?
Machine Data
In Operations Analysis, we focus on what type of data?
10 million
In the video, 2.5 Quintillion Bytes of data are equivalent to DVDs?
Scalable infrastructure
Name one of the drivers of Volume in the Big Data Era?
Solid State Drive
Storage device, Information is stored in microchips
Profits
Value from Big Data can be _______?
Analyzing data in-motion and at rest
What can help organizations to find new associations or uncover patterns and facts to significantly improve intelligence, security and law enforcement?
Online Analytical Processing
What does "OLAP" stand for?
Data Lakes
What is a method of storing data to support the analysis of originally disparate sources of data?
JSON files
What is an example of a source of semi-structured big data?
Data Science
What is the process of cleaning and analyzing data to derive insight and value from it?
Polaris
What is the search engine used by Walmart?
Enhanced 360-degree view
What is the term used to describe an holistic approach that takes into account all available and meaningful information about a customer to drive better engagement, revenue and long term loyalty?
2020
When is it estimated that the date we create and copy will reach around 35 zettabytes?
Key
a combination of one or more columns that is used to identify rows in a relation
Hadoop
a distributed computing platform for data manipulation and analysis of data at large scales
Composite key
a key that consists of two or more columns
Candidate key
a key that determines all of the other columns in a relation
Computing blade
a node that has been stripped of many components to save space
Algorithm
a process or set of rules to be followed during calculations and other problem solving operations
MapReduce
a programming model that simplifies parallel computing
computenode
any physical connection point
Data archivist
arts and humanities
Data librarian
arts and humanities
Data steward
arts and humanities
Value
at the heart of the big data challenge is turning all of the dimensions into truly useful business value
Distributed file system
can handle the big data access
Computing cluster
collection of nodes
Motherboard
connects all the parts
Attributes
data that describes the entity
Availability variation
forms are: real time availability like sensor data, versus not real time availability like stored patient records
Data journalist
journalism, media studies, communications studies
Data analyst
mathematics, statistics, business studies
Organizations
more traditional types of data, including transaction information in databases and structured data stores in data warehouses
Node
one element of a computing cluster contained in a rack
Data center
physical infrastructure is fundamental to operation
semantic variety
refers to how we interpret the data. We often use different units for quantities we measure or might use qualitative versus quantitative measures
Scalability
refers to the capability of a distributed computing platform to handle a growing amount of work- its potential to be scaled up to accommodate more data
Structural variety
refers to the difference in the representation of data
Valence
refers to the fraction of data items that are connected out of total number of possible connections
Media variety
refers to the medium in which the data gets delivered
Veracity
refers to the quality of the data
Graphics Processing Unit
renders images, animations and video for the computer's screen
Central Processing Unit
responsible for interpreting and executing most of the commands from the other parts of a computer
Tacit Knowledge
subjective or experimental knowledge that is difficult to transfer
Effectiveness
the degree to which a goal is attained
Internet of Things
the widespread availability of the smart devices and their interconnectivity led to a new term being coined
Compute node
used for computing
Relationships
values of relations between entity and attributes
Humans
vast amount of social media data, status updates, tweets, text messages, photos and medias