Big Data Test 1

Ace your homework & exams now with Quizwiz!

Clickstream Analysis

- Analysis of data that occur in the web environment - everything that you do on the web is captured - how you came in, where you go, how long you stay, what you look at , whether you buy

Structured Data

- Has a defined length and format - helps us make sense of data - easy to store because we know what to expect ex: Social Security number, tiger card number

Business Analytics

- Needed to change the business - looking in front of you to see what is going to happen

Data Mining

- Supervised (predictive)- the algorithm is given specific guidelines for the purpose of testing a hypothesis -Unsupervised (descripitive)- the algorithm is not given guidelines and there is not preconceived notion of what will be found

Primary key

- a candidate key selected as the primary means of identifying rows in a relation - short, numeric and never changes

Dashboard

- a collection of resources assembled to create a single unified visual display - used to quickly demonstrate metrics - originally developed for busy executives

Text Mining

- a data mining process used on data that is largely unstructured - usually uses the text itself to determine indices for subsequent analysis

Data integrity

- a data quality measure - refers to accuracy being maintained during manipulation - your AU permanent record should not change just because you have moved from an undergraduate to a graduate

Relations

- a two-dimensional table that has the following characteristics: - rows contain data about an entity - columns contain data about attributes - all entities in a column are of the same kind

Commodity Clusters

- affordable parallel computers with an average number of computing nodes - the computing nodes are clustered in racks connected to each other via a fast network

Relational Database

- based on the concept of two dimensional tables - designed with a number of related tables - each of these tables contains record attributes- the records are listed in rows and the attributes are listed in columns

Variety

- complexity - increasing different forms that data can be generated such as text, images, voice

Blade server

- compute node blade - each blade is a node

Distributed computing

- computing in one or more of these clusters across a local area network or the internet - enables data parallelism - improves scalability - a foundational technology - technique that allows individual computers to be networked together across geographical areas as though they were a single environment

Knowledge

- consists of data and/or information that have been organized and processed to convey understanding, experience, accumulated learning and expertise as they apply to a current business problem - actionable information in context

Web mining

- content mining- text mining of web content - structure mining- mining of web links and their relationships - usage mining- mining of web navigation

Data quality

- critical attributes of data including: 1. accuracy 2. precision 3. completeness 4. relevance 5. temporality

Metadata

- data about data - generally regards form and structure of data - used by data repositories to read and recall data, also to support data quality

Unstructured data

- data that has attributes that do not fit well into the "relation" model - ex: images, videos, audio files, text

Data

- facts and statistics collected together for reference and analysis - without context data has no meaning - a list of name is an example

Business Intelligence

- needed to run the business - looking in the rearview mirror and using historical data from one minute ago to many years ago

Explicit knowledge

- objective, often technical material that is easily transferred

Goals of Data visualization

- record information - analyze data to support reasoning - communicate ideas to others - interact with the data (which supports all of the above)

Information

- refers to data that have been organized so that they have meaning and value to the recipient - a list of names with a heading of "ISMN students"

Latency

- refers to systematic delays derived from delays in task execution- the delay before a transfer of data begins following an instruction for transfer - impacts the bid data life cycle in different ways

Data Visualization

- representation of data in a non-text way - often graphically - generally considered to be the ideal method of information transfer - caution should be used when interpreting

Volume

- size of data

Entities

- some identifiable thing that users want to track - ex: customers, computers, sales, products

Velocity

- speed at which data is being generated and the pace at which data moves from one point to the next

The 4 pillars of data

- statistics and probability - computer science and software programming - business domain - written and verbal communication

Random Access Memory

- the physical hardware inside a computer that temporarily stores data - allows a computer to work with more information at the same time

Data Visualization

- the presentation of data in a graphical format - shows many factors in one picture

Foreign Key

- the primary key of one relation that is *NORMALLY* placed in another relation to form a link between the relations - can be a single column or a composite key

Efficiency

- the ratio of input to output in relation to goal attainment - in business efficiency is often substituted for quickness but this does not show the whole picture

Hard Disk Drive

-Data storage device to store and retrieve digital information - uses a mechanical arm with a read/write head to move around and read information from the right location

Semi-structured data

-data that is somewhere in between (some consistent attributes have been defied) - part can be stored in a relational database while the rest can't

Path and evolution of Data Management

1. Computing machines 2. invention of relational data model 3. added SQL 4. large amounts of data slows down processing 5. enter data warehouse/mart 6. enter web and content management 7. managing big data (cloud storage and network speed)

entities, attributes, relations, relationships

A relational database consists of:

Temperature on a map

An example of visualizing Big Data is _______?

Data engineer

Computer science, engineering

True

Data Warehouses provide online analytic processing: T/F

Machines

Data generated from real time sensors in industrial machinery or vehicles, user behavior online trackers, environmental sensors, personal health trackers and many other sense data resources

1024

How many petabytes make up an exabyte?

Machine Data

In Operations Analysis, we focus on what type of data?

10 million

In the video, 2.5 Quintillion Bytes of data are equivalent to DVDs?

Scalable infrastructure

Name one of the drivers of Volume in the Big Data Era?

Solid State Drive

Storage device, Information is stored in microchips

Profits

Value from Big Data can be _______?

Analyzing data in-motion and at rest

What can help organizations to find new associations or uncover patterns and facts to significantly improve intelligence, security and law enforcement?

Online Analytical Processing

What does "OLAP" stand for?

Data Lakes

What is a method of storing data to support the analysis of originally disparate sources of data?

JSON files

What is an example of a source of semi-structured big data?

Data Science

What is the process of cleaning and analyzing data to derive insight and value from it?

Polaris

What is the search engine used by Walmart?

Enhanced 360-degree view

What is the term used to describe an holistic approach that takes into account all available and meaningful information about a customer to drive better engagement, revenue and long term loyalty?

2020

When is it estimated that the date we create and copy will reach around 35 zettabytes?

Key

a combination of one or more columns that is used to identify rows in a relation

Hadoop

a distributed computing platform for data manipulation and analysis of data at large scales

Composite key

a key that consists of two or more columns

Candidate key

a key that determines all of the other columns in a relation

Computing blade

a node that has been stripped of many components to save space

Algorithm

a process or set of rules to be followed during calculations and other problem solving operations

MapReduce

a programming model that simplifies parallel computing

computenode

any physical connection point

Data archivist

arts and humanities

Data librarian

arts and humanities

Data steward

arts and humanities

Value

at the heart of the big data challenge is turning all of the dimensions into truly useful business value

Distributed file system

can handle the big data access

Computing cluster

collection of nodes

Motherboard

connects all the parts

Attributes

data that describes the entity

Availability variation

forms are: real time availability like sensor data, versus not real time availability like stored patient records

Data journalist

journalism, media studies, communications studies

Data analyst

mathematics, statistics, business studies

Organizations

more traditional types of data, including transaction information in databases and structured data stores in data warehouses

Node

one element of a computing cluster contained in a rack

Data center

physical infrastructure is fundamental to operation

semantic variety

refers to how we interpret the data. We often use different units for quantities we measure or might use qualitative versus quantitative measures

Scalability

refers to the capability of a distributed computing platform to handle a growing amount of work- its potential to be scaled up to accommodate more data

Structural variety

refers to the difference in the representation of data

Valence

refers to the fraction of data items that are connected out of total number of possible connections

Media variety

refers to the medium in which the data gets delivered

Veracity

refers to the quality of the data

Graphics Processing Unit

renders images, animations and video for the computer's screen

Central Processing Unit

responsible for interpreting and executing most of the commands from the other parts of a computer

Tacit Knowledge

subjective or experimental knowledge that is difficult to transfer

Effectiveness

the degree to which a goal is attained

Internet of Things

the widespread availability of the smart devices and their interconnectivity led to a new term being coined

Compute node

used for computing

Relationships

values of relations between entity and attributes

Humans

vast amount of social media data, status updates, tweets, text messages, photos and medias


Related study sets

BUS 187 FINAL EXAM 12-20 REVIEW QUESTIONS

View Set

Psychology 120 Chapter 9 Thinking and Intelligence

View Set

M 5.7- Measurement of Blood Pressure

View Set