Purdue IE 332 Exam 2

¡Supera tus tareas y exámenes ahora con Quizwiz!

Label

(expert) assigned identifier to sample

2. Learn: iteratively adjust model parameters to better represent gained knowledge

-A learning algorithm defined how to adjust the parameters -The algorithm may use the data multiple times

1. Choose a model to represent the knowledge. Which model(s)?

-Each model makes assumptions about what knowledge could look like -Not all models are equal, and no single model is always best -What is the practical purpose/task for the model? What meta-data or features does the data set contain? Existing expert knowledge about expected patterns in the data? Computational resources/limitations?

Relationships between entities

-Entities may be related somehow -No limit on the number of entities participating in a relationship -An entity may be involved in a number of relationships -Each relationship has a unique name -IMPORTANT: relationships indicate how data is related, not program flow or under-interaction steps with the system

Idea behind decision Trees

-Split training data into branches based on some logical rule (E.g. maximize information gain at each split) -Node of the tree is the rule, branches are subsets of data satisfying parts of the rule -Continue splitting the data until some criteria is met

data redundancy

-entering the same data over and over again is a waste of time -wasted storage space -increases probability of occurring errors -errors can occur when retrieving data -degrade overall system performance (speed, esp)

table

-every table has a unique name -attributes within a table have unique names -every attribute is atomic (i.e. not multivalued) -every row is unique -the order of the columns doesn't matter -the order of the rows doesn't matter

pros of naive bayes

-fast learning and prediction, -extension to non-binary labels is already in the equations -if independence assumption holds, minimal data is required

data integrity

-make it difficult to insert invalid or inconsistent data -on inserting/updating/deleting an item, automatically enforce logical relationships throughout database

cons of naive bayes

-missing observations of a label(s) in training data -independence assumption is rarely to strongly hold in reality

normalization

-proper separation/organization of data -data should not depend on a non-key attribute

ERD for weak entity sets

1) a double or thick border on rectangle of weak entity set 2) a double or thick border on diamond of the identifying relationship 3) a double or thick line and arrow linking 4)a double or dashed underline of the weak entity set attributes that are part of the composite key

creating a weak entity set

1. declare WES attributes 2. add the primary key of the identifying owner 3. declare the composite key using attributes from both entity sets 4. declare a foreign key constraint on the identifying owner keys (5. instruct the system to auto. delete any rooms when a building is deleted)

big data challenges

1. lacks integrity 2. lacks meta-data 3. back-end is cheap 4. front-end is confusing 5. you analysts dont understand your question 6. analysis is incomplete 7. lack a means to interpret the anayses 8. you dont act on analyses

relationship constraints

1:1 the element of the entity MUST exist in the relationship once 0:1 the element of the entity MAY exist in the relationship at most once 1:M the element of the entity MUST be in the relationship at least once 0:M the element of the entity MAY be in the relationship

Machine Learning vs AI vs Data Mining vs Statistics

AI: computers that behave and reason intelligently ML: automatically learn models of data for prediction DM: human-guided discovery of hidden patterns in a particular dataset Statistics: quantify and summarize data (These often overlap!)

What is a database?

An organized collection of data, often a model of the system in consideration, supports concurrent access to data, supports secure access to data, and supports efficient access to data.

Why is data independence important for a DBMS (database management system)?

Applications can't change the definition or organization of data.

(in depth) Things big data alone can't help with?

Chance correlation: with so much data, correlations are easy to find -Meaning: we can find relationship in data, but more is needed to determine meaning or cause -Action: more data doesn't imply more knowledge -Easily fooled: many big data tools can be purposely fooled -Data drift: incoming data can cause unintentional signals -Feedback: reinforcing data (esp wrt web-based) -Critical thought: scientific sounding answers to vague questions -New Data: how to handle previously unseen data -Realistic: big data is NOT a silver bullet

Deleting Data Syntax (SQL)

DELTE FROM table [WHERE logical criteria]

DROP TABLE syntax (SQL)

DROP TABLE Students

Why not always use a database?

Databases are expensive and complicated to set up and maintain o Databases are general-purpose, and not suited for special purpose tasks like text search o A database may be overkill - e.g. for a simple tabulation task

requirements specification (ERD)

Essential: constant customer communication! Goals: identify key data, no redundant details, no unimportant details, clarify unclear natural language, attain missing information, separate data from operations.

Generalization in ERD

Generalization is the process of extracting common properties form a set of entities and creating a generalized entity from it It's a bottom up approach, where two or more entities can be generalized to a higher level entity if they have some attributes in common

Why is managing data tough?

Global amount of data is growing exponentially, data is scattered among many different sources, it is collected by many different individuals and devices of different formats, is must be kept secure and not corrupted, and it must be easy to ask questions of the data.

Insert Syntax (SQL)

INSERT INTO table [(field names)] VALUES (field names)

Batch Syntax (SQL)

INSERT INTO table [(field names)] VALUES (field values)[,(field values)]*

dimensionality reduction (A Motivation)

In classification and clustering, the features are in X^n. If n is large, then: There may be a large number of parameters; combined with small data set this can lead to: Large variance in output & Over-fitting. There may be irrelevant features that impact/confuse learning

(In depth) Why can big data happen now?

Low cost: data storage costs are very low -CPU power: we have the CPU power to process the data -Fast access: network technologies send information very quickly -Cloud computing: minimal manpower for access to "unlimited" storage and other computing tools -Distributed computing: ability to spread tasks over many computers -Government investment: $200 mil from NSF, MIH etc. -Open source software: SQL, Hadoop, etc. -Machine Learning: we'll see this later

foreign key

PK of another entity that shares a relationship

Why can writing (and reading) queries be hard?

Possible SQL syntax/grammar to use (practice) English grammar is loose, whereas SQL is strict Expressing exactly the logic of an English question or statement

What is the premise of data persistence?

Premise: the continuance of an effect after its cause is removed

SELECT Syntax (SQL)

SELECT field-list FROM table-list [WHERE qualification] [GROUP BY fieldlist [ORDER BY field-list] [HAVING aggregate function and condition]]

Update Data Syntax (SQL)

UPDATE table SET field = value WHERE criteria

weak entity sets

When the attributes of an entity are insufficient to define a primary key We can use attributes from other, related, entities if we have a 1:1 relationship with an entity and can borrow its PK

Database vs. spreadsheet

a Database stores data for access by users, and typically has more data than a spreadsheet would. Databases require access through an application for editing whereas spreadsheets are edited directly. Spreadsheets typically used for presentations or paperwork, whereas databases are used for larger-scale projects

Relationship Instances

a collection of attribute values for a particular relationship

machine learning

a computational method that uses experience to improve algorithm performance for purpose of prediction. Experience: data-driven task and thus statistics, probability, and optimization will play a significant role

Key

a minimal set of attributes that uniquely define each entity If > 1 potential key exists, they are called candidate keys (e.g. PUID and a cell number could be candidate keys)

Spreadsheet

a program for tabulating data

AI

a science and technology based on disciplines such as computer science, biology, psychology, linguistics, mathematics, and engineering

what does SELECT return?

a table of results that satisfy a query

example information gathered in DSS

accessing all data sources, comparative sales information, projected revenue based on sales assumptions, consequences of alternative decisions

Left (JOIN)

all rows in left table + right table if match in left table

Right (JOIN)

all rows in right table + left table if match in right table

Full (JOIN)

all rows of both tables

relationship

an association between two or more entities. e.g. whether two students are in a particular course

JOIN

an instruction to combine data from more than one table

decision support systems

an interactive software used to aid in decision making by suggesting solutions to the problem at hand

Main Assumption of Classification

any hypothesis/learned model that does sufficiently well during training will also do well on unobserved examples (Generalization)

Entity

any individual real-world object. e.g. a student Modeled as rectangles (ERD)

What is the premise of data independence?

applications shouldn't care how data is structured and stored.

Example/Sample

arbitrary element from a data set

batch data

big chunks, time separated

generalization

common characteristics among a set of items

database design guidelines

data redundancy (minimize) normalization (maximize) data integrity (maximize)

Potential benefits of DSS

decision quality, improved communication, cost reduction, increased productivity, time savings, improved customer and employee satisfaction

data definition langauge (DDL)

declare schema and CREATE TABLEs

meta-data

definitions, mappings, data about data

relational algebra

describes tables and their combinations

Examples of scientific databases

digital libraries, satellite images, simulation data.

relational calculus

expresses queries

Attribute

feature of each entity (e.g. student name, ID, date of birth) listed within rectangle

why is data persistence important for DBMS?

for evaluating database systems with respect to speed, memory needs, etc.

Goal of AI

goal: to develop computers that can think, as well as see, hear, walk, talk, and feel

Why is data redundancy important for DBMS?

inefficient use of both non/volatile storage, difficult to ensure data consistency if it is repeated many times in many places.

data manipulation language (DML)

insert, delete, update, query existing tables

Entity-Relationship (ER) Diagram

is a graphical representation that illustrates the entities and their relationships in a database.

Data Management Definition

is the practice of collecting, keeping and using data securely, efficiently and cost effectively

what do we mean by learn

it is a process of acquiring knowledge from observations/data and or interactions/feedback from an environment

Types of independence

logical - the data model of the system physical - storage structure of organization.

why can big data happen now?

low cost, higher CPU power, faster access, cloud computing (more storage), distributed computing (more man-power), government investment, open source software, machine learning

composite key

made up of more than one attribute from another entity

examples of personal databases

music, photos, email archive, "desktop search"

unstructured data

no pre-set format (web pages, social media) -currently most data is this

first normal form (1NF)

no repeating columns

second normal form (2NF)

no repeating columns, all non-key attributes cannot depend on a subset of the primary key

third normal form (3NF)

no repeating columns, all non-key attributes cannot depend on a subset of the primary key, no transitive functional dependency

Primary key

one uniquely defined key selected from candidate keys; underlined in diagram Choose a key that will be static throughout the lifetime of the database indicated with PK

big data cant help with

potential dangers, concerns, biases and limitations of data and its analysis. chance correlation meaning action easily fooled data drift feedback critical thought new data realistic

structured data

pre-set format (e.g. banking transaction)

Testing/Inference

process of applying the model on testing data

Training

process of learning model parameters from training data

decision trees

represents the decision making process/logic for the purpose of classification

Examples of corporate databases

retain/swipe systems, supply chain management, customer relationship management.

Database management system (DBMS)

software for storing, managing, and providing access to databases

decision support system components

specialized databases, analytical models/decision make insights and judgments, Interactive Graphical User Interface (GUI)

SQL

standard programming language for interacting with RDBMS - international standard mostly -Different REDMS implementation may subtly differ -a declarative language -when querying the database the DBMS can optimize how your query result is found using relational algebra/calculus

Training Data

the given data set, used to learn model parameters

data mining

the main purpose if knowledge discovery, which is a component some DSS. attempt to discover patterns, trends, and correlations hidden in the data that can give a strategic business advantage, can highlight buying patterns, reveal customer tendencies, cut redundant costs, or uncover profitable relationships and opportunities

Entity Instances

the set of attribute values for a particular entity

redundancy

the state of being not or no longer needed or useful.

what does redundancy mean for data storage?

this means that data is unnecessarily repeated.

Data persistence for data storage

this means that the data survives after the process with which it was created has ended for a database to be considered persistent, it must write to non-volatile storage

primary key

unique identifier as noted in conceptual model

semi-structured data

unstructured data that can be put into a structure using format descriptions (e.g. merging different types of contract information, email)

what type of data is a word document? what meta-data tags are added

unstructured, semi-structured

streaming data

very small chunks, consistent feed

characteristics of big data

volume: scale of data, velocity: analysis of data, variety: different forms of data, veracity: uncertainty of data

Inner (JOIN)

when there is a match in both tables for what you're looking for


Conjuntos de estudio relacionados

Quiz 10 Chapter 14 "Marketing and the Customer Relationship"

View Set

Pharmacology Exam 2/immunization/inflammation etc

View Set

State Exam Prep (Random Questions) 12/22

View Set