Purdue IE 332 Exam 2
Label
(expert) assigned identifier to sample
2. Learn: iteratively adjust model parameters to better represent gained knowledge
-A learning algorithm defined how to adjust the parameters -The algorithm may use the data multiple times
1. Choose a model to represent the knowledge. Which model(s)?
-Each model makes assumptions about what knowledge could look like -Not all models are equal, and no single model is always best -What is the practical purpose/task for the model? What meta-data or features does the data set contain? Existing expert knowledge about expected patterns in the data? Computational resources/limitations?
Relationships between entities
-Entities may be related somehow -No limit on the number of entities participating in a relationship -An entity may be involved in a number of relationships -Each relationship has a unique name -IMPORTANT: relationships indicate how data is related, not program flow or under-interaction steps with the system
Idea behind decision Trees
-Split training data into branches based on some logical rule (E.g. maximize information gain at each split) -Node of the tree is the rule, branches are subsets of data satisfying parts of the rule -Continue splitting the data until some criteria is met
data redundancy
-entering the same data over and over again is a waste of time -wasted storage space -increases probability of occurring errors -errors can occur when retrieving data -degrade overall system performance (speed, esp)
table
-every table has a unique name -attributes within a table have unique names -every attribute is atomic (i.e. not multivalued) -every row is unique -the order of the columns doesn't matter -the order of the rows doesn't matter
pros of naive bayes
-fast learning and prediction, -extension to non-binary labels is already in the equations -if independence assumption holds, minimal data is required
data integrity
-make it difficult to insert invalid or inconsistent data -on inserting/updating/deleting an item, automatically enforce logical relationships throughout database
cons of naive bayes
-missing observations of a label(s) in training data -independence assumption is rarely to strongly hold in reality
normalization
-proper separation/organization of data -data should not depend on a non-key attribute
ERD for weak entity sets
1) a double or thick border on rectangle of weak entity set 2) a double or thick border on diamond of the identifying relationship 3) a double or thick line and arrow linking 4)a double or dashed underline of the weak entity set attributes that are part of the composite key
creating a weak entity set
1. declare WES attributes 2. add the primary key of the identifying owner 3. declare the composite key using attributes from both entity sets 4. declare a foreign key constraint on the identifying owner keys (5. instruct the system to auto. delete any rooms when a building is deleted)
big data challenges
1. lacks integrity 2. lacks meta-data 3. back-end is cheap 4. front-end is confusing 5. you analysts dont understand your question 6. analysis is incomplete 7. lack a means to interpret the anayses 8. you dont act on analyses
relationship constraints
1:1 the element of the entity MUST exist in the relationship once 0:1 the element of the entity MAY exist in the relationship at most once 1:M the element of the entity MUST be in the relationship at least once 0:M the element of the entity MAY be in the relationship
Machine Learning vs AI vs Data Mining vs Statistics
AI: computers that behave and reason intelligently ML: automatically learn models of data for prediction DM: human-guided discovery of hidden patterns in a particular dataset Statistics: quantify and summarize data (These often overlap!)
What is a database?
An organized collection of data, often a model of the system in consideration, supports concurrent access to data, supports secure access to data, and supports efficient access to data.
Why is data independence important for a DBMS (database management system)?
Applications can't change the definition or organization of data.
(in depth) Things big data alone can't help with?
Chance correlation: with so much data, correlations are easy to find -Meaning: we can find relationship in data, but more is needed to determine meaning or cause -Action: more data doesn't imply more knowledge -Easily fooled: many big data tools can be purposely fooled -Data drift: incoming data can cause unintentional signals -Feedback: reinforcing data (esp wrt web-based) -Critical thought: scientific sounding answers to vague questions -New Data: how to handle previously unseen data -Realistic: big data is NOT a silver bullet
Deleting Data Syntax (SQL)
DELTE FROM table [WHERE logical criteria]
DROP TABLE syntax (SQL)
DROP TABLE Students
Why not always use a database?
Databases are expensive and complicated to set up and maintain o Databases are general-purpose, and not suited for special purpose tasks like text search o A database may be overkill - e.g. for a simple tabulation task
requirements specification (ERD)
Essential: constant customer communication! Goals: identify key data, no redundant details, no unimportant details, clarify unclear natural language, attain missing information, separate data from operations.
Generalization in ERD
Generalization is the process of extracting common properties form a set of entities and creating a generalized entity from it It's a bottom up approach, where two or more entities can be generalized to a higher level entity if they have some attributes in common
Why is managing data tough?
Global amount of data is growing exponentially, data is scattered among many different sources, it is collected by many different individuals and devices of different formats, is must be kept secure and not corrupted, and it must be easy to ask questions of the data.
Insert Syntax (SQL)
INSERT INTO table [(field names)] VALUES (field names)
Batch Syntax (SQL)
INSERT INTO table [(field names)] VALUES (field values)[,(field values)]*
dimensionality reduction (A Motivation)
In classification and clustering, the features are in X^n. If n is large, then: There may be a large number of parameters; combined with small data set this can lead to: Large variance in output & Over-fitting. There may be irrelevant features that impact/confuse learning
(In depth) Why can big data happen now?
Low cost: data storage costs are very low -CPU power: we have the CPU power to process the data -Fast access: network technologies send information very quickly -Cloud computing: minimal manpower for access to "unlimited" storage and other computing tools -Distributed computing: ability to spread tasks over many computers -Government investment: $200 mil from NSF, MIH etc. -Open source software: SQL, Hadoop, etc. -Machine Learning: we'll see this later
foreign key
PK of another entity that shares a relationship
Why can writing (and reading) queries be hard?
Possible SQL syntax/grammar to use (practice) English grammar is loose, whereas SQL is strict Expressing exactly the logic of an English question or statement
What is the premise of data persistence?
Premise: the continuance of an effect after its cause is removed
SELECT Syntax (SQL)
SELECT field-list FROM table-list [WHERE qualification] [GROUP BY fieldlist [ORDER BY field-list] [HAVING aggregate function and condition]]
Update Data Syntax (SQL)
UPDATE table SET field = value WHERE criteria
weak entity sets
When the attributes of an entity are insufficient to define a primary key We can use attributes from other, related, entities if we have a 1:1 relationship with an entity and can borrow its PK
Database vs. spreadsheet
a Database stores data for access by users, and typically has more data than a spreadsheet would. Databases require access through an application for editing whereas spreadsheets are edited directly. Spreadsheets typically used for presentations or paperwork, whereas databases are used for larger-scale projects
Relationship Instances
a collection of attribute values for a particular relationship
machine learning
a computational method that uses experience to improve algorithm performance for purpose of prediction. Experience: data-driven task and thus statistics, probability, and optimization will play a significant role
Key
a minimal set of attributes that uniquely define each entity If > 1 potential key exists, they are called candidate keys (e.g. PUID and a cell number could be candidate keys)
Spreadsheet
a program for tabulating data
AI
a science and technology based on disciplines such as computer science, biology, psychology, linguistics, mathematics, and engineering
what does SELECT return?
a table of results that satisfy a query
example information gathered in DSS
accessing all data sources, comparative sales information, projected revenue based on sales assumptions, consequences of alternative decisions
Left (JOIN)
all rows in left table + right table if match in left table
Right (JOIN)
all rows in right table + left table if match in right table
Full (JOIN)
all rows of both tables
relationship
an association between two or more entities. e.g. whether two students are in a particular course
JOIN
an instruction to combine data from more than one table
decision support systems
an interactive software used to aid in decision making by suggesting solutions to the problem at hand
Main Assumption of Classification
any hypothesis/learned model that does sufficiently well during training will also do well on unobserved examples (Generalization)
Entity
any individual real-world object. e.g. a student Modeled as rectangles (ERD)
What is the premise of data independence?
applications shouldn't care how data is structured and stored.
Example/Sample
arbitrary element from a data set
batch data
big chunks, time separated
generalization
common characteristics among a set of items
database design guidelines
data redundancy (minimize) normalization (maximize) data integrity (maximize)
Potential benefits of DSS
decision quality, improved communication, cost reduction, increased productivity, time savings, improved customer and employee satisfaction
data definition langauge (DDL)
declare schema and CREATE TABLEs
meta-data
definitions, mappings, data about data
relational algebra
describes tables and their combinations
Examples of scientific databases
digital libraries, satellite images, simulation data.
relational calculus
expresses queries
Attribute
feature of each entity (e.g. student name, ID, date of birth) listed within rectangle
why is data persistence important for DBMS?
for evaluating database systems with respect to speed, memory needs, etc.
Goal of AI
goal: to develop computers that can think, as well as see, hear, walk, talk, and feel
Why is data redundancy important for DBMS?
inefficient use of both non/volatile storage, difficult to ensure data consistency if it is repeated many times in many places.
data manipulation language (DML)
insert, delete, update, query existing tables
Entity-Relationship (ER) Diagram
is a graphical representation that illustrates the entities and their relationships in a database.
Data Management Definition
is the practice of collecting, keeping and using data securely, efficiently and cost effectively
what do we mean by learn
it is a process of acquiring knowledge from observations/data and or interactions/feedback from an environment
Types of independence
logical - the data model of the system physical - storage structure of organization.
why can big data happen now?
low cost, higher CPU power, faster access, cloud computing (more storage), distributed computing (more man-power), government investment, open source software, machine learning
composite key
made up of more than one attribute from another entity
examples of personal databases
music, photos, email archive, "desktop search"
unstructured data
no pre-set format (web pages, social media) -currently most data is this
first normal form (1NF)
no repeating columns
second normal form (2NF)
no repeating columns, all non-key attributes cannot depend on a subset of the primary key
third normal form (3NF)
no repeating columns, all non-key attributes cannot depend on a subset of the primary key, no transitive functional dependency
Primary key
one uniquely defined key selected from candidate keys; underlined in diagram Choose a key that will be static throughout the lifetime of the database indicated with PK
big data cant help with
potential dangers, concerns, biases and limitations of data and its analysis. chance correlation meaning action easily fooled data drift feedback critical thought new data realistic
structured data
pre-set format (e.g. banking transaction)
Testing/Inference
process of applying the model on testing data
Training
process of learning model parameters from training data
decision trees
represents the decision making process/logic for the purpose of classification
Examples of corporate databases
retain/swipe systems, supply chain management, customer relationship management.
Database management system (DBMS)
software for storing, managing, and providing access to databases
decision support system components
specialized databases, analytical models/decision make insights and judgments, Interactive Graphical User Interface (GUI)
SQL
standard programming language for interacting with RDBMS - international standard mostly -Different REDMS implementation may subtly differ -a declarative language -when querying the database the DBMS can optimize how your query result is found using relational algebra/calculus
Training Data
the given data set, used to learn model parameters
data mining
the main purpose if knowledge discovery, which is a component some DSS. attempt to discover patterns, trends, and correlations hidden in the data that can give a strategic business advantage, can highlight buying patterns, reveal customer tendencies, cut redundant costs, or uncover profitable relationships and opportunities
Entity Instances
the set of attribute values for a particular entity
redundancy
the state of being not or no longer needed or useful.
what does redundancy mean for data storage?
this means that data is unnecessarily repeated.
Data persistence for data storage
this means that the data survives after the process with which it was created has ended for a database to be considered persistent, it must write to non-volatile storage
primary key
unique identifier as noted in conceptual model
semi-structured data
unstructured data that can be put into a structure using format descriptions (e.g. merging different types of contract information, email)
what type of data is a word document? what meta-data tags are added
unstructured, semi-structured
streaming data
very small chunks, consistent feed
characteristics of big data
volume: scale of data, velocity: analysis of data, variety: different forms of data, veracity: uncertainty of data
Inner (JOIN)
when there is a match in both tables for what you're looking for