Data Analytics: Course 3

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Benefits of Open Data

- Credible databases can be used more widely - The data can be leveraged, shared, and combined with other data

Open Data Standards

- Must be available as a whole, preferably by downloading over the internet in a convenient and modifiable form - Must be provided under terms that allow reuse and redistribution, including the ability to use it with other datasets - Universal participation: everyone must be able to use, reuse, and redistribute the data

Bias

A preference in favor of or against a person, group of people, or thing. Can be conscious or subconscious

Composite Key

A primary key that is constructed using multiple columns of a table ex. customer_ID and location_ID are two columns of a composite key for a customer table & the values assigned to those fields in any given row must be unique within the entire table

Data Governance

A process to ensure the formal management of a company's data assets. Gives an organization better control of their data and helps a company manage issues related to data security and privacy, integrity, usability, and internal & external data flows

De-identification

A process used to wipe data clean of all personally identifying information

Fairness

A quality of data analysis that does not create or reinforce bias

Measurable Question

A question whose answers can be quantified and assessed

Ordinal Data

A type of qualitative data with a set order or scale ex. movie ratings (number of stars: 1, 2, 3, 4, or 5), ranked choice voting selections (1st, 2nd, 3rd), income level (low, middle, high)

Entity Relationship Diagram (ERD)

A visual way to understand the relationship between entities in the data model

Sorting Data

Arranging data into a meaningful order to make it easier to understand, analyze, and visualize *when sorting data in a spreadsheet, it's always a good idea to freeze the header row (View → Freeze → 1 Row)

Continuous Data

Data that is measured and can have almost any numeric value ex. movie runtime broken down into fractions of a second, height of kids in third grade classes, temperature

Unstructured Data

Data that is not organized in any easily identifiable manner. Has varied data types (most often qualitative) and is more difficult to search, but provides more freedom for analysis ex. audio files, video files, social media posts, emails

External Data

Data that lives and is generated outside of an organization. Valuable when analysis depends on as many sources as possible ex. national average wages for the various positions throughout your organization, credit reports for customers of an auto dealership

Internal Data

Data that lives within a company's own systems. Usually more reliable and easier to collect ex. wages of employees across different business units, sales data by store location, product inventory levels across distribution centers

Physical Data Modeling

Depicts how a database operates. A physical data model defines all entities and attributes used ex. table names, column names, and data types for a database

Types of Metadata

Descriptive, structural, and administrative

Audio File

Digitized audio storage usually in an MP3, AAC, or other compressed format

Why transform data?

Organization (easier to use), compatibility (different applications/systems can use the same data, migration (system → system), merging, enhancement, comparison

Filtering Data

Showing only the data that meets a specific criteria while hiding the rest. Simplifies a spreadsheet by only showing us the info we need Data → Create a Filter → (click the filter button in the column with the data we need and apply a filter)

CamelCase (SQL)

(aka camelCase) A convention for naming variables that capitalizes the first letter of every word without spaces. Typically used to name tables (more-so than snake_case) Alternatives: alllowercase or ALLUPPERCASE or snake_case

Observer Bias

(aka experimenter bias or researcher bias) The tendency for different people to observe things differently ex. two scientists looking into the same microscope may see different things

Data Privacy

(aka information privacy or data protection) Preserving a data subject's information and activity any time a data transaction occurs. Covers access, use, and collection of data & a person's legal right to their data The subject should have: - protection from unauthorized access to their private data - freedom from inappropriate use of their data - the right to inspect, update, or correct their data - the legal right to access their data

Data Ethics: Openness

(aka open data) Free access, usage, and sharing of data. Doesn't mean we ignore the other aspects of data ethics—we should still be transparent, respect privacy, and make sure we have consent for data that's owned by others. Just means we can access, use, and share that data if it meets these high standards

Benefits of Using Metadata

- Puts data into context - Creates a single source of truth by keeping things consistent and uniform - Makes data more reliable by making sure it's accurate, precise, relevant, and timely → easier to identify root causes of problems that pop up

Often Anonymized Data

- Telephone numbers - Names - License plates/numbers - SSNs - IP addresses - Photographs - Account numbers

Elements of Metadata

- Title and description: what is the name of the file/website you are examining? What type of content does it contain? - Tags and categories: what is the general overview of the data that you have? Is it indexed or described in a specific way? - Who created it and when - Who last modified it and when - Who can access or update it

Cloud

A place to keep data online, rather than a computer hard drive

Cell Reference

A cell or a range of cells in a worksheet typically used in formulas and functions

Video File

A collection of images, audio files, and other data usually encoded in a compressed format such as MP4, MV4, AVI, or FLV

Record

A collection of related data in a data table, usually synonymous with row

Boolean Data Type

A data type with only two possible values: true or false

Metadata Repository

A database specifically created to store metadata, either physical or virtual (i.e. in the cloud). Describes where metadata came from, keeps it in an accessible form, and keeps it in a common structure Makes it easier & faster to join multiple sources for analysis by describing state & location of the metadata, the structure of the tables inside, how data flows through the repository, and who accesses the metadata and when

Relational Database

A database that contains a series of related tables that can be connected via their relationships. Allows data analysts to organize and link data based on what the data has in common For two tables to have a relationship, one or more of the same fields must exist inside both tables

Foreign Key

A field within a table that's a primary key in another table—how one table can be connected to another. Provides a link between the data and two tables

Data Model

A model that is used for organizing data elements and how they relate to one another. Help to keep data consistent and provide a map of how data is organized

Sample

A part of a population that is representative of the whole population

Field

A single piece of information from a row or column of a spreadsheet; in a data table, typically a column in the table

Cookie

A small file stored on a computer that contains information about its users

Data Type

A specific kind of data attribute that tells what kind of value the data is

Data Bias

A type of error that systematically skews results in a certain direction. Can happen when: sample group doesn't represent entire population/lacks inclusivity, questions on a survey have a particular slant to influence answers, you give people a short timeframe to answer questions

Logical Data Modeling

Focuses on the technical details of a database such as relationships, attributes, and entities ex. a logical data model defines how individual records are uniquely identified in a database

Non-relational Table

All of the possible variables you might be interested in analyzing all grouped together—difficult to sort through → relational databased simplify a lot of analysis processes and make data easier to find and use across an entire database

snake_case (SQL)

All words are in lower case. Words are linked with an underscore symbol instead of spaces

Digital Photo

An electronic or computer-based image usually in BMP or JPG format

Primary Key

An identifier that references a column in which each value is unique. No two rows can have the same primary key & it cannot be null/blank

Data Ethics: Consent

An individual's right to know explicit details about how and why their data will be used before agreeing to provide it They should know answers to questions like: - Why is the data being collected? - How will it be used? - How long will it be stored?

Types of Organization

Categorical, chronological, hierarchical, location

Bad Data

Data that "does NOT ROCCC" NOT Reliable (inaccurate, incomplete, or biased) NOT Original (you can't locate the original data source and you're just relying on 2nd/3rd party info) NOT Comprehensive (missing important info) NOT Current NOT Cited

Primary Data

Collected by a researcher from first-hand sources ex. data from an interview you conducted, from a survey returned from 20 participants, from questionnaires you got back from a group of workers

CSV

Comma-separated values. A CSV file saves data in a table format

Naming Conventions

Consistent guidelines that describe the content, creation date, and version of a file in its name

Good Data

Data that ROCCCs!!! Reliable data sources: vetted public data sets, academic papers, financial data, governmental agency data

Open Data

Data that is available to the public

Discrete Data

Data that is counted and has a limited number of values ex. dollar amounts, number of people who visit a hospital on a daily basis, room's maximum capacity, tickets sold in the current month

Metadata

Data about data—tells you where the data comes from, when and how it was created, and what it's all about Think of it like a reference guide: without the guide, all you have is a bunch of data with no context explaining what it means.

Tabular Data

Data arranged in rows (records) and columns (fields)

Second-party Data

Data collected by a group directly from its audience and then sold

First-party Data

Data collected by an individual or group using their own resources

Long Data

Data in which each row is one time point per subject, so each subject will have data in multiple rows. Great format for storing and organizing data when there are multiple variables for each subject at each time point that we want to observe ex. spreadsheet containing country populations with a row for each year

Wide Data

Data in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject. Allows you to easily identify and quickly compare different columns ex. spreadsheet containing country populations with a column for population in each year

Structured Data

Data organized in a certain format such as rows and columns. Has defined data types (most often quantitative) and is easy to organize, search, and analyze ex. spreadsheets, relational databases, tables, tax returns, store inventory, expense reports

Third-party Data

Data provided from outside sources who didn't collect it directly

Secondary Dara

Gathered by other people or from other research ex. data you bought from a local data analytics firm's customer profiles, demographic data collected by a university, census data

Conceptual Data Modeling

Gives a high-level view of the data structure, such as how data interacts across an organization ex. a conceptual data model may be used to define the business requirements for a new database

IMPORTHTML Function

Google Sheets function that enables you to import the data from an HTML table or list on a webpage =IMPORTHTML("URL of the webpage", "table", table number)

IMPORTRANGE Function

Google Sheets function that enables you to specify a range of cells in the other spreadsheet to duplicate in the spreadsheet you are working in (must allow access to the spreadsheet containing the data the first time you import) =IMPORTRANGE("URL of the other spreadsheet", "sheet#!X1:Y1")

What types of data should be anonymized?

Healthcare and financial data are two of the most sensitive types of data—usually go through de-identification

Personally Identifiable Information (PII)

Information that can be used by itself or with other data to track down a person's identity

Data Ethics: Currency

Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions i.e. if your data is helping to fund a company's efforts, you should know what those efforts are all about and be given the opportunity to opt out

Descriptive Metadata

Metadata that describes a piece of data and can be used to identify it at a later point in time ex. the code (ISBN) on the spine of a library book & author/title

Structural Metadata

Metadata that indicates how a piece of data is organized and whether its part of one or more than one data collection. Keeps track of the relationship between two things ex. how the pages of a book are put together to create different chapters & the digital document of a book manuscript (the original version of the printed book)

Administrative Metadata

Metadata that indicates the technical source of a digital asset ex. the metadata inside a photo: file type, date and time it was taken

Data Types in Spreadsheets

Number Text or string Boolean

Foldering

Organizing your files into folders to keep project-related files together in one place

Aspects of Data Ethics

Ownership, transaction transparency, consent, currency, privacy, openness

Data Elements

Pieces of information, such as people's names, account numbers, and addresses

General Data Protection Regulation of the European Union (GDPR)

Policy-making body in the European Union created to help protect people and their data

ROCCC

Process for identifying good data sources: R - reliable (accurate, complete, and unbiased info) O - original (validate it with the original source) C - comprehensive (contain all critical info needed to answer the question) C - current (usefulness of data decreases as time passes) C - cited (who created the data set? is it part of a credible organization? when was the data last refreshed?)

Networking

Professional relationship building—meeting people on and offline to stay current with what's going on in your field Types: - Public meetups in your area - Digital/social media (LinkedIn, Twitter, FB, IG) - Data podcasts - Blogs/online communities (O'Reilly, Kaggle, KDnuggets, GitHub, Medium)

Data Security

Protecting data from unauthorized access or corruption by adopting safety measures

Nominal Data

Qualitative data that's categorized without a set order (no sequence) ex. "yes, no, not sure", "first time customer, returning customer, regular customer", "new job applicant, existing applicant, internal applicant"

Tokenization

Replaces the data elements you want to protect with randomly generated data referred to as a "token" Original data is stored in a separate location and mapped to the tokens Even if the tokenized data is hacked, the original data is still safe and secure in a separate location

Archiving Old Files

The practice of moving old projects to a separate location to create an archive and cut down on clutter

Quantitative Data

Specific and objective measures of numerical facts ex. percentage of board certified doctors who are women, population of elephants in Africa, distance from Earth to Mars

Qualitative Data

Subjective and explanatory measures of qualities and characteristics ex. exercise activity most enjoyed, favorite brands of most loyal customers, fashion preferences of young adults

Interoperability

The ability of data systems and services to openly connect and share data. Key to open data's success ex. health care information systems where multiple organizations (hospitals, clinics, pharmacies, labs) need to access and share data to ensure patients get the care they need, so they have compatible databases—this is how your doctor sends your prescriptions directly to a pharmacy

Ownership

The aspect of data ethics that presumes individuals own the raw data they provide and have primary control over its usage, processing, and sharing

Observation

The attributes that describe a piece of data contained in a row of a table

Geolocation

The geographical location of a person or device by means of digital information

Data Ethics: Transaction Transparency

The idea that all data processing activities and algorithms should be completely explainable and understood by the individual who provides their data

Metadata Specialists

The people who organize and maintain company data, ensuring that it's of the highest possible quality. - Create basic metadata identification and discovery info - Describe the way different datasets work together - Explain the different types of data resources - Create standards that everyone follows and the models used to organize the data

Data Transformation

The process of changing the data's format, structure, or values Usually involves: - Adding, copying, or replicating data - Deleting fields or records - Standardizing the names of variables - Renaming, moving, or combining columns in a database - Joining one set of data with another - Saving a file in a different format

Data Modeling

The process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. Three most common types are conceptual, logical, and physical

Data Anonymization

The process of protecting people's private or sensitive data by eliminating that kind of information Typically involves blanking, hashing, or masking personal info, often by using fixed-length codes to represent data columns or hiding data with altered values

Interpretation Bias

The tendency to always interpret ambiguous situations in a positive or negative way ex. you get a voicemail from your boss and you assume she's angry, but you play the message for your friend and he doesn't hear anger at all

Confirmation Bias

The tendency to search for or interpret information in a way that confirms preexisting beliefs ex. being so eager to confirm a gut feeling that you only notice things that support it, ignoring all other signals

Encryption

Use of a unique algorithm (saved as a "key" which can be used to reverse the encryption) to alter data and make it unusable by users and applications that don't know the algorithm

Unified Modeling Language (UML)

Very detailed diagrams that describe the structure of a system by showing the system's entities, attributes, operations, and their relationships

Data Ethics

Well-founded standards of right and wrong that dictate how data is collected, shared, and used. Tries to get to the root of the accountability companies have in protecting and responsibly using the data they collect

Ethics

Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues

The Open Data Debate

What data should be publicly available?

Sampling Bias

When a sample isn't representative of the population as a whole. Avoid by making sure the sample is chosen at random, so all parts of the population have an equal chance of being included

Redundancy

When the same piece of data is stored in two or more places

Data Ethics: Ownership

Who owns data? It isn't the organization that collected, stored, processed, and analyzed it—it's individuals who own the raw data they provide, and they have primary control over its usage, how it's processed, and how it's shared


Kaugnay na mga set ng pag-aaral

Risk Analysis Techniques Chapter 11

View Set

PART 1 (FINGER, THUMB, & HAND) Radiographic Procedures 2: Chapter 4 Upper Limb

View Set

Oncology and hematological problems

View Set