Data Analytics: Course 3
Benefits of Open Data
- Credible databases can be used more widely - The data can be leveraged, shared, and combined with other data
Open Data Standards
- Must be available as a whole, preferably by downloading over the internet in a convenient and modifiable form - Must be provided under terms that allow reuse and redistribution, including the ability to use it with other datasets - Universal participation: everyone must be able to use, reuse, and redistribute the data
Bias
A preference in favor of or against a person, group of people, or thing. Can be conscious or subconscious
Composite Key
A primary key that is constructed using multiple columns of a table ex. customer_ID and location_ID are two columns of a composite key for a customer table & the values assigned to those fields in any given row must be unique within the entire table
Data Governance
A process to ensure the formal management of a company's data assets. Gives an organization better control of their data and helps a company manage issues related to data security and privacy, integrity, usability, and internal & external data flows
De-identification
A process used to wipe data clean of all personally identifying information
Fairness
A quality of data analysis that does not create or reinforce bias
Measurable Question
A question whose answers can be quantified and assessed
Ordinal Data
A type of qualitative data with a set order or scale ex. movie ratings (number of stars: 1, 2, 3, 4, or 5), ranked choice voting selections (1st, 2nd, 3rd), income level (low, middle, high)
Entity Relationship Diagram (ERD)
A visual way to understand the relationship between entities in the data model
Sorting Data
Arranging data into a meaningful order to make it easier to understand, analyze, and visualize *when sorting data in a spreadsheet, it's always a good idea to freeze the header row (View → Freeze → 1 Row)
Continuous Data
Data that is measured and can have almost any numeric value ex. movie runtime broken down into fractions of a second, height of kids in third grade classes, temperature
Unstructured Data
Data that is not organized in any easily identifiable manner. Has varied data types (most often qualitative) and is more difficult to search, but provides more freedom for analysis ex. audio files, video files, social media posts, emails
External Data
Data that lives and is generated outside of an organization. Valuable when analysis depends on as many sources as possible ex. national average wages for the various positions throughout your organization, credit reports for customers of an auto dealership
Internal Data
Data that lives within a company's own systems. Usually more reliable and easier to collect ex. wages of employees across different business units, sales data by store location, product inventory levels across distribution centers
Physical Data Modeling
Depicts how a database operates. A physical data model defines all entities and attributes used ex. table names, column names, and data types for a database
Types of Metadata
Descriptive, structural, and administrative
Audio File
Digitized audio storage usually in an MP3, AAC, or other compressed format
Why transform data?
Organization (easier to use), compatibility (different applications/systems can use the same data, migration (system → system), merging, enhancement, comparison
Filtering Data
Showing only the data that meets a specific criteria while hiding the rest. Simplifies a spreadsheet by only showing us the info we need Data → Create a Filter → (click the filter button in the column with the data we need and apply a filter)
CamelCase (SQL)
(aka camelCase) A convention for naming variables that capitalizes the first letter of every word without spaces. Typically used to name tables (more-so than snake_case) Alternatives: alllowercase or ALLUPPERCASE or snake_case
Observer Bias
(aka experimenter bias or researcher bias) The tendency for different people to observe things differently ex. two scientists looking into the same microscope may see different things
Data Privacy
(aka information privacy or data protection) Preserving a data subject's information and activity any time a data transaction occurs. Covers access, use, and collection of data & a person's legal right to their data The subject should have: - protection from unauthorized access to their private data - freedom from inappropriate use of their data - the right to inspect, update, or correct their data - the legal right to access their data
Data Ethics: Openness
(aka open data) Free access, usage, and sharing of data. Doesn't mean we ignore the other aspects of data ethics—we should still be transparent, respect privacy, and make sure we have consent for data that's owned by others. Just means we can access, use, and share that data if it meets these high standards
Benefits of Using Metadata
- Puts data into context - Creates a single source of truth by keeping things consistent and uniform - Makes data more reliable by making sure it's accurate, precise, relevant, and timely → easier to identify root causes of problems that pop up
Often Anonymized Data
- Telephone numbers - Names - License plates/numbers - SSNs - IP addresses - Photographs - Account numbers
Elements of Metadata
- Title and description: what is the name of the file/website you are examining? What type of content does it contain? - Tags and categories: what is the general overview of the data that you have? Is it indexed or described in a specific way? - Who created it and when - Who last modified it and when - Who can access or update it
Cloud
A place to keep data online, rather than a computer hard drive
Cell Reference
A cell or a range of cells in a worksheet typically used in formulas and functions
Video File
A collection of images, audio files, and other data usually encoded in a compressed format such as MP4, MV4, AVI, or FLV
Record
A collection of related data in a data table, usually synonymous with row
Boolean Data Type
A data type with only two possible values: true or false
Metadata Repository
A database specifically created to store metadata, either physical or virtual (i.e. in the cloud). Describes where metadata came from, keeps it in an accessible form, and keeps it in a common structure Makes it easier & faster to join multiple sources for analysis by describing state & location of the metadata, the structure of the tables inside, how data flows through the repository, and who accesses the metadata and when
Relational Database
A database that contains a series of related tables that can be connected via their relationships. Allows data analysts to organize and link data based on what the data has in common For two tables to have a relationship, one or more of the same fields must exist inside both tables
Foreign Key
A field within a table that's a primary key in another table—how one table can be connected to another. Provides a link between the data and two tables
Data Model
A model that is used for organizing data elements and how they relate to one another. Help to keep data consistent and provide a map of how data is organized
Sample
A part of a population that is representative of the whole population
Field
A single piece of information from a row or column of a spreadsheet; in a data table, typically a column in the table
Cookie
A small file stored on a computer that contains information about its users
Data Type
A specific kind of data attribute that tells what kind of value the data is
Data Bias
A type of error that systematically skews results in a certain direction. Can happen when: sample group doesn't represent entire population/lacks inclusivity, questions on a survey have a particular slant to influence answers, you give people a short timeframe to answer questions
Logical Data Modeling
Focuses on the technical details of a database such as relationships, attributes, and entities ex. a logical data model defines how individual records are uniquely identified in a database
Non-relational Table
All of the possible variables you might be interested in analyzing all grouped together—difficult to sort through → relational databased simplify a lot of analysis processes and make data easier to find and use across an entire database
snake_case (SQL)
All words are in lower case. Words are linked with an underscore symbol instead of spaces
Digital Photo
An electronic or computer-based image usually in BMP or JPG format
Primary Key
An identifier that references a column in which each value is unique. No two rows can have the same primary key & it cannot be null/blank
Data Ethics: Consent
An individual's right to know explicit details about how and why their data will be used before agreeing to provide it They should know answers to questions like: - Why is the data being collected? - How will it be used? - How long will it be stored?
Types of Organization
Categorical, chronological, hierarchical, location
Bad Data
Data that "does NOT ROCCC" NOT Reliable (inaccurate, incomplete, or biased) NOT Original (you can't locate the original data source and you're just relying on 2nd/3rd party info) NOT Comprehensive (missing important info) NOT Current NOT Cited
Primary Data
Collected by a researcher from first-hand sources ex. data from an interview you conducted, from a survey returned from 20 participants, from questionnaires you got back from a group of workers
CSV
Comma-separated values. A CSV file saves data in a table format
Naming Conventions
Consistent guidelines that describe the content, creation date, and version of a file in its name
Good Data
Data that ROCCCs!!! Reliable data sources: vetted public data sets, academic papers, financial data, governmental agency data
Open Data
Data that is available to the public
Discrete Data
Data that is counted and has a limited number of values ex. dollar amounts, number of people who visit a hospital on a daily basis, room's maximum capacity, tickets sold in the current month
Metadata
Data about data—tells you where the data comes from, when and how it was created, and what it's all about Think of it like a reference guide: without the guide, all you have is a bunch of data with no context explaining what it means.
Tabular Data
Data arranged in rows (records) and columns (fields)
Second-party Data
Data collected by a group directly from its audience and then sold
First-party Data
Data collected by an individual or group using their own resources
Long Data
Data in which each row is one time point per subject, so each subject will have data in multiple rows. Great format for storing and organizing data when there are multiple variables for each subject at each time point that we want to observe ex. spreadsheet containing country populations with a row for each year
Wide Data
Data in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject. Allows you to easily identify and quickly compare different columns ex. spreadsheet containing country populations with a column for population in each year
Structured Data
Data organized in a certain format such as rows and columns. Has defined data types (most often quantitative) and is easy to organize, search, and analyze ex. spreadsheets, relational databases, tables, tax returns, store inventory, expense reports
Third-party Data
Data provided from outside sources who didn't collect it directly
Secondary Dara
Gathered by other people or from other research ex. data you bought from a local data analytics firm's customer profiles, demographic data collected by a university, census data
Conceptual Data Modeling
Gives a high-level view of the data structure, such as how data interacts across an organization ex. a conceptual data model may be used to define the business requirements for a new database
IMPORTHTML Function
Google Sheets function that enables you to import the data from an HTML table or list on a webpage =IMPORTHTML("URL of the webpage", "table", table number)
IMPORTRANGE Function
Google Sheets function that enables you to specify a range of cells in the other spreadsheet to duplicate in the spreadsheet you are working in (must allow access to the spreadsheet containing the data the first time you import) =IMPORTRANGE("URL of the other spreadsheet", "sheet#!X1:Y1")
What types of data should be anonymized?
Healthcare and financial data are two of the most sensitive types of data—usually go through de-identification
Personally Identifiable Information (PII)
Information that can be used by itself or with other data to track down a person's identity
Data Ethics: Currency
Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions i.e. if your data is helping to fund a company's efforts, you should know what those efforts are all about and be given the opportunity to opt out
Descriptive Metadata
Metadata that describes a piece of data and can be used to identify it at a later point in time ex. the code (ISBN) on the spine of a library book & author/title
Structural Metadata
Metadata that indicates how a piece of data is organized and whether its part of one or more than one data collection. Keeps track of the relationship between two things ex. how the pages of a book are put together to create different chapters & the digital document of a book manuscript (the original version of the printed book)
Administrative Metadata
Metadata that indicates the technical source of a digital asset ex. the metadata inside a photo: file type, date and time it was taken
Data Types in Spreadsheets
Number Text or string Boolean
Foldering
Organizing your files into folders to keep project-related files together in one place
Aspects of Data Ethics
Ownership, transaction transparency, consent, currency, privacy, openness
Data Elements
Pieces of information, such as people's names, account numbers, and addresses
General Data Protection Regulation of the European Union (GDPR)
Policy-making body in the European Union created to help protect people and their data
ROCCC
Process for identifying good data sources: R - reliable (accurate, complete, and unbiased info) O - original (validate it with the original source) C - comprehensive (contain all critical info needed to answer the question) C - current (usefulness of data decreases as time passes) C - cited (who created the data set? is it part of a credible organization? when was the data last refreshed?)
Networking
Professional relationship building—meeting people on and offline to stay current with what's going on in your field Types: - Public meetups in your area - Digital/social media (LinkedIn, Twitter, FB, IG) - Data podcasts - Blogs/online communities (O'Reilly, Kaggle, KDnuggets, GitHub, Medium)
Data Security
Protecting data from unauthorized access or corruption by adopting safety measures
Nominal Data
Qualitative data that's categorized without a set order (no sequence) ex. "yes, no, not sure", "first time customer, returning customer, regular customer", "new job applicant, existing applicant, internal applicant"
Tokenization
Replaces the data elements you want to protect with randomly generated data referred to as a "token" Original data is stored in a separate location and mapped to the tokens Even if the tokenized data is hacked, the original data is still safe and secure in a separate location
Archiving Old Files
The practice of moving old projects to a separate location to create an archive and cut down on clutter
Quantitative Data
Specific and objective measures of numerical facts ex. percentage of board certified doctors who are women, population of elephants in Africa, distance from Earth to Mars
Qualitative Data
Subjective and explanatory measures of qualities and characteristics ex. exercise activity most enjoyed, favorite brands of most loyal customers, fashion preferences of young adults
Interoperability
The ability of data systems and services to openly connect and share data. Key to open data's success ex. health care information systems where multiple organizations (hospitals, clinics, pharmacies, labs) need to access and share data to ensure patients get the care they need, so they have compatible databases—this is how your doctor sends your prescriptions directly to a pharmacy
Ownership
The aspect of data ethics that presumes individuals own the raw data they provide and have primary control over its usage, processing, and sharing
Observation
The attributes that describe a piece of data contained in a row of a table
Geolocation
The geographical location of a person or device by means of digital information
Data Ethics: Transaction Transparency
The idea that all data processing activities and algorithms should be completely explainable and understood by the individual who provides their data
Metadata Specialists
The people who organize and maintain company data, ensuring that it's of the highest possible quality. - Create basic metadata identification and discovery info - Describe the way different datasets work together - Explain the different types of data resources - Create standards that everyone follows and the models used to organize the data
Data Transformation
The process of changing the data's format, structure, or values Usually involves: - Adding, copying, or replicating data - Deleting fields or records - Standardizing the names of variables - Renaming, moving, or combining columns in a database - Joining one set of data with another - Saving a file in a different format
Data Modeling
The process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. Three most common types are conceptual, logical, and physical
Data Anonymization
The process of protecting people's private or sensitive data by eliminating that kind of information Typically involves blanking, hashing, or masking personal info, often by using fixed-length codes to represent data columns or hiding data with altered values
Interpretation Bias
The tendency to always interpret ambiguous situations in a positive or negative way ex. you get a voicemail from your boss and you assume she's angry, but you play the message for your friend and he doesn't hear anger at all
Confirmation Bias
The tendency to search for or interpret information in a way that confirms preexisting beliefs ex. being so eager to confirm a gut feeling that you only notice things that support it, ignoring all other signals
Encryption
Use of a unique algorithm (saved as a "key" which can be used to reverse the encryption) to alter data and make it unusable by users and applications that don't know the algorithm
Unified Modeling Language (UML)
Very detailed diagrams that describe the structure of a system by showing the system's entities, attributes, operations, and their relationships
Data Ethics
Well-founded standards of right and wrong that dictate how data is collected, shared, and used. Tries to get to the root of the accountability companies have in protecting and responsibly using the data they collect
Ethics
Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues
The Open Data Debate
What data should be publicly available?
Sampling Bias
When a sample isn't representative of the population as a whole. Avoid by making sure the sample is chosen at random, so all parts of the population have an equal chance of being included
Redundancy
When the same piece of data is stored in two or more places
Data Ethics: Ownership
Who owns data? It isn't the organization that collected, stored, processed, and analyzed it—it's individuals who own the raw data they provide, and they have primary control over its usage, how it's processed, and how it's shared