Course 3: Week 1, Week2, Week 3
The AND operator for Boolean data
"IF(Color="Grey") AND (Color="Pink) then buy them." The AND operator lets you stack multiple conditions.
When is it a good time to freeze a header row?
Anytime you're sorting data. You'll highlight the header row. Then from the view menu, chose freeze and one row. This locks the row in place.
What is sorting data?
Arranging data into a meaningful order to make it easier to understand, analyze, and visualize.
What are some examples of unstructured datasets?
Audio files Video files Emails Photos Social media
In order for data to be considered open, it has to
Be available and accessible to the public as a complete dataset. Be provided under terms that allow it to be reused and redistributed. Allow universal participation so that anyone can use, reuse, and redistribute the data.
What are the Boolean operators?
Boolean operators, include AND, OR, and NOT. These operators are similar to mathematical operators and can be used to create logical statements that filter your results. Data analysts use Boolean statements to do a wide range of data analysis tasks, such as creating queries for searches and checking for conditions when writing programming code.
What is metadata?
Data about data.
What is Data Anonymization?
Data anonymization is the process of protecting people's private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values.
What is third party data?
Data collected from outside sources who did not collect it directly.
What is data modeling?
Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models.
What is structured data?
Data organized in a certain format such as rows and columns.
What is Continuous data?
Data that is measured and can have almost any numeric value.
What is unstructured data?
Data that is not organized in an easily identifiable manner.
What is external data?
Data that lives and is generated outside an organization.
What is external data?
Data that lives and is generated outside of an organization.
What is internal data?
Data that lives within a company's own systems.
What are the three common types of metadata?
Descriptive, structural, and administrative.
What is a relational database?
A database that contains a series of related tables that can be connected via their relationships.
What is a database?
A collection of data stored in a computer system.
What are some examples of foreign keys:
A column or group of columns in a relational database table that provides a link between the data in two tables. Refers to the field in a table that's the primary key of another table. More than one foreign key is allowed to exist in a table.
What is a data type?
A data type is a specific kind of data attribute that tells what kind of value the data is. In other words, a data type tells you what kind of data you're working with.
What is a Boolean data type?
A data type with only two possible values, such as TRUE or FALSE.
What is metadata repository?
A database specifically created to store metadata. Metadata repositories make it easier and faster to bring together multiple sources for data analysis
What is a foreign key?
A field within a table that is a primary key in another table. a foreign key is how one table can be connected to another. These keys are what create the relationships between tables in a relational database, which helps organize and connect data across multiple tables in the database.
What is Data Model?
A model that is used for organizing data elements and how they relate to one another.
What is Bias?
A preference in favor of or against a person, group of people, or thing.
Data governance
A process to ensure the formal management of a company's data assets.
What is de identification?
A process used to wipe data clean of all personally identifying information.
What is a Sample?
A sample is a part of a population that is representative of the population
What is a text or string data type?
A sequence of characters and punctuation that contains textual information.
What is Data Bias?
A type of error that systematically skews results in a certain direction.
What is nominal data?
A type of qualitative data that is categorized without a set order.
What is Ordinal data?
A type of qualitative data with a set order or scale.
What is SQL?
A type of query language that lets data analysts communicate with a database. Databases use a special language to communicate called a query language.
What is a primary key?
An identifier that references a column in which each value is unique. In other words, it's a column of a table that is used to uniquely identify each record within that table.
What is consent?
An individuals right to know explicit details about how and why their data will be used before agreeing to provide it.
What is csv?
Comma separated values. A csv file saves data in a table format.
What are the three most common types of data modeling?
Conceptual data modeling Logical data modeling Physical data modeling
To track people's online activities and interests, which method of data collection is most effective?
Cookies
What are some aspects of data ethics?
Ownership Transaction Transparency Consent Currency Privacy Openness
What are the two most popular data modeling techniques?
Entity Relationship Diagram (ERD) and the Unified Modeling Language (UML) diagram. ERDs are a visual way to understand the relationship between entities in the data model. UML diagrams are very detailed diagrams that describe the structure of a system by showing the system's entities, attributes, operations, and their relationships.
What are some examples of structured datasets?
Excel, Google Sheets, SQL, Customer data, phone records, and transaction history.
What is openness?
Free access, usage, and sharing of data. Open data must be available as a whole, preferably by downloading over the internet in a convenient and modifiable form. Open data must be provided under terms that allow reuse and redistribution including the ability to use it with other datasets. And the last area is universal participation. Everyone must be able to use, reuse, and redistribute the data.
What are the two most sensitive types of data?
Healthcare and financial data are two of the most sensitive types of data.
Data collection considerations
How the data will be collected Choose data sources Decide what data to use How much data to collect Select the right data type Determine the timeframe
A CSV file makes it easier for data analysts to complete which tasks?
Import data to a new spreadsheet. Examine a small subset of a large dataset. Distinguish values from one another.
What is currency?
Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
What does inspecting your dataset help you do?
Inspecting your dataset will help you pinpoint what questions are answerable and what data is still missing.
To determine if a data source is cited, you should ask?
Is this dataset from a credible organization? Who created this dataset? When was the data last refreshed?
What is GDPR and why was it created?
It was created to protect people and their data. The GDPR of the European Union was created to do just this. General Data Protection Regulation of the European Union.
What are three things you need to do before beginning an analysis?
It's important to inspect your data to determine if it contains the specific information you need to answer your stakeholders questions. The data is not there ( you have sandwich data, but you need pizza data) The data is insufficient ( you have pizza data for June 1-7, but you need data for the entire month of June) The data is incorrect ( your pizza data lists the cost of a slice as $250, which makes you question the validity of the dataset)
What is Long Data?
Long data is data in which each row is one time point per subject, so each subject will have data in multiple rows.
What is metadata?
Metadata is used in database management to help data analysts interpret the contents of the data within the database.
What is descriptive metadata?
Metadata that describes a piece of data and can be used to identify it at a later point in time
What is structural metadata?
Metadata that indicates how a piece of data is organized and whether it is part of one, or more than one, data collection
What is Administrative metadata?
Metadata that indicates the technical source of a digital asset.
What is Data Normalization?
Normalization is a process of organizing data in a relational database.
What is Personally identifiable information?
Personally identifiable information, or PII, is information that can be used by itself or with other data to track down a person's identity.
What is physical data modeling?
Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.
What is Data Elements?
Pieces of Information, such as people's names, account numbers, and addresses.
What is privacy?
Preserving a data subject's information and activity any time a data transaction occurs.
What is qualitative data?
Qualitative data is usually listed as a name, category, or description.
What is ROCCC (identifying good data sources)?
R- Reliable O- Original C- Comprehensive C- Current C-Cited Comprehensive aka make sure it fits. It's important information needed to answer the question or find the solution. Citing makes the information you're providing more credible. When you're choosing a data source, think about three things. Who created the data set? Is it part of a credible organization? When was the data last refreshed?
What does SUM mean?
SUM instructs the spreadsheet to add up the values in that range of cells. This works similarly if you wish to add across the rows instead.
What is filtering?
Showing only the data that meets a specific criteria while hiding the rest.
What are some ways data can be collected?
Survey data Interview Observations Forms Questionnaires
What are some things that can often be anonymized?
Telephone numbers Names License plates and license numbers Social security numbers IP addresses Medical records Email addresses Photographs Account numbers
The NOT operator for Boolean data
The NOT operator lets you filter by subtracting specific conditions from the results. "IF(Color="Grey") AND (Color=NOT"Pink") then buy them. "
The OR operator for Boolean data
The OR operator lets you move forward if either one of your two conditions is met. Condition is "If the shoes are grey or pink, you will buy them." "IF (Color="Grey") OR (Color="Pink") then buy them.
If you need a great reliable data source, check out the?
The U.S. Census Bureau, which regularly updates their information.
What is data interoperability?
The ability of data systems and services to openly connect and share data.
What is transaction transparency?
The idea that all data processing activities and algorithms should be completely explainable and understood by the individual who provides their data.
What is Data Transformation?
The process of changing the data's format, structure, or values.
What is Observer bias ( experimenter bias/ research bias)?
The tendency for different people to observe things differently.
What is Interpretation bias?
The tendency to always interpret ambiguous situations in a positive or negative way.
What is confirmation bias?
The tendency to search for or interpret information in a way that confirms pre existing beliefs.
What is ownership in data ethics?
This answers the question who owns data? It isn't the organization that invested time and money collecting, storing, processing, and analyzing it. It's individuals who own the raw data they provide, and they have primary control over its usage, how it's processed and how it's shared.
What is first party data?
This is data collected by an individual or group using their own resources. Collecting first-party data is typically the preferred method because you know exactly where it came from.
Metadata is stored in a single, central location, and gives the company standardized information about all of its data. (True or False)
True
CSV files use plain text and are delineated by characters, such as a comma. (True or False)
True CSV files use plain text and are delineated by characters, such as a comma. A delineator indicates a boundary or separation between two things.b
You can always trust Data.gov for reliable data? (True or False)
True. Data.gov, which is home to the U.S. governments open data.
What are some examples of primary keys:
Used to ensure data in a specific column is unique Uniquely identifies a record in a relational database table Only one primary key is allowed in a table Cannot contain null or blank values
An unbiased sample is representative of the population being measured. Which of the following helps ensure unbiased sampling?
Using random sampling during data collection helps ensure unbiased sampling.
What are some usually good data sources?
Vetted public datasets, Academic papers, and Governmental agency data.
What is Data Ethics?
Well founded standards of right and wrong that dictate how data is collected, shared, and used.
What is Ethics?
Well founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues.
What is unbiased sampling?
When a sample is representative of the population being measured.
What is Sampling Bias?
When a sample isn't representative of the population as a whole.
The use of external data is particularly valuable in which circumstances?
When analysis depends on as many data sources as possible.
a data type in a spreadsheet can be one of three things?
a data type in a spreadsheet can be one of three things: a number, a text or string, or a Boolean.
What is second-party data?
data collected by a group directly from its audience and then sold.
What is discrete data?
data that is counted and has a limited number of values. Discrete data isn't limited to dollar amounts. Examples of other discrete data are stars and points. When partial measurements (half-stars or quarter-points) aren't allowed, the data is discrete. If you don't accept anything other than full stars or points, the data is considered discrete.
What is Wide Data?
every data subject has a single row with multiple columns to hold the values of various attributes of the subject.
What is logical data modeling?
focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database.
What is conceptual data modeling?
gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database.
What is population?
population refers to all possible data values in a certain data set.
What is quantitative data?
quantitative data can be measured or counted and then expressed as a number. This is data with a certain quantity, amount, or range.
Metadata Repositories
• Describe the state and location of the metadata • Describe the structures of the tables inside • Describe how the data flows through the repository • Keep track of who accesses the metadata and when