Unit 9 Com Sci Concepts

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

"I think this visualization tells me this..." * Something is more popular than something else * Something is more important than something else * Something has become more or less searched over time

"... but I'm not sure because..." * I don't know exactly how the data was collected * This might tell me people searched for green more than red, but it doesn't tell me why they do that or that green is a better color * We need more data!

What are situations when you would filter vs. clean your data?

( remember the difference between cleaning and filtering the data)

Takeaways about Metadata

* It can be changed without impacting the primary data * Used for finding, organizing, and managing information * Increases effective use of data by providing extra information * Allows data to be structured and organized

Method for Cleaning Messy Data

* Look through the data manually. Find and fix messy data. * Use a program to find and fix messy data.

Key Takeaways of Charts and Data Visualizers

* Programs (like the Data Visualizer) can help process data so we can understand it and learn. * Charts and other visualizations can help both find and communicate what we've learned from data * Bar charts and histograms are two common chart types for exploring one column of data in a table.

Information we can get out of histograms

* What range of value(s) are most common in this column? * What range value(s) are least common in this column? * What ranges of values do or do not appear in this column?

Training Data and Bias

* machine learning is only as good as the training data you put in it * if x-ray data is only collected from men, computer's predictions may only work for men --> this blindspot in the training data creates bias

What is Machine Learning?

* type of AI * will help us to tackle problems * "how computers recognize patterns and make decisions without being explicitly programmed" * instead of programming compute to learn step by step, computer can be programmed to learn just like you and me (trial and error, lots of practice) * experience means lots and lots of data * can take in any kind of data (images, audio, video, text) * can make predictions based on patterns

Big Data Definition

- "Collect huge amounts of data so we can learn even more from it" - The size of the datasets we analyzed impacts how much information can be extracted - As a result, in business, science, and many other contexts people are working with increasingly big data sets - When data gets too big it can no longer be processed on one computer. Cloud computing or parallel systems are sometimes used to help process all that information. - In general scalability of your system is important to consider when working with big data. You want your system to be able to work even as you're using more and more data.

Citizen Science and Crowdsourcing Definition

- "collecting data from others so you can analyze it" - Crowdsourcing is the practice of obtaining input or information from a large number of people via the Internet. - Citizen science is research where some of the data collection is done by members of the public using own computing devices which leads to solving scientific problems - Crowdsourcing offers new models for collaboration, such as connecting businesses or social causes with funding - Both are examples of how human capabilities can be enhanced by collaboration via computing

Open Data Definition

- "sharing data with others so they can can analyze it" - Open data is publicly available data shared by governments, organizations, and others - Making data open help spread useful knowledge or creates opportunities for others to use it to solve problems

When does data need to be cleaned?

- Data is incomplete - Data is invalid - Multiple tables are combined into one

What is Cross Tab useful for

- Finding the most / least common combinations of values in two columns - Finding patterns across two columns - Exploring two columns when one or both are strings.

Be aware of what the data is actually showing vs. assumptions based on trends in the data

- People like dogs more than cats - People search for "dogs" more frequently than "cats" ** - There was a sharp increase in the dog population sometime between 2014 and 2015 - The popularity of dogs as pets is slightly increasing over time, while the popularity of cats is relatively flat

What leads to "messy" data?

- Users enter in different types of data ("two", 2) - Users use different abbreviations to represent the same information ("February", "Feb", "Febr") - Data may have different spellings ("color", "colour") or inconsistent capitalization ("spring", "Spring")

Key Takeaways of Cross Tab and Scatter plots (Reflecting on visual models)

- We can develop insights and knowledge about our world from manipulating and visualizing data, in particular by finding patterns - When investigating two columns of data we can observe patterns different values move together (are correlated). We cannot know for certain the cause of the correlation.

Information we can get out of bar charts:

- What value(s) are most common in this column? - What value(s) are least common in this column? - What is the unique list of values in this column?

When looking @ training data, ask:

- enough data? - does data represent all possible scenarios and users w/o bias?

Data Analysis Process

1) Choose or collect data 2) Clean and/or filter data 3) Visualize and find patterns 4) New Information

Why would someone make a histogram instead of a bar chart?

A bar chart helps to visualize the popularity of individual values in a dataset. When there is a large range of values in a dataset, this may make reading a bar chart difficult. A histogram groups values into a range, which is ideal for visualizing metadata such as the maximum life span of dogs in the AppLab dataset.

Limitations of Bar Charts and Histograms

Bar charts and histograms are only useful for looking at one column of data. If we want to look at relationships between two pieces of information (like time of day and happiness) we'll need ways to visualize data that look at two columns of data at the same time.

How are the questions you can investigate with scatter or crosstab charts different from the ones you can investigate with bar charts or histograms?

Bar charts and histograms can help you to answer questions regarding the frequency of values in a dataset. For example, these types of charts can distinguish which values are the most common in the data versus the least common. Scatter and crosstab charts can be used to find relationships between subgroups in the data. This can help to answer questions about whether there is a correlation or cause and effect between different variables in the data.

Do you notice any patterns in which charts are or are not useful? (bar charts)

Bar charts are used to visualize how many times a value appears in a column. This can help to identify trends and patterns, and may be used to draw inferences about why certain trends exist. In the charts that are not useful, every bar height has a count of 1. These are not useful because every value in the column only appears once, making it impossible to find trends or draw conclusions about the data. In the bar charts that are useful, the bars have varying heights, which indicate that certain values are more popular than others. It is easier to draw conclusions from these charts.

Bias definition

Biased data favors some things and de-prioritizes or excludes others. (human bias can enter data, causing AI to make biased prediction) depends on: - how data was collected - who collected data - how data is entered

Bar Charts

Count how many times each value in the column appears and make a bar at that height.

Cross Tab

Counts how often pairs of values in two columns appear.

Filtering Data

Filtering data allows the user to look at a subset of the data. In Unit 5, we filtered data programmatically using traversals to gain insight into knowledge from data. Software programs with built in tools (like the Data Visualizer) can also be used to filter data.

What bucket size did you choose? Why do you think this is the most helpful bucket size for this chart? (large range on y-axis)

I chose a bucket size of 10 because the maximum weights range from 0-10 lbs to 200-210 lbs.This range isn't so small that the data is too spread apart to find patterns, and the range isn't so large that the data is too condensed to find patterns.

What bucket size did you choose and why? (small range on y-axis)

I chose a bucket size of two because the oldest possible lifespan of a dog in the chart is only 22 years, If the range on the y-axis is small, it is typically better to choose a smaller bucket size that allows for the distribution of data to be seen more clearly.

What is Cross Tab not useful for

If either column has too many values (the chart would be enormous)

What makes manually cleaning data challenging?

Manually cleaning the data is challenging because you have to look through every data point individually and then correct any inconsistencies.

What is a Scatter plot useful for?

Seeing patterns and trends between two values Numeric data with lots of different values Not useful for: lots of repeated values

Scatter Plot

Shows combinations of values from two columns

Histogram

Similar to a bar chart, but first all numbers in a range or "bucket" are grouped together. For example, the chart below has a bucket size of 20 so the numbers 41, 48, and 53 would all be placed in the same bucket between 40 and 60. Histograms can only be created with numeric data but can be useful when a normal bar chart may be difficult to read.

What is the core idea of your topic? What is it about?

The core ideas of my topic are crowdsourcing and citizen science. Crowdsourcing is drawing data from a large numberof people through the Internet, and citizen science is a type of collaborative research in which data collected by thepublic are applied to scientific problems.

Why do people make visualizations out of data?

Visualizations can help us: * Look at lots of data at once * See patterns that are "invisible" if you just look at the table

Think about examples of Machine Learning you may have encountered in the past such as a website that recommends what video you may be interested in watching next. Are the recommendations ever wrong or unfair? Give an example and explain how this could be addressed.

When I watch YouTube, sometimes I get video recommendations for topics that are inconsistent with my interests, like gaming or sports videos. Video recommendations may be improved by collecting more data about the types of videos that the user likes to watch. I also know that there are multiple social media platforms that allow you to select 'not interested' when you see a post or recommendation you do not like. The computer takes this into account the next time it gives you a recommendation, which helps to make a more appropriate suggestion for your taste.

Takeaways about Visualizations

When looking at visualizations, consider: * What does the data show? - fact * Why might that be the case? - opinion Be careful when making assumptions about data: Correlation does not equal Causation

Correlation does not equal causation

correlation = similarities, patterns causation = the thing caused by that thing

Metadata

data about data

Cleaning Data

goal is to clean data without changing the meaning

Give 2 examples of the problems / questions your topic is being used to solve / answer. (Crowdsourcing and Citizen Science)

● How is climate change influencing bird migration patterns? ● Where do small streams exist and how do they contribute to the water supply of larger bodies of water?


Kaugnay na mga set ng pag-aaral

Chapter 19 Analysis and Monitoring of Gas Exchange

View Set

Module 8 - Energy Balance, Weight Management, & Eating Disorders

View Set

Property: Basics, Possession and Personal Property

View Set