Data Science Final

¡Supera tus tareas y exámenes ahora con Quizwiz!

#5 Understand what an API is, and what REST is, and how they can be used to retrieve data

API- application programming interface clearly defined methods of communication between various software components. Building blocks put together by the programmer. part of remote server that sends and receives responses API objects: Tweets, users, entities (metadata), places REST- Representational state transfer. requests made to a resource's URI, response is in xml, HTML, JSON, etc.

What type of data is being used and how is it being collected?

All types: geographical, transport, natural, metrological, statistical, financial, scientific, cultural Human- sourced information collected books, art, pictures, audio visual Social networks, internet searches, text messages Process-meditated data records of business events: registration, manufacturing a product, an order public agencies: medical records Businesses: commercial transactions, banking records Machine generated data sensors and machines for events in physical world Linked, network, structured, unstructured, geographic, real-time, natural language, time series, event *Quantitative Ratio: ordered, equidistant, meaningful 0 Interval: ordered, equidistant *Qualititative: Ordinal Ordered Nominal

How and where can metadata be defined?

Descriptive- discovery and identification (author, title, abstract) Structural - how compound objects are put together Administrative- how created, file type, technical info, who can access it

Understand the type of chart visualizations and what purpose they best suit.

Bar chart numerical data that can be categorized Line chart: connect numeric data points over time and differences between them Pie charts: relative proportions and percentages Histograms and box plots: distribution of data Scatterplot: relationships between variables, trend lines

What differentiates a 'big data' data science project to a 'normal' data science project?

Big data: analysis of a data set whose size is beyond the ability of typical database software to capture, store, merge, and analyze

How can we categorize and discuss data? Where can we get it, what types of data are there from a high and low level discussion / categorization?

Categorizing data (example): High Level: Vehicle Medium Level: Car vs. truck. vs. SUV Low Level: Pontiac vs. Chevy Tahoe vs. Subaru Outback Discussing Data: High level includes more of a broad overview, fewer specifics, gives general description. Low level includes specifics, differentiating characteristics.

What issues do you have to consider when using REST e.g. the Twitter API?

Cleaning may need to be done on the data, issues with special characters, None vs. NONE

Why is color so important when considering visualization? (You will NOT be asked anything about Gestalt psychology nor pre-attentive processing.)

Colors should be chosen to deliver an enhanced aesthetic appeal and a better user experience .

Can you talk about the similarities and differences between machine learning and data mining

Data mining: discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems, extract info Machine Learning: creating computational systems that learn from data in order to make predictions and inferences Both: used for regression, classification, clustering, both concerned with "HOW do we learn from the data?"

Why is it useful to be able to define data from a low level point of view?

Details are often important to one's analysis. The difference between two things that could be considered the same on a "high level" could be crucial to one's ability to conduct a proper analysis.

Can you describe some of the considerations when integrating data/data sources for analysis? How would you contract 'simple' against 'complex data?

Differences in measurement and format Simple: few sources, small size Complex: many sources, large

What areas, fields, and disciplines make up data science?

Fields: statistics, machine learning, data mining, predictive analytics disciplines: math, stats, operations research, computer science, machine translation, speech recognition

When you have data across multiple sources, how can you link, relate, associate, bind, blend the sources together

Find common information (e.g. an identification number, name, etc.). This is especially effective if we can link two unique elements from lines of data.

What is data science

Interdisciplinary field of scientific processes and systems to extract knowledge or insights from data

What is important to consider when looking at data formats?

JSON: syntax rules, attribute-value pairs CSV: tabular data, plain text, each line is a data record HTML: Web scraping, creating web pages and applications

#2 is there a single process, framework or methodology used across Data Science? If so, what are the main steps?

No, everyone has there own way to do it. 1. starting question/ hypothesis 2. Get data 3. Explore data 4. Answer question 5. Present and communicate findings using visualizations 6. Further exploration

If using a graphical approach, e.g. using a scatter plotter, what considerations do you need when trying to interpreting the visual output?

Put outcome on y axis

#3 What is data and how is metadata used alongside it?

Set of values of qualitative or quantitative variables Metadata are data that describe other data:

What is the idea behind a data lake

Single store of all data within an enterprise, ranging from raw to transformed data, which is used fro reporting, visualization, analytics, and machine learning.

Longer definition of data science

Structured or unstructured data,

Discuss how you can identify, group, and categorize types of skills and roles across data science

Subject matter expertise, computer science, math and statistics, Data engineering, scientific method,visualization,

What is an outlier?

Values outside an area of a distribution that would commonly occur A point more than 3 sd from the mean more than 1.5 IQRs beyond a value's corresponding hinge

3 secondary Vs of big data, why important?

Veracity (accuracy, trustworthy, representative, noise), Variability ( homogeneity, constant meaning eg. the definitions of words change over time) , Value (quality of insights)

3 primary Vs of big data, why important?

Volume, velocity (speed of data), variety (structured vs. unstructured, # of sources)

What kind of questions are being asked

Why do the people who love Twitter get it, and how can we help all the other people who aren't getting it get there much faster? identify ways to help counselors interact effectively with teens at some of their most crucial times of need . Could restaurant reviews posted on Yelp could be a source of valuable information in the ongoing battle to prevent foodborne illnesses? Homeowners ... 'Zestimate' ... how your home's value might have changed substantially improve the accuracy of predictions about how much someone is going to enjoy a movie based on their movie preferences. What do service requests tell us about the different neighborhoods in Chicago? Can we use a neighborhood's characteristics to predict future service requests volumes across the city?

How can understanding the type of your data influence the type of EDA your apply?

categorical: range of values, frequency, histogram Quantitative: center, modality, shape, outliers, 1 Categorical, 1 quantitative: side-by-side boxplots 2 quantitative: scatterplots

Can you identify and highlight the high level aspects of regression, classification, clustering and where they could be used?

clustering: grouping a set of objects together in such a way that the objects in one group are more similar to each other than to those in other groups (clusters). Data mining algorithms, pattern recognition, image analysis Regression: estimating the relationships among (several) variables, prediction and forecasting classification: identifying to which of a set of categories a NEW observation belongs on the basis of a training set, eg. is an email spam or diagnosing a patient

How could you differentiate clustering from classification

clustering: putting all observations into clusters that are more similar to each other than those in other clusters. classification: deciding which group a NEW object goes into based on a training set

How can you handle missing data?

delete it, fill it in with the mean value of the variable

What forms can EDA take with respect to your data?

gain knowledge of data, prepare it for modeling, detection of mistakes, checking assumptions, preliminary selection of models, determining relationships between explanatory variables, ANY method of looking at data which does not include formal statistical modeling non-graphical: summary statistics graphical: diagrammatic / pictoral univariate: one variable/column multivariate: 2 or more variables

What is imputation, and would issues do you need to be aware of when applying it across a dataset?

imputation is the process of replacing missing data with substituted values. Less ecological validity, overconfidence in results, altered summary statistics

What strategies can you apply to deal with outliers? Do they have subsequent implications?

keep, delete, impute

What is Data Visualization, and what are its main aims?

subset of computer graphics, a branch of computer science The study of visual representation of abstract data to reinforce human understanding aims: inform, educate, persuade audience

What is EDA: Exploratory Data Analysis? How can is be used on a dataset? Forming questions vs data profiling.

summarize data's main characteristics what data can tell us beyond formal modeling or hypothesis testing task Asking questions: formulate new hypothesis that could lead to new data collection and experiments Data profiling: summarize data set with descriptive statistics, assess data quality, correct, discard, handle data differently

How could you use correlation within exploratory data analysis. Is correlation enough or would you need further analysis?

the extent to which two variables have a linear relationship with each other. You need further analysis, because correlation does not imply causation.

Data Science Final

Conjuntos de estudio relacionados

PMI-ACP 02 : Scrum

Physiology Lab #2 Skeletal Muscle Physiology Review

French 406 Ch. 1-2

First 20 Elements of the Periodic Table

Child D quiz #2 (ch 4-5)

AP Euro Midterm

Natuur en techniek Begrippen

ACCT 229 Exam 2

Geology Quiz 1

Nursing 3 exam 3 end of chapter questions, ATI, and NCLEX questions

Criminology 1101: Chapters 1-6

Lesson 6: Using Data Types and Lessons

Brachytherapy pt 1

Practice Quiz ch8

NewWorking Chapter 2

MUSC 1101 Midterm: Middle Ages and Renaissance

ITP 120 - Chapter 10 QUIZ

CIS 2332 Test 3

Systems of the Biosphere

Chapter 8: Economic Growth