Intro to Data Science with Python

¡Supera tus tareas y exámenes ahora con Quizwiz!

How to import the movies.csv file into a data frame named movies in python?

import panda as pd movies = pd.read_csv('./movies.csv', sep=',')

what is the difference between dict.pop and dict.del ?

we will not able to retrieve data back if we use del

the command to draw line graph in pandas

df.plot()

the command to plot bar, box, histogram charts in pandas

df.plot.bar() df.plot.box() df.plot.hist()

what's the difference between .merge() and concat with axis=1 ?

merge removes duplicated columns

how to drop rows with null values in dataframe movies?

movies = movies.dropna()

the command to check if dataframe movies has any NULL values

movies.isnull().any()

ndarray is immutable or mutable?

mutable

How can we create a rank 2 ndarray?

np.array([[items],[items]])

how to sort entries in order?

pd.sort_values(by= ' ', ascending=true) [ : ]

how to convert epoch time to regular time in pandas?

pd.to_datetime( , unit='s')

How to assign labels to a plot chart?

plt.xlabel('namex') plt.ylabel('namey')

List some common string related functions in python

upper, lower, strip, split, word, find, format

In python, we have x and y such that x == y is TRUE but then "x is y" is false. Give one example to explain that case.

x and y point to different objects on the heap. Note that if the type of each variable is different, each variable will point to different object even though the value maybe equal. x may point to int object while y points to a float object

what is the difference between .concat and .append ?

.append append the other dataframes into the calling dataframe itself

List some statistical functions

.mean() .sum() .median() .unique() .intersectld() .unionld() .setdiffld() .inld() axis = 1 --> row axis=0 --> col

How to read and write text to/from disk in python?

.savetxt() .loadtxt()

what function should we use to count unique value occurring in the input?

.value_counts()

List 5 key steps of a data sci process

+ Acquire (finding, accessing, acquiring, moving data) + Prepare (preliminary analysis, pre-processing such as cleaning, subsetting, filtering, packaging...) + Analyze + Report + Act

What is the difference between list.append and list.extend?

Append adds an element to the end of the list. Extend expands and appends all of the elements from one list into the either. So, if you append a list to another list, that whole list will become an element in the other list A[a1,a2,..,[list B]]

Why merging dataframes is very common in pandas?

Because data is usually distributed across different locations and tables Combine data from distinct dataframes also help with obtaining the big picture

Strings in python are immutable. What does it mean by "immutable" ?

Can't be changed

At the "Exploring data" phase, the first step is preliminary investigation in which we focus on what characteristics of data?

Correlation (using correlation graph to explore dependencies) Trends Outliers (double check for errors in data

What Folium can be used for?

Creating geographic overlays

Why doing data sci?

Data --> Insight --> Action

In data preprocessing phase, the data transformation is also known as?

Data wrangling, data munging, data preprocessing

What could be the most useful data structure in Python for Data Sci ? Give a brief description of that data structure

Dictionary A dictionary holds a combination of key value pairs which can contain any kind of object (even a list or another dictionary). The key can be an unique id or unique tuple. Dictionaries are also super fast at doing lookups. The keys are immutable

Some operations in data pre-processing?

Dimensionality reduction Data manipulation Transformation Feature selection Scaling

what movies.head() will do?

Display the first 5 entries in the data frame movies

Describe benefits of histograms, boxplots, linegraphs,

Histograms show the distribution of the data and can show skewness or unusual dispersion. Boxplots are another type of plot for showing data distribution. Line graphs are useful for seeing how values in your data change over time.

The first step of acquiring data is to?

Identify what data is available and that means 2 things: - Identify what data is relevant to the problem - make use of all of those relevant data

Issues with raw data?

Inconsistent values Duplications Missing values Invalid data Outliers

Text block in jupiter notebook supports Markdown but also?

Latex and Html

What does this command do df['three'] = df['one'] * df['two']

It creates a new col in the frame with values are values of col 1 multiplied by values of col 2.

What does plt.subplots() do?

It gives the figure and the axis seperately

what does data_frame.describe() do ?

It gives you some basic summary statistic for your data set. Those are count, mean, std, min, 25%, 50%, 75%, max, name, dtype

When you want to do something with a string, the best practice is?

Look to see if there is already a pre-built function to handle that task for you

What are range and standard deviation?

Measures of spread in data

To write equations in Latex, we must?

Put the content within $$ signs $content$

With numeric variables, we can also call them as quantitative variable. With categorial variables, what else can we call them?

Qualitative variables or nominal variables

What is data aggregation?

Reducing noise and variability

What 2 things are most important in creating elegant visualizations?

Relevant Lean

Ways to solve issues with raw data

Remove data with missing values Merge duplicated records Generate best estimate for invalid values Remove outliers *Note that all these actions must be based on domain knowledge

Activities in feature selection?

Remove, combine, add, create features

What is a "sample" and "variable" in ML ?

Sample is basically a row with data in a dataset, aka record, observation, instance, example.. Variable is basically columns, sometimes referred to as "features", attribute, dimension, field

Which graphing method should you use to visualize the correlation between two arrays?

Scatter plot

Sets are useful because?

Sets are ordered Support math operations Only allow unique elements

what does pandas.concat([ ,]) do?

Stack dataframes

List some statistical functions other than the basic/core statistical functions

count() clip() rank() round()

In building a decision tree, what is the "greedy approach" ?

the approach of only consider the best way to split a particular portion of data into subsets

Data visualization is important since humans visual intelligence are very powerful. However, data visualizations are meaningless without?

the context

What is "mode"?

the value that occurs most frequently in your data set.

how to build a list of 5 random int in the range between 0 and 10?

>>> import random >>> list = [random.randint(0,5) for i in range(0,10)]

Jupiter notebook supports linux commands. In order to execute linux commands, we must?

Type "!" first, for example !ls

What is the result of the following line of code? import numpy as np np.unique([1,1,3,4,2,3,3])

array([1,2,3,4])

What is the output of the following broadcasting call? A = np.array([[1],[2]]) B = np.array([[1,2],[3,4]]) A + B

array([[2, 3], [5, 6]])

After identifying the problem, data sci team needs to assess the situation which includes?

+ Risks, costs, benefits, regulations, backup plans, + Requirements, assumptions, constraints, resources ...

What are the steps to construct a decision tree?

+ Start with all samples at the root node + Partition the samples into subsets (records that are purest - aka homogenous) + Repeat to partition data into successively purer subsets (induction)

Why numpy is very useful with data sci?

+ It supports multi-dimensional arrays (matrices) + It has built-in array operations (optimized statistical operations) + Simplified but powerful array interations --> broadcasting + Allow integration with other languages like Fortran, C, C++ (ie further code optimizations) + Fast (numpy arrays are 10 times faster than python lists) + Many useful packages are built on top of numpy

Common traits of data scientists?

+ Passionate about the meaning behind data + Understand the problem they are trying to solve + Care about engineering solutions + Currious + Communicate with team mates

What are the main benefits of Pandas?

+ All the benefits of numpy + Data variety support + Data integration (of large data sets, merge, joins..) + Data transformation + Data visualizations + Support for time-series data + The ability to use native methods + Descriptive statistics

List some reasons why Jupyter notebook is good for Data Science?

+ Allow documenting of the process by combining notes, code and graphics, allowing others to understand the motives behind each step + Allow replication and inspection of methods + Support Julia, R and Python (most commonly used languages in DataSci)

What are the two main ways to categorize data visualizations?

+ Conceptual or Data-driven + Declarative or exploratory (the supply n demand graphy)

Why Python is the best for DataScience?

+ Easy to learn + Open language with strong community + Tons of libraries applicable to every steps in data sci + Producible, repeatable, built-in training

In order to promote speed, numpy arrays have some limitations compared with python's list. What are those limitations?

+ Fixed in size + Elements and arrays must be the same type

What is Data Science?

+ The basis of empirical research + Exploratory data analysis and modeling + A continuous process + The intersection of computer science, mathematics, and business or scientific expertise. + A team sport

According to Andy Kirk, what makes good data visualization?

+ Trustworthy (data is honestly portrayed) + Accessible (focusing on your audience) + Elegant

List 3 ways to retrieve items in a series

+ by values of the locations : ser.loc[['nancy','bob']] + by index values : ser[[4, 3, 1]] + by iloc: ser.iloc[2]

Some quick ways to spot issues with data set?

+ min and max are out of range (more here)

with dict[(key)], system will produce a run-time error if the key does not exist in the dictionary. How can we check for the existence of the key?

+ use dict.get and see if that returns "none" + use (key) in dict and test for true or false

What are the methods to get values from a dictionary named "dict" ?

. x = dict[(key)] . x = dict.get(key) . x = dict.pop(key)

Different ways to initialize ndarray with values

.zeros .full (filled with specified value) .eye (fill diagonally) .ones .random.random

syntax to call a method in python

<var_name>.<method_name>(params)

What is a machine learning model?

A mathematical model or a parametric function over the input

The two data structures behind Pandas power are Pandas Series and Pandas Data Frame. Describe those data structures.

A series is one one-dimensional array-like object that provides us with many ways to index data. Series acts like an ndarray, but it supports many data types as a part of the array. A DataFrame is a 2-D elastic data structure that supports heterogeneous data with labeled axis for rows and columns.

Why there are different types of analysis techniques?

Because there are different types of problems

Two main goals in data preprocessing step?

Clean and Transform

Tuples in python are muttable or immutable?

Immutable

What is feature selection?

Selecting the features that will have the biggest impact towards the problem being solved

What is "summary statistic" and what can it contribute in "Exploring data" phase?

Summary statistic consists of mean, median, mode, range, standard deviation. It capture various characteristics of a set of values with a single number or a small set of numbers

In decision tree, what is the depth, and the size?

The depth of a decision tree is the number of edges in the longest path from the root node to the leaf node. The size of a decision tree is the number of nodes in the tree.

an_array[1, :] gives [21 22 23 24] an_array[1:2, :] [[21 22 23 24]] What are the difference between the two arrays?

The first one is rank 1 The second one is a rank2 array

In decision tree, what is the depth of a node?

The number of edges from the root node to that node

What should you do if you want to find unique elements shared between 2 sets?

We use the intersect command or & both = set1.intersection(set2) both = set1 & set2

Describe the outputs of any(), all()

a.any(): returns whether ANY element is True. Can detect if a cell matches a condition very quickly a.all(): returns whether ALL element is True. Can detect if a column or row matches a condition very quickly

Describe the outputs of mean(), std()

a.mean() : output series or DataFrame with the mean values a.std(): series or dataframe with the standard deviation value, normalized by N-1

What is the correct way to access elements of an array "arr" that are less than 0?

arr[arr<0]

Why is it a bad practice when you mutate the data while iterating through a dictionary? What could be the best solution?

because you may change the structure of the data The best solution would be a two step process. + Iterate the dictionary based on a set of search criteria and built a list of results + Mutate the dict items based on the list of results in the previous step

List some main categories of analysis techniques

classification (labels), regression (numeric values), clustering, association analysis (set of association rules between events/items) and graph analysis.

What are the main categories of machine learning techniques?

classification, regression, cluster analysis, and association analysis

how to select the last 10 rows of a dataframe?

dataframe_name[-10:]

what is the general data type for time in pandas

datetime64 [ns]

How to declare a dictionary name dict ?

dict = {key1:val1, key2:val2,...}

What is the main difference between set remove and set discard ?

discard will not trigger error if the target item is not in the set

What is "groupby" for and how to perform it?

groupby combines statistics about the data frames Perform by : df.groupby('name')

In decision tree, how do we call the nodes that are neither root or leaf nodes?

internal node

what does pd.concat([left, right], axis=1, join='inner') does?

it combines the column values (axis=1) of two data frames into a new data frame

What are some algorithms to build a classification model?

kNN Decision Tree Naive Bayes

List some common list methods

list.append list.pop (by index) list.remove (by value) list.extend zip(list1, list2, ...)

How internal nodes and root nodes differ from leaf nodes?

they have test conditions

Ver todos los conjuntos de estudio

Intro to Data Science with Python

Conjuntos de estudio relacionados

Pre-Algebra - Quarter 1 Review

Drug-Resistant Superbugs, Multi-drug Resistant Organisms: MRSA, VRE, Clostridium difficile, and CRE

Maternity & Newborn Nursing - Ricii - Ch's 11-22

Green Street Advisors Interview Terms

Early American History pt. 1

1-15 ais study guide questions

Perfusion Exemplar 16.J Peripheral Vascular Disease

Intro to Business

Google Search Ad Assessment

BUSN law chapter 9

Module 7 Retirement Planning

Supply and Demand

HIST CH22

Chapters 7, 8, 24, 26 (Pharmacology)

AP Human Geography-Agriculture

English 11 semester 2

Pham Final

Chapter 9

Interview tips

21-25 Photo