Intro to Data Science with Python
How to import the movies.csv file into a data frame named movies in python?
import panda as pd movies = pd.read_csv('./movies.csv', sep=',')
what is the difference between dict.pop and dict.del ?
we will not able to retrieve data back if we use del
the command to draw line graph in pandas
df.plot()
the command to plot bar, box, histogram charts in pandas
df.plot.bar() df.plot.box() df.plot.hist()
what's the difference between .merge() and concat with axis=1 ?
merge removes duplicated columns
how to drop rows with null values in dataframe movies?
movies = movies.dropna()
the command to check if dataframe movies has any NULL values
movies.isnull().any()
ndarray is immutable or mutable?
mutable
How can we create a rank 2 ndarray?
np.array([[items],[items]])
how to sort entries in order?
pd.sort_values(by= ' ', ascending=true) [ : ]
how to convert epoch time to regular time in pandas?
pd.to_datetime( , unit='s')
How to assign labels to a plot chart?
plt.xlabel('namex') plt.ylabel('namey')
List some common string related functions in python
upper, lower, strip, split, word, find, format
In python, we have x and y such that x == y is TRUE but then "x is y" is false. Give one example to explain that case.
x and y point to different objects on the heap. Note that if the type of each variable is different, each variable will point to different object even though the value maybe equal. x may point to int object while y points to a float object
what is the difference between .concat and .append ?
.append append the other dataframes into the calling dataframe itself
List some statistical functions
.mean() .sum() .median() .unique() .intersectld() .unionld() .setdiffld() .inld() axis = 1 --> row axis=0 --> col
How to read and write text to/from disk in python?
.savetxt() .loadtxt()
what function should we use to count unique value occurring in the input?
.value_counts()
List 5 key steps of a data sci process
+ Acquire (finding, accessing, acquiring, moving data) + Prepare (preliminary analysis, pre-processing such as cleaning, subsetting, filtering, packaging...) + Analyze + Report + Act
What is the difference between list.append and list.extend?
Append adds an element to the end of the list. Extend expands and appends all of the elements from one list into the either. So, if you append a list to another list, that whole list will become an element in the other list A[a1,a2,..,[list B]]
Why merging dataframes is very common in pandas?
Because data is usually distributed across different locations and tables Combine data from distinct dataframes also help with obtaining the big picture
Strings in python are immutable. What does it mean by "immutable" ?
Can't be changed
At the "Exploring data" phase, the first step is preliminary investigation in which we focus on what characteristics of data?
Correlation (using correlation graph to explore dependencies) Trends Outliers (double check for errors in data
What Folium can be used for?
Creating geographic overlays
Why doing data sci?
Data --> Insight --> Action
In data preprocessing phase, the data transformation is also known as?
Data wrangling, data munging, data preprocessing
What could be the most useful data structure in Python for Data Sci ? Give a brief description of that data structure
Dictionary A dictionary holds a combination of key value pairs which can contain any kind of object (even a list or another dictionary). The key can be an unique id or unique tuple. Dictionaries are also super fast at doing lookups. The keys are immutable
Some operations in data pre-processing?
Dimensionality reduction Data manipulation Transformation Feature selection Scaling
what movies.head() will do?
Display the first 5 entries in the data frame movies
Describe benefits of histograms, boxplots, linegraphs,
Histograms show the distribution of the data and can show skewness or unusual dispersion. Boxplots are another type of plot for showing data distribution. Line graphs are useful for seeing how values in your data change over time.
The first step of acquiring data is to?
Identify what data is available and that means 2 things: - Identify what data is relevant to the problem - make use of all of those relevant data
Issues with raw data?
Inconsistent values Duplications Missing values Invalid data Outliers
Text block in jupiter notebook supports Markdown but also?
Latex and Html
What does this command do df['three'] = df['one'] * df['two']
It creates a new col in the frame with values are values of col 1 multiplied by values of col 2.
What does plt.subplots() do?
It gives the figure and the axis seperately
what does data_frame.describe() do ?
It gives you some basic summary statistic for your data set. Those are count, mean, std, min, 25%, 50%, 75%, max, name, dtype
When you want to do something with a string, the best practice is?
Look to see if there is already a pre-built function to handle that task for you
What are range and standard deviation?
Measures of spread in data
To write equations in Latex, we must?
Put the content within $$ signs $content$
With numeric variables, we can also call them as quantitative variable. With categorial variables, what else can we call them?
Qualitative variables or nominal variables
What is data aggregation?
Reducing noise and variability
What 2 things are most important in creating elegant visualizations?
Relevant Lean
Ways to solve issues with raw data
Remove data with missing values Merge duplicated records Generate best estimate for invalid values Remove outliers *Note that all these actions must be based on domain knowledge
Activities in feature selection?
Remove, combine, add, create features
What is a "sample" and "variable" in ML ?
Sample is basically a row with data in a dataset, aka record, observation, instance, example.. Variable is basically columns, sometimes referred to as "features", attribute, dimension, field
Which graphing method should you use to visualize the correlation between two arrays?
Scatter plot
Sets are useful because?
Sets are ordered Support math operations Only allow unique elements
what does pandas.concat([ ,]) do?
Stack dataframes
List some statistical functions other than the basic/core statistical functions
count() clip() rank() round()
In building a decision tree, what is the "greedy approach" ?
the approach of only consider the best way to split a particular portion of data into subsets
Data visualization is important since humans visual intelligence are very powerful. However, data visualizations are meaningless without?
the context
What is "mode"?
the value that occurs most frequently in your data set.
how to build a list of 5 random int in the range between 0 and 10?
>>> import random >>> list = [random.randint(0,5) for i in range(0,10)]
Jupiter notebook supports linux commands. In order to execute linux commands, we must?
Type "!" first, for example !ls
What is the result of the following line of code? import numpy as np np.unique([1,1,3,4,2,3,3])
array([1,2,3,4])
What is the output of the following broadcasting call? A = np.array([[1],[2]]) B = np.array([[1,2],[3,4]]) A + B
array([[2, 3], [5, 6]])
After identifying the problem, data sci team needs to assess the situation which includes?
+ Risks, costs, benefits, regulations, backup plans, + Requirements, assumptions, constraints, resources ...
What are the steps to construct a decision tree?
+ Start with all samples at the root node + Partition the samples into subsets (records that are purest - aka homogenous) + Repeat to partition data into successively purer subsets (induction)
Why numpy is very useful with data sci?
+ It supports multi-dimensional arrays (matrices) + It has built-in array operations (optimized statistical operations) + Simplified but powerful array interations --> broadcasting + Allow integration with other languages like Fortran, C, C++ (ie further code optimizations) + Fast (numpy arrays are 10 times faster than python lists) + Many useful packages are built on top of numpy
Common traits of data scientists?
+ Passionate about the meaning behind data + Understand the problem they are trying to solve + Care about engineering solutions + Currious + Communicate with team mates
What are the main benefits of Pandas?
+ All the benefits of numpy + Data variety support + Data integration (of large data sets, merge, joins..) + Data transformation + Data visualizations + Support for time-series data + The ability to use native methods + Descriptive statistics
List some reasons why Jupyter notebook is good for Data Science?
+ Allow documenting of the process by combining notes, code and graphics, allowing others to understand the motives behind each step + Allow replication and inspection of methods + Support Julia, R and Python (most commonly used languages in DataSci)
What are the two main ways to categorize data visualizations?
+ Conceptual or Data-driven + Declarative or exploratory (the supply n demand graphy)
Why Python is the best for DataScience?
+ Easy to learn + Open language with strong community + Tons of libraries applicable to every steps in data sci + Producible, repeatable, built-in training
In order to promote speed, numpy arrays have some limitations compared with python's list. What are those limitations?
+ Fixed in size + Elements and arrays must be the same type
What is Data Science?
+ The basis of empirical research + Exploratory data analysis and modeling + A continuous process + The intersection of computer science, mathematics, and business or scientific expertise. + A team sport
According to Andy Kirk, what makes good data visualization?
+ Trustworthy (data is honestly portrayed) + Accessible (focusing on your audience) + Elegant
List 3 ways to retrieve items in a series
+ by values of the locations : ser.loc[['nancy','bob']] + by index values : ser[[4, 3, 1]] + by iloc: ser.iloc[2]
Some quick ways to spot issues with data set?
+ min and max are out of range (more here)
with dict[(key)], system will produce a run-time error if the key does not exist in the dictionary. How can we check for the existence of the key?
+ use dict.get and see if that returns "none" + use (key) in dict and test for true or false
What are the methods to get values from a dictionary named "dict" ?
. x = dict[(key)] . x = dict.get(key) . x = dict.pop(key)
Different ways to initialize ndarray with values
.zeros .full (filled with specified value) .eye (fill diagonally) .ones .random.random
syntax to call a method in python
<var_name>.<method_name>(params)
What is a machine learning model?
A mathematical model or a parametric function over the input
The two data structures behind Pandas power are Pandas Series and Pandas Data Frame. Describe those data structures.
A series is one one-dimensional array-like object that provides us with many ways to index data. Series acts like an ndarray, but it supports many data types as a part of the array. A DataFrame is a 2-D elastic data structure that supports heterogeneous data with labeled axis for rows and columns.
Why there are different types of analysis techniques?
Because there are different types of problems
Two main goals in data preprocessing step?
Clean and Transform
Tuples in python are muttable or immutable?
Immutable
What is feature selection?
Selecting the features that will have the biggest impact towards the problem being solved
What is "summary statistic" and what can it contribute in "Exploring data" phase?
Summary statistic consists of mean, median, mode, range, standard deviation. It capture various characteristics of a set of values with a single number or a small set of numbers
In decision tree, what is the depth, and the size?
The depth of a decision tree is the number of edges in the longest path from the root node to the leaf node. The size of a decision tree is the number of nodes in the tree.
an_array[1, :] gives [21 22 23 24] an_array[1:2, :] [[21 22 23 24]] What are the difference between the two arrays?
The first one is rank 1 The second one is a rank2 array
In decision tree, what is the depth of a node?
The number of edges from the root node to that node
What should you do if you want to find unique elements shared between 2 sets?
We use the intersect command or & both = set1.intersection(set2) both = set1 & set2
Describe the outputs of any(), all()
a.any(): returns whether ANY element is True. Can detect if a cell matches a condition very quickly a.all(): returns whether ALL element is True. Can detect if a column or row matches a condition very quickly
Describe the outputs of mean(), std()
a.mean() : output series or DataFrame with the mean values a.std(): series or dataframe with the standard deviation value, normalized by N-1
What is the correct way to access elements of an array "arr" that are less than 0?
arr[arr<0]
Why is it a bad practice when you mutate the data while iterating through a dictionary? What could be the best solution?
because you may change the structure of the data The best solution would be a two step process. + Iterate the dictionary based on a set of search criteria and built a list of results + Mutate the dict items based on the list of results in the previous step
List some main categories of analysis techniques
classification (labels), regression (numeric values), clustering, association analysis (set of association rules between events/items) and graph analysis.
What are the main categories of machine learning techniques?
classification, regression, cluster analysis, and association analysis
how to select the last 10 rows of a dataframe?
dataframe_name[-10:]
what is the general data type for time in pandas
datetime64 [ns]
How to declare a dictionary name dict ?
dict = {key1:val1, key2:val2,...}
What is the main difference between set remove and set discard ?
discard will not trigger error if the target item is not in the set
What is "groupby" for and how to perform it?
groupby combines statistics about the data frames Perform by : df.groupby('name')
In decision tree, how do we call the nodes that are neither root or leaf nodes?
internal node
what does pd.concat([left, right], axis=1, join='inner') does?
it combines the column values (axis=1) of two data frames into a new data frame
What are some algorithms to build a classification model?
kNN Decision Tree Naive Bayes
List some common list methods
list.append list.pop (by index) list.remove (by value) list.extend zip(list1, list2, ...)
How internal nodes and root nodes differ from leaf nodes?
they have test conditions