Python for Data Science
unix time
tracks progress of time by counting seconds since an instant, appears as an integer
What is the definition of 'predictive analytics'?
using data to predict outcomes
How do you determine the size of a decision tree?
# of nodes in tree (exclude root node)
What is the symbol that denotes Magic Functions in Jupyter? % $ # //
%
Using markdown cells in Jupyter, how do you format text as bold? **word** *word* ##word## #word#
**word**
two categories of data visualization
-conceptual or data driven (show a concept EX: supply demand curve) -declarative or exploratory (display conclusion to observer)
key features of numpy
-multi dimensional arrays -built in operations/packages -integrate multiple languages -speed
how is numpy able to be so speedier
-numpy arrays are fixed in size -elements must be the same type
key features of jupyter notebook
-supports multiple platforms -documented data science -reproducible data science -collaboration in groups
panda benefits
-variety of data sources -data integration -data transformation -visualizations
Which Root Mean Square Error (RMSE) would represent a perfect prediction with no errors in regression? 0 NaN 1 -1
0
After running nltk.download("movie_reviews"), which of the following needs to be called to import movie_reviews? from nltk.corpus import movie_reviews from nltk.corpora import movie_reviews nltk.import(movie_reviews) nltk.import("movie_reviews")
1
As an example, let's consider a data set consisting of two variables, one representing how long students spend studying and the other representing their average test scores. Assuming students study effectively, how would you expect these two variables to be correlated? Positively correlated because I would expect the test scores to go up if they spend more time studying. Negatively correlated because as the test scores go down, students should spend more time studying. No correlation because these variables should be independent.
1
How do you assign each sample in a dataset to a centroid using the k-means algorithm? Assign the sample to the cluster with the closest centroid. Assign the sample to the cluster with the furthest centroid. Assign the sample to a random cluster.
1
How do you determine the new centroid of a cluster? Calculate the mean of the cluster Calculate the max of the cluster Calculate the mode of the cluster Calculate the min of the cluster
1
SELECT name FROM san_diego_elementary WHERE grade > 70 Which students at San Diego Elementary scored higher than 70%? At which elementary schools have students scored higher than 70%? Of all the elementary students in san diego that have an average grade of >70%, how many students are from San Diego Elementary?
1
What does the "within-cluster sum of squared error" provide? A mathematical measure of the variation within a cluster. An error measurement for a specific sample in relation to the centroid of a particular cluster. An answer to which cluster is the most 'correct.'
1
What does the following method call return? accuracy_score(data_true = data_test, data_pred = predictions) The fraction of correctly classified samples. The number of correctly classified samples.
1
What does the most_common function of Counter return? a list of words in the form ('word', frequency) a list of words in the form ('word') a list of words in the form (frequency, 'word') a list of words in the form (frequency)
1
What is an example of insight turned into action? (pick 1) 1 Marketing a new product based on past sales information 2 Prediction of customers' choices 3 Understanding of customer profiles 4 Gathering data sales data
1
What is the appropriate input for the following line of code to make a linear regression prediction? y_prediction = regressor.predict(___) x_test x_train y_train y_test
1
What is the command to get the number of rows in a data set titled "data"? data.shape[0] data.shape[1] data.size() data.length()
1
What is the definition of corpus? a collection of text in digital form a download interface to pre-processed text datasets
1
What is the first step in constructing a decision tree? Start with all samples at a node. Partition the samples into subsets based on the input variables. Repeatedly partition data into successively purer subsets until stopping criteria are satisfied.
1
What is the output of the execute method in sqlite? an iterator a cursor a table
1
What is the primary data structure for a relational data model? Table List Vector 3-D data frame
1
What type object is the intersection function called on? set twitter api authentication list
1
What type of object does the function Kmeans output? kmeans dataframe integer series
1
What will type(_) return? type of output from previous line type that represents the underscore symbol prompt to fill in the blank for type of output error because there is no object to return the type
1
What would 18490/1e3 (thousands format) result? 18.49 18490000 1.8490 18490.000
1
When is a prediction task referred to as simple linear regression? When there is only one input variable. When there is more than one input variable. When there are two input variables.
1
Which of the following string formatting centers a string to be in the middle of 10 spaces? {:^10} {:10} {:>10} {:<10}
1
Which parameter in the KMeans clustering algorithm do you have to specify for the number of clusters you want? n_clusters clusters tot cluster_centers
1
Why are decision boundaries of a decision tree parallel to the axes formed by the variables? Each split considers only a single variable Each subset should be as homogenous as possible The induction algorithm eventually stops expanding
1
You are given a dataframe labeled x where the column 'number' indicates the index of a record. Which function call would create a new dataframe y that takes more than 10 samples x if x has 100 records? y = x[(x['number']%5)==0] y = x[(x['number']%10)==0] y = x[(x['number']%15)==0]
1
What are the two components of data retrieval mentioned in this class? The way you specify how to get the desired data out of the relational data store. The internal processing that occurs within the data management system to compute or evaluate that specified retrieval request. The way you store specific data in a data management system. How to handle the data once it has been retrieved.
1 2
What are the ingredients to form a data science problem (select 3)? 1 define what it is you're trying to tackle 2 assess the situation with respect to the problem 3 define your goals and objectives 4 assess the population to which the problem refers
1 2 3
To use scikit-learn: DecisionTreeRegressor, train_test_split, and mean_squared_error, which of the following libraries are necessary? (Choose the best two) pandas sklearn.metrics sklearn.model_selection sklearn.tree scikitlearn
1 5
5 Basic Steps of Data Science
1. Acquire 2. Prepare 3. Analyze 4. Report 5. Act APARA
3 questions to ask when reporting (apaRa)
1. What are the main results 2. What value do these results provide 3. How can the model add to this application
steps to formulate a research question
1. define a problem 2. assess the situation 3. Define the goals
two steps of preparing data
1. explore dataset 2. pre-process dataset
2 types of pandas
1. panda series 2. panda dataframe
principles of data visualization
1. trustworthy (do not mislead audience) 2. accessible (made to be easily perceived and used by audience) 3. elegant (easy to read and interpret in a visually appealing way)
What is an example of Unix time (int64)?
1138537770
For a classification problem, if you want to predict the letter grade that a student would receive, what are 2 examples of reasonable input data to consider? Amount of time spent studying Percentage grade these students received in the previous semester Letter grade different students received in another class The students' ID numbers
12
What 2 statements describe classification in the context of machine learning? Predict the category of the target given input data Supervised task Unsupervised task Numerical target variable
12
What happens when you join two tables in pandas using natural join? Select all that applies. The common column is represented once. The common row is represented once. The common column is represented twice. The common row is represented twice.
12
What does the Twitter API allow access to? users location interactions personal information
123
Which of the following are frequent corner cases in tokenization? inconsistent use of punctuation shortened use of words hyphenated words multiple spaces between words
123
In accessing a client using OAuth 1.0A, what are the four required identifiers to establish access to the resources? consumer key consumer secret access token access token key access token secret
1235
Which two code outputs are equivalent? print('#'*5) print ('#*5') print('#####') print('#*'5)
13
panda series
1D array, similar to ndarray but can handle multiple data types
Rank 1 ndarray
1D vector or matrix
Automatic translators are a natural language processing (NLP) technique. How do they work? Learn what and how a human talks over time. Takes words, phrases, and context into account to understand what is being said. Process questions, categorize them, and match them to existing answers.
2
If you have a dataframe titled 'dat' with 5 rows and 2 columns and you run the following line of code, how many boolean values are returned? dat.isnull().any() 2 5 1 10
2
In the twitter package in python, what does the OAuth function return? list of necessary authentication authentication object twitter API there is no return value
2
What is the correct way to show the last 2 files using movie_reviews.fileids() where movie_reviews is a downloaded dataset? movie_reviews.fileids()[:-2] movie_reviews.fileids()[-2:] movie_reviews.fileids([:-2]) movie_reviews.fileids([-2:])
2
What is the difference between regression and classification for machine learning in Python? Regression transforms categorical values to numeric and then follows the same as classification. Regression is used to predict a numeric value while classification is used to predict a categorical value. Classification is used when the input data is categorical and regression is used when the input data is numeric.
2
What is the first step with any dataset? Research all of the background information. Perform an initial exploration. Verify the findings that others have found.
2
What is the function call to output the name of columns of a dataframe named x? x.columns(0) x.columns columns(x)
2
When is it NOT acceptable to avoid axis labels in plots using matlibplot. When you are simply exploring the data and know their values. When you are presenting non-intuitive results to another person. When the labels can be determined by the values (e.g., percentage correct, years).
2
Which of the following code snippets is a conditional statement? for s in statuses: if not s["text"] in all_text: [s['text'] for s in search_results['statuses'] statuses = filtered_statuses
2
What is true about data science? Select 3. 1 a static, one time analysis of big data 2 a model generated that leads to insights and can be improved 3 management of data 4 seeing how everything is connected
2 3 4
Which of the following are examples of stop words in English? stop doing those being myself word
2345
panda dataframe
2D data structure, multiple data types
Rank 2 ndarray
2D vector or matrix
How would you initially handle an anomaly (apparent outlier) in cluster analysis? Throw it out of the dataset Disregard in further analysis Provide further analysis on the anomaly
3
What is the benefit of a log graph over a graph that has not modified the scales? There is no outright advantage to converting a graph into a log scale. Log scales are significantly better in representing every type of data compared to unmodified scale graphs. Log scales allow a large range of values that are compressed into one point of the graph to be shown.
3
What is the name of a sophisticated word tokenizer trained on English in nltk? punkd punked punkt punnet
3
When plotting word frequency, what does a peak distribution most often entail about the vocabulary? There are many unique words There is a large vocabulary There is a focused topic
3
Which of the following is true about a model? built using test data evaluated on training data trained by the training data set
3
Assume the code line "text = New York-based" is run, what would be the output of text.split()? ['New',' ','York-based'] ['New','York', '-', 'based'] ['New','York','based'] ['New','York-based']
4
Given a list named x, what is the object type of the output returned by Counter(x)? no return type dataset list counter
4
Given: twitter_api = twitter.Twitter(auth=auth) WOE = 1 What is the correct code to find the trend? twitter_api.trends.place(id = WOE) twitter_api.trends.place(1) twitter_api.trends.place(WOE) twitter_api.trends.place(_id = WOE)
4
How does twitter identify location? city, state, and country longitude and latitude coordinates integer number
4
What is the correct way to use json to show the first three statuses in a list named statuses? print(json.dumps(statuses[4], indent = 1)) print(json.dumps(statuses[3], indent = 1)) print(json.dumps(statuses[1:4], indent = 1)) print(json.dumps(statuses[0:3], indent = 1))
4
What is the reason to prefix id with an underscore for query string parameterization with the twitter object? there is no need in the function, it is there to make it easier on the reader it is needed because it is part of the parameter name it is needed for the twitter object to know the type of output from previous line without it, twitter package appends the value to the URL
4
Where did we import NaiveBayesClassifier from in our movie review NLP notebook? sklearn sklearn.naive_bayes nltk.corpus nltk.classify
4
Given the python function from the video: def build_bag_of_words_features(words): return {word:True for word in words} What is the parameter type and return type? dictionary and boolean value set of words and boolean value dictionary and set of words dictionary and boolean value set of words and dictionary
5
What is the default number of rows that the function head() will return for a dataframe in pandas? 10 1 5 2
5
What is the definition of 'data mining'?
Activities related to finding patterns in databases and data warehouses.
As an example, you have a dataset containing numerical values of subjects' heart rates during exercise and categorical values describing how much they smoke. You want to determine whether smoking and heart rate are related. What machine learning category would this fall under?
Assocation analysis
For example, you want to predict the number of kids someone will have: either 0, 1, 2, or 3+. Is this an example of regression or classification? Regression Classification
CLASSIFICATION
What is the function Kmeans (from sklearn) used for in Python in the example soccer data analysis overview?
Clustering
Which of the following is NOT a step in the formal data science process? Collaborate Acquire Prepare Analyze
Collaborate
Regular expression: '.*\,(.*)\&.*' String in Series: "Feuer, Eis & Dosenbier" What would be the output of the extract function with this string and regular expression?
Eis
You are given a data set that contains crime reports for different neighborhoods in New York City and told to analyze it. Would you consider this a declarative or explorative example? Explorative Declarative
Explorative
Which should you do first after you put together data needed for application? Build models to analyze the data Explore the data you put together
Explore the data
XML
Extensible Markup Language, a way of using markup symbols to extract the contents of a web page; common type is JSON
A pickle file is a python utility module that uses serialization and cannot be deserialized in a different notebook. True False
F
In the matplotlih.pyplot function hist, the bins parameter indicates to plot only elements in a list which has a number equal to bins. True False
F
Information from a tweet is only the data given by the 140 character string. True False
F
T/F Data science only generates actionable information for the future.
F
T/F Raw data from sources can always be directly used to perform analysis.
F
TF Cluster analysis is a supervised task
F
TF Regression is an unsupervised task.
F
TF Test data is the same dataset as training data in classification models.
F
The accuracy of test data will be high if one data from a large data set was used in the training set. True False
F
The join operation works efficiently for data of all sizes. True False
F
The nltk corpus only supports English stopwords. True False
F
The raw method is unique to movie_reviews, a downloaded dataset. True False
F
The strings "ABC" and "abc" will be counted as the same string using Counter. True False
F
True or False: The function call train_test_split(a, b) where a and b are dataframes will always output the same result. True False
F
Using the string library, the code "string.punctuation" returns a list of punctuation characters. True False
F
When you concatenate two dataframes using pandas concat function, the number of resulting columns will be the columns that BOTH dataframes have. As an example, if one dataframe has columns titled 'cat','dog' and another dataframe has columns titled 'dog','bunny', then the resulting dataframe columns will be 'dog'. True False
F
T/F Once you have devised an action based on an insight, there is no need to continue collecting data.
F - you need to collect data for post implementation progress tracking
Code in the Jupyter code cells are restricted to being one line. True False
False
In machine learning, algorithms and programs directly aim to learn a given task.
False
T/F: Data is a static one time analysis
False
Changing an element of an array slice in numpy will NOT change the original array. True False
False (mutable)
What 2 things are most important in creating elegant visualizations? Focus on what is relevant. Remove anything which isn't adding to the figure. Use a unique style. Make sure the your visualization is trustworthy.
Focus on what is relevant. Remove anything which isn't adding to the figure.
What is NOT important in collecting data? -The user population -The intended uses of the application -How the data was presented
How the data was presented
html
Hyper Text Markup Language, A markup language used to structure text and multimedia documents, and to set up hypertext links between documents. Used extensively on the World Wide Web, it is the basis of every webpage
What is true between supervised and unsupervised approaches?
In supervised approaches, the target is provided. In unsupervised approaches, the target is unavailable.
How do we show the histogram created with graph.hist(...)? It will automatically show once you call the hist() function Call "graph.show()" Call "hist.show()" Include "%matplotlib inline"
Include "%matplotlib inline"
What are the 3 reasons that data scientists working in Python use numpy all the time? Its speed. Its functionality. Many packages rely on numpy. It enables text markup cells.
Its speed. Its functionality. Many packages rely on numpy.
In building a machine learning model, why do we want to adjust the parameters?
MIN ERROR
If you create a DataFrame using pandas by accessing a column label that doesn't exist, what values are present in that column? NULL 0 No values; error message is printed NaN
NaN
Which of the following are benefits of ndarrays over lists? Select 3. Ndarrays are more space efficient. Ndarrays are more optimized for memory. Ndarrays often have faster computation. Ndarrays have more variable types than lists.
Ndarrays are more space efficient. Ndarrays are more optimized for memory. Ndarrays often have faster computation.
Which is an example of conceptually driven data visualization? Physicists visualize the well-understood relationship between force and acceleration to teach introductory physics students. Doctors try to explore the relationship between a drug and the effect it has on their patients using data visualization. Realtors visualize a data set containing rental listings and the amount of interest they attract.
Physicists visualize the well-understood relationship between force and acceleration to teach introductory physics students.
In what 3 ways can you quickly access numpy array elements? Slicing Using an array of indices Boolean indexing Segmenting an array
Slicing Using an array of indices Boolean indexing
sql
Structured Query Language
JSON is a data format used to communicate semi-structured information. True False
T
Only authentication needs to be used to create a Twitter API object in python. True False
T
Outliers can sometimes be critical to finding convincing answers when analyzing data. True False
T
T/F Data science is an iterative process.
T
T/F Elements in numpy arrays must be all the same type.
T
TF A Root Mean Square Error (RMSE) higher than our mean value would be too high. (Assume all values are positive)
T
TF Final clusters are sensitive to initial centroids.
T
TF It works out better mathematically to measure the impurity of a split in a decision tree, rather than the purity.
T
TF The target variable is always categorical in classification.
T
The bag-of-words model tracks if a word appears to identify the sentiment or overall idea. True False
T
We can use '1' and 'True' interchangeably. True False
T
When you import a dataset using the read_csv function in pandas, the rows of the dataset are Series. True False
T
ndarrays are mutable. True False
T
For the merge() function in pandas, how does the parameter "how" handle row indices when how='inner'? The function takes the union of the row indices The function takes the intersection of the row indices The function takes the complement of the row indices
The function takes the intersection of the row indices
When working with cells in Jupyter, what does "_" refer to? The output of the last cell executed. A space in the line of code. A string character.
The output of the last cell executed.
What does the parameter unit refer to in the to_datetime function? The unit of the output The unit of the input
The unit of the input, the output of the date time function is always int64
What requirement is needed to add two numeric numpy arrays? They need to have the same or compatible dimensions. They need to be of the same type. They need to be converted to type float first.
They need to have the same or compatible dimensions.
What does a negative correlation score mean? All the values of those features in our dataset are negative There is no correlation between those features in our data set Those features in our dataset are inversely correlated
Those features in our dataset are inversely correlated
Take a look at the following lines of code: a = np.array([2, 3]) b1 = np.array([1]) b2 = 1 True or False: a+b1 and a+b2 result in the same ndarray. True False
True
Look at the following code: b = np.array([1,2,3]) b[1] = 'one' What error prints out after you run these two lines of code? SyntaxError NameError KeyError ValueError
ValueError (not the same type)
Which of the following statistics does the describe() function NOT return on dataframe columns? Min Variance Count Mean
Variance
When would you use the machine learning technique 'regression'?
When your model has to predict a numerical value.
What is the difference between deleting a column and popping a column? RESULTS You can store a deleted column. You can store a popped column. There is no difference.
You can store a popped column.
What is the result of the following lines of code? a=np.array(["cat","dog","fish"]) b=np.array(["dog","fish","rabbit"]) print(np.setdiff1d(a,b)) ['cat'] ['rabbit'] ['dog' 'fish'] ['cat' 'dog' 'fish' 'rabbit']
['cat']
data engineering portion of process
acquire, prepare
import pandas as pd ser = pd.Series([100, 200, 300, 400, 500], index = ['tom', 'bob', 'nancy', 'dan', 'eric']) Which of the following calls have the same output? ser.loc[['tom','bob']] ser[[0,1]] ser.iloc[[0,1]] ser[['tom','bob']]
all
What are the 3 main string operations? split contains extract delete
all but delete
computational data science portion of process
analyze, report, act
What is the next step in building a classification model after the model is constructed and parameters are adjusted?
apply model to new data
How would you change the number 5 to 7 in this matrix? arr = np.array([1,2,3,4,5]) arr[0,5] = 7 arr[4] = 7 arr[5] = 7 arr[0,4] = 7
arr[4] = 7
What is the correct way to access elements of an array "arr" that are less than 0? arr[<0] arr[arr<0] arr[arr[,]<0]
arr[arr<0]
What is the result of the following line of code? import numpy as np np.unique([1,1,3,4,2,3,3]) array([1,2,3,4]) array([1,3,4,2]) array([1,1,3,4,2,3,3]) array([4,3,2,1])
array([1,2,3,4])
Which analysis technique has the goal of finding a set of rules to capture associations between items or events? classification regression clustering association analysis graph analytics
association analysis
market basket analysis
association analysis used to predict customer purchasing patterns
Which command allows you to edit the view of the axes on a matlibplot plot? grid() plot() axis() arange()
axis()
Is age group a numeric or a categorical variable?
categorical
transforming a dataset (data munging): SCALING
changing the range of values to be between a specified range so one feature does not dominate the results. EX: height and weight
Analyze data step
classification (predict category) regression (predict numeric value) clustering (organize items into groups, create categories) association (find rules to capture association between items)
Two goals of pre-processing data
clean and transform dataset
csv
comma separated values
When thinking about data visualization, what element is critical in making you value the visualization? Context Raw values Plots and graphs
context
Which function call will allow you to group a dataset (titled dat) by 'time'? dat.groupby('time') groupby(dat['time']) dat.aggregate(['time']) dat['time'].group()
dat.groupby('time')
why is data cleaning necessary?
data is received downstream with little to no control on what is included, data cleaning is to ensure we are looking at "good" data and address data quality issues
what data step is where the most time is spent
data preparation (aPara)
What is the general syntax for calling the mode() function on a dataframe? data_frame[mode] mode(data_frame) data_frame.mode()
data_frame.mode()
What does calling the function dropna() return from a dataframe? dataframe with the rows containing the NA values dataframe omitting NA values dataframe with the columns containing the NA values dataframe with changed NA values by replacing NA values with 0
dataframe omitting NA values
What does the parameter expand represent in the split function? determines if the output is a dataframe creates every split string into a new column pipes the split strings into the output
determines if the output is a dataframe, normally the split function produces a series of strings, by using expand it will create a new df
Which of the following will display the first 7 rows in a data frame object named df? df.head(n=7) df.head(7) df.head([7]) df.head(x=7) df[:7]
df.head(n=7) df.head(7) df[:7]
function for filling missing values with interpolation
df.interpolate(), fills missing values using a linear axis assumption
histogram
df.plot.hist() shows distribution of data and skewness
function to check if a string has a certain character
df['column'].str.contains('hello')
function to return the first match of a regular expression found in a column
df['column'].str.extract()
function to replace a string with another
df['column'].str.replace()
function to separate column by a delimeter
df['column'].str.split('_')
What is the function call to find cells in a dataframe df with timestamp on 2007-02-04, given the dataframe has a parsed time column labelled 'parsed_time'? df['parsed_time'] == '2007-02-04' df['parsed_time'] = '2007-02-04' '2007-02-04' >= df['parsed_time'] >= '2007-02-04' '2007-02-05' > df['parsed_time'] > '2007-02-03'
df['parsed_time'] == '2007-02-04'
function for describing a data frame
df_name.describe()
Which function allows you to clean a data set by dropping NA values? na() dropna() drop(na) na(FALSE)
df_name.dropna()
function to remove missing values
df_name.dropna(axis=0) axis = 0 is the default function, it removes row with missing values, axis = 1 removes columns with missing values
function for backfilling missing values
df_name.fillna(method = 'backfill')
function for forwarding filling missing values
df_name.fillna(method = 'ffill')
function that uses a regular index value (integer) to index
df_name.iloc['1']
function that indexes using a specific object
df_name.loc['apple'] df_name.loc[['apple', 'orange']]
function to sort data in ascending or descending order
df_name.sort_values(by = 'column name', ascending = 'true') ascending = false for descending order
function for determining if all elements are true
dfname.all()
function for determining if any element is true
dfname.any()
function for measuring the dependencies between variables
dfname.corr()
What is the general syntax for calling the mode() function on a dataframe?
dfname.mode()
When is it acceptable to avoid axis labels in plots using matlibplot.
during data exploration
broadcasting
ensuring matrix dimensions align
Suppose you are looking at a data set consisting of how much students liked a particular class. The rows are students. There are two columns: one asks if the student has taken the class and the other asks to rate the class on a scale of 1-10. However, you notice that some students have not taken the class and therefore do not want to include their ratings. Which data filtering technique would you use to clean this dataset? Slice out columns Filter out rows Transform data
filter rows
Which library can you use to easily create geographic overlays? Folium NumPy Matlibplot IPython
folium
What graph is best to show distribution of data?
histogram and box plot
Which algorithm to build classification models relies on the notion that samples with similar characteristics likely belong to the same class?
kNN
big data
large troves of data we mine for insights
In a decision tree, which nodes do NOT have test conditions?
leaf node
what is the downside to data transformation
less detailed data
What graph is best to show how values in your data change over time?
line graph
Which of the following does a boxplot NOT show you? Median Interquartile range Min and max values Mean
mean
Which is NOT a quality of good data visualization, according to Andy Kirk? RESULTS Trustworthy Accessible Elegant Meaningful
meaningful
cleaning a dataset
merge duplicate values, remove outliers, estimate invalid values, remove missing values
There is a syntax error in the code below. np.array( [11,12,13],[21,22,23] ) How would you fix it to create the intended 2x3 array? np.array( [[11,12,13],[21,22,23]] ) np.array( [11,12,13,21,22,23] ) np.array( [11,12,13],[21,22,23] ) np.array[ [11,12,13,21,22,23] ]
np.array( [[11,12,13],[21,22,23]] )
How do you create an Rank 1 array with numpy using the numbers 1, 2, 3? np.ndarray([1,2,3]) np.array([1,2,3]) np.array(1,2,3) np.ndarray(1,2,3) np.ndarray([[1,2,3],[3,2,1]) np.array([[1,2,3],[3,2,1])
np.ndarray([1,2,3])
function to stack dataframes and create a new dataframe out of them
pandas.concat(), if dataframes have columns that do not exist in all the missing data will be filled with NaN, matching columns will be duplicated
function to merge dataframes
pd.merge(df1, df2, how = 'inner'), this will remove duplicate columns
function to convert time to a timstamp format
pd.to_datatime(df_name['column'])
For which of the following scenarios would you use the analysis technique of classification? predicting the weather predict the price of a stock simulating sales of a new product predicting the score on a test
predicting the weather
transforming a dataset (data munging): Dimensionality Reduction
reduces the # of dimensions in a dataset by choosing a sample that reflects the variability of the dataset
transforming a dataset (data munging): Aggregating data
reduces variability in a dataset by smoothing the data EX: aggregating data by fiscper rather than actual goods issue date
transforming a dataset (data munging): feature selection
removing irrelevant features, combining features, adding features (EX: ODO drop time - pulling out the hour)
examples of data cleaning
replacing values, estimating values, dropping fields
What is the correct word to describe an instance of an entity in your data?
sample
What are ways to manipulate the cleaned data into the format needed for analysis (select 5)? Scaling Transformation Feature selection Dimensionality reduction Data manipulation Denominate data
scaling, feature selection, dimensionality reduction, transformation, data manipulation
Which graphing method should you use to visualize the correlation between two arrays? Histogram Barplot Scatter plot Line plot
scatter
What graph is best to show correlation between two values?
scatter plot
What are the 2 main data structures in pandas?
series, data frame
Which is NOT mentioned in the course as a common similarity measure in cluster analysis? Euclidean distance Manhattan distance Cosine similarity Sine similarity
sine
In general, are classification and regression often supervised or unsupervised approaches?
supervised
A table with a primary key logically implies that the table cannot have a duplicate record. True False
t
Assert statements help locate bugs. True False
t
In assigning true or false values in a bag-of-words model, a missing word is equivalent to assigning a false value. True False
t
SQL for structured relational data can provide more operations than pandas data frames. True False
t
TF When you search an incorrectly spelled term online, suggested words is an example of machine learning.
t