Technical Interview
How would you do a cross-product of two tables in R?
"merge()" function can be used to perform a cross-product in R: We have two tables-> "employee_designation" and "employee_salary" employee_designation table: which consists of "name" and "designation" employee_salary table: which consists of "name" and "salary" By following the below command, we will get a cross-product: merge(employee_designation,employee_salary,by=NULL)
Can you write and explain some of the most common syntax in R?
# — as in many other languages, # can be used to introduce a line of comments. This tells the compiler not to process the line, so it can be used to make code more readable by reminding future inspectors what blocks of code are intended to do. "" — quotes operate as one might expect; they denote a string data type in R. <- — one of the quirks of R, the assignment operator is <- rather than the relatively more familiar use of =. This is an essential thing for those using R to know, so it would be good to display your knowledge of it if the question comes up. \ — the backslash, or reverse virgule, is the escape character in R. An escape character is used to "escape" (or ignore) the special meaning of certain characters in R and, instead, treat them literally.
How would you write a custom function in R? Give an example.
<object-name>=function(x){ — — — } Let's look at an example to create a custom function in R -> fun1<-function(x){ ifelse(x>5,100,0) } v<-c(1,2,3,4,5,6,7,8,9,10) fun1(v)->v
Explain about confusion matrix in R?
A confusion matrix can be used to evaluate the accuracy of the model built. It Calculates a cross-tabulation of observed and predicted classes. This can be done using the "confusionmatrix()" function from the "caTools" package
What are the different data structures in R? Briefly explain about them.
A vector is a sequence of data elements of the same basic type. Members in a vector are called components. Lists are the R objects which contain elements of different types like − numbers, strings, vectors or another list inside it. A matrix is a two-dimensional data structure. Matrices are used to bind vectors from the same length. All the elements of a matrix must be of the same type (numeric, logical, character, complex). A data frame is more generic than a matrix, i.e different columns can have different data types (numeric, character, logical, etc). It combines features of matrices and lists like a rectangular list.
What is the difference between CLASS statement and BY statement in proc means?
Answer: Unlike CLASS processing, BY processing requires that your data already be sorted or indexed in the order of the BY variables. BY group results have a layout that is different from the layout of CLASS group results.
How to specify variables to be processed by the FREQ procedure?
Answer: By using TABLES Statement.
What is the difference between SAS functions and procedures?
Answer: Functions expect argument values to be supplied across an observation in a SAS data set whereas a procedure expects one variable value per observation. For example: data average ;set temp ;avgtemp = mean( of T1 - T24 ) ;run ; Here arguments of mean function are taken across an observation. The mean function calculates the average of the different values in a single observation. proc sort ;by month ;run ;proc means ;by month ;var avgtemp ;run ; Proc means is used to calculate average temperature by month (taking one variable value across an observation). Here, the procedure means on the variable month.
What is the difference between using drop = data set option in data statement and set statement?
Answer: If you don't want to process certain variables and you do not want them to appear in the new data set, then specify drop = data set option in the set statement. Whereas If want to process certain variables and do not want them to appear in the new data set, then specify drop = data set option in the data statement.
What are the differences between PROC MEANS and PROC SUMMARY?
Answer: PROC MEANS produces subgroup statistics only when a BY statement is used and the input data has been previously sorted (using PROC SORT) by the BY variables. PROC SUMMARY automatically produces statistics for all subgroups, giving you all the information in one run that you would get by repeatedly sorting a data set by the variables that define each subgroup and running PROC MEANS. PROC SUMMARY does not produce any information in your output. So you will need to use the OUTPUT statement to create a new DATA SET and use PROC PRINT to see the computed statistics.
How does PROC SQL work?
Answer: PROC SQL is a simultaneous process for all the observations. The following steps happen when PROC SQL is executed: SAS scans each statement in the SQL procedure and check syntax errors, such as missing semicolons and invalid statements. SQL optimizer scans the query inside the statement. The SQL Optimizer decides how the SQL query should be executed in order to minimize run time. Any tables in the FROM statement are loaded into the data engine where they can then be accessed in memory. Code and Calculations are executed. Final Table is created in memory. Final Table is sent to the output table described in the SQL statement.
What are the differences between sum function and using "+" operator?
Answer: SUM function returns the sum of non-missing arguments whereas "+" operator returns a missing value if any of the arguments are missing.
Name few SAS functions?
Answer: Scan, Substr, trim, Catx, Index, tranwrd, find, Sum
Give an example where SAS fails to convert character value to numeric value automatically?
Answer: Suppose value of a variable PayRate begins with a dollar sign ($). When SAS tries to automatically convert the values of PayRate to numeric values, the dollar sign blocks the process. The values cannot be converted to numeric values.
What is the difference between reading data from an external file and reading data from an existing data set?
Answer: The main difference is that while reading an existing data set with the SET statement, SAS retains the values of the variables from one observation to the next. Whereas when reading the data from an external file, only the observations are read. The variables will have to re-declared if they need to be used.
What is the purpose of trailing @ and @@? How do you use them?
Answer: The trailing @ is also known as a column pointer. By using the trailing @, in the Input statement gives you the ability to read a part of your raw data line, test it and then decide how to read additional data from the same record. The single trailing @ tells the SAS system to "hold the line". The double trailing @@ tells the SAS system to "hold the line more strongly". An Input statement ending with @@ instructs the program to release the current raw data line only when there are no data values left to be read from that line. The @@, therefore, holds the input record even across multiple iteration of the data step.
How many data types are there in SAS?
Answer: There are two data types in SAS. Character and Numeric. Apart from this, dates are also considered as characters although there are implicit functions to work upon dates.
Given an unsorted data set, how to read the last observation to a new data set?
Answer: We can read the last observation to a new data set using end= data set option. For example: data work.calculus;set work.comp end=last;If last;run;
Where do you use PROC MEANS over PROC FREQ?
Answer: We will use PROC MEANS for numeric variables whereas we use PROC FREQ for categorical variables
What is the function of output statement in a SAS Program?
Answer: You can use the OUTPUT statement to save summary statistics in a SAS data set. This information can then be used to create customized reports or to save historical information about a process. You can use options in the OUTPUT statement to Specify the statistics to save in the output data set, Specify the name of the output data set, and Compute and save percentiles not automatically computed by the CAPABILITY procedure.
What are the default statistics for means procedure?
Answer: n-count, mean, standard deviation, minimum, and maximum
Give examples of "select" and "filter" functions from "dplyr" package.
Birth_weight %>% select(1,2,3)->birth #selects first three columns of dataset birth_weight Birth_weight %>% filter(baby_wt>125 & smoke=="smoker")->birth selects only those with wt>125 and smoke =="smoker"
How to limit decimal places for variable using PROC MEANS?
By using MAXDEC= option
How do you delete duplicate observations in SAS?
By using nodups in the procedureProc sort data=SAS-Dataset nodups;by var;run; 2. By using an SQL query inside a procedure Proc sql;Create SAS-Dataset as select * from Old-SAS-Dataset where var=distinct(var);quit; 3. By cleaning the data Set temp;By group;If first.group and last.group thenRun;
Give examples of "rbind()" and "cbind()" functions in R
Cbind(): As the name suggests, it is used to bind two columns together. One fact to be kept in mind while binding two columns is, the number of rows in both the columns need to be same. cbind(Marks,Percentage)
What is clustering? What is the difference between kmeans clustering and hierarchical clustering?
Cluster is a group of objects that belongs to the same class. Clustering is the process of making a group of abstract objects into classes of similar objects. Let us see why clustering is required in data analysis: Scalability − We need highly scalable clustering algorithms to deal with large databases. Ability to deal with different kinds of attributes − Algorithms should be capable of being applied to any kind of data such as interval-based (numerical) data, categorical, and binary data. Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes. High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space. Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters. Interpretability − The clustering results should be interpret-able, comprehensible, and usable K-MEANS clustering: K-means clustering is a well known partitioning method. In this method objects are classified as belonging to one of K-groups. The results of partitioning method are a set of K clusters, each object of data set belonging to one cluster. In each cluster there may be a centroid or a cluster representative. In the case where we consider real-valued data, the arithmetic mean of the attribute vectors for all objects within a cluster provides an appropriate representative; alternative types of centroid may be required in other cases. Hierarchical Clustering: This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two approaches here: Agglomerative Approach Divisive Approach Agglomerative Approach: This approach is also known as the bottom-up approach. In this, we start with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keeps on doing so until all of the groups are merged into one or until the termination condition holds. Divisive Approach: This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
What is a factor? How would you create a factor in R?
Conceptually, factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables. One of the most important use of factors is in statistical modeling; since categorical variables enter into statistical models differently than continuous variables, storing data as factors ensures that the modeling functions will treat such data correctly. Factor variables are more memory efficient as well. Converting the character vector into a factor can be done using the as.factor() function: as.character(fruit)->fruit
What is correlation? How would you measure correlation in R?
Correlation is a measure to find the strength of association between two variables. We can use the cor() function in R to find the correlation coefficient
List down the reasons for choosing SAS over other data analytics tools
Data Handling CapabilitiesSAS is on par with all leading tools including R & Python when it comes to handling huge amount of data and options for parallel computations. Graphical CapabilitiesSAS provides functional graphical capabilities and with a little bit of learning, it is possible to customize on these plots. Advancements in ToolSAS releases updates in a controlled environment, hence they are well tested. R & Python, on the other hand, have an open contribution and there are chances of errors in the latest developments. Job ScenarioGlobally, SAS is the market leader in available corporate jobs. In India, SAS controls about 70% of the data analytics market share
How would you make multiple plots onto a single page in R?
For, example if you want to plot 4 graphs onto the same pane, you can use the below command: par(mfrow=c(2,2)) also a way to do it in an html rmarkdown file?
Explain what is t-tests in R?
In R, the t.test () function produces a variety of t-tests. T-test is the most common test in statistics and used to determine whether the means of two groups are equal to each other.
How would you join multiple strings together?
Joining strings in R is quite an easy task. We can do it either with the help of "paste()" function or "string_c()" function from "stringR" package. Let's understand this with an example: We have the "fruit" vector, which comprises of names of fruits, and we would want to add the string "fruit" before the name of the fruit. Let's go ahead and do that. First, let's have a glance at the "fruits" vector. print(fruit) Now, let's use the paste function: paste("fruit",fruit) Now, let's perform the same task using "str_c()" function from "stringR" package. str_c("fruit",fruit,sep="-")
Given a vector of values, how would you convert it into a time series object?
Let's say this is our vector-> a<-c(1,2,3,4,5,6,7,8,9) To convert this into a time series object-> as.ts(a)->a ts.plot(a)
Write a custom function which will replace all the missing values in a vector with the mean of values.
Let's take this vector: a<-c(1,2,3,NA,4,5,NA,NA) Now, let's write the function to impute the values: mean_impute<-function(x){ ifelse(is.na(x),mean(x,na.rm = T),x) }
What is Principal Component Analysis and how can you create a PCA model in R?
Principal Component Analysis is a method for dimensionality reduction. Many a times, it happens that, one observation is related to multiple dimensions(features) and this brings in a lot of chaos to the data, that is why it is important to reduce the number of dimensions. The concept of Principal Component Analysis is this: The data is transformed to a new space, with equal or less number of dimensions. These dimensions(features) are known as principal components. The first principal component captures the maximum amount of variance from the features in the original data. The second principal component is orthogonal to the first and captures the maximum amount of variability left. The same is true for each principal component, they are all uncorrelated and each is less important than the previous one. We can do PCA in R with the help of "prcomp()" function. prcomp(iris[-5])->pca screeplot(pca)
What is R?
R is an open-source language and environment for statistical computing and analysis, or for our purposes, data science.
What is Rmarkdown? What is the use of it
RMarkdown is a reporting tool provided by R. With the help of Rmarkdown, you can create high quality reports of your R code. The output format of Rmarkdown can be: HTML PDF WORD
What is a Random Forest? How do you build and evaluate a Random Forest in R?
Random Forest is an ensemble classifier made using many decision tree models. It combines the results from many decision tree models and this result is usually better than the result of any individual model. Let's build a random forest model on top of this to predict the "smoke" column, i.e, whether the mother smokes or not. Let's start off by dividing the data into train and test sample.split(birth, splitratio=0.65)<-mysplit subset(birth, mysplit==T)<-train subset(bitrh, mysplit==F)<-test Build random forest model on the train set-> randomForest(smoke~.,birth)->mod1 Now, we'll predict the model on the test set-> predict(mod1,test)->result
What is SAS?
SAS is a software suite for advanced analytics, multivariate analyses, business intelligence, data management and predictive analytics It is developed by SAS Institute. SAS provides a graphical point-and-click user interface for non-technical users and more advanced options through the SAS language.
Tell me something about shinyR.
Shiny is an R package that makes it easy to build interactive web apps straight from R. You can host standalone apps on a webpage or embed them in Rmarkdown documents or build dashboards. You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions.
What are the steps to build and evaluate a linear regression model in R?
Start off by dividing the data into train and test sets, this step is vital because you will be building the model on the train set and evaluating it's performance on the test set. You can do this using the sample.split() function from the "catools" package. This function gives an option of split-ratio, which you can specify according to your needs. Once, you are done splitting the data into training and test sets, You can go ahead and build the model on the train set. The "lm()" function is used to build a model. Finally you can predict the values on the test set, using the "predict()" function.
How would you rename the columns of a dataframe?
The "colnames()" function is used to rename the columns. colnames(fruits)<-c("name","cost")
What is advantage of using apply family of functions in R?
The apply function allows us to make entry-by-entry changes to data frames and matrices. The usage in R is as follows: apply(X, MARGIN, FUN, ...) where: X is an array or matrix; MARGIN is a variable that determines whether the function is applied over rows (MARGIN=1), columns (MARGIN=2), or both (MARGIN=c(1,2)); FUN is the function to be applied. If MARGIN=1, the function accepts each row of X as a vector argument, and returns a vector of the results. Similarly, if MARGIN=2 the function acts on the columns of X. Most impressively, when MARGIN=c(1,2) the function is applied to every entry of X. Advantage: With the apply function we can edit every entry of a data frame with a single line command. No auto-filling, no wasted CPU cycles
List out some of the function that R provides?
The function that R provides are Mean Median Distribution Covariance Regression Non-linear Mixed Effects GLM
How would you extract one particular word from a string?
The string_extract_all() function from the "stringR" package can be used to extract a particular pattern from a string string_extract(sparta, "Sparta!") sparta is dataset and "Sparta!" is being extracted
What is a White Noise model and how can you simulate it using R?
The white noise (WN) model is a basic time series model.It is the simplest example of a stationary process. A white noise model has: A fixed constant mean A fixed constant variance No correlation over time Simulating a white noise model in R: arima.sim(model=list(order=c(0,0,0)),n=50)->wn ts.plot(wn)
Name some functions which can be used for debugging in R?
These are some functions which can be used for debugging in R: traceback() debug() browser() trace() recover()
How would you find the number of missing values in a dataset and remove all of them?
This Code gives the number of missing values-> sum(is.na(employee)) Now, let's delete the missing values: na.omit(employee)
You have an employee data-set, which comprises of two columns->"name" and "designation", add a third column which would indicate the current date and time.
We can add the date using cbind() function cbind(employee,date())
From the below data-set, extract only those values where Age>60 and Sex="F"
We can do it using the "dplyr" package. "dplyr" is a package which provides many functions for data manipulation, one such function is filter(). Let's go ahead and perform the desired task using the filter() function AARP %>% filter(Age>60 & Sex=="F") With the above command, we are filtering out those values where Age is greater than 60 and "Sex" is female can also write subset with which()
How would you check the distribution of a categorical variable in R?
We can use the table() function to find the distribution of categorical values. table(iris$Species) Now, let's find out the percentage distribution of these values. table(iris$Species)/nrow(iris)
Given a vector of numbers, how would you turn the values into scientific notation?
We have the below vector: a<-c(0.1324,0.0001234,234.21341324,09.324324) We can convert it into scientific notation using the "formatC()" function: formatC(a,format="e")
How would you do a left and right join in R?
We have two data-sets -> employee salary and employee designation Let's do a left join on these two data-sets using "left_join()" function from dplyr package: left_join(employee_designation,employee_salary,by="name")
How would you find out the mean of one column w.r.t another?
We'll be using the mean() function from the mosaic package mean(iris$Sepal.Length~iris$Species) This command gives the mean values of Sepal-Length across different species of iris flower.
Give examples of while and for loop in R.
While loop: sparta<-"THIS IS SPARTAAA" i<-1 while (i <= 5) { print(sparta) i<-i+1 } For loop: fruits<-c("apple", "orange", "pomagranite") for (i in fruits){ print(i) }
What packages are used for data mining in R?
data.table- provides fast reading of large files rpart and caret- for machine learning models. Arules- for associaltion rule learning. GGplot- provides varios data visualization plots. tm- to perform text mining. Forecast- provides functions for time series analysis
Name some functions available in "dplyr" package.
filter select mutate arrange count
How would you fit a linear model over a scatter-plot?
ggplot(data = house,aes(y=price,x=living_area))+geom_point() we'll be adding the geom_smooth() layer on top of this, to fit a linear model. ggplot(data = house,aes(y=price,x=living_area))+geom_point()+geom_smooth(method = "lm")
How would you facet the data using ggplot2 package?
ggplot(house,aes(y=price,x=waterfront))+geom_boxplot()+facet_grid(.~waterfront)
How would you create a scatterplot using ggplot2 package?
ggplot(iris,aes(y=Sepal.Length,x=Petal.Length))+geom_point()
How do you concatenate strings in R?
hello <- "Hello, "world <- "World."paste(hello, world)[1] "Hello, World."
What is the difference between a bar-chart and a histogram? Where would you use a bar-chart and where would you use a histogram?
histograms are used to plot the distribution of a continuous variable and bar-charts are used to plot the distribution of a categorical variable. ggplot(data = iris,aes(x=Sepal.Length))+geom_histogram(fill="palegreen4",col="green") ggplot(data = iris,aes(x=Species))+geom_bar(fill="palegreen4")
How can you load a .csv file in R?
house<-read.csv("C:/Users/John/Desktop/house.csv") additional features: header=T na.omit=T
How do you install a package in R?
install.packages("<package_name>")
Name some packages in R, which can be used for data imputation?
missForest imputeR rjags?
How would you build a Scatter-plot using plotly?
plot_ly(house,y=~price,x=~living_area,color=~rooms)
How would you create a box-plot using "plotly"?
plot_ly(house,y=~price,x=~rooms,color=~rooms,type="box")
What are the different import functions in R?
read.csv()-> for reading .csv files read_sas()-> for reading .sas7bdat files read_excel()-> for xl sheets read_sav()-> for spss data
What is the use of With () and By () function in R?
with() function applies an expression to a dataset. #with(data,expression) with(randomDataSet, expression.test(sample)) By() function applies a function t each level of a factors. #by(data,factorlist,function) by(data, factor, function, ...)