Data science - train

Ace your homework & exams now with Quizwiz!

34. What are all types of user defined functions?

Three types of user defined functions are. Scalar Functions. Inline Table valued functions. Multi statement valued functions. Scalar returns unit, variant defined the return clause. Other two types return table as a return.

17. What is a Cursor?

A database Cursor is a control which enables traversal over the rows or records in the table. This can be viewed as a pointer to one row in a set of rows. Cursor is very much useful for traversing such as retrieval, addition and removal of database records.

48. How to select unique records from a table?

Select unique records from a table by using DISTINCT keyword. Select DISTINCT StudentID, StudentName from Student.

7. What is a unique key?

A Unique key constraint uniquely identified each record in the database. This provides uniqueness for the column or set of columns. A Primary key constraint has automatic unique constraint defined on it. But not, in the case of Unique Key. There can be many unique constraint defined per table, but only one Primary key constraint defined per table.

6. What is a primary key?

A primary key is a combination of fields which uniquely specify a row. This is a special kind of unique key, and it has implicit NOT NULL constraint. It means, Primary key values cannot be NULL.

How is logistic regression done?

Logistic regression measures the relationship between the dependent variable and one or more independent variables by estimating probability using its underlying logistic function (sigmoid).

54. What is collaborative filtering?

Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

How can precision and recall rate be calculated using a confusion matrix in machine learning?

Precision and recall rate can be calculated using a confusion matrix in machine learning. Precision is calculated as the ratio of true positives to the sum of true positives and false positives

52. What are recommender systems?

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.

65. What are the types of biases that can occur during sampling?

Selection bias Undercoverage bias Survivorship bias

64. What is selection bias?

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.

57. What are the drawbacks of the linear model?

The assumption of linearity of the errors It can't be used for count outcomes or binary outcomes There are overfitting problems that it can't solve

What is stationarity in time

series data, and how can it be determined in machine learning?- Stationarity in time-series data refers to the constancy of the mean and variance of the series over time. In machine learning, stationarity can be determined by plotting the series and visually inspecting for trends, seasonality, and irregularity. More rigorous tests such as the Augmented Dickey-Fuller (ADF) test can also be used to determine stationarity.

19. What is a query?

A DB query is a code written in order to get the information back from the database. Query can be designed in such a way that it matched with our expectation of the result set. Simply, a question to the Database.

23. What is a trigger?

A DB trigger is a code or programs that automatically execute with response to some event on a table or view in a database. Mainly, trigger helps to maintain the integrity of the database. Example: When a new student is added to the student database, new records should be created in the related tables like Exam, Score and Attendance tables.

1. What is DBMS?

A Database Management System (DBMS) is a program that controls creation, maintenance and use of a database. DBMS can be termed as File Manager that manages data in a database rather than saving it in file systems.

What is a ROC curve, and how is it used in machine learning?

A Receiver Operating Characteristic (ROC) curve is a graphical representation of the performance of a binary classifier system. It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. ROC curve is used to evaluate the performance of a machine learning algorithm by assessing its ability to discriminate between positive and negative classes. It is commonly used in medical diagnosis, fraud detection, and spam filtering, among other applications.

48. What are the feature vectors?

A feature vector is an ndimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that's easy to analyze.

8. What is a foreign key?

A foreign key is one table which can be related to the primary key of another table. Relationship needs to be created between two tables by referencing foreign key with the primary key of another table.

How do you build a random forest model?

A random forest is built up of a number of decision trees. Steps to build a random forest model: Randomly select 'k' features from a total of 'm' features where k << m. Among the 'k' features, calculate the node D using the best split point. Split the node into daughter nodes using the best split. Repeat steps two and three until leaf nodes are finalized. Build forest by repeating steps one to four for 'n' times to create 'n' number of trees.

What is a recommender system and how does it work in machine learning?

A recommender system is a type of machine learning algorithm that predicts a user's preferences for a particular product or service. It can be classified into two categories: collaborative filtering and content-based filtering. Collaborative filtering recommends items based on the user's history and preferences, while content-based filtering recommends items based on their attributes and features.

40. What is recursive stored procedure?

A stored procedure which calls by itself until it reaches some boundary condition. This recursive function or procedure helps programmers to use the same set of code any number of times.

20. What is subquery?

A subquery is a query within another query. The outer query is called as main query, and inner query is called subquery. SubQuery is always executed first, and the result of subquery is passed on to the main query. Let's look into the sub query syntax MySQL SubQuery Tutorial with Examples A common customer complaint at the MyFlix Video Library is the low number of movie titles. The management wants to buy movies for a category which has least number of titles. You can use a query like SELECT category_name FROM categories WHERE category_id =( SELECT MIN(category_id) from movies);

5. What are tables and Fields?

A table is a set of data that are organized in a model with Columns and Rows. Columns can be categorized as vertical, and Rows are horizontal. A table has specified number of column called fields but can have any number of rows which is called record. Example:. Table: Employee. Field: Emp ID, Emp Name, Date of Birth. Data: 201456, David, 11/15/1960.

14. What is a View?

A view is a virtual table which consists of a subset of data contained in a table. Views are not virtually present, and it takes less space to store. View can have data of one or more tables combined, and it is depending on the relationship.

Q: Write a program that prints numbers from 1 to 50, replacing multiples of three with "Fizz", multiples of five with "Buzz", and multiples of both with "FizzBuzz" in Python.

A: for i in range(1, 51): if i % 3 == 0 and i % 5 == 0: print("FizzBuzz") elif i % 3 == 0: print("Fizz") elif i % 5 == 0: print("Buzz") else: print(i)

Q: Calculate the Euclidean distance between two points [1,3] and [2,5] in Python.

A: from math import sqrt plot1 = [1, 3] plot2 = [2, 5] euclidean_distance = sqrt((plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2) print(euclidean_distance)

Q: How should you maintain a deployed machine learning model?

A: A deployed machine learning model should be constantly monitored to determine its performance accuracy, evaluated using metrics to decide if a new algorithm is needed, compared to other models to find the best performer, and rebuilt on the current state of data using the best-performing model.

Q: How do you build a random forest model?

A: A random forest is built up of a number of decision trees. 'K' features are randomly selected from a total of 'm' features, and the node is calculated using the best split point. The node is then split into daughter nodes using the best split, and the process is repeated until leaf nodes are finalized. The process is repeated for 'n' times to create 'n' number of trees and build a forest.

Q: What is dimensionality reduction, and what are its benefits?

A: Dimensionality reduction refers to the process of reducing the number of dimensions in a data set with many dimensions, resulting in data with fewer fields that convey similar information concisely. Its benefits include compressing data and reducing storage space, reducing computation time, removing redundant features, and improving accuracy by removing noise.

Q: How is logistic regression done?

A: Logistic regression measures the relationship between the dependent variable and one or more independent variables by estimating probability using the sigmoid function. The formula and graph for the sigmoid function are shown, and the model is trained on the data to predict the dependent variable based on the independent variables.

Q: How do you handle missing data values in a data set with more than 30 percent missing values?

A: Missing data values can be handled by removing rows with missing data in larger datasets, or by substituting missing values with the mean or average of the rest of the data using pandas' data frame in Python, such as df.mean() or df.fillna(mean).

Q: How can you avoid overfitting your model?

A: Overfitting can be avoided by keeping the model simple, using crossvalidation techniques like k folds crossvalidation, and using regularization techniques like LASSO that penalize certain model parameters if they're likely to cause overfitting.

Q: What is the difference between supervised and unsupervised learning?

A: Supervised learning uses known and labeled data as input with a feedback mechanism, while unsupervised learning uses unlabeled data as input without a feedback mechanism. Common supervised learning algorithms include decision trees, logistic regression, and support vector machine, while common unsupervised learning algorithms include kmeans clustering, hierarchical clustering, and apriori algorithm.

Q: Calculate the eigenvalues and eigenvectors of the matrix [_2, _4, 2; _2, 1, 2; 4, 2, 5] in Python.

A: The eigenvalues of the matrix are 3, -5, and 6, and their corresponding eigenvectors can be calculated using methods such as matrix diagonalization or eigendecomposition.

Q: Explain the steps in making a decision tree.

A: The entire data set is taken as input, and entropy is calculated for the target variable and predictor attributes. The information gain of all attributes is then calculated, and the attribute with the highest information gain is chosen as the root node. The same procedure is repeated on every branch until the decision node of each branch is finalized. An example decision tree for deciding whether to accept or decline a job offer is provided.

Q: What are the feature selection methods used to select the right variables?

A: Two main methods for feature selection are filter and wrapper methods. Filter methods include linear discrimination analysis, ANOVA, and Chi-Square. Wrapper methods include forward selection, backward selection, and recursive feature elimination. Filter methods clean up the data coming in, while wrapper methods are more labor-intensive and require high-end computers for large datasets.

Q: Differentiate between univariate, bivariate, and multivariate analysis.

A: Univariate analysis deals with one variable, bivariate analysis involves two variables, and multivariate analysis involves three or more variables. Univariate analysis describes the data and finds patterns, while bivariate analysis determines relationships between variables. Multivariate analysis is similar to bivariate but contains more than one dependent variable.

42. What is an ALIAS command?

ALIAS name can be given to a table or column. This alias name can be referred in WHERE clause to identify the table or column. Example: Select st.StudentID, Ex.Result from student st, Exam as Ex where st.studentID = Ex. StudentID Here, st refers to alias name for student table and Ex refers to alias name for exam table.

How can accuracy be calculated using a confusion matrix in machine learning?

Accuracy can be calculated using a confusion matrix in machine learning by summing the true positive and true negative values and dividing by the total number of observations. The formula for accuracy is: Accuracy = (True Positive + True Negative) / Total Observations

44. What are aggregate and scalar functions?

Aggregate functions are used to evaluate mathematical calculation and return single values. This can be calculated from the columns in a table. Scalar functions return a single value based on the input value. Example: Aggregate _ max(), count _ Calculated with respect to numeric. Scalar _ UCASE(), NOW() _ Calculated with respect to strings.

15. What is an Index?

An index is performance tuning method of allowing faster retrieval of records from the table. An index creates an entry for each value and it will be faster to retrieve data.

13. What is a Graph database?

Ans: A graph database is one of the most important of all databases. It is mainly specific for storing and navigating data relationships. The concept is entity information, and edges will store data relationships. This database is used by banks or social media or new channels etc.

4. What is the role of the aggregate_oriented database?

Ans: Actually, this is a collection of data that interacts with other data as a unit. By using ACID operations and key-value, all data can be seen as a form of an aggregate-oriented database. It helps to manage the storage over the cluster. This often reduces computation.

12. How to increase scalability in the NoSQL database?

Ans: All these databases are heavy and need good server configuration on PC. To increase scalability, you can use a vertical database or a horizontal database also. Now on the PC, you can increase the RAM and SSD hard disk size so that the PC will be running faster. This way also, you can increase the scalability in NoSQL.

4. If I learn NoSQL, what will be my future career scope?

Ans: Data Science is booming nowadays. It's all about a huge amount of data management by adopting a big data methodology. If you see, other types of databases are not going to business on a large scale, but NoSQL is coming up with high demand in business. It has very faster career growth.

8. Clarify the key value in the NoSQL database?

Ans: Generally, in a database, we store the data in a table. In NoSQL, we usually store data in the hash table. These all have tables are having unique identities. If you are finding some data, then using a key-value store is a better option than working with joins. This key value will be picking up data faster from the hash table.

11. How can you perform column view data presentation in NoSQL?

Ans: If you are looking for a highly analytical output, you can use this column view data presentation. This NoSQL can store a huge analytical amount of data in columns rather than rows. You can also build subgroups by collecting columns. You don't need to give any key names to this type of database. This is mainly recommended for the data belonging to the data science field.

4. Can you tell me when you should use NoSQL in place of the normal database?

Ans: If you are looking for key-value stores with massive high-speed performances, you can use NoSQL. Because in the relational databases, we use ACID transactions. Once we use this kind of transaction, the schema-based process will slow down the database performance. Suggestive possible situations to use NoSQL are: A. If you use multiple JOIN queries. B. If the client wants high traffic site. C. If you are using denormalized data.

5. Write down the script for NoSQL DB config?

Ans: If you are looking forward to building a NoSQL DB connection repeatedly, then you need to admin CLI commands. It can be used as scripted in different ways. For example, you can build a file that will store a sequence of commands to run using any programming language suitable for the particular database. Please go through the given below script:

15. What do u know about database sharding in the NoSQL database?

Ans: In NoSQL, database sharding means partitioning the database in patterns for the NoSQL age. You can store data by sharding in different potentially separate servers around the world. Then, a database administrator can access these stored data easily with high data speed performances from all over the world.

7. What is eventual consistency in the context of NoSQL?

Ans: In the database, we do use service logic. Once you execute these service logics, the database system will be left in a constant state. To increase the data availability, you can use this concept. It has a distributed database system too.

14. Explain the CAP theorem in NoSQL?

Ans: It is the most reliable three guarantees for a database. CAP theorem is expertise with skills like consistency, availability, and partition tolerance. The nodes will be working in the network seamlessly. As a result, the database will work faster.

3. How is this impedance mismatch happening in the database?

Ans: Let's talk about the main difference between NoSQL and relational databases. This is a problem statement that happens due to the miss-match of database models and programming languages. If you want to use a richer memory structure, then you have translated this database to a relational database to store on disk. As a result, impedance mismatch will occur.

10. What kind of data can we manage in NoSQL?

Ans: Mainly this NoSQL types database can manage semi-structured data as well as unstructured data. Moreover, it has a flexible data model system.

1. Is NoSQL occurring in a normal database table?

Ans: NoSQL does not mean no to SQL; obviously, SQL is there. It works in the non-tabular form. Actually, you do not need to create any table for this type of database. By using NoSQL, you can improve the database performance. Mostly in dynamic SQL, to make parameterized queries, database developers use this NoSQL.

1. What do you understand by NoSQL?

Ans: Nowadays, developers are dealing with a large volume of data which is called big data. So naturally, big complexity and big issues will be there. Once most of the systems are getting online, so data load increases. NoSQL helps to manage unstructured, messy, and complicated data. This is not a traditional database or relational database management.

6. What do you know about polyglot persistence in NoSQL?

Ans: Once the applications are used and developed with mixed programming languages, debugging becomes easy in databases. However, tough complex problems will be there. Now let's talk about an e-commerce web application with a huge database of carts that is highly available to the buyer and will be easy to manage by this hybrid concept of polyglot. This hybrid technology helps the database to give suggestions to buyers.

9. Why do we use impala in the NoSQL database?

Ans: Once the database administrators handle big data with the Hadoop system, then this impala provides parallel processing in database technology. You can also do low latency queries by using impala. Due to this parallel processing, data fetching time will be less.

5. Do you give me any idea which particular NoSQL database is most demanding?

Ans: See, there are many database systems under NoSQL. But MongoDB is a most helpful and efficient database as it is a document-based NoSQL database. It is also use case sensitive. MongoDB is the best if anyone wants to do read and write operations in the database.

6. What is your opinion on NoSQL replacing SQL?

Ans: The answer is yes. As per market demand, the database is also changing and getting replaced by NoSQL. Because it can manage big data, the cost is less, the latest technologies are compatible with this new database, but the traditional database is costly as well as doe's not matched with new technologies.

3. Can you tell me what the main principle of NoSQL is?

Ans: The main principle of NoSQL is to make the database high availability.

2. What is the main target of NoSQL?

Ans: The main target of NoSQL is to create an alternate database in SQL. It helps to store textual data in a database easily that is also in a non-structured format.

1. How does the NoSQL database control machine price range memory?

Ans: The replication node that manages the NoSQL database save information is the replication node. It is likewise the primary client of reminiscence. The java heap and the cache length that the replication node can utilize are the critical elements in performance phrases. By default, those matters are calculated via way of means of NoSQL in phrases of the quantity of memory to be had to the storage node. Specification of the to be had reminiscence for a storage node is recommended. The memory can be calmly divided among all of the RNs if the garage node hosts a couple of replication node.

2. How many types of mechanism works in NoSQL? Write down their name?

Ans: There are four types of mechanisms: A. Graph database B. Key value calculation C. Document oriented D. Column view presentation

5. Write down the NoSQL's different features?

Ans: These are different features of NoSQL: A. It can store a big amount of unstructured, structured, and semi-structured data. B. It is object-oriented programming based, which is best for a web application. C. It is agile, sprints based, which is best for project management. D. It is cost-effective with scale-out architecture and efficiency.

8. Explain the base property of the NoSQL database?

Ans: These are the base property of NoSQL: A. Availability of stored data after even multiple data search failures. B. In the soft state, all base data will be stored in the ACID model. C. Regularity.

2. What do you know about Big SQL in NoSQL?

Ans: This Big SQL is developed by IBM. This is a high-speed performance database that follows MPP ( Massive parallel processing) SQL engine for a large amount of data managed by Hadoop. Mainly enterprise data will be stored by this process. By using Big SQL, you can access data from across the organization with the permission of the database administrator. It is fully secured too. Mainly banking industries are using this.

9. What is a hash table? How does it work in NoSQL?

Ans: This is like a data structure that provides an associative array of abstract data types. This table uses to function in a complex database. You need to write has code-based queries in this type of database.

10. What is the meaning of document_oriented DB?

Ans: This is one of the features of the NoSQL database. It helps to store the data as schema-free. As a result, JavaScript object notation will be used, and scalability will be higher. The project will be developed faster at a low cost too. You can use given below these DocumentDB: A. MongoDB B. Amazon DocumentDB C. Microsoft Azure CosmosDB

7. Can we use NoSQL in an Oracle_based database?

Ans: Yes, NoSQL is applicable in the Oracle database to record data. This database helps to find out the data records through external table functions. As well, it is easier to perform some queries in the Oracle base database. It is very flexible and key-value based.

9. What do you think, NoSQL uses normalization?

Ans: Yes, normalization is used by the NoSQL database. One of the famous NoSQL named Cassandra is based on normalization to finding stored data. It creates a series of tables related to the different fields. All these fields will be given true value in the table.

7. Can I learn NoSQL easily?

Ans: Yes, of course, you can learn NoSQL easily and quickly. It is a bit different from the traditional database, but it has some easily understandable logic. Here you don't need to maintain schemas or normalize at all. As a result, your workload will be less. - 8. Name a few of the companies that are using NoSQL?- Ans: There are lots of companies that are using NoSQL. Mostly these companies are using a huge volume of data and also using AI, Data science to pretend the future business. In this situation, NoSQL is the best solution. Companies are: A. Google B. Amazon C. Netflix D. Facebook

6. If we ask you to track data record relations in NoSQL, how will you do?

Ans: You can follow these steps to track data records in NoSQL: A. First, you can embed all data in any user object. B. Then, you can create the user id credentials. C. By using login id, need to give comments value with a list of comments. D. Your expected data will be found.

Question 23: What is a basic SQL query to list all orders with customer information if you have separate tables for orders and customers with specific columns?

Answer: A basic SQL query to list all orders with customer information if you have separate tables for orders and customers with specific columns is as follows: SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country FROM Order JOIN Customer ON Order.CustomerId = Customer.Id

What is a normal distribution?

Answer: A normal distribution, also known as a Gaussian distribution, is a probability distribution that shows data near the mean and the frequency of that data. When represented in graphical form, a normal distribution appears as a bell curve. Parameters of the normal distribution include mean, standard deviation, and median.

What is a recurrent neural network (RNN)?

Answer: A recurrent neural network (RNN) is an algorithm that uses sequential data, and is commonly used in language translation, voice recognition, and image captioning. There are different types of RNN networks, including one-to-one, one-to-many, many-to-one, and many-to-many.

What is the difference between an error and a residual error?

Answer: An error is the difference between the actual value and the predicted value, while a residual error is the difference between the observed data and the predicted data. Errors are generally unobservable, while residual errors can be represented using a graph to show how the sample population data and the observed data differ from each other.

What is the difference between a box plot and a histogram?

Answer: Both box plots and histograms visually denote the frequency of a certain feature's values. However, box plots are often used to compare several datasets and take less space while containing fewer details compared to histograms. On the other hand, histograms are used to understand the probability distribution underlying a dataset.

What is deep learning?

Answer: Deep learning is an important aspect of Data Science that involves creating algorithms that resemble the human brain. Multiple layers are formed from raw input to extract high level features in deep learning, making it more reliable with human thoughts.

What is entropy in a decision tree algorithm?

Answer: Entropy in a decision tree algorithm is a measure of randomness or disorder in a set of observations, and is used to determine how the decision tree splits the data. Entropy is also used to check the homogeneity of the given data.

Question 29: What must also be true if the association rules algorithm finds the two rules {banana, apple} => {grape} and {apple, orange} => {grape} to be relevant?

Answer: If the association rules algorithm finds the two rules {banana, apple} => {grape} and {apple, orange} => {grape} to be relevant, then {grape, apple} must be a frequent itemset.

What information is gained in a decision tree algorithm?

Answer: In a decision tree algorithm, information gain is the expected reduction in entropy that results from splitting the data. Information gain helps determine how the tree is built, and can be used to extract the best features from the data.

What is kfold crossvalidation?

Answer: Kfold crossvalidation is a method used to estimate a model's skill in new data. It involves dividing the data set into k equal parts, training the model on k1 parts, and testing it on the remaining part. This process is repeated k times, with each part used for testing once.

What is the difference between long format data and wide format data?

Answer: Long format data contains values that repeat in the first column and each row is a one-time point per subject. Wide format data has the data's repeated responses in a single row, with each response recorded in separate columns.

What are Markov chains?

Answer: Markov chains are a type of stochastic process that describes a system's state transitions where the future probability depends only on its current state. In other words, the probability of moving from one state to another depends only on the current state and not the history of the process. Markov chains are often used in machine learning algorithms like natural language processing for generating recommendations based on previous data.

What does NLP stand for?

Answer: NLP stands for natural language processing. It is a branch of machine learning that deals with the study of how computers can learn from textual data through programming. Some examples of NLP include stemming, sentimental analysis, tokenization, and the removal of stop words.

What is the difference between normalization and standardization?

Answer: Normalization and standardization are techniques used in data preprocessing to transform data into a specific range or distribution. Normalization involves converting data values to lie between 1 and 0, also known as min_max scaling. On the other hand, standardization converts data in such a way that it is normally distributed with a standard deviation of 1 and a mean of 0.

What is the difference between point estimates and confidence interval?

Answer: Point estimates are specific values used to estimate a population parameter, while confidence intervals are ranges of values likely containing the population parameter. Confidence intervals tell us how likely a particular interval is to contain the population parameter, and the confidence coefficient (or level) denotes the probability or likeness of the interval containing the parameter.

What are some popular libraries used in Data Science?

Answer: Popular libraries used in Data Science include TensorFlow, Pandas, NumPy, SciPy, Scrapy, Librosa, and Matplotlib.

What is pruning in a decision tree algorithm?

Answer: Pruning is a technique used in decision tree algorithms to simplify the tree by reducing the number of rules. Pruning helps to avoid complexity and improve accuracy, and can be accomplished through methods such as reduced error pruning or cost complexity pruning.

Why is Python commonly used for data cleaning in Data Science?

Answer: Python is commonly used for data cleaning in Data Science due to its powerful libraries, such as Pandas and Matplotlib, which allow for efficient and effective data cleaning and analysis.

Why is R used in data visualization?

Answer: R is commonly used in data visualization because it has several libraries and built-in functions that allow users to create almost any type of graph. R also makes it easier to customize graphics compared to other programming languages like Python. Additionally, R is used in feature engineering and exploratory data analysis.

What are some techniques used for sampling? What is the main advantage of sampling?

Answer: Some techniques used for sampling include probability and non-probability sampling. The main advantage of sampling is that it allows data scientists to estimate the characteristics of an entire population by selecting a subset of individual members for analysis.

Question 25: Which machine learning algorithm is appropriate for inputting missing values of both categorical and continuous variables?

Answer: The appropriate machine learning algorithm for inputting missing values of both categorical and continuous variables is the Knearest neighbor (KNN) algorithm. It can compute the nearest neighbor and if it does not have a value, it just computes the nearest neighbor based on all the other features. Linear regression and Kmeans clustering require preprocessing to handle missing values, and decision trees also have the same problem, although there is some variance.

What is the bias_variance trade_off?

Answer: The bias-variance trade-off is a concept in machine learning that describes the relationship between a model's bias and variance. Bias refers to the error caused by oversimplification, leading to underfitting, while variance refers to the error caused by overcomplication, leading to overfitting. The goal of a machine learning algorithm is to have low bias and low variance to achieve the best performance.

Question 26: What is the entropy of a given target variable with eight actual values of 0s and 1s?

Answer: The entropy of the target variable can be calculated using the formula: Putting p=5 and n=8, we get Entropy = -(5/8 log(5/8) + 3/8 log(3/8))

Question 28: Which algorithm is most appropriate for finding all users who are most similar to each of the four specific individual types identified after studying the behavior of a population?

Answer: The most appropriate algorithm for finding all users who are most similar to each of the four specific individual types identified after studying the behavior of a population is Kmeans clustering. As we are looking for grouping people together specifically by four different similarities, it indicates the value of k.

Question 27: What is the most appropriate algorithm for predicting the probability of death from heart disease based on age, gender, and blood cholesterol level?

Answer: The most appropriate algorithm for predicting the probability of death from heart disease based on age, gender, and blood cholesterol level is logistic regression.

Question 31: What do you understand about true positive rate and false positive rate?

Answer: The true positive rate (TPR) gives the proportion of correct predictions of the positive

What is variance in Data Science?

Answer: Variance in Data Science is a measure of the distribution of individual values within a set of data, and describes the difference of each value from the mean value. Data scientists use variance to understand the distribution of a data set.

Question 24: You have built a classification model for cancer detection and achieved an accuracy of 96%. Why should you not be happy with this performance? What can you do about it?

Answer: You should not be happy with the performance of a classification model for cancer detection based solely on accuracy because cancer detection results in imbalanced data. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection and can greatly improve a patient's prognosis. To evaluate model performance, you should use sensitivity (true positive rate), specificity (true negative rate), and F measure to determine the class-wise performance of the classifier.

Question 30: Which analysis method should you use to determine if offering a coupon to website visitors has any impact on their purchase decisions if visitors randomly receive one of two coupons or none?

Answer: You should use one-way ANOVA to determine if offering a coupon to website visitors has any impact on their purchase decisions if visitors randomly receive one of two coupons or none.

28. What is Auto Increment?

Auto increment keyword allows the user to create a unique number to be generated when a new record is inserted into the table. AUTO INCREMENT keyword can be used in Oracle and IDENTITY keyword can be used in SQL SERVER. Mostly this keyword can be used whenever PRIMARY KEY is used.

29. What is the difference between Cluster and Non

Cluster Index?- Clustered index is used for easy retrieval of data from the database by altering the way that the records are stored. Database sorts out rows by the column which is set to be clustered index. A nonclustered index does not alter the way it was stored but creates a complete separate object within the table. It point back to the original table rows after searching.

What is clustering?

Clustering is a machine learning technique used to group similar data points together. It is an unsupervised learning technique, meaning that the algorithm does not require labeled data to learn. The objective of clustering is to minimize the intra-cluster distance and maximize the inter-cluster distance. There are different types of clustering algorithms, including K-means, Hierarchical clustering, and DBSCAN.

35. What is collation?

Collation is defined as set of rules that determine how character data can be sorted and compared. This can be used to compare A and, other language characters and also depends on the width of the characters. ASCII value can be used to compare these character data.

46. How to fetch common records from two tables?

Common records result set can be achieved by Select studentID from student INTERSECT Select StudentID from Exam

26. What is a constraint?

Constraint can be used to specify the limit on the data type of table. Constraint can be specified while creating or altering the table statement. Sample of constraint are. NOT NULL. CHECK. DEFAULT. UNIQUE. PRIMARY KEY. FOREIGN KEY.

32. What is Cross_Join?

Cross join defines as Cartesian product where number of rows in the first table multiplied by number of rows in the second table. If suppose, WHERE clause is used in cross join then the query will work like an INNER JOIN.

To prevent overfitting, we can use the following methods:

Cross validation: This method involves splitting the data into training and testing sets multiple times and averaging the results. Regularization: This method adds a penalty to the loss function to prevent the model from becoming too complex. Dropout: This method involves randomly dropping out some neurons during training to prevent the model from memorizing the training data. Early stopping: This method stops the training process when the performance on the validation set stops improving.

What is crossvalidation?

Cross-validation is a technique used to evaluate the performance of a machine learning model. It works by dividing the data into k subsets (folds) and using each subset as the validation set while training the model on the remaining k-1 subsets. This process is repeated k times, and the performance of the model is averaged over the k-folds. Cross-validation is used to estimate the generalization performance of the model and prevent overfitting.

53. Explain crossvalidation.

Crossvalidation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of crossvalidation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

24. What is the difference between DELETE and TRUNCATE commands?

DELETE command is used to remove rows from the table, and WHERE clause can be used for conditional set of parameters. Commit and Rollback can be performed after delete statement. TRUNCATE removes all rows from the table. Truncate operation cannot be rolled back.

Differentiate Between Data Analytics and Data Science

Data Analytics Use data to draw meaningful insights and solve problems. Tools include data mining, data modeling, database management, and data analysis. Data Science Used in asking questions, writing algorithms, coding, and building statistical models. Tools include machine learning, Hadoop, Java, Python, software development, etc. Discovers new questions to drive innovation. Uses scientific methods and algorithms to extract knowledge from unstructured data.

27. What is data Integrity?

Data Integrity defines the accuracy and consistency of data stored in a database. It can also define integrity constraints to enforce business rules on the data when it is entered into the application or database.

What is Data Science?

Data Science combines statistics, maths, specialized programs, artificial intelligence, machine learning, etc. Data Science is simply the application of specific principles and analytic techniques to extract information from data used in strategic planning, decision making, etc. Simply, data science means analyzing data for actionable insights.

13. What are all the different normalizations?

Database Normalization can be easily understood with the help of a case study. The normal forms can be divided into 6 forms, and they are explained below: Database Normal Forms Database Normal Forms First Normal Form (1NF):. This should remove all the duplicate columns from the table. Creation of tables for the related data and identification of unique columns. Second Normal Form (2NF):. Meeting all requirements of the first normal form. Placing the subsets of data in separate tables and Creation of relationships between the tables using primary keys. Third Normal Form (3NF):. This should meet all requirements of 2NF. Removing the columns which are not dependent on primary key constraints. Fourth Normal Form (4NF):. If no database table instance contains two or more, independent and multivalued data describing the relevant entity, then it is in 4th Normal Form. Fifth Normal Form (5NF):. A table is in 5th Normal Form only if it is in 4NF and it cannot be decomposed into any number of smaller tables without loss of data. Sixth Normal Form (6NF):. 6th Normal Form is not standardized, yet however, it is being discussed by database experts for some time. Hopefully, we would have a clear & standardized definition for 6th Normal Form in the near future...

18. What is a relationship and what are they?

Database Relationship is defined as the connection between the tables in a database. There are various data basing relationships, and they are as follows:. One to One Relationship. One to Many Relationship. Many to One Relationship. Self-Referencing Relationship.

4. What is a Database?

Database is nothing but an organized form of data for easy access, storing, retrieval and managing of data. This is also known as structured form of data which can be accessed in many ways. Example: School Management Database, Bank Management Database.

30. What is Datawarehouse?

Datawarehouse is a central repository of data from multiple sources of information. Those data are consolidated, transformed and made available for the mining and online processing. Warehouse data have a subset of data called Data Marts.

12. What is Denormalization?

DeNormalization is a technique used to access the data from higher to lower normal forms of database. It is also process of introducing redundancy into a table by incorporating data from the related tables.

62. What are eigenvalue and eigenvector?

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching. Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.

45. How can you create an empty table from an existing table?

Example will be Select * into studentcopy from student where 1=2 Here, we are copying student table to another table with the same structure with no rows copied.

What are the feature selection methods used to select the right variables?

Filter methods, including linear discrimination analysis, ANOVA, and Chi-Square. Wrapper methods, including forward selection, backward selection, and recursive feature elimination.

36. What are all different types of collation sensitivity?

Following are different types of collation sensitivity : Case Sensitivity - A and a and B and b. Accent Sensitivity. Kana Sensitivity - Japanese Kana characters. Width Sensitivity - Single byte character and double byte character. 37. Advantages and Disadvantages of Stored Procedure? Stored procedure can be used as a modular programming - means create once, store and call for several times whenever required. This supports faster execution instead of executing multiple queries. This reduces network traffic and provides better security to the data. Disadvantage is that it can be executed only in the Database and utilizes more memory in the database server.

What is gradient descent?

Gradient descent is an optimization algorithm used to minimize the cost function of a machine learning model. It works by iteratively adjusting the model parameters in the direction of the steepest descent of the cost function. The learning rate controls the step size of each iteration. There are three types of gradient descent: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

58. What is the law of large numbers?

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.

60. What is star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.

How can you avoid overfitting your model?

Keep the model simple. Use cross-validation techniques, such as k folds cross-validation. Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting.

50. Which operator is used in query for pattern matching?

LIKE operator is used for pattern matching, and it can be used as -. % _ Matches zero or more characters. _(Underscore) _ Matching exactly one character. Example : Select * from Student where studentname like 'a%' Select * from Student where studentname like 'ami_'

25. What are local and global variables and their differences?

Local variables are the variables which can be used or exist inside the function. They are not known to the other functions and those variables cannot be referred or used. Variables can be created whenever that function is called. Global variables are the variables which can be used or exist throughout the program. Same variable declared in global cannot be used in functions. Global variables cannot be created whenever that function is called.

51. What is logistic regression?

Logistic regression is also known as the logit model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.

What are MSE and RMSE in a linear regression model, and how are they calculated in machine learning?

MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) are commonly used measures of accuracy for linear regression models in machine learning. MSE is the average of the squared differences between the predicted and actual values, while RMSE is the square root of the MSE. Both are calculated using the following formulae: MSE = (1/n) * sum((y_true - y_pred)^2) RMSE = sqrt(MSE) where y_true is the actual value, y_pred is the predicted value, and n is the number of data points.

11. What is normalization?

Normalization is the process of minimizing redundancy and dependency by organizing fields and table of a database. The main aim of Normalization is to add, delete or modify field that can be made in a single table.

38. What is Online Transaction Processing (OLTP)?

Online Transaction Processing (OLTP) manages transaction based applications which can be used for data entry, data retrieval and data processing. OLTP makes data management simple and efficient. Unlike OLAP systems goal of OLTP systems is serving real-time transactions. Example - Bank Transactions on a daily basis.

How can outlier values be handled in machine learning?

Outlier values can be handled in machine learning by removing them if they are garbage values or if they have extreme values that are not representative of the data. Alternatively, outlier values can be treated by using different models that are less affected by outliers or by normalizing the data to pull extreme data points to a similar range.

What is overfitting, and how can it be prevented?

Overfitting is a scenario where the model is trained too well on the training data, which causes it to perform poorly on the test data.

2. What is RDBMS?

RDBMS stands for Relational Database Management System. RDBMS store the data into the collection of tables, which is related by common fields between the columns of the table. It also provides relational operators to manipulate the data stored into the tables. Example: SQL Server.

47. How to fetch alternate records from a table?

Records can be fetched for both Odd and Even row numbers To display even numbers-. Select studentId from (Select rowno, studentId from student) where mod(rowno,2)=0 To display odd numbers-. Select studentId from (Select rowno, studentId from student) where mod(rowno,2)=1 from (Select rowno, studentId from student) where mod(rowno,2)=1.[/sql]

What is regularization?

Regularization is a technique used to prevent overfitting in machine learning models. It adds a penalty term to the loss function of the model, forcing the model to learn simpler patterns and avoid memorizing the noise in the training data. There are two types of regularization techniques: L1 regularization (Lasso) and L2 regularization (Ridge).

63. Why is resampling done?

Resampling is done in any of these cases: Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points Substituting labels on data points when performing significance tests Validating models by using random subsets (bootstrapping, crossvalidation)

50. What is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

39. What is CLAUSE?

SQL clause is defined to limit the result set by providing condition to the query. This usually filters some rows from the whole set of records. Example - Query that has WHERE condition Query that has HAVING condition.

3. What is SQL?

SQL stands for Structured Query Language , and it is used to communicate with the Database. This is a standard language used to perform tasks such as retrieval, updation, insertion and deletion of data from a database. Standard SQL Commands are Select.

31. What is Self_Join?

Self-join is set to be query used to compare to itself. This is used to compare values in a column with other values in the same column in the same table. ALIAS ES can be used for the same table comparison.

22. What is a stored procedure?

Stored Procedure is a function consists of many SQL statement to access the database system. Several SQL statements are consolidated into a stored procedure and execute them whenever and wherever required.

What are the differences between supervised and unsupervised learning?

Supervised Learning Uses known and labeled data as input. Has a feedback mechanism. Most commonly used algorithms are decision trees, logistic regression, and support vector machine. Unsupervised Learning Uses unlabeled data as input. Has no feedback mechanism. Most commonly used algorithms are k-means clustering, hierarchical clustering, and apriori algorithm.

66. What is survivorship bias?

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

43. What is the difference between TRUNCATE and DROP statements?

TRUNCATE removes all the rows from the table, and it cannot be rolled back. DROP command removes a table from the database and operation cannot be rolled back.

49. What are the steps in making a decision tree?

Take the entire data set as input. Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets. Apply the split to the input data (divide step). Re-apply steps one and two to the divided data. Stop when you meet any stopping criteria. This step is called pruning. Clean up the tree if you went too far doing splits.

Explain the steps in making a decision tree

Take the entire data set as input. Calculate entropy of the target variable, as well as the predictor attributes. Calculate your information gain of all attributes. Choose the attribute with the highest information gain as the root node. Repeat the same procedure on every branch until the decision node of each branch is finalized.

What is the Central Limit Theorem?

The Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal if the sample size is large enough. This theorem applies to random samples taken from any population with a finite standard deviation.

What is the curse of dimensionality?

The curse of dimensionality refers to the fact that as the number of features or dimensions increases, the amount of data needed to generalize accurately increases exponentially. This issue arises because data becomes sparser as the number of dimensions increases. As a result, we need a large amount of data to cover all possible combinations of features. To avoid the curse of dimensionality, we can use dimensionality reduction techniques such as PCA, tSNE, and LDA, among others.

How can you select the optimal number of clusters (k) in a kmeans algorithm in machine learning?

The optimal number of clusters (k) in a kmeans algorithm can be selected using the elbow method. This involves plotting the within-cluster sum of squares (WSS) against the number of clusters and selecting the value of k at the "elbow" of the plot, where the WSS starts to level off. Another method is the silhouette score, which measures the similarity between data points within a cluster and dissimilarity between data points in different clusters.

What is the significance of p_value in hypothesis testing, and how is it interpreted in machine learning?

The p_value is a statistical measure that indicates the likelihood of obtaining a result as extreme as the observed result, assuming that the null hypothesis is true. In machine learning, p_value is used to determine the statistical significance of a hypothesis test. A p_value of less than or equal to 0.05 indicates strong evidence against the null hypothesis and the rejection of the null hypothesis. A p_value greater than 0.05 indicates weak evidence against the null hypothesis and acceptance of the null hypothesis. A p_value at the cutoff of 0.05 is considered marginal and requires further investigation.

What is the purpose of regularization in machine learning, and what are the commonly used techniques?

The purpose of regularization in machine learning is to prevent overfitting by adding a penalty term to the loss function. Commonly used techniques for regularization include: L1 regularization: It adds a penalty proportional to the absolute value of the weights. It leads to sparse solutions, as it tends to set many weights to zero. L2 regularization: It adds a penalty proportional to the square of the weights. It leads to small but non zero weights. Elastic Net regularization: It is a combination of L1 and L2 regularization. It provides a balance between sparse and small non zero weights. Dropout regularization: It randomly drops out some neurons during training to prevent the model from memorizing the training data.

67. How do you work towards a random forest?

The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are: Build several decision trees on bootstrapped training samples of data On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors Rule of thumb: At each split m=p√m=p Predictions: At the majority rule

49. What is the command used to fetch first 5 characters of the string?

There are many ways to fetch first 5 characters of the string Select SUBSTRING(StudentName,1,5) as studentname from student Select LEFT(Studentname,5) as studentname from student

16. What are all the different types of indexes?

There are three types of indexes: Unique Index. This indexing does not allow the field to have duplicate values if the column is unique indexed. Unique index can be applied automatically when primary key is defined. Clustered Index. This type of index reorders the physical order of the table and search based on the key values. Each table can have only one clustered index. NonClustered Index. NonClustered Index does not alter the physical order of the table and maintains logical order of data. Each table can have 999 nonclustered indexes.

21. What are the types of subquery?

There are two types of subquery _ Correlated and Non_Correlated. A correlated subquery cannot be considered as independent query, but it can refer the column in a table listed in the FROM the list of the main query. A Non-Correlated sub query can be considered as independent query and the output of subquery are substituted in the main query.

10. What are the types of join and explain each?

There are various types of join which can be used to retrieve data and it depends on the relationship between tables. Inner Join. Inner join return rows when there is at least one match of rows between the tables. Right Join. Right join return rows which are common between the tables and all rows of Right hand side table. Simply, it returns all the rows from the right hand side table even though there are no matches in the left hand side table. Left Join. Left join return rows which are common between the tables and all rows of Left hand side table. Simply, it returns all the rows from Left hand side table even though there are no matches in the Right hand side table. Full Join. Full join return rows when there are matching rows in any one of the tables. This means, it returns all the rows from the left hand side table and all the rows from the right hand side table.

59. What are the confounding variables?

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

55. Do gradient descent methods always converge to similar points?

They do not, because in some cases, they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

9. What is a join?

This is a keyword used to query data from more tables based on the relationship between the fields of the tables. Keys play a major role when JOINs are used.

56. What is the goal of A/B Testing?

This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.

What is the difference between a Type I error and a Type II error?

Type I Error - A type I error is a false positive. It means that the null hypothesis was rejected when it was actually true. Type II Error A type II error is a false negative. It means that the null hypothesis was not rejected when it was actually false.

What is the difference between Type I and Type II errors?

Type I error occurs when we reject the null hypothesis when it is actually true. It is also known as a false positive. The significance level of the test, also known as alpha, determines the probability of making a Type I error. Type II error occurs when we fail to reject the null hypothesis when it is actually false. It is also known as a false negative. The power of the test, also known as beta, determines the probability of making a Type II error.

41. What is Union, minus and Interact commands?

UNION operator is used to combine the results of two tables, and it eliminates duplicate rows from the tables. MINUS operator is used to return rows from the first query but not from the second query. Matching records of first and second query and other rows from the first query will be displayed as a result set. INTERSECT operator is used to return rows returned by both the queries.

Differentiate between univariate, bivariate, and multivariate analysis

Univariate - Contains only one variable. Purpose is to describe the data and find patterns that exist within it. Bivariate - Involves two different variables. Analysis deals with causes and relationships and is done to determine the relationship between the two variables. Multivariate - Involves three or more variables. Similar to a bivariate but contains more than one dependent variable.

33. What is user defined functions?

User defined functions are the functions written to use that logic whenever required. It is not necessary to write the same logic several times. Instead, function can be called or executed whenever needed.

3. Write down the difference between vertical and horizontal databases?

Vertical Database Horizontal Database You can do vertical scaling, adding more power to the present PC. Here you can do horizontal scaling with more equipment. All data will be stored in a single node. Only part data will be stored in all nodes. Multi-core scaling will be done. Single-core scaling will be done. Example: Amazon cloud Example: Cassandra

61. How regularly must an algorithm be updated?

You will want to update an algorithm when: You want the model to evolve as data streams through infrastructure The underlying data source is changing There is a case of non-stationarity


Related study sets

Summary, Paraphrase, and Quotation

View Set

MicroEcon Chapter 19 w/ glossary

View Set

Business Finance Ch 12 Reading - Connect

View Set

Assessment 521: Assessment in Counseling

View Set

PDBIO 210 -- Lesson 11 (part 6) Dermatomes, Shingles, Plexus, Reflex Arc

View Set