Data Analyst Interview Questions: Basic
What do you think are the criteria to say whether a developed data model is good or not?
Well, the answer to this question may vary from person to person. But below are a few criteria which I think are a must to be considered to decide whether a developed data model is good or not: -A model developed for the dataset should have predictable performance. This is required to predict the future. -A model is said to be a good model if it can easily adapt to changes according to business requirements. - If the data gets changed, the model should be able to scale according to the data. -The model developed should also be able to easily consumed by the clients for actionable and profitable results.
What are the important steps in the data validation process?
As the name suggests Data Validation is the process of validating data. This step mainly has two processes involved in it. These are Data Screening and Data Verification. Data Screening: Different kinds of algorithms are used in this step to screen the entire data to find out any inaccurate values. Data Verification: Each and every suspected value is evaluated on various use-cases, and then a final decision is taken on whether the value has to be included in the data or not.
When do you think you should retrain a model? Is it dependent on the data?
Business data keeps changing on a day-to-day basis, but the format doesn't change. As and when a business operation enters a new market, sees a sudden rise of opposition or sees its own position rising or falling, it is recommended to retrain the model. So, as and when the business dynamics change, it is recommended to retrain the model with the changing behaviors of customers.
What is the process of Data Analysis?
Collect Data: The data gets collected from various sources and is stored so that it can be cleaned and prepared. In this step, all the missing values and outliers are removed. Analyse Data: Once the data is ready, the next step is to analyze the data. A model is run repeatedly for improvements. Then, the mode is validated to check whether it meets the business requirements. Create Reports: Finally, the model is implemented and then reports thus generated are passed onto the stakeholders.
What is data cleansing and what are the best ways to practice data cleansing?
Data Cleansing or Wrangling or Data Cleaning. All mean the same thing. It is the process of identifying and removing errors to enhance the quality of data.
What is the difference between Data Mining and Data Analysis?
Data Mining Used to recognize patterns in data stored. Mining is performed on clean and well-documented data. Results extracted from data mining are not easy to interpret. Data Analysis Used to order & organize raw data in a meaningful manner. The analysis of data involves Data Cleaning. So, data is not present in a well-documented format. Results extracted from data analysis are easy to interpret. So, if you have to summarize, Data Mining is often used to identify patterns in the data stored. It is mostly used for Machine Learning, and analysts have to just recognize the patterns with the help of algorithms. Whereas, Data Analysis is used to gather insights from raw data, which has to be cleaned and organized before performing the analysis.
What is the difference between Data Mining and Data Profiling?
Data Mining: Data Mining refers to the analysis of data with respect to finding relations that have not been discovered earlier. It mainly focuses on the detection of unusual records, dependencies and cluster analysis. Data Profiling: Data Profiling refers to the process of analyzing individual attributes of data. It mainly focuses on providing valuable information on data attributes such as data type, frequency etc.
Mention the name of the framework developed by Apache for processing large dataset for an application in a distributed computing environment?
The complete Hadoop Ecosystem was developed for processing large dataset for an application in a distributed computing environment. The Hadoop Ecosystem consists of the following Hadoop components. HDFS -> Hadoop Distributed File System YARN -> Yet Another Resource Negotiator MapReduce -> Data processing using programming Spark -> In-memory Data Processing PIG, HIVE-> Data Processing Services using Query (SQL-like) HBase -> NoSQL Database Mahout, Spark MLlib -> Machine Learning Apache Drill -> SQL on Hadoop Zookeeper -> Managing Cluster Oozie -> Job Scheduling Flume, Sqoop -> Data Ingesting Services Solr & Lucene -> Searching & Indexing Ambari -> Provision, Monitor and Maintain cluster
Can you mention a few problems that data analyst usually encounter while performing the analysis?
The following are a few problems that are usually encountered while performing data analysis. -Presence of Duplicate entries and spelling mistakes, reduce data quality. -If you are extracting data from a poor source, then this could be a problem as you would have to spend a lot of time cleaning the data. -When you extract data from sources, the data may vary in representation. Now, when you combine data from these sources, it may happen that the variation in representation could result in a delay. -Lastly, if there is incomplete data, then that could be a problem to perform analysis of data.
What is the KNN imputation method?
This method is used to impute the missing attribute values which are imputed by the attribute values that are most similar to the attribute whose values are missing. The similarity of the two attributes is determined by using the distance functions.