Data Science Interview

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

How can you calculate accuracy using a confusion matrix?

(true pos + true neg)/total observations

Mention some techniques used for sampling.

- Random Sampling: Every data point has an equal chance of being selected. - Stratified Sampling: The population is divided into strata, and random samples are taken from each. - Systematic Sampling: Selects every nth item from an ordered list. - Cluster Sampling: Divides the population into clusters and randomly selects entire clusters.

How should you maintain a deployed model?

- Track model accuracy and drift over time. - Periodically update the model with fresh data to maintain accuracy. - Implement real-time monitoring to detect performance degradation.

You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?

1) Assess the Missingness Pattern 2) Drop Features with Excessive Missing Data, or fill missing values using Mean/Median/Mode Imputation, K-Nearest Neighbors (KNN) Imputation (average of 3 nearest neighbors)

Explain the steps in making a decision tree.

1) Choose the attribute that best splits the data. 2) Divide the data into subsets. 3) Continue splitting the data (recursively repeat) until all nodes are pure or a stopping condition is met. 4) Pruning (Optional): Remove branches that do not add value to prevent overfitting.

How is logistic regression done?

1) Define the hypothesis function: Uses the sigmoid function to map inputs to probabilities. 2) Select features: Choose independent variables affecting the outcome. 3) Train the model: Use maximum likelihood estimation to find the best-fitting parameters. 4) Apply regularization (optional): Prevents overfitting. 5) Make predictions: Convert probabilities into outcomes using a threshold (e.g., 0.5).

What statistical methods did you use in your sustainability internship?

Conducted trend analysis and hypothesis testing in Python/R. Used SQL for data management and Power BI/Tableau for visualization. Helped inform sustainability decisions, including wetlands management initiatives.

What's an example of a complex Power BI dashboard you built?

Created an interactive dashboard tracking sustainability metrics (carbon footprint, water usage). Used DAX measures and filtering options for dynamic analysis. Enabled stakeholders to drill down into trends and drive data-backed decisions.

How do you remove duplicate rows from a table?

DELETE FROM market_transactionsWHERE transaction_id NOT IN (SELECT MIN(transaction_id)FROM market_transactionsGROUP BY transaction_date, region, price, quantity);

Differentiate between Data Analytics and Data Science

Data Science is a broader field that involves data collection, preprocessing, modeling, and deriving insights using machine learning and AI techniques. It focuses on predictive and prescriptive analytics. Data Analytics is a subset of Data Science that focuses on analyzing data to identify trends, summarize findings, and support decision-making. It is more descriptive.

Tell me about a time you failed

During my sustainability analysis internship, I built a Power BI dashboard to track key environmental metrics. Initially, I focused too much on technical complexity rather than user needs, resulting in a dashboard that wasn't intuitive for stakeholders. (too many tables with complex relationships, SQL expressions could be simplified) After receiving feedback, I simplified the layout, improved labeling, and added interactive filters to make the insights more accessible. This experience taught me the importance of user-centered design and clear communication when presenting data. The final dashboard was well-received and became a key tool for sustainability reporting. Now, I always prioritize usability when creating visualizations or automating data processes.

How can you select k for k-means?

Elbow Method - Plots the Within-Cluster Sum of Squares (WCSS) for different values of k. - WCSS measures the sum of squared distances between each data point and its assigned cluster center. - The optimal k is at the "elbow" point, where the reduction in WCSS slows down.

What are the feature selection methods used to select the right variables?

Filter Methods: - Chi-Square Test: Used for categorical variables to test independence. - ANOVA (Analysis of Variance): Compares means of different groups to assess feature relevance. Wrapper methods: - Recursive Feature Elimination (RFE): Iteratively removes the least important features based on model performance. - Forward Selection: Starts with no features and adds them one by one based on performance improvement. - Backward Elimination: Starts with all features and removes them one by one based on statistical significance.

How would you implement time series forecasting for a business problem?

Handle Missing Data, remove outliers, convert to stationary series, normalize data, choose forecasting model (ARIMA), train model and validate performance, deploy

What are you good at?

I excel at breaking down complex problems, optimizing data pipelines, and developing ML models to extract meaningful insights. Additionally, my experience as a teaching assistant has sharpened my ability to explain technical concepts clearly and collaborate effectively.

6. What are your greatest strengths?:

I love challenging myself to learn new things and thus my approach to work is evolving as technology evolves.

Greatest weakness

I tend to rely on solving problems independently before reaching out for help, which can sometimes slow me down. However, I've been actively working on improving my collaboration by engaging with peers earlier and leveraging team discussions to find efficient solutions faster.

What are some best practices for model monitoring and retraining in an ML Ops pipeline?

I track model drift, performance metrics, and prediction accuracy using Azure Monitor. Automated retraining is triggered when performance declines, and versioning ensures rollback safety. Regular explainability reports and compliance audits maintain trust and transparency.

How do you handle large-scale data ingestion and ensure data integrity in data lakes?

I use parallel processing and optimized formats for efficient ingestion. Schema enforcement, checksums, and data lineage tracking ensure integrity. Implementing access control, anomaly detection, and monitoring prevents corruption and unauthorized access.

How do you approach uncertainty quantification in machine learning models?

I use techniques like Monte Carlo dropout to quantify uncertainty. Regularly assessing model uncertainty with out-of-sample testing ensures robustness.

what do you want to do

I want to work in data science and machine learning, using data-driven insights to solve real-world problems. I'm particularly interested in ML Ops, data visualization, and AI model optimization, where I can apply my skills in Python, SQL, and cloud-based data processing. My goal is to contribute to impactful projects that drive better decision-making and continuous improvement in data systems.

Have you ever faced a situation where your ML model underperformed? How did you troubleshoot and improve it?

I worked on an ML model that was slow and inefficient, so I first analyzed feature engineering, data preprocessing, and model complexity. I optimized it by reducing redundant features.

Why do you want this job?

I'm excited about this job because it offers the opportunity to apply my data analysis and machine learning skills to real-world challenges in a meaningful way. I enjoy uncovering insights from complex datasets and translating them into actionable recommendations, and this role aligns perfectly with that passion. I'm eager to contribute to a collaborative team where I can leverage my experience in Python, ML Ops, and data visualization to drive impactful, data-driven solutions.

tell me about a challenge you faced

In my machine learning sentiment analysis project, I initially struggled with slow model performance due to large amounts of unstructured text data. The processing time was inefficient, making it difficult to generate insights quickly. I optimized the pipeline by implementing data preprocessing techniques like tokenization and stop-word removal, reducing input size. I also experimented with more efficient ML models, switching from a complex transformer model to a fine-tuned BERT variant that balanced accuracy and speed.t

9. Which machine learning model did you find most effective in your projects, and why?

In my sentiment analysis project, BERT-based transformers provided more context-aware insights than traditional models. Used VADER for quick polarity scoring, but BERT captured nuances in sentiment shifts. Found that fine-tuning models on specific data improved accuracy.

How do you find RMSE and MSE in a linear regression model?

MSE: 1/n * sum_{i=1}^{n}(y_1 - y^_1)^2

whats your work style

My work style is structured, analytical, and collaborative. I like to break down complex problems into clear, actionable steps and use data-driven approaches to find solutions. I work well independently but also enjoy collaborating with others to share ideas and refine insights. I'm highly detail-oriented and always aim for efficiency, whether it's optimizing a data pipeline or improving a machine learning model's performance.

How would you optimize a slow-running Python script for large-scale data processing?

Optimizations could include vectorization (NumPy, Pandas), parallel processing (multiprocessing, Dask), and memory-efficient data structures.

What is the significance of the p-value?

The p-value helps determine the statistical significance of results in null hypothesis testing: - Low p-value (< 0.05): Strong evidence against the null hypothesis; the effect is statistically significant. - High p-value (> 0.05): Weak evidence against the null hypothesis; fail to reject it.

What steps would you take to ensure regulatory compliance in AI and ML models?

Understanding Relevant Regulations (HIPAA), privacy protection

Differentiate between univariate, bivariate, and multivariate analysis.

Univariate Analysis: Examines a single variable. Example: Histogram of customer ages. Bivariate Analysis: Examines the relationship between two variables. Example: Scatter plot of sales vs. advertising spend. Multivariate Analysis: Examines the relationship between multiple variables. Example: Predicting sales based on price, marketing spend, and location.

Can you elaborate on your role in identifying academic dishonesty?

Used Excel and ML models to detect patterns in student submissions. Used supervised learning neutral network (Code BERT from hugging face) to learn deep representations of code similarity.

ow did you apply machine learning to analyze differential equations in your research?

Used Python-based simulations to model biological growth patterns. Applied ML algorithms to identify patterns in differential equations: PINNS (physics informed netural networks) by training a neural network (using PyTorch/TensorFlow) to satisfy the differential equation and boundary conditions. Useful when working with noisy, real-world biological data. Visualized results using Matplotlib to interpret findings.

How do you calculate the precision and recall rate?

precision: (true pos)/(true pos + false pos) recall: (true pos)/(total pos + false neg)

SQL INTERSECT operator

returns the common records that are the results of 2 or more SELECT statements.

What is the difference between CHAR and VARCHAR?

• CHAR(n): Fixed-length string, always takes n characters. • VARCHAR(n): Variable-length string, takes only the required space (up to n).

cross field validation

• Checks data consistency by comparing values across different fields to detect errors or inconsistencies. • Works by defining logical rules or relationships between different fields in a dataset and checking for inconsistencies. • This can be done using conditional statements or mathematical relationships

Explain descriptive, predictive, and prescriptive analysis

• Descriptive: Answers "what has happened" and uses data aggregation techniques • Predictive: Answers "what could happen" and uses statistical of forecasting techniques • Prescriptive: Answers "what you should do" and uses simulation algorithms

methods for data cleaning

• Identify and remove duplicates • Focus on accuracy (cross-field validation) • Normalize the data

types of hypothesis testing

• Null hypothesis: States that there is no relation between the predictor and outcome variables in the population. • Alternative hypothesis: States that there is some relation between the predictor and outcome variables in the population

overfitting vs underfitting

• Overfitting: The model trains the data well using the training set, happens when the model learns the random fluctuations and noise in the training dataset in detail. • Underfitting: the model neither trains the data well nor can generalize to new data, happens when there is lesser data to build an accurate model and when we try to develop a linear model using non-linear data.

Monte Carlos simulations

• Runs multiple simulations using random inputs to estimate possible outcomes. - used to model uncertainty and variability by running a large number of random simulations to estimate possible outcomes.

Purpose of index in SQL

• Speed up searches and queries by reducing the amount of data scanned. • They are particularly useful for columns used in WHERE, JOIN, and ORDER BY clauses

Type I and Type II errors

• Type I: Occurs when the null hypothesis is rejected even if it is true (false positive) • Type II: When the null hypothesis is not rejected, even if it is false (false negative)

Difference between WHERE and HAVING clauses in SQL

• WHERE operates on row data, filters before grouping are made, aggregate functions cannot be used. • HAVING operates on aggregated data, filters values from a group, and aggregate functions can be used.

What is a Confusion Matrix?

A confusion matrix is a table used to evaluate the performance of a classification model. It helps in calculating accuracy, precision, recall. It consists of: - True Positives (TP): Correctly predicted positive cases. - True Negatives (TN): Correctly predicted negative cases. - False Positives (FP): Incorrectly predicted positive cases (Type I Error). - False Negatives (FN): Incorrectly predicted negative cases (Type II Error).

Steps to build a data pipeline?

A data pipeline involves ingestion (APIs), processing, and storage (data lakes, warehouses). It must include data validation, orchestration (Azure), and security (encryption). Finally, the pipeline should support BI tools, APIs, and monitoring for real-time access and insights.

What is Data Science?

An interdisciplinary field that combines statistics, programming, and analysis to extract insights from data. Involves data collection, cleaning, exploration, analysis, and modeling to support decision-making.

SQL UNION operator

Combines the output of two or more SELECT statements. ex: SELECT column_name(s) FROM table1 UNION SELECT column_name(s) FROM table2;

Can you explain a time when you had to use Power BI or another visualization tool to present data insights?

During my internship, I used Power BI to monitor sustainability data, creating interactive dashboards to track energy usage trends and land use for the company. This helped the team identify inefficiencies and propose data-driven sustainability initiatives.

tell me about yourself

Present: I'm a mathematics major at Purdue University with a strong focus on applied math, data science, and machine learning. I have hands-on experience in ML model development, data analysis, and visualization, particularly using Python, SQL, and Power BI. Past: In my previous roles, I worked on a machine learning sentiment analysis project, where I applied NLP techniques like BERT and VADER to extract trends from web-scraped data. I also interned as a Sustainability Analysis Intern, where I built SQL databases, conducted statistical analysis, and created Power BI dashboards to present insights. Additionally, as a teaching assistant for Python programming, I honed my technical skills and ability to explain complex concepts clearly. Future: I'm eager to apply my experience in machine learning operations, data visualization, and cloud-based pipelines to solve real-world problems. I'm particularly excited about this role because it aligns with my passion for data-driven decision-making and collaborating with cross-functional teams to drive impactful insights.

What are the differences between supervised and unsupervised learning?

Supervised Learning: The model is trained on labeled data (input-output pairs). It is used for classification and regression tasks. Unsupervised Learning: The model is trained on unlabeled data, identifying patterns. It is used for clustering and association tasks.

What is the ROC curve?

The ROC curve is a graphical representation of a classification model's performance across different decision thresholds. It plots true positive rate on the y axis and false positive rate on the x axis


Set pelajaran terkait

Chapter 17/18/19 315 Nursing NIU

View Set

NG Tube: Insertion, Care, Use, Removal

View Set

Chapter 14: Laboratory Acquired Diseases

View Set

Life and Health Insurance ExamFX

View Set

"Intro to New Testament: Paul Terms"

View Set

The CE shop - promulgated contract forms

View Set

5.7 - GIVEN A SCENARIO, TROUBLESHOOT COMMON WIRED AND WIRELESS NETWORK PROBLEMS

View Set