Interview Prep

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What does ACID stand for?

Atomicity: each transaction is an indivisible unit of work. if any part of the transaction fails, the entire transaction is rolled back, leaving database in original state. Consistency: data is consistent before and after a transaction. Isolation: each transaction is isolated from concurrent transactions, so results of one transaction do not affect results of another Durability: once a transaction is committed, its changes are permanent and will survive any subsequent failures.

What is a unit test in python?

Automated test designed for isolated components of the code, such as a specific function or an object's method. One software component will have many unit tests. Written using testing frameworks such as pytest that provide utilities for creating test cases and making assertions about the behaviour of the code.

What is automated testing in Python and why is it important?

Automated tools and frameworks to automatically test software applications and ensure they work as intended. Involves writing code to automatically test the functionality and performance of software applications.

What is a table in SQL

Basic database object and physical storage location that contains data organized into columns and rows.

What is a class in Python?

Blueprint that defines the properties (attributes) and behaviours (methods) of an object. Serves as a template for creating objects of a particular type. Objects created from a class are referred to as instances of that class. Allows for creation of reusable objects with their own attributes and methods. Organize related data and functions into a single unit.

Difference between a star schema and a snowflake schema

Both forms of dimensional modelling, main difference is the level of normalization. Star schema: data is organized into a single fact table and multiple denormalized dimension tables. Snowflake schema: dimension tables are normalized, broken down into additional tables to reduce redundancy and improve integrity.

What is a data lake?

Centralized repository that stores large amounts of raw, unstructured and semi-structured data. Designed to support big data processing. Provide flexible and scalable solution for storing and analysing diverse data.

What is a data warehouse?

Centralized repository that stores structured data from various sources. Designed to support business intelligence and decision making. Highly normalized and organized. Requires strong data modelling practices to ensure completeness.

What is a data frame?

Conceptually equivalent to a table in a relational database but optimized for distributed processing on big data, created from structured files, external databases, RDDs, lazy evaluated, schema defines the structure of data in each column making it easy to query and manipulate data with a SQL like syntax through Spark.

What is docker?

Container technology that allows developers to automate the deployment and management of applications within software containers.

What is databricks?

Data Lakehouse platform built on top of Spark. Provides a unified workspace for data professionals to collaborate on projects in a secure and scalable way.

What is ELT?

Data is extracted from source system and loaded directly into a target system data lake. Once loaded, transformed in place for specific purposes for specific departments.

What is a relational database

Data is organized into tables, each table represents a collection of related information. Each table consists of rows and columns, where each row represents a specific instance of data and each column represents a specific attribute or characteristic of that record. Relationships through tables are defined through keys/links between different tables. Flexible, scalable and high data integrity.

What is an index in SQL?

Data structure that is created on one or more columns of a table that allow the database engine to quickly locate and retrieve specific rows based on the values in the indexed columns.

What functionality does indexing serve in SQL?

Database feature in SQL that provides fast and efficient way to retrieve data from a table. Indexes are typically created on columns that are frequently searched or used in joins, or on columns that enforce constraints such as primary keys.

What is apache spark and when would you use it?

Distributed computing system designed for processing large-scale data processing tasks. Designed to run in a distributed environment by using a cluster of servers to perform computations in parallel. Processes data in memory.

What is Kafka?

Distributed streaming platform used to handle real time streams. Designed to handle high volumes of real-time data streams, provides scalable and fault-tolerant platform for data integration. Involves creating a topic, sending messages and subscribing/displaying topics.

Scenarios that are appropriate for relational database

E-commerce: relational database to store information about products, customers, orders, transactions. Healthcare: used to store and manage patient data, medical history, test results, medication records. Provides high degree of data accuracy and consistency.

What does lazy evaluation mean?

Execution plan is made but not carried out until an action calls upon it. Reduces computation time and memory usage. Actions include count, take, collect.

What is ETL?

Extract, Transform, Load. Data is organized from source systems, transformed into desired format, loaded into target system. Transformation involves cleaning, filtering and manipulating data to ensure it is accurate and consistent. Data is highly normalized and cleaned, which can result in data loss for specific purposes.

What are containers docker?

Lightweight, isolated environments that package software and its dependencies, enabling applications to run consistently across different computing environments.

What is a function in python?

Named block of reusable code that performs a specific task or carries out a particular computation. Break complex tasks into smaller functions and make code easier to maintain and debug. Encapsulate a sequence of instructions

What is OLAP?

Online Analytical Processing, historical data, complex queries and data analysis for business intelligence, optimized for fast query performance, organized into star/snowflake schemas to allow for efficient data aggregation/analysis.

What is OLTP?

Online Transaction Processing, operational data, support business processes that require frequent and rapid transactions, optimized for fast data access and updates, highly normalized to minimize redundancy and improve data consistency.

What does an aggregation function do in SQL?

Operates on a set of rows to return a single value, used to summarize data based on one or more columns in a table. count, min, max, avg, sum used in addition to group by to summarize data across multiple rows.

What does normalization mean?

Organizing data to reduce redundancy and ensure data consistency and integrity. Break down large table into smaller tables and establishing relationships between them. 1NF: each table has a PK and each attribute is indivisible. 2NF: Each non-key is fully dependent on primary key.

What does a window function do in SQL?

Perform calculations across a sliding window of rows and return a result for each row in the window. Used to calculate the running totals, moving averages or rankings of rows based on specific criteria. Not used with group by. ORDER BY used to specify order in which the rows in the windows are processed. PARTITION BY used to divide the rows into groups based on one or more columns. Window function applied separately for each partition.

What is a stored procedure in SQL?

Pre written block of code that is stored in a database and can be called by a user or application. Performs a series of operations that manipulate data, control flow or perform calculations. Advantages: encapsulate complex logic/calculations that might be difficult to write in SQL, improve query performance since database engine can compile and cache the procedure for faster execution.

What are images docker?

Read-only template that serves as basis for creating containers. Created from a set of instructions specified in a Dockerfile.

When would you use a star schema and when would you use a relational database?

Relational database: suitable for managing large volumes of structured data that require frequent updates, complex queries and flexible schema designs. Highly transactional data. Star schema: suitable for data warehousing and business intelligence applications that require fast and efficient querying of large volumes of data. Optimize query performance by reducing joins. Better suited for analytical applications.

What is a RDD?

Resilient Distributed Dataset. Immutable distributed collection of data elements that can be processed in parallel across a cluster of machines. Fault tolerant: RDDs can be reconstructed from lineage of parent RDDs. Lower level API for distributed data processing. Unstructured data. Transformed using map, filter and reduce with lazy evaluation.

What does ACID mean?

Set of properties that ensure database transactions are processed reliably, even in the event of hardware or software failures.

What are APIs?

Set of rules defining how computers should interact/communicate with each other. Most APIs are RESTful, meaning they have a set of rules/frameworks that tell computers how to communicate involving client and server requests using HTTP protocol. REST APIs can be implemented using various framework. Fast API is a high performance web framework designed to use Python features to quickly and effectively design APIs.

What is a data silo?

Situation where data is stored and managed in isolated databases within an organization, making it difficult to access or share that data across different departments, teams or systems. Data is compartmentalized.

What scenarios would be appropriate for a star schema and a snowflake schema

Snowflake better used with large, complex and highly granular dim tables. Star better used where query performance is a high priority and the size and complexity of the data are small.

What is test driven development?

Software development methodology that emphasizes writing automated tests for each piece of functionality before writing the code to implement that functionality. Ensures code being developed is thoroughly tested, meets the requirements of the user, and avoids un necessary code.

What is a virtual machine?

Software environment that emulates a complete hardware system including the CPU, memory, storage and network interfaces. Allows for running of multiple OS and applications on a single machine without interfering with each other.

What is a query in SQL?

Statement that retrieves data from one or more tables and returns a result set, executed every time it is run.

What is a virtual environment

Tool that allows for the creation of isolated Python environments with its own dependencies, packages and config settings. Allows for installing and managing of multiple packages on the same system without affecting other Python installations. Used to ensure consistency and isolation across different development environments

What is GIT?

Version control system used for tracking changes in software development projects. Allows for collaboration, keeping track of changes and managing different versions of source code efficiently. Provides a command line interface for interacting with version control systems.

What is a view in SQL?

Virtual table based on the result set of a SELECT statement. Stored as a SELECT statement that can be executed like a table. When you query the view, the database engine executes the SELECT statement to generate the result set.


Kaugnay na mga set ng pag-aaral

LEPRA/FINECHAPS/ENDORSEMENT/CAUTION/OATH OF OFFICE

View Set

Anatomy Midterm (check description)

View Set

Chapter 4: The International Flow of Funds and Exchange Rates

View Set

Standard VII—Responsibilities as a CFA Institute Member or CFA Candidate (Ethics and Standards of Practice Module 7)

View Set

Yo, Tú, Él, Ella, Usted, Nosotros, Vosotros, Ellos, Ellas, Ustedes - Spanish Personal Pronouns

View Set

Chapter 36 Abdominal and Genitourinary Trauma

View Set