Fivetran/SE Knowledge

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

XMIN system column

- Based on hidden 'xmin' system column that is present in all PostgreSQL tables - Looks at a snapshot of the table and records how it was at that time; selects only new or changed rows since last update - Cannot replicate/track row deletions; a deleted row doesn't appear as being recently modified because it no longer exists - Requires a full table scan to detect updated rows

Email (source connection)

- Can use if your application supports sending a report in the form of a csv attachment, Fivetran can read this data and write it to your specified table in the destination

What can Fivetran connect to?

- Cloud APIs - Databases - Cloud storage & file uploads - Events - Cloud Reports

What are the limitations of ETL?

- Complexity: Data pipelines run on custom code dictated by the specific needs of specific transformations - Brittleness: Parts of the code base can become nonfunctional with little warning, and new business requirements and use cases require extensive revisions of the code - Inaccessibility: requires dedicated data engineers and not accessible to business users

How does ELT address ETL shortcomings?

- Complexity: Pipeline is simplified - Brittleness: Pipeline is more resilient and less risky because transformations are applied after the data is warehoused - Accessibility: Pipeline is more accessible because it's less labor-intensive to maintain

What are constraints in SQL?

- Constraints are used to specify the rules concerning data in the table - NOT NULL, CHECK, DEFAULT, UNIQUE, INDEX, PRIMARY KEY, FOREIGN KEY

File Replication

- Convert data into files that can connect to your destination using one of the Fivetran replication services

Components of a modern data stack

- Data Pipeline (ELT: Fivetran, Stitch, Segment) - Cloud Data Warehouse (Snowflake, Google BigQuery, Amazon Redshift) - Business Intelligence Tool (Looker, Tableau, Mode)

What are common data transformation problems?

- Data duplication - Data cleanup - Data manipulation from extracted data - Data manipulation in loading data

How long does it take to set up Fivetran?

- Depends on source and destination - In addition to connecting, each source and destination have a list of prerequisites for setup - Get prerequisites ready before you start to set up your connector

What kind of notifications do I get from Fivetran?

- Email notifications and alerts in your Fivetran dashboard - You can customize your notification settings in your Fivetran dashboard

Benefits of using a Cloud Function with Fivetran

- Extension of Fivetran - Run within your Virtual Private Cloud (VPC) - Support for many programming languages - Built-in software version control - Built-in testing

What challenges arise without an ETL/ELT solution?

- Extensive planning: think about all of the questions you want to ask your data ahead of time - Custom code: removes business users because of the required, custom code - Manual schema updates: Anytime schemas, APIs, or queries change, you must re-engineer your pipeline

What is ELT?

- Extract, Load, and Transform data into a warehouse (post-transform) - Possible to post load because of the rapid growth of cloud-based options and plummeting cost of cloud-based computation and storage - Analysts do NOT have to determine beforehand what insights they want to generate downstream

What is ETL?

- Extract, Transform, and Load data into a warehouse (pre-transform) - Analysts have to predict every use of the data before a report is ever created - IT/Eng team must create custom code to extract and transform each source of data

What are your options when Fivetran doesn't support your source?

- File replication - Webhooks - Database applications - Cloud functions - Email - Fivetran has a global network of partners that we can connect you to

Fivetran Initial Sync of Data

- Fivetran copies all rows from every table in every schema for which we have SELECT permissions (except those excluded on your Fivetran dashboard) and add Fivetran-generated columns - We copy rows by performing a SELECT statement on each table - For large tables we copy a limited number of rows at a time so that we don't have to start the sync over from the beginning if connection is lost midway

Where does Fivetran fit in with Data Transformation?

- Fivetran extracts from your data sources, automatically creates and manages schemas and data types appropriately for your destination; this creates tables that are immediately ready to query in your data warehouse or data lake

Fivetran main competitors & Fivetran

- Fivetran: Most robust solution, first to this use-case - Stitch (Talend): Open source - Segment (Twilio): Lets you send data to other apps (not just a data warehouse)

Explain the architectural style for creating a web API

- HTTP for client server communication - XML/JSON as formatting language - Simple URL/URI as the address for the services- Stateless communication (server)

Database Applications

- If your application sits on top of an underlying database, send data to Fivetran from applications using one of the many Fivetran database connectors

Why don't I see any data in my destination yet?

- It can take a while for Fivetran to load data into your destination - Some sources have restrictive API limits which constrain how much data we can sync in a given time - Large amounts of data can also make the initial sync take longer

What are dbt Packages by Fivetran?

- Libraries of reusable dbt models that connect to a number of sources (Zendesk, Hubspot, Pinterest, Stripe, etc.) and create analytic-ready tables in your destination - Source packages: expose and document the underlying Fivetran schemas being created in the destination - Model package: Reproduce commonly-recreated models for each source

What are the two Postgres replication methods?

- Log-Based Replication (preferred approach) - XMIN system column

Write-Ahead Log (WAL)

- Logs present in Postgres database, which record all Data Manipulation Language (DML) in the database - Used for Log-based replication

What do you need to use Logical Replication in Generic PostgreSQL?

- May need significant extra storage space (even if log storage is only temporary) - Requires a server reboot to active logging - Does NOT work if you use 'swap and drop' method of replicating data (create a temporary table and load in data from current table)

Column Hashing

- Method for anonymizing data in your destination while preserving its analytical value

What is a MAR?

- Monthly active rows, which are used for consumption-based pricing. - Companies exchange credits for MARs - A single row that is changed/updated multiple times in a month only counts as 1 MAR

Cloud Functions

- Must have access to source's API - Write custom cloud function allowing Fivetran to read the data from the application and write it to your destination - Can use Go, Java, Node.js, Python, C#, or F# - Let's you write your own script to extract and process data - Host script on a serverless computing platform (Google Cloud Functions, AWS Lambda, or Azure Functions) - Take advantage of automatic syncing and deduplication, as with any other Fivetran standardized connector - Fivetran supports AWS Lambda, Azure Functions and Google Cloud Functions

What can't Fivetran always load data into?

- Not always possible to load large datasets into row stores like Postgres and MySQL that were designed as operational data stores - Will need to switch to a horizontally scalable column store like Snowflake, Redshift, BigQuery, or Azure Synapse.

What's the difference between PUT and POST?

- PUT will change/update the file or resource if there is already one that exists at that URI - POST sends data to a particular URI and expects the resource at the URI to deal with the request

Data Warehouse

- Process and transform data for advanced querying and analytics in a more structured database environment - Postgres, Amazon Redshift, Snowflake, Google BigQuery, MS SQL

What do you need to set up dbt transformation?

- Public Key: used to grand Fivetran SSH (Secure Shell network protocol) to access git repository - Repository URL - Default Schema Name

How do you perform data transformations?

- Python: Usually in combination with a library like pandas to emulate SQL for data manipulation - SQL: Applied in the destination database which can leverage the powerful capabilities of a cloud data warehouse like Snowflake, BigQuery, or Redshift

SSH Keys/Protocol

- Secure Shell keys are an access credential that is used in the SSH protocol - SSH provides a more secure way of logging into a server

What permissions does Fivetran require on the source and destination sides?

- Source: Only requires a READ permission for databases and API Cloud Applications - Destination: Requires the CREATE permission

What is SQL?

- Structured Query Language (SQL) is used to create and use databases, tables, and relationships. - SQL is divided into two categories: SQL statements for database definition and SQL statements for database processing (querying and updating)

Fivetran Funding

- Total of $163.1M, with the most recent being a Series C $100M round led by a16z at a $1.2B evaluation

What's the major difference between ETL and ELT?

- Traditional ETL post-loads data, whereas ELT pre-loads data to allow business users to utilize the data themselves

What is TLS?

- Transport Layer Security - Allows both encryption and authentication of the connections that Fivetran makes to your data sources and destinations - Fivetran connections made over TLS are always encrypted, and support automatic verification for connectors that use hostname verification (such as web-based applications), and for proprietary systems with built-in certificate authority management such as Snowflake, BigQuery, and Redshift.

Webhooks

- Using Fivetran webhooks connector, send data to your destination from applications that support webhooks - HTTP callback triggered by an event on your website or application - Fivetran will give you a URL to put inside your SaaS service. When new events occur in your applications, the SaaS service will POST data to that URL, which Fivetran will collect and ingest

Log-Based Replication

- Utilizes logs present in Postgres databased called write-ahead log (WAL), which records all data manipulation language (DML) in the database - Can update tables incrementally without scanning the entire table to find correct transaction IDs - Most efficient and can capture deleted information for tables with Primary Keys - Minimizes processing overhead on your PostgreSQL server - Can only enable if PostgreSQL server version is 9.4.15+, 9.5.10+, 9.6.6+ or 10.1+

Where can I see my data in Fivetran?

- You can't see your data in Fivetran because we don't store it - Fivetran sync loads your data into your destination (warehouse or lake), however, you can check your schema and sync status on your Fivetran dashboard

What happens if a Fivetran sync fails?

- You do NOT lose data when a sync fails, but no data is added or updated in your destination - Fivetran will immediately notify you about your failed sync so that you can begin troubleshooting

Fivetran-generated columns (Initial Sync)

- _fivetran_synced: (UTC timestamp) keeps track of when each row was last successfully synced - _fivetran_id: (string) is the has of the non-Fivetran values of each row. A unique ID that Fivetran uses to identify rows in tables that do not have a Primary Key - _fivetran_deleted: (boolean) Marks rows that were deleted in the source database (WAL only)

What does using dbt with Fivetran allow business users to do?

- dbt can connect to a git repository so that all code generated by the analytics team is now centralized in a single source of truth - View job status - Investigate run details - Pause jobs at any time

How can you flatten and query JSON data in Snowflake?

1. Convert strings containing JSON to a VARIANT type using Snowflake's Parse_JSON function 2. Use Snowflake's FLATTEN function on the VARIANT data 3. Use Snowflake's SQL Extension to query the semi-constructed data NOTE - you can also use Fivetran's native Transformations to run the functions automatically against your JSON data and save the results in a new schema

Data integration Data Stack

1. Data Sources 2. Data Pipeline and Data Connectors 3. Data warehouse and/or data lake 4. Data modeling and/or transformations 5. BI Tool

What are the steps to Data Integration?

1. Data gathered from sensor feeds, manual data entry or software, and stored in files or databases 2. Data extracted from files, databases, and API endpoints and centralized in a data warehouse 3. Data cleansed and modeled to meet analytics needs 4. Data used to power products or generate BI

What is a Join?

A SQL join is used to combine records (rows) from two or more table in a SQL db on a related column between the two (INNER, LEFT, RIGHT, OUTER)

What is an index?

A database index is a data structure that provides quick lookup of data in a column or columns of a table

What is a query?

A query is a request for data or information from a database table or combination of tables

What is a subquery?

A query within another query

What is the SELECT statement?

A statement that retrieves data from the database.

What are Tables and Fields?

A table is an organized collection of data stored in the form of rows and columns

Business Intelligence (BI) Tools

A tool that combines business analytics, data mining, data visualization, data tools and infrastructure, and best practices to help organizations to make more data-driven decisions (Looker, Tableau, MODE)

What is a database?

An organized collection of data, stored and retrieved digitally from a remote or local computer system

What are the similarities and differences between JSON and XML?

Both are: - self describing (human readable) - can be parsed - can be fetched with an XMLHttpRequest JSON is: - easier to parse (JSON.parse the JSON string) - already in JavaScript Object Notation

What can Fivetran load data into?

Data Warehouses - Postgres - Amazon Redshift - Snowflake - Google BigQuery - MS SQL

What is DBMS?

Database Management System - a system software responsible for the creation, retrieval, updating and management of the database

What is a database schema?

Description of a database which is specified during database design and not expected to change frequently. Most data models have conventions for displaying schemas as diagrams.

Why would PostgreSQL have a Permission Denied error?

Fivetran does NOT have 'read-only' access to your source schema/tables

How does Fivetran perform transforms within a data warehouse?

Fivetran enables users to create derivative tables (views) without altering the source data. This allows organizations to create a repository of record that is immune to changing business needs or upstream schema changes

Data Lake

Highly scalable storage repository that holds large volumes of raw data in its native format until required for use

What markup languages can be used in RESTful web APIs?

JSON and XML

How are JSON objects copied into the destination?

JSON objects are NOT unpacked to separate tables in the destination, rather the object will be created into a table with each key made into a column.

What's the main difference between the WHERE and HAVING clause?

Main difference comes when using the GROUP BY clause, in that WHERE is used to filter rows before grouping and HAVING is used to filter rows after grouping

Explain what is REST and RESTFUL

These focus on system resources and how state of resource should be transported over HTTP protocol to different clients within different languages. HTTP methods include: GET, POST, PUT, and DELETE, which can be used to perform CRUD operations

What is a Primary Key?

This uniquely identifies each record in the db.

When should you use Cloud Functions?

When you are fetching data from the following places: - APIs without a pre-built connector - Private APIs - Data formats that don't self-describe - Sensitive data that needs filtering or anonymizing


Ensembles d'études connexes

Adolescent Diversity: Socioeconomic Status

View Set

BIO 231 Ch. 13 Central Nervous System

View Set

Chap 52 study guide and book questions

View Set

Communication Chapter 9: Small Group Presentations

View Set

Philippine National Regions and Provinces

View Set