Model Scope, Azure AI, Hugging Face, PyTorch, Keras, and Tensorflow
TFDV Check for Data Drift and Skew
# Calculate skew for the diabetesMed feature diabetes_med_feat = tfdv.get_feature( schema, 'diabetesMed') diabetes_med_feat.skew_comparator.infinity_norm.threshold = 0.03 # domain knowledge helps to determine this threshold # Calculate anomalies skew_drift_anomalies = tfdv.validate_statistics(train_stats, schema, previous_statistics=eval_stats, serving_statistics=serving_stats) # Display anomalies tfdv.display_anomalies(skew_drift_anomalies)
TFDV Calculate, Visualize Anomalies
# Check evaluation data for errors by validating the evaluation # dataset statistics using the reference schema anomalies = tfdv.validate_statistics( statistics=eval_stats, schema=schema #Training Set Schema ) # Visualize anomalies tfdv.display_anomalies( anomalies )
Training models
.fit() function
TensorFlow Hierarchy of Abstraction Layers
0) execution layer CPU, GPU, TPU, Android 1) Low Level API: C++ Core TensorFlow. Implement in C++ then register implementation in Tensorflow API. You can also get a Python wrapper 2) Python Level API: Python Core TensorFlow provides full control of creating shapes and tensors 3) Predefined Components and Modules for building Custom Neural Net Models: tf.layers, tf.losses, tf. metrics, tf.optimizers, etc.. 4) High Level API allow you to easily do common NN activities like Distributed Training, Data Preprocessing, Model Definition etc ( tf.keras, tf.data, tf.estimator )
Batch Normalization
Normalizing the input or output of the activation functions in a hidden layer. Batch Norm doesn't work well with smaller batch sizes. That results in too much noise in the mean and variance of each mini-batch it is standard practice to normalize the data to zero mean and unit variance. Batch normalization can provide the following benefits: Make neural networks more stable by protecting against outlier weights. Enable higher learning rates Reduce overfitting. and Exploding Gradients
batch processing
Processing large amounts of data all in one batch when the computer system is not busy Smaller Batch sizes are nosier and provide lower generalization errors. Easier to pass to CPU or GPU standard 1- 512 average 32 #pytorch BATCH_SIZE = 2 for epoch in range(NUM_EPOCHS): ...for i in range(0, X.shape[0], BATCH_SIZE): ........# set gradients to zero ........optimizer.zero_grad() ........# forward pass ........y_pred = model(X[i:i+BATCH_SIZE]) ........# calculate loss ........loss = loss_fun(y_pred, y_true[i:i+BATCH_SIZE]) ........loss.backward() ........# Update Weights ........optimizer.step()
tf.constant
Produces Constant immutable tensors
tf.Variable
Produces Mutable Tensors x =tf.Variable( initial value for variable, dtype=, name=) afterwards the value of x can be changed with x.assign( value) x.assign_add( increment value) x.assign_sub(decrement value)
Reshaping Operations
The shape of a tensor gives us something concrete we can use to shape an intuition for our tensors. Our task is to build neural networks that can transform or map input data to the correct output we are seeking. This can be done by calling A) using reshape( [row, col]) B) Change Rank i) Squeezing a tensor removes the dimensions or axes that have a length of one using squeeze() [1,12] ->[12] ii) Unsqueezing a tensor adds a dimension with a length of one using unsqueeze(dim=0) [12] -> [1,12]
Embeddings Columns
These overcome the limitation of converting into One-Hot encoding for Large vocabularies. It creates a dense vector of not just a sparse one of 0 and 1's fc.embeddting_column( categorical_column= fc_crossed_ploc, dimension =3)
Feature Column
Think of feature columns as the intermediary between raw data and Estimators. They enable you to transform a diverse range of raw data into formats that Estimators can use. featcols= [ tf.feature_column.numeric_column("hdr 1"), tf.feature_column.categorical_column_with_vocabulary_list("hdr 1",["v1","v2","v3"])] tf.feature_column. categorical_column_with_vocabulary_file categorical_column_with_identity categorical_column_with_hash_bucket bucketized_column embedding_column crossed_column ...
Machine learning frameworks
provide tools and code libraries: PyTorch TensorFlow Keras Caffe2 Gluon CNTK Torch Chainer Apache MXNet
built in Keras
Keras was built into TF 2.x
fc.bucketized_column
n_buckets = 16 bucket_ranges = np.linspace( start=38.0 , Stop =42.5 ,num = #buckets).tolist() fc.bucketized_column( source_column = fc.numeric_column( "col1"), boundaries = bucket_ranges )
Cross-Entropy Loss
Loss function measuring performance of classification models.
Machine Learning Challenges
Machine learning challenges Data -Quality Representation, Quantity, Over/Under fitting People -expertise, cost , support Business - Explainability, Question Formulation, Cost Building Technology - Privacy, Selection, Integration
"Adam" Compile Optimizer
Model Compile Optimizer like Stochastic Gradient descent used to adjust weights based on error loss function to minimize errors in each iteration. Best for a lot of parameters, large data sets , moisy or sparse gradients
"Adagrad" Compile Optimizer
Adagrad is an adaptive learning rate algorithms that keeps track of the squared gradients over time and automatically adapts the learning rate per-parameter. It can be used instead of vanilla SGD and is particularly helpful for sparse data, where it assigns a higher learning rate to infrequently updated parameters applies low learning rate to frequently occurring features this technique works very well if you have sparse data sets. Sparse data sets are data sets that have empty values for many features. But the problem here is that the learning rate decreases with time, and sometimes it gets too small so that the learning is slow, which was the initial problem to overcome
Leaky ReLU
negative values aren't zero but close to it. Is a common effective method to solve a dying ReLU problem, and it does so by adding a slight slope in the negative range. This modifies the function to generate small negative outputs when input is less than 0.
TensorFlow Execution Engine
A JVM generates the DAG of Nodes and connecting Tensors enabling it to run on any OS or Hardware CPU/GPU
TensorFlow
A Library for Numerical Computation A large-scale, distributed, machine learning platform. The term also refers to the base API layer in the TensorFlow stack, which supports general computation on dataflow graphs. Although TensorFlow is primarilyterm-63 used for machine learning, you may also use TensorFlow for non-ML tasks that require numerical computation using dataflow graphs.
Flatten
A flatten operation on a tensor reshapes the tensor to have a shape that is equal to the number of elements contained in the tensor. This is the same thing as a 1d-array of elements. [x,y,z] -> [1, #elements]
Activation Functions
A neural network without an activation function is essentially just a linear regression model. An activation function is used to map the input to the output. This activation function helps a neural network to learn complex relationships and patterns in data. non-linear Transformation Layer that makes ANN non-linear. It acts as a transition point between hidden layers. and the only way to prevent an ANN from collapsing down into a single linear model. Sigmoid, 0<= y <= 1 continuous clear binary classifier Softmax combo of softmaxes multiclass classification Tanh like sigmoid but -1 to 1 and symmetric ReLU- rectified linear Activation Unit (Start here) Softplus (~ReLU) Leaky ReLU PReLU - Parametric ELU - Exponential GELU - Gaussian
TensorFlow Playground
A program that visualizes how different hyperparameters influence model (primarily neural network) training. Go to http://playground.tensorflow.org to experiment with TensorFlow Playground.
MLMD: Architecture
A) Top Layer are the TFX Components present in an ML Pipeline B) Every Component is Connected to the Same MetaData Store Artifact: Artifact Type -> Event: Input -> Execution: Execution Type -> Event: output -> Artifact: Artifact Type Artifact: Artifact Type -> Attribution -> Context; Context Type Execution: Execution Type -> Association-> Context; Context Type C) Storage Backend -the Physical Storage medium of Metadata Store D) (Optional) GUI !pip install ml-metadata from ml_metadata import metadata_store from ml_metadata.proto import metadata_store_pb2 # DB where ml registers Database connection_config = metadata_store_pb2.ConnectionConfig() # set Fake DB connection_config.fake_database.SetInParent() # or SQLite connection_config.sqlite.filename_uri= ... connection_config.sqlite.connection mode = 3 # READWRITE # or MySQL connection_config.mysql.host= ... connection_config.mysql.port= ... connection_config.mysql.database= ... connection_config.mysql.user= ... connection_config.mysql.password= ... #instantiate Store with Config store = metadata_store.MetadataStore( connection_config )
ML Metadata Library (MLMD)
A) Track/Retrieves metadata flowing between components in pipeline B) Artifacts/Objects are Stored in MLMD. Artifact Properties are stored in an Relational Database: Supports multiple Storage Backends Data Entities: UNITS: Each of these units can hold additional data describing it in more detail using properties. A) Artifacts - Elementary Unit of Data fed into Metadata Store. describes a specific instance of an ArtifactType, and its properties that are written to the metadata store. B) Executions - is a record of a component run or a step in an ML workflow and the runtime parameters. An execution can be thought of as an instance of an ExecutionType. Executions are recorded when you run an ML pipeline or step. C) Context - Grouping and Clustering of Artifacts and Executions. is an instance of a ContextType. It captures the shared information within the group. For example: project name, changelist commit id, experiment annotations etc. It has a user-defined unique name within its ContextType. TYPES: Each type contains the Properties of that Type A) ArtifactsType - describes an artifact's type and its properties that are stored in the metadata store. You can register these types on-the-fly with the metadata store in code, or you can load them in the store from a serialized format. Once you register a type, its definition is available throughout the lifetime of the store. B) ExecutionType -describes a type of component or step in a workflow, and its runtime parameters. C) ContextType - describes a type of conceptual group of artifacts and executions in a workflow, and its structural properties. For example: projects, pipeline runs, experiments, owners etc. RELATIONSHIPS: Store the various Units getting generated or consumed when interacting with other untis A) Event- is a record of the relationship between artifacts and executions. When an execution happens, events record every artifact that was used by the execution, and every artifact that was produced. These records allow for lineage tracking throughout a workflow. By looking at all events, MLMD knows what executions happened and what artifacts were created as a result. MLMD can then recurse back from any artifact to all of its upstream inputs. B)
Gradient tapes
API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually tf.Variables. TensorFlow "records" relevant operations executed inside the context of a tf.GradientTape onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation. with tf.GradientTape(persistent=True) as tape: y = x @ w + b loss = tf.reduce_mean(y**2) <recorded Operations> ....
TFX: Interactive Context and ExampleGen Component
Allows you to see through Each Component of a manually executed Orchectration Pipeline, providing the functionality to step through each component and inspect its outputs. When you initialize InteractiveContext # location of the pipeline metadata store _pipeline_root = './pipeline/' # directory of the raw data files _data_root = './data/census_data' # path to the raw training data _data_filepath = os.path.join(_data_root, 'adult.data') context = InteractiveContext(pipeline_root=_pipeline_root) it will create a database in the _pipeline_root directory which the different components will use to save or get the state of the component executions. A) split the data into training and evaluation sets (by default: 2/3 train, 1/3 eval). example # Instantiate ExampleGen with the input CSV dataset example_gen = tfx.components.CsvExampleGen(input_base=_data_root) B) convert each data row into tf.train.Example format. This protocol buffer is designed for Tensorflow operations and is used by the TFX components. # An Example is a standard proto storing data for training and inference. example = tf.train.Example() C) compress and save the data collection under the _pipeline_root directory for other components to access. These examples are stored in TFRecord format. This optimizes read and write operations within Tensorflow. context.run(example_gen) # get the artifact object artifact = example_gen.outputs['examples'].get()[0] artifact.type -> <class 'tfx.types.standard_artifacts.Examples'> # print split names and uri print(f'split names: {artifact.split_names}') print(f'artifact uri: {artifact.uri}') artifact .split_names -> ["train", "eval"] artifact .uri -> ./pipeline/CsvExampleGen/examples/1 # print split names and uri print(f'split names: {artifact.split_names}') artifact.split names -> ["train", "eval"] # You will find TFRecords in Directories in GZIP Format ./pipeline/CsvExampleGen/examples/1/Split-train ./pipeline/CsvExampleGen/examples/1/Split-eval print(f'artifact uri: {artifact.uri}') artifact uri -> ./pipeline/CsvExampleGen/examples/1 #array List of those files tfrecord_filenames = [os.path.join('./pipeline/CsvExampleGen/examples/1/Split-train', name) for name in os.listdir(train_uri)] datas
Apache Beam
Batch/streaming data processing Orchestrator that can be used when pushing to production, you want to automate the pipeline execution using orchestrators
Convulsional Neural Network
CNN utilizes spatial correlations that exist within the input data. It uses three basic ideas − Local respective fields Convolution Pooling Each concurrent layer of a neural network connects some input neurons. This specific region in the input layer are called local receptive field. The area of our filter is also called the receptive field, named after the neuron cells! Output is projected onto Hidden layers as feature maps convolution is performed on the input data with the use of a filter or kernel (these terms are used interchangeably) to then produce a feature map. We execute a convolution by sliding the filter over the input. At every location, a matrix multiplication is performed and sums the result onto the feature map. Numerous convolutions with different filters produce a different feature map take all of these feature maps and put them together as the final output of the convolution layer. The function of pooling is to continuously reduce the dimensionality to reduce the number of parameters and computation in the network. BINARY IMAGE Classification is best with Cross Entropy Loss
One-hot Encoding
Categorical Data needs to be represented features in a vocabulary vector vocabulary ["v1","v2","v3"] "v1" = [1,0,0] "v2" = [0,1,0] "v3" = [0,0,1]
TensorFlow How it Works
Create a DAG to represent the Computation that you want to do 1) Nodes are Mathematical Operations like SoftMax or Matrix Multiplication 2) Edges are the Tensors acting as the Input and Outputs of the Nodes (mathematical operations) as Arrays of Data
initialization from out- of - memory
Data Sets can be instantiated from sharded data example multiple CSV ds =tf.data.Dataset.List_files( path.*).flat_map(tf.data.TextLineDataset).map( function_for_each_row) TextLineDataset TFRecordDataset FixedLengthRecordDataset
tf.data.Dataset initialization from in - memory
Data Sets can be instantiated from tensors t= tf.constant([[4,2],[5,3]]) ds =tf.data.Dataset.from_tensors(t) ~ [[4,2],[5,3]] or from slices for each row ds =tf.data.Dataset.from_tensor_slices(t) ~ [4,2],[5,3]
TFX: Component Architecture
Driver : supplies required Metadata to Executor Executor: location of Coded Functionality and where it runs Publisher: Stores result in Metadata Metadata Store: Where Metadata is located
Dropout Layers 0.0 to 1.0
Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel. % chance of dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections. Dropout has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs.
TFDV Fix Evaluation Anomalies
Feature_domain = tfdv.get_domain(schema, 'Feature_domain') Feature_domain.value.append('missing-anomaly-feature_value') # Get the feature and relax feature's distribution constraints. # to match 90% of the domain payer_code_feat = tfdv.get_feature(schema, 'payer_code') payer_code_feat.distribution_constraints.min_domain_mass = 0.9 # During the inferring of a Feature a domain or set of # enumerated values may be shared amongst Features tfdv.set_domain( schema, feature_path='Feature_domain2', domain=' 'Feature_domain'' )
tfmd.proto.statistics_pb2
Generated protocol buffer code.
Keras
High Level Neural Network library running on top of Tensorflow, CNTK, and Theano. Keras allows for easy and fast prototyping as well as CPU and GPU execution. Python Based
Tensor
Immutable , and stateless Tensorflow Sliceable Data Arrays of Rank N or N dimensional array. You will have to specify shape. x = tf.constant([[[..tin..],[..tin..]],[[..tin..],[..tin..]]]) or tin = tf.constant([3,5,4]) shape = (3,) = vector t2 = tf.stack(tin,tin) shape = (2,3) = vector t3 = tf.stack(t2,t2,t2,t2) shape = (4,2,3) = vector ... tout =tf.stack(tn-1,...) shape = (tout,...,tin) = vector
Sequential API
one input feeds a layer and results in an Output that feeds another layer
PyArrow
PyArrow integrates very nicely with Pandas and has many built-in capabilities of converting to and from Pandas efficiently. The Arrow datasets make use of these conversions internally import tensorflow_io.arrow as arrow_io import numpy as np import pandas as pd data = {'label': np.random.binomial(1, 0.5, 10)} data['x0'] = np.random.randn(10) + 5 * data['label'] data['x1'] = np.random.randn(10) + 5 * data['label'] df = pd.DataFrame(data) ds = arrow_io.ArrowDataset.from_pandas( df, batch_size=2, preserve_index=False) # Make an iterator to the dataset ds_iter = iter(ds) DataFrame index column is omitted by setting preserve_index to False. from pyarrow.feather import write_feather # Write the Pandas DataFrame to a Feather file write_feather(df, '/path/to/df.feather') # Create the dataset with one or more filenames ds = arrow_io.ArrowFeatherDataset( ['/path/to/df.feather'], #array of files columns=(0, 1, 2), #select column indices output_types=(tf.int64, tf.float64, tf.float64), output_shapes=([], [], [])) # Iterate over each row of each file for record in ds: label, x0, x1 = record ds = arrow_io.ArrowStreamDataset.from_pandas( df, batch_size=2, preserve_index=False)
Dimensions
Rank of a tensor 3x4 or [3,4] is a rank 2. Length of the shape
Types of Tensor operations
Reshaping, Element-Wise, Reduction, Access
automatic differentiation
Set of Techniques to evaluate the derivative of a function specified by a program using the chain rule repeatedly across it's Operators. ML algorithms use it for Backpropagations in NN
Scalar Tensor Data
Simplest for of Tensorflow Data Array of Rank 0 or 0 dimensional array. A single numeric value x = tf.constant(3) shape = () = scalar
slicing_util
TFDV can be configured to compute statistics over slices of data. Slicing can be enabled by providing slicing functions which take in an Arrow RecordBatch and output a sequence of tuples of form (slice key, record batch) example # None means every # Slice on country feature (i.e., every unique value of the feature). slice_fn1 = slicing_util.get_feature_value_slicer( features={ 'country': None } ) # Slice on the cross of country and state feature (i.e., every # unique pair of values of the cross). slice_fn2 = slicing_util.get_feature_value_slicer( features={ 'country': None, 'state': None } ) # Slice on specific values of a feature. slice_fn3 = slicing_util.get_feature_value_slicer( features={' age': [10, 50, 70] } )
Slice Dataset For Unique Values of a Feature
TFDV computes statistics for the overall dataset in addition to the configured slices. Each slice is identified by a unique name which is set as the dataset name in the DatasetFeatureStatistics protocol buffer. def split_datasets(dataset_list): ''' split datasets. Parameters: dataset_list: List of datasets to split Returns: datasets: sliced data ''' datasets = [] for dataset in dataset_list.datasets: proto_list = DatasetFeatureStatisticsList() proto_list.datasets.extend([dataset]) datasets.append(proto_list) return datasets def display_stats_at_index(index, datasets): ''' display statistics at the specified data index Parameters: index : index to show the anomalies datasets: split data Returns: display of generated sliced data statistics at the specified index ''' if index < len(datasets): print(datasets[index].datasets[0].name) tfdv.visualize_statistics(datasets[index]) def sliced_stats_for_slice_fn(slice_fn, approved_cols, dataframe, schema): ''' generate statistics for the sliced data. Parameters: slice_fn : slicing definition approved_cols: list of features to pass to the statistics options dataframe: pandas dataframe to slice schema: the schema Returns: slice_info_datasets: statistics for the sliced dataset ''' # Set the StatsOptions slice_stats_options = tfdv.StatsOptions(schema=schema, slice_functions=[slice_fn], infer_type_from_schema=True, feature_allowlist=approved_cols) # Convert Dataframe to CSV since `slice_functions` works only with `tfdv.generate_statistics_from_csv` CSV_PATH = 'slice_sample.csv' dataframe.to_csv(CSV_PATH) # Calculate statistics for the sliced dataset sliced_stats = tfdv.generate_statistics_from_csv(CSV_PATH, stats_options=slice_stats_options)
Vector Tensor Data
Tensorflow Data Array of Rank 1 or 1 dimensional array. Series of numbers x =tf.constant([3,5,4]) shape = (3,) = vector
Matrix Tensor Data
Tensorflow Data Array of Rank 2 or 2 dimensional array. Table of Numbers x =tf.constant([[3,5,4],[1,2,6]]) shape = (2,3) = vector
3D Tensor Data
Tensorflow Data Array of Rank 3 or 3 dimensional array. Cube of Numbers x =tf.constant([[[3,5,4],[1,2,6]],[[10,15,14],[11,12,16]]]) shape = (2,2,3) = vector
The Dying ReLU Layer Problem
The dying ReLU problem refers to the scenario when many ReLU neurons only output values of 0. Caused by: (i) High learning rate- If our learning rate (α) is set too high, there is a significant chance that our new weights will end up in the highly negative value range since our old weights will be subtracted by a large number. new negative weights will create zeros (ii) Large negative bias - Bias is a constant value added to the product of inputs and weights. (W*x) + b. where b is Bias Detect fraction of Zero weights in tensor board and solve by reducing Learning Rate. use a Leaky ReLU
Arrow Dataset
The Arrow datasets are an extension of tf.data.Dataset Arrow enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. . This provides several significant advantages: 1) Arrow's standard in-memory format, allows zero-copy reads to support for accelerated operations on modern hardware which removes virtually all serialization overhead. it allows data exchange between systems without the need to implement a number of converters for different file formats. 2) Arrow is language-agnostic so it supports different programming languages. 3) Arrow is column-oriented so it is faster at querying and processing slices or columns of data. 4) Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow. 5) Arrow supports many, possibly nested, column types. Currently, TensorFlow I/O offers 3 varieties of Arrow datasets. By name they are: ArrowDataset, ArrowFeatherDataset , ArrowStreamDataset. Arrow data has two important aspects: the data is structured - defines a columnar data format and a schema is used to describe each column with a name, data type, bit width, etc. ensuring that TensorFlow will be using the exact type specification no matter where the source came from. The format of Arrow data is language agnostic and is designed to be able to transfer data across language boundaries the data is batched -used most efficiently when it is chunked into record batches that consist of a set of columns with an equal number of rows. Batches are then exchanged via stream or file formats. Arrow can natively create batches of data and use them to efficiently convert the batched data into tensors.
ArrowDataset
The ArrowDataset works with Arrow data that is already loaded into memory. Because all data must fit in memory, it is only recommended to use this method on small datasets and is most useful to quickly load data if it fits the memory constraints.
ArrowStreamDataset
The ArrowStreamDataset is used to connect to one or more endpoints that are serving Arrow record batches in the Arrow stream format. Streaming batches is an excellent way to iterate over a large dataset, both local or remote, that might not fit entirely into memory. Currently supported endpoints are a POSIX IPv4 socket with endpoint : <IP>:<PORT> or tcp://<IP>:<PORT>, a Unix Domain Socket with endpoint: unix://<pathname> STDIN with endpoint: fd://0 or fd://-. The constructor is nearly identical to the ArrowDataset
TFX: Interactive Context and ExampleValidator Component
The ExampleValidator component detects anomalies in your data based on the generated schema from the previous step. Like the previous two components, it also uses TFDV under the hood. ExampleValidator will take as input the statistics from StatisticsGen : statistics=statistics_gen.outputs['statistics'] and the schema from SchemaGen: schema=schema_gen.outputs['schema'] example_validator = tfx.components.ExampleValidator( statistics=statistics_gen.outputs['statistics'], schema=schema_gen.outputs['schema']) # Visualize the output context.show(example_validator.outputs['anomalies']) Artifact at ./pipeline/ExampleValidator/anomalies/4 'train' split: No anomalies found. 'eval' split: No anomalies found.
EXPLODING GRADIENT PROBLEM
The Exploding Gradient Problem is the opposite of the Vanishing Gradient Problem. In Deep Neural Networks gradients may explode during backpropagation, resulting number overflows. in Back Propagation weights are adjusted in a neural network, to reduce total error of the network, by moving backward through the network's layers. w1 := w1 - alpha Learning rate * dL/dw1 where L is Loss function A common technique to deal with exploding gradients is to perform Gradient Clipping
ReLU
The Rectified Linear Unit (ReLU) activation function can be described as: f(x) = max(0, x) What it does is: (i) For negative input values, output = 0 (ii) For positive input values, output = original input value ReLU 1) takes less time to learn and is computationally less expensive than other common activation functions (e.g., tanh, sigmoid). Because it outputs 0 whenever its input is negative, fewer neurons will be activated, leading to network sparsity and thus higher computational efficiency. 2) involves simpler mathematical operations 3)
TFX: Interactive Context and SchemaGen Component
The SchemaGen component also uses TFDV to generate a schema based on your data statistics. As you've learned previously, a schema defines the expected bounds, types, and properties of the features in your dataset. SchemaGen will take as input the statistics that we generated with StatisticsGen, : statistics_gen .outputs['statistics']. schema_gen = tfx.components.SchemaGen( statistics=statistics_gen.outputs['statistics'], ) context.run(schema_gen ) # Display Schema context.show(schema_gen.outputs['schema'])
TFX: Interactive Context and StatisticsGen Component
The StatisticsGen component computes statistics over your dataset for data analysis, as well as for use in downstream components (i.e. next steps in the pipeline). As mentioned earlier, this component uses TFDV under the hood so its output will be familiar to you. StatisticsGen takes as input the dataset we just ingested using CsvExampleGen : example_gen.outputs['examples'] statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples']) context.run(statistics_gen) # Display interactive comparison Graphs of the statistics context.show(statistics_gen.outputs['statistics']) artifact = statistics_gen .outputs['statistics'].get()[0] artifact..uri-> ./pipeline/StatisticsGen/statistics/2 # You will find ./pipeline/StatisticsGen/statistics/2/Split-train ./pipeline/StatisticsGen/statistics/2/Split-eval
TFX: Interactive Context and Module Component for Preprocessing Function
The Transform component performs feature engineering for both training and serving datasets. It uses the TensorFlow Transform library introduced in the first ungraded lab of this week. Transform will take as input the data from ExampleGen: examples=example_gen.outputs['examples'] SchemaGen: schema=schema_gen.outputs['schema'], module containing the preprocessing function. A) Create a Python File containing Constants of FEATURE KEYS and a utility for renaming the transformed Feature # Set the transform module filename _census_transform_module_file = 'census_transform.py' %%writefile {_census_transform_module_file} # Feature to scale from 0 to 1 RANGE_FEATURE_KEYS = ['clouds_all'] # Features to be scaled to the z-score DENSE_FLOAT_FEATURE_KEYS = ['temp', 'snow_1h'] # Features with string data types that will be converted to indices CATEGORICAL_FEATURE_KEYS = ['education', 'marital-status', 'occupation', 'race', 'relationship', 'workclass', 'sex', 'native-country'] # Numerical features that are marked as continuous NUMERIC_FEATURE_KEYS = ['fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week'] # Feature that can be grouped into buckets BUCKET_FEATURE_KEYS = ['age'] # Number of buckets used by tf.transform for encoding each bucket feature. FEATURE_BUCKET_COUNT = {'age': 4} # Feature that the model will predict LABEL_KEY = 'label' # Count of out-of-vocab buckets in which unrecognized VOCAB_FEATURES are hashed. OOV_SIZE = 10 # Features with string data types that will be converted to indices VOCAB_FEATURE_KEYS = [ 'holiday', 'weather_main', 'weather_description' ] # Utility function for renaming the feature def transformed_name(key): return key + '_xf' B) Create a Preprocessing Function that imports the above Constants import tensorflow as tf import tensorflow_transform as tft import your_constants # Unpack the contents of the constants module _NUMERIC_FEATURE_KEYS = your_constants.NUMERIC_FEATURE_KEYS _CATEGORICAL_FEATURE_KEYS = your_constants.CATEGORICAL_FEATURE_KEYS _BUCKET_FEATURE_KEYS = your_constants.BUCKET_FEATURE_KEYS _FEATURE_BUCKET_COUNT = your_constants.FEATURE_BUCKET_COUNT _LABEL_KEY = your_constants.LABEL_KEY _transformed_name = your_constants.transfo
tf.Example
message type
TensorFlow Transform
Training Data -> tf.Transform API-> Serving System tf.Transform API - Pipeline + MetaData Storage - manages and tracts lineage of Artifacts produced and sent to Serving system PIPELINE: Input Data -> Transform -> Transformed Data -> Trainer -> Trained Models tf.Transform API ----------------- Transform : A) INPUT:Example Gen (Data), Schema Gen ( Schema) , User Code a1) RAW Data -> Apache Beam -> Transformed Tensorflow Graph ( Definition of all transformations we are doing on the Data) -> Transformed/Processed Data B) OUTPUT: Transformed Data, Transformed Graph Trainer: A) INPUT: Transformed/Processed Data, Transformed Tensorflow Graph a1) Transformed/Processed Data -> tf.Transform Analyzers -> training -> Model Training Tensorflow Graph B) OUTPUT: Model Training Tensorflow Graph SAVED MODEL = Model Training Tensorflow Graph , Transformed Tensorflow Graph Serving -------- Raw Inference Request -> SAVED MODEL -> Prediction
TensorFlow Embedding Projector
UI for helping you get a sense of whhat your data looks like in a particular space
Concat
We combine tensors using the cat() function, and the resulting tensor will have a shape that depends on the shape of the two input tensors. [2,2] and [2,2] -> [4, 2]
tf.data.Dataset
Wrapper that Represents a potentially Large list of Samples, and provides functionality for iterating and transforming. Allows Prefetching of Data. Which allows the CPU to Gather data at the same time as allowing the GPU to do work. Without prefetching the CPU and GPU work serially on the data batch. So once a batch is sent to GPU another Batch can be prepared
define a constant X with the shape (3,1)
X = tf.constant(np.random.randn(3,1), name = "X")
zero-copy I/O
a technique for transferring data across the kernel-user boundary without a memory-to-memory copy, e.g., by manipulating page table entries describes computer operations in which the CPU does not perform the task of copying data from one memory area to another or in which unnecessary data copies are avoided.
Learning Rate
a value that can range from 0 to 1 and controls how much learning takes place after each trial. Pay attention to this when detecting Exploding Gradient in Back Propagation weights are adjusted in a neural network, to reduce total error of the network, by moving backward through the network's layers. w1 := w1 - alpha Learning rate * dL/dw1 where L is Loss function Essentially the adjustment value of step taken in each iteration of gradient descent For a Stable Epoch 1) As you decrease learning rate bias takes longer to increase and eventually converging number 2) As you decrease learning rate losses take longer to converge to zero May need more epochs 3) As you decrease learning rate the Slope will increase before eventually decreasing 4) if the Learning rate is too large = .1 your Losses may actually explode increase and blow up to infinity 5) if the Learning rate is too large = .1 your bias may toggle between +and - infinity 6) if the Learning rate is too large = .1 your slope may toggle between +and - infinity
cvlib
a very simple but powerful library for object detection that is fueled by OpenCV and Tensorflow.
Functional API
are Graph like where layers may take different inputs from multiple layers to generate an Output that can also potentially go to multiple layers. Essentially treating a Model as a Layer to other Models
vanishing gradient problem
as one keeps adding layers to a network, the network eventually becomes untrainable, because the weights zero out the more layers you add. in Back Propagation weights are adjusted in a neural network, to reduce total error of the network, by moving backward through the network's layers. w1 := w1 - alpha Learning rate * dL/dw1 where L is Loss function Best Bet use ReLUs
TFDV -Tensorflow Data Validation -Facets
can analyze training and serving data to: a) compute descriptive statistics, b) infer a schema, c) detect data anomalies. The core API supports each piece of functionality, with convenience methods that build on top and can be called in the context of notebooks.
Kubeflow
can be used out-of-the box to operationalize xgboost model. This Orchestrator can be used when pushing to production, you want to automate the pipeline execution using orchestrators
Simple Logistic Regression Model: KERAS
def model_fit(ds): """Create and fit a Keras logistic regression model.""" # Build the Keras model model = tf.keras.Sequential() model.add( tf.keras.layers.Dense( 1, input_shape=(2,), activation='sigmoid' ) ) model.compile( optimizer='sgd', loss='mean_squared_error', metrics=['accuracy'] ) # Fit the model on the given dataset model.fit( ds, epochs=5, shuffle=False ) return model
Dataframe Split TRAIN/EVAL/TEST(SERVING)
def prepare_data_splits_from_dataframe(df): # 70% of records for generating the training set train_len = int(len(df) * 0.7) # Remaining 30% of records evaluation and serving sets eval_serv_len = len(df) - train_len # 15% eval_len = eval_serv_len // 2 # Remaining 15% of total records for generating the serving set serv_len = eval_serv_len - eval_len # Split the dataframe into the three subsets train_df = df.iloc[ :train_len ].reset_index(drop=True) eval_df = df.iloc[ train_len: train_len + eval_len ].reset_index(drop=True) serving_df = df.iloc[ train_len + eval_len : train_len + eval_len + serv_len ].reset_index(drop=True) # Serving data emulates the data that would be submitted # for predictions, so it should not have the label column. # target Column for test = 'readmitted' serving_df = serving_df.drop(['readmitted'], axis=1)
PyArrow CSV -> Arrow Table
def read_and_process(filename): opts = pyarrow.csv.ReadOptions( use_threads=True, block_size=4096) table = pyarrow.csv.read_csv(filename, opts) # Fit the feature transform df = table.to_pandas() scaler = StandardScaler().fit(df[['x0', 'x1']]) # Iterate over batches in the pyarrow.Table and # apply processing for batch in table.to_batches(): df = batch.to_pandas() # Process the batch and apply feature transform X_scaled = scaler.transform(df[['x0', 'x1']]) df_scaled = pd.DataFrame( {'label': df['label'], 'x0': X_scaled[:, 0], 'x1': X_scaled[:, 1]} ) batch_scaled = pa.RecordBatch.from_pandas( df_scaled, preserve_index=False ) yield batch_scaled
Apache Arrow
enables the means for high-performance data exchange with TensorFlow that is both standardized and optimized for analytics and machine learning. Arrow datasets from TensorFlow I/O provide a way to bring Arrow data directly into TensorFlow tf.data that will work with existing input pipelines and tf.data.Dataset APIs.
TFDV StatsOptions: Removing Irrelevant Features
feature_allowlist: Optional[Union[List[types.FeatureName], List[types.FeaturePath]]] = None, EXAMPLE # Define features to remove features_to_remove = {'encounter_id', 'patient_nbr'} # Collect features to include while computing the statistics approved_cols = [col for col in df.columns if (col not in features_to_remove)] # Instantiate a StatsOptions class and define the # feature_allowlist property stats_options = tfdv.StatsOptions( feature_allowlist=approved_cols ) # Duplicate Schema Options for Use in #TEST/Serving data set Statistics options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True, feature_allowlist=approved_cols)
TFRecord
format for storing data that can be written to or read from. Used to load record orientated binary format data
ArrowFeatherDataset
he ArrowFeatherDataset can load a set of files in Arrow Feather format. Feather is a light-weight file format that provides a simple and efficient way to write Pandas DataFrames to disk, It is currently limited to primitive scalar data, but after Arrow 1.0.0 is released, it is planned to have full support for Arrow data and also interop with R DataFrames. This dataset will be ideal if your workload processes many DataFrames and writing to disk is desired. The Arrow Feather readers/writers are designed to maximize performance when loading/saving Arrow record batches. However, if your files are intended for long-term storage, other columnar formats, such as Apache Parquet, might be better suited.
TensorFlow Extended (TFX)
https://arxiv.org/abs/2010.02013 A TFX pipeline is a sequence of components that implement an ML pipeline which is specifically designed for scalable, high-performance machine learning tasks. Sequence of Scalable Components for deploying to production TFX PROCESS: LIBRARY Data Ingestion: DATA INGESTION Data Validation: TENSORFLOW DATA VALIDATION Feature Engineering: TENSORFLOW TRANSFORMATION Train Model: ESTIMATOR or KERAS MODEL Validate Model: TENSORFLOW MODEL ANALYSIS Push if Good: VALIDATION OUTCOME Serve Model: TENSORFLOW SERVING TFX PROCESS: TFX DAG COMPONENT Data Ingestion: ExampleGen Data Validation: StatisticsGen, SchemaGen, Example Validator Feature Engineering: Transform Train Model: Tuner, Trainer Validate Model: Evaluator Push if Good: InfraValidator, Pusher Serve Model: Model Server, Bulk Inference
Tensorflow Data Validation (TFDV)
https://blog.tensorflow.org/2018/09/introducing-tensorflow-data-validation.html Understand , Validate , and Monitor ML Data at scale Can be used to validate Petabytes of data at Google Daily TFX Users maintain health of ML Platforms Provides Browser Visualizations for Data Statistics Infers Schemas Validates against Schemas Detects Training vs Prediction Skews [ Schema Skew, Feature Skew, Distribution Skew ] This is performed on Categorical Features https://en.wikipedia.org/wiki/Chebyshev_distance via Chebyshev Distance= D(x,y) = max(| xi -yi |) max abs diff D is set to a trigger threshold
TFDV StatsOptions
import tensorflow_data_validation as tfdv #takes in Feature Slices and defines Options stats_options = tfdv.StatsOptions( slice_functions=[slice_fn1, slice_fn2, slice_fn3] )
TFDV Infer a data schema
infer the data schema from only the training dataset. get the statistics of your data as input. DatasetFeatureStatisticsList = a Schema protocol buffer = train_stats # Infer the data schema by using the training statistics that you generated schema = tfdv.infer_schema(statistics=train_stats) # Display the data schema tfdv.display_schema(schema=schema) # Serving Statistic will be missing Label Feature Info in Schema serving_stats = tfdv.generate_statistics_from_dataframe(dataframe=serving_df, stats_options=options) #slight schema variations can be expressed by using #environments. In particular, features in the schema can be #associated with a set of environments using #default_environment, in_environment and not_in_environment. # All features are by default in both #TRAINING and SERVING environments. schema.default_environment.append('TRAINING') schema.default_environment.append('SERVING') tfdv.get_feature(schema, 'LabelFeature').not_in_environment.append('SERVING') # inferred schema and the SERVING environment parameter. serving_anomalies_with_env = tfdv.validate_statistics(statistics=serving_stats , schema=schema, environment='SERVING')
tf.data.BatchDataSet
is a JVM Class API object derived from tf.data.Dataset
tf.data
is an API used to help build data Pipelines
Shape
is the length in each dimension
"momentum" Compile Optimizer
reduces learning rate when gradient values are small The formula for a momentum optimizer is: v = β * v - learning_rate * gradient; parameters = parameters + v. Explanation: v: Represents the "velocity" which accumulates momentum from past gradients. β (beta): A hyperparameter between 0 and 1, controlling how much influence past gradients have on the current update. learning_rate: The step size for updating the parameters. gradient: The calculated gradient at the current iteration. How it works: 1) Calculate the gradient: Compute the gradient of the loss function with respect to the model parameters. 2) Update velocity: Multiply the previous velocity (v) by beta and then subtract the current gradient scaled by the learning rate. 3) Update parameters: Add the updated velocity (v) to the current parameters.
Wide Neural Networks
represent a network with less number of hidden layers (usually 1-2) but more number of neurons per layer. useful when you have less data and your problem isn't too complex. Single hidden layer with a lot of neurons can detect simple patterns(simple classification and regression problems) in the dataset but will fail when you start expecting it to detect complex relations (Image detection, Speech recognition, etc.).
Deep Neural Networks
represent a network with more number of hidden layers(more than 1-2). These types of networks can be useful in complex datasets and problems where finding high accuracy with small models is difficult. These are typically used in computer vision and NLP problems.
tf.reshape(x, ...)
reshaping occurs row by row so if x ~[[3,5,7],[4,6,8]] tf.reshape(x,[3,2]) => [3,5],[7,4],[6,8]
Dense Tensor
store values in a contiguous sequential block of memory where all values are represented
tf.keras.layers.DenseFeatures
takes an array of features and uses them to generate a Layer that can be used in a Model. The output of that layer is a dense Tensor
file_io
tensorflow.python.lib.io def _open_file_read_binary(uri): try: return file_io.FileIO(uri, mode='rb') except errors.InvalidArgumentError: return file_io.FileIO(uri, mode='r') with _open_file_read_binary(uri) as f: image_bytes = f.read() def _write_vocab(destination, vocab_list): # Write the top_words to destination (line by line fashion) with file_io.FileIO(destination, 'w+') as f: for word in vocab_list: f.write(u'{} {}\n'.format(word[0], word[1])) # Create a rev_vocab dictionary that returns the index of each word return dict([(word, i) for (i, (word, word_count)) in enumerate(vocab_list)])
Estimator
tf.estimator—a high-level TensorFlow API. Estimators encapsulate the following actions: Training Evaluation Prediction Export for serving Estimators use a system called feature columns to describe how the model should interpret each of the raw input features. An Estimator expects a vector of numeric inputs, and feature columns describe how the model should convert each feature. With Estimators, Tensorflow provides Pre-made Estimators, which are models which you can use straight away, simply by plugging in the hyperparameters.
Theory Loss and Gradient Smoothening
the LOSS landscape travelled by gradient descent is assumed to be extremely bumpy with random hills and valleys. The shape of the Loss Curve of Weight to Weight axis changes relative to each other is called the Loss Landscape It is assumed bath normalization smoothens out the steps taken in this landscape
Theory Internal Covariate Shift
the model is fed data with a very different distribution than what it was previously trained with — even though that new data still conforms to the same target function. For the model to figure out how to adapt to this new data, it has to re-learn some of its target output function. This slows down the training process. Had we provided the model with a representative distribution that covered the full range of values from the beginning, it would have been able to learn the target output sooner. each layer ends up trying to learn from a constantly shifting input, thus taking longer to converge and slowing down the training.
TFDV Generate Training Statistics then Visualize them
train_stats = tfdv.generate_statistics_from_dataframe( dataframe=train_df , stats_options=stats_options ) # get the number of features used to compute statistics print(f"Number of features used: {len(train_stats.datasets[0].features)}") # check the number of examples used print(f"Number of examples used: {train_stats.datasets[0].num_examples}") # check the column names of the first and last feature print(f"First feature: {train_stats.datasets[0].features[0].path.step[0]}") print(f"Last feature: {train_stats.datasets[0].features[-1].path.step[0]}") tfdv.visualize_statistics(train_stats)
W * X
w= tf.Variable([[1.],[2.]]) x= tf.constant([[3.,4,]]) tf.matmult(w,x)
