Web Application & Software Architecture 101
What Is A Database?
A database is a component required to persist data. Data can be of many forms: structured, unstructured, semi-structured and user state data.
AJAX
AJAX stands for asynchronous JavaScript & XML. The name says it all, it is used for adding asynchronous behaviour to the web page.
What Is CAP Theorem?
CAP theorem simply states that in case of a network failure, when a few of the nodes of the system are down, we have to make a choice between Availability & Consistency
What Is A Document Oriented Database?
Document Oriented databases are the main types of NoSQL databases. They store data in a document-oriented model in independent documents. The data is generally semi-structured & stored in a JSON-like format. Some of the popular document-oriented stores used in the industry are MongoDB, CouchDB, OrientDB, Google Cloud Datastore, Amazon Document DB
End Systems
End systems are devices that are connected to the Internet. They include: • Desktop computers • Servers • Mobile devices • IoT devices. So, an end system can be anything from a rack server to an Internet-enabled toaster. These devices are often also called edge systems in networking jargon because they are technically situated at the 'edge' of the Internet since they don't relay data from one device to another.
Horizontal Scaling
Horizontal scaling, also known as scaling out, means adding more hardware to the existing hardware resource pool. This increases the computational power of the system as a whole. Now the increased traffic influx can be easily dealt with the increased computational capacity & there is literally no limit to how much we can scale horizontally assuming we have infinite resources. We can keep adding servers after servers, setting up data centres after data centres.
Three Tier Application
In a three-tier application, the user interface, application logic & the database all lie on different machines & thus have different tiers. They are physically separated.
What Is A Key Value Database?
Key-value databases also are a part of the NoSQL family. These databases use a simple key-value method to store and quickly fetch the data with minimum latency. A primary use case of a Key-value database is to implement caching in applications due to the minimum latency they ensure. The Key serves as a unique identifier and has a value associated with it. The value can be as simple as a block of text & can be as complex as an object graph. The data in Key-value databases can be fetched in constant time O(1), there is no query language required to fetch the data. It's just a simple no-brainer fetch operation. This ensures minimum latency. Some of the popular key-value data stores used in the industry are Redis, Hazelcast, Riak, Voldemort & Memcache.
Latency
Latency is the amount of time a system takes to respond to a user request. Let's say you send a request to an app to fetch an image & the system takes 2 seconds to respond to your request. The latency of the system is 2 seconds. There are typically two types: - Network Latency - Application Latency
Load Balancing
Load balancing is vital in enabling our service to scale well, with the increase in the traffic load, as well as stay highly available. Load balancing is facilitated by load balancers, that makes them a key component in the web application architecture. Load balancers distribute heavy traffic load across the servers running in the cluster based on several different algorithms. This averts the risks of convergence of all the traffic on the service to a single or a few machines in the cluster. If the entire traffic on the service is converged only to a few machines this will not only overload them resulting in the increase in the latency of the application, killing its performance. But will also eventually bring them down.
What Is Polyglot Persistence?
Polyglot persistence means using several different persistence technologies to fulfil different persistence requirements in an application.
Client-Server Architecture
The architecture works on a request-response model. The client sends the request to the server for information & the server responds with it. Every website you browse, be it a Wordpress blog or a web application like Facebook, Twitter or your banking app is built on the client-server architecture.
What Is A Relational Database?
This is the most common & widely used type of database in the industry. A relational database saves data containing relationships. One to One, One to Many, Many to Many, Many to One etc. It has a relational data model. SQL is the primary data query language used to interact with relational databases.
HTTP Push
Time To Live (TTL) # In the regular client-server communication, which is HTTP PULL, there is a Time to Live (TTL) for every request. It could be 30 secs to 60 secs, varies from browser to browser. If the client doesn't receive a response from the server within the TTL, the browser kills the connection & the client has to re-send the request hoping it would receive the data from the server before the TTL ends this time. Open connections consume resources & there is a limit to the number of open connections a server can handle at one point in time. If the connections don't close & new ones are being introduced, over time, the server will run out of memory. Hence, the TTL is used in client-server communication. But what if we are certain that the response will take more time than the TTL set by the browser? Persistent Connection # In this case, we need a Persistent Connection between the client and the server. A persistent connection is a network connection between the client & the server that remains open for further requests & the responses, as opposed to being closed after a single communication.
Vertical Scaling
Vertical scaling means adding more power to your server. Let's say your app is hosted by a server with 16 Gigs of RAM. To handle the increased load you increase the RAM to 32 Gigs. You have vertically scaled the server. Ideally, when the traffic starts to build upon your app the first step should be to scale vertically. Vertical scaling is also called scaling up. In this type of scaling we increase the power of the hardware running the app. This is the simplest way to scale since it doesn't require any code refactoring, not making any complex configurations and stuff. I'll discuss further down the lesson, why code refactoring is required when we horizontally scale the app. But there is only so much we can do when scaling vertically. There is a limit to the capacity we can augment for a single server.
Web Hooks
WebHooks are more like call-backs. It's like I will call you when new information is available. You carry on with your work. WebHooks enable communication between two services without a middleware. They have an event-based mechanism. To use the Webhooks, consumers register an HTTP endpoint with the service with a unique API Key. It's like a phone number. Call me on this number, when an event occurs. I won't call you anymore. Whenever the new information is available on the backend. The server fires an HTTP event to all the registered endpoints of the consumers, notifying them of the new update.
What Is A Wide Column Database?
Wide-column databases belong to the NoSQL family of databases, primarily used to handle massive amounts of data, technically called the Big Data. Wide-column databases are perfect for analytical use cases. They have a high performance and a scalable architecture. Also, known as column-oriented databases wide-column databases store data in a record with a dynamic number of columns. A record can hold billions of columns. Some of the popular wide column databases are Cassandra, HBase, Google BigTable, Scylla DB etc.
Network Latency
Network Latency is the amount of time that the network takes for sending a data packet from point A to point B. The network should be efficient enough to handle the increased traffic load on the website. To cut down the network latency, businesses use CDN & try to deploy their servers across the globe as close to the end-user as possible.
Digital Subscriber Line (DSL)
A Digital Subscriber Line or DSL uses the existing groundwork of telephone lines for an Internet connection. DSL connections are generally provided by the same company that provides local wired phone access. A device on the home user's end called a DSL modem modulates the digital signals that a computer outputs into high-frequency analog audio signals that are out of the human voice and hearing range. The telephone wire's frequency spectrum is divided into 3 parts: • A downstream channel (which is used to receive data), in the 50 kHz to 1 MHz frequency range or 'band' • An upstream channel (used to send data) which takes up the 4 kHz to 50 kHz band • A regular channel used for telephone conversations taking up the 0 to 4kHz range
High Availability Clustering
A High Availability cluster also known as the Fail-Over cluster contains a set of nodes running in conjunction with each other that ensures high availability of the service. The nodes in the cluster are connected by a private network called the Heartbeat network that continuously monitors the health and the status of each node in the cluster. A single state across all the nodes in a cluster is achieved with the help of a shared distributed memory & a distributed co-ordination service like the Zookeeper. To ensure the availability, HA clusters use several techniques such as Disk mirroring/RAID Redundant Array Of Independent Disks, redundant network connections, redundant electrical power etc. The network connections are made redundant so if the primary network goes down the backup network takes over.
REST API
A REST API is an API implementation that adheres to the REST architectural constraints. It acts as an interface. The communication between the client & the server happens over HTTP. A REST API takes advantage of the HTTP methodologies to establish communication between the client and the server. REST also enables servers to cache the response that improves the performance of the application.
Two Tier Application
A Two-tier application involves a client and a server. The client would contain the user interface & the business logic in one machine. And the backend server would be the database running on a different machine. The database server is hosted by the business & has control over it.
Single Tier Application
A single-tier application is an application where the user interface, backend business logic & the database all reside in the same machine. PROS: The main upside of single-tier applications is they have no network latency since every component is located on the same machine. This adds up to the performance of the software. CONS: One big downside of single-tier app is that the business has no control over the application. Once the software is shipped, no code or features changes can possibly be done until the customer manually updates it by connecting to the remote server or by downloading & installing a patch. The code in single-tier applications is also vulnerable to being tweaked & reversed engineered. The security, for the business, is minimal.
Access Networks
Access networks, also known as the last mile, consist of the media through which end systems connect to the Internet. In other words, access networks are networks that physically connect end systems to the first router (also known as the "edge router") on a path which connects them to some other end systems.
Monolithic Architecture
An application has a monolithic architecture if it contains the entire application code in a single codebase. A monolithic application is a self-contained, single-tiered software application unlike the microservices architecture, where different modules are responsible for running respective tasks and features of an app. PROS: Simplicity CONS: Continuous Deployment Regression Testing Single Points Of Failure Scalability Issues Cannot Leverage Heterogeneous Technologies Not Cloud-Ready, Hold State
Application Latency
Application Latency is the amount of time the application takes to process a user request. There are more than a few ways to cut down the application latency. The first step is to run stress & load tests on the application & scan for the bottlenecks that slow down the system as a whole. I've talked more about it in the upcoming lesson.
What Are Multi-Model Databases?
But with the advent of multi-model databases, we have the ability to use different data models in a single database system. Multi-model databases support multiple data models like Graph, Document-Oriented, Relational etc. as opposed to supporting only one data model. They also avert the need for managing multiple persistence technologies in a single service. They reduce the operational complexity by notches. With multi-model databases, we can leverage different data models via a single API. Some of the popular multi-model databases are Arango DB, Cosmos DB, Orient DB, Couchbase etc.
Caching Strategies
Cache Aside: This is the most common caching strategy. In this approach, the cache works along with the database trying to reduce the hits on it as much as possible. The data is lazy-loaded in the cache. When the user sends a request for particular data, the system first looks for it in the cache. If present, then it is simply returned from it. If not, the data is fetched from the database, the cache is updated and is returned to the user. This kind of strategy works best with read-heavy workloads. This includes the kind of data which is not much frequently updated, for instance, user profile data in a portal. User's name, account number etc. Read-Through: This strategy is pretty similar to the Cache Aside strategy. A subtle difference from the Cache Aside strategy is that in the Read-through strategy, the cache always stays consistent with the database. The cache library or the framework takes the onus of maintaining the consistency with the backend; The information in this strategy too is lazy-loaded in the cache, only when the user requests it. So, for the first time when information is requested, it results in a cache miss. Then the backend has to update the cache while returning the response to the user. However, the developers can always pre-load the cache with the information which is expected to be requested most by the users. Write-Through: In this strategy, each & every information written to the database goes through the cache. Before the data is written to the DB, the cache is updated with it. This maintains high consistency between the cache and the database though it adds a little latency during the write operations as data is to be updated in the cache additionally. This works well for write-heavy workloads like online massive multiplayer games. This strategy is generally used with other caching strategies to achieve optimized performance. Write-Back: This strategy helps optimize costs significantly. In the Write-back caching strategy the data is directly written to the cache instead of the database. And the cache after some delay as per the business logic writes data to the database. If there are quite a heavy number of writes in the application. Developers can reduce the frequency of database writes to cut down the load & the associated costs. A risk in this approach is if the cache fails before the DB is updated, the data might get lost. Again, this strategy is used with other caching strategies to make the most out of these.
Caching
Caching is key to the performance of any kind of application. It ensures low latency and high throughput. An application with caching will certainly do better than an application without caching, simply because it returns the response in less time as opposed to the application without a cache implemented. First up, it's always a good idea to use a cache as opposed to not using it. It doesn't do any harm. It can be used at any layer of the application & there are no ground rules as to where it can and cannot be applied. You might remember, we talked about the Key-value data stores in the database lesson. They are primarily used to implement caching in web applications. They can be used for cross-module communication in a microservices architecture by saving the shared data which is commonly accessed by all the services. It acts as a backbone for the microservice communication. Key-value data stores via caching are also widely used in in-memory data stream processing and running analytics.
Cookies
Cookies are unique string identifiers that can be stored on the client's browser. These identifiers are set by the server through HTTP headers when the client first navigates to the website. After the cookie is set, it's sent along with subsequent HTTP requests to the same server. This allows the server to know who is contacting it and hence serve content accordingly. So the HTTP request, the HTTP response, the cookie file on the client's browser, and a database of cookie-user values on the server's end are all involved in the process of setting and using cookies. Set-cookie Header# Let's look at how cookies work in a bit more detail. When a server wants to set a cookie on the client-side, it includes the header Set-cookie: value in the HTTP response. This value is then appended to a special cookie file stored on your browser. The cookie file contains: • The website's domain • The string value of the cookie • The date that the cookie expires (yes, much like actual cookies, they do expire)
DNS
Domain name system commonly known as DNS is a system that averts the need to remember long IP addresses to visit a website by mapping easy to remember domain names to IP addresses How Does Domain Name System Work? When a user types in the url of the website in their browser and hits enter. This event is known as DNS querying. There are four key components, i.e. a group of servers, that make up the DNS infrastructure. Those are: • DNS Recursive Nameserver aka DNS Resolver • Root Nameserver • Top-Level Domain Nameserver • Authoritative Nameserver So, when the user hits enter after typing in the domain name into their browser, the browser sends a request to the DNS Recursive nameserver which is also known as the DNS Resolver. DNS Recursive nameserver is generally managed by our ISP Internet service provider. The whole DNS system is a distributed system setup in large data centers managed by internet service providers. The role of DNS Resolver is to receive the client request and forward it to the Root nameserver to get the address of the Top-Level domain nameserver. Once the DNS Resolver forwards the request to the Root nameserver, the Root nameserver returns the address of the Top-Level domain nameserver in response Once the top-level domain name server receives the request from the Resolver, it returns the IP address of amazon.com domain name server. DNS Resolver then fires a query to the authoritative nameserver & it then returns the IP address of amazon.com website to the DNS Resolver. DNS Resolver caches the data and forwards it to the client. Often all this DNS information is cached and the DNS servers don't have to do so much rerouting every time a client requests an IP of a certain website. DNS information of websites that we visit also gets cached in our local machines that is our browsing devices with a TTL Time To Live
Fault Tolerance
Fault tolerance is the ability of the system to stay up despite taking hits. A fault-tolerant system is equipped to handle faults. Being fault-tolerant is an essential element in designing life-critical systems. A few of the instances/nodes, out of several, running the service go offline & bounce back all the time. In case of these internal failures, the system could work at a reduced level but it will not go down entirely. A very basic example of a system being fault-tolerant is a social networking application. In the case of backend node failures, a few services of the app such as image upload, post likes etc. may stop working. But the application as a whole will still be up. This approach is also technically known as Fail Soft.
Federated Architecture
Federated architecture is an extension to the decentralized architecture. It powers social networks like Mastodon, Minds, Diaspora etc. The term federated in a general sense means a group of semi-autonomous entities exchanging information with each other. A real-world example of this would be looking at different states of a country which are managed by the state governments. They are partially self-governing & exercise power to keep things running smoothly. And then, those states governments share information with each other & with a central government making a complete autonomous government. As shown in the diagram below. A federated network has entities called servers or pods. A large number of nodes subscribe to the pods. There are several pods in the network that are linked to each other & share information with each other. The pods can be hosted by individuals as it is ideally achieved in a decentralized network. As new pods are hosted & introduced to the network, the network keeps growing. In case if the link between a few pods breaks temporarily. The network is still up. Nodes can still communicate with each other via the pods they are subscribed to. Pods facilitate node discovery. In a peer to peer network, there is no way of discovering other nodes & we would just sit in the dark if it weren't for a centralized node registry or something.
What Is A Graph Database?
Graph databases are also a part of the NoSQL database family. They store data in nodes/vertices and edges in the form of relationships. Features Of A Graph Database Primarily, two reasons. The first is visualization. Think of that pinned board in the thriller detective movies where the pins are pinned on a board over several images connected via threads. It does help in visualizing how the entities are related & how things fit together. Right? The second reason is the low latency. In graph databases, the relationships are stored a bit differently from how the relational databases store relationships.
High Availability
High availability also known as HA is the ability of the system to stay online despite having failures at the infrastructural level in real-time.
Client-Side vs Server Side Rendering
How does a browser render a webpage? When a user requests a web page from the server & the browser receives the response. It has to render the response on the window in the form of an HTML page. For this, the browser has several components, such as the: • Browser engine • Rendering engine • JavaScript interpreter • Networking & the UI backend • Data storage etc. The rendering engine constructs the DOM tree, renders & paints the construction. And naturally, all this activity needs a bit of time. Server-Side Rendering To avoid all this rendering time on the client, developers often render the UI on the server, generate HTML there & directly send the HTML page to the UI. This technique is known as the Server-side rendering. It ensures faster rendering of the UI, averting the UI loading time in the browser window since the page is already created & the browser doesn't have to do much assembling & rendering work. Use Cases For Server-Side & Client-Side Rendering The server-side rendering approach is perfect for delivering static content, such as WordPress blogs. It's also good for SEO as the crawlers can easily read the generated content. However, the modern websites are highly dependent on Ajax. In such websites, content for a particular module or a section of a page has to be fetched & rendered on the fly. Therefore, server-side rendering doesn't help much. For every Ajax-request, instead of sending just the required content to the client, the approach generates the entire page on the server. This process consumes unnecessary bandwidth & also fails to provide a smooth user experience. A big downside to this is once the number of concurrent users on the website rises, it puts an unnecessary load on the server. Client-side rendering works best for modern dynamic Ajax-based websites. Though we can leverage a hybrid approach, to get the most out of both techniques. We can use server-side rendering for the home page & for the other static content on our website & use client-side rendering for the dynamic pages.
Microservices Architecture
In a microservices architecture, different features/tasks are split into separate respective modules/codebases which work in conjunction with each other forming a large service as a whole. Every service ideally has a separate database, there are no single points of failure & system bottlenecks. PROS: No Single Points Of Failure Leverage the Heterogeneous Technologies Independent & Continuous Deployments CONS: Complexities In Management No Strong Consistency
AJAX - Long Polling
Long Polling lies somewhere between Ajax & Web Sockets. In this technique instead of immediately returning the response, the server holds the response until it finds an update to be sent to the client. The connection in long polling stays open a bit longer in comparison to polling. The server doesn't return an empty response. If the connection breaks, the client has to re-establish the connection to the server. The upside of using this technique is that there are quite a smaller number of requests sent from the client to the server, in comparison to the regular polling mechanism. This reduces a lot of network bandwidth consumption. Long polling can be used in simple asynchronous data fetch use cases when you do not want to poll the server every now & then.
Message Queue
Message queue as the name says is a queue which routes messages from the source to the destination or we can say from the sender to the receiver. Since it is a queue it follows the FIFO (First in First Out) policy Message queues facilitate asynchronous behaviour. We have already learned what asynchronous behaviour is in the AJAX lesson. Asynchronous behaviour allows the modules to communicate with each other in the background without hindering their primary tasks. Message queues facilitate cross-module communication which is key in service-oriented or microservices architecture. It allows communication in a heterogeneous environment. They also provide temporary storage for storing messages until they are processed & consumed by the consumer.
Publish-Subscribe Model
Model where multiple consumers receive the same message sent from a single or multiple producers. Exchanges To implement the pub-sub pattern, message queues have exchanges which further push the messages to the queues based on the exchange type and the rules which are set. Exchanges are just like telephone exchanges which route messages from sender to the receiver through the infrastructure based on a certain logic There are different types of exchanges available in message queues, some of which are, direct, topic, headers, fanout. To have more insight into how these different exchange types work, this RabbitMQ article is a good read.
Web Sockets
Persistent bi-directional communication channel between a client (e.g. a browser) and a backend service. In contrast with HTTP request/response connections, websockets can transport any number of protocols provide message delivery without polling without racy, high-latency, and bandwidth-intensive implementations establish TCP-style connections in a browser-compatible fashion using HTTP during initial setup. Messages over websockets can be provided in any protocol, remove unnecessary overhead of HTTP requests and responses (including headers, cookies, and other artifacts).
Point to Point Model
Point to point communication is a pretty simple use case where the message from the producer is consumed by only one consumer. It's like a one to one relationship, a publish-subscribe model is a one to many relationship Speaking of the messaging protocols, there are two protocols popular when working with message queues. AMQP Advanced Message Queue Protocol & STOMP Simple or Streaming Text Oriented Message Protocol. Technology Used To Implement the Messaging Protocols Speaking of the queuing tech widely used in the industry, they are RabbitMQ, ActiveMQ, Apache Kafka etc
Redundancy - Active-Passive HA Mode
Redundancy is duplicating the components or instances & keeping them on standby to take over in case the active instances go down. It's the fail-safe, backup mechanism. In the above diagram, you can see the instances active & on standby. The standby instances take over in case any of the active instances goes down. This approach is also known as Active-Passive HA mode. An initial set of nodes are active & a set of redundant nodes are passive, on standby. Active nodes get replaced by passive nodes, in case of failures. There are systems like GPS, aircrafts, communication satellites which have zero downtime. The availability of these systems is ensured by making the components redundant.
Replication - Active-Active HA Mode
Replication means having a number of similar nodes running the workload together. There are no standby or passive instances. When a single or a few nodes go down, the remaining nodes bear the load of the service. Think of this as load balancing. This approach is also known as the Active-Active High Availability mode. In this approach, all the components of the system are active at any point in time.
shared nothing architecture
Shared Nothing Architecture means eliminating all single points of failure. Every module has its own memory, own disk. So even if several modules in the system go down, the other modules online stay unaffected. It also helps with the scalability and performance.
Streaming Over HTTP
Streaming Over HTTP is ideal for cases where we need to stream large data over HTTP by breaking it into smaller chunks. This is possible with HTML5 & a JavaScript Stream API. The technique is primarily used for streaming multimedia content, like large images, videos etc, over HTTP.
HTML5 Event Source API & Server Sent Events
The Server-Sent Events implementation takes a bit of a different approach. Instead of the client polling for data, the server automatically pushes the data to the client whenever the updates are available. The incoming messages from the server are treated as Events. Via this approach, the servers can initiate data transmission towards the client once the client has established the connection with an initial request. This helps in getting rid of a huge number of blank request-response cycles cutting down the bandwidth consumption by notches. To implement server-sent events, the backend language should support the technology & on the UI HTML5 Event source API is used to receive the data in-coming from the backend. An important thing to note here is that once the client establishes a connection with the server, the data flow is in one direction only, that is from the server to the client. SSE is ideal for scenarios such as a real-time feed like that of Twitter, displaying stock quotes on the UI, real-time notifications etc.
The Network Edge
The network edge is simply the collection of end-systems that we use every day: smartphones, laptops, tablets, etc. However, note that devices that relay messages (such as routers) are not part of the edge of the Internet. Note that the two networks shown could be connected through any number of intermediate networks such as those for their Internet Service Providers. Since the actual path doesn't matter, we obfuscate the interconnectivity by using the cloud symbol.
Network Interface Adapter
The network interface adapter enables a computer to attach to a network. Since there are so many different types of networks, network adapters are used so that the user can install one to suit the network to which they want to attach. Network interfaces also usually have an address associated with them. One machine may have multiple such interfaces. These interfaces are essentially the physical gateways that connect devices to the Internet. Most machines then have external ports which network cables can be plugged into. The type of access network depends on the physical media involved. Here are some common access networks: • Digital Subscriber Line (DSL) • Cable Internet • Fiber To The Home (FTTH) • Dial-Up • Satellite • WiFi
Load Balancing Methods
There are primarily three modes of load balancing - 1. DNS Load Balancing 2. Hardware-based Load Balancing 3. Software-based Load Balancing Hardware Load Balancers Hardware load balancers are highly performant physical hardware, they sit in front of the application servers and distribute the load based on the number of existing open connections to a server, compute utilization and several other parameters. When using hardware load balancers, we may also have to overprovision them to deal with the peak traffic that is not the case with software load balancers. Hardware load balancers are primarily picked because of their top-notch performance. Software Load Balancers Software load balancers can be installed on commodity hardware and VMs. They are more cost-effective and offer more flexibility to the developers. Software load balancers can be upgraded and provisioned easily in comparison to hardware load balancers. You will also find several LBaaS Load Balancers as a Service services online that enable you to directly plug in load balancers into your application without you having to do any sort of setup. Software load balancers are pretty advanced when compared to DNS load balancing as they consider many parameters such as content that the servers host, cookies, HTTP headers, CPU & memory utilization, load on the network & so on to route traffic across the servers. They also continually perform health checks on the servers to keep an updated list of in-service machines
What Is A Time Series Database?
Time-Series databases are optimized for tracking & persisting time series data. It is the data containing data points associated with the occurrence of an event with respect to time. These data points are tracked, monitored and then finally aggregated based on certain business logic. Time-Series data is generally ingested from IoT devices, self-driving vehicles, industry sensors, social networks, stock market financial data etc. Some of the popular time-series databases used in the industry are Influx DB, Timescale DB, Prometheus etc.
Data Ingestion
a collective term for the process of collecting data streaming-in from several different sources and making it ready to be processed by the system Data Standardization The data which streams in from several different sources is not in a homogeneous structured format. So, in order to make the data uniform and fit for processing, it has to be first collected and converted into a standardized format to avoid any future processing issues. This process of data standardization occurs in the Data collection and preparation layer Data Processing Once the data is transformed into a standard format it is routed to the Data processing layer where it is further processed based on the business requirements. It is generally classified into different flows, routed to different destinations. Data Analysis After being routed, analytics is run on the data which includes execution of different analytics models such as predictive modelling, statistical analytics, text analytics etc. All the analytical events occur in the Data Analytics layer Data Visualization Once the analytics are run & we have valuable intel from it. All the information is routed to the Data visualization layer to be presented before the stakeholders, generally in a web-based dashboard. Data Storage & Security Moving data is highly vulnerable to security breaches. The Data security layer ensures the secure movement of data all along. Speaking of the Data Storage layer, as the name implies, is instrumental in persisting the data.
Data Pipelines
the core component of a data processing infrastructure. They facilitate the efficient flow of data from one point to another & also enable the developers to apply filters on the data streaming-in in real-time. Features Of Data Pipelines Speaking of some more features of the data pipelines - These ensure smooth flow of data. Enables the business to apply filters and business logic on streaming data. Avert any bottlenecks & redundancy in the data flow. Facilitate parallel processing of data. Avoid data being corrupted. What Is ETL? If you haven't heard of ETL before, it means Extract Transform Load. Extract means fetching data from single or multiple data sources. Transform means transforming the extracted heterogeneous data into a standardized format based on the rules set by the business. Load means moving the transformed data to a data warehouse or another data storage location for further processing of data. The ETL flow is the same as the data ingestion flow. It's just the entire movement of data is done in batches as opposed to streaming it, through the data pipelines, in real-time. Also, at the same time, it doesn't mean the batch processing approach is obsolete. Both real-time & batch data processing techniques are leveraged based on the project requirements. You'll gain more insight into it when we will go through the Lambda & Kappa architectures of distributed data processing in the upcoming lessons.