Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Batch and Stream Processing Flow, Cheat Sheet of Data Warehousing

The difference between batch and stream processing flow, and the importance of monitoring and detecting data anomaly errors. It also discusses OLTP and OLAP, and the process of moving data from source to target using a data pipeline. The document also introduces the concept of message systems and their types, and provides a use case scenario.

Typology: Cheat Sheet

2015/2016

Available from 11/21/2023

maydore
maydore 🇮🇩

2 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
BATCH AND STREAM PROCESS
Batch n stream processing flow: there are window slide, fixed, session (can be selected from
the required use case)
Monitoring to see if there are any problems with the application we are running. If there is a
problem, we have to deal with it. Detect data anomaly errors
We can immediately process and analyze the data that appears (make decisions)
Critical time requires streaming piepline data
Non-analysis data = OLTP [if the storage is heavy, large companies don't mess with OLTP
because it will impact the performance of the application]
Data analysis = OLAP
So, data from OLTP is immediately moved to OLAP so that 1 million users are not disturbed
by slow applications
Streaming is expensive because the service has to run 24 hours. This is different from batches,
for example monthly, so the system runs once a month
If the streaming process time for the e-commeerce promo = so you need to increase the pipeline
specifications
Data pipeline = data flow, the basic task of a data pipeline === moving data from source to
target
So there is a process in the data pipeline
If the batch is charged from a script that we created (we take it from the source, not the source
that was sent to us)
Message system = so that two systems can communicate / activate two-way and asynchronous
communication applications (chat via WA, if the opponent has read it, it has been delivered)
If it's synchronous (call via WA, if you don't pick up, wait for the message, then call again)
There are 2 types =
Point to point type (point to point, 1 message can only be used by 1 consumer)
There is a queue = queuing system
Pub-sub
There is a topic
Use case = pen moves data from 1 OLTP to 2 OLAP (1 as primary, 2 as backup) using PUB-
SUB
Source message system processing engine message system target
pf2

Partial preview of the text

Download Batch and Stream Processing Flow and more Cheat Sheet Data Warehousing in PDF only on Docsity!

BATCH AND STREAM PROCESS

Batch n stream processing flow: there are window slide, fixed, session (can be selected from the required use case) Monitoring to see if there are any problems with the application we are running. If there is a problem, we have to deal with it. Detect data anomaly errors We can immediately process and analyze the data that appears (make decisions) Critical time requires streaming piepline data Non-analysis data = OLTP [if the storage is heavy, large companies don't mess with OLTP because it will impact the performance of the application] Data analysis = OLAP So, data from OLTP is immediately moved to OLAP so that 1 million users are not disturbed by slow applications Streaming is expensive because the service has to run 24 hours. This is different from batches, for example monthly, so the system runs once a month If the streaming process time for the e-commeerce promo = so you need to increase the pipeline specifications Data pipeline = data flow, the basic task of a data pipeline === moving data from source to target So there is a process in the data pipeline If the batch is charged from a script that we created (we take it from the source, not the source that was sent to us) Message system = so that two systems can communicate / activate two-way and asynchronous communication applications (chat via WA, if the opponent has read it, it has been delivered) If it's synchronous (call via WA, if you don't pick up, wait for the message, then call again) There are 2 types =

  • Point to point type (point to point, 1 message can only be used by 1 consumer) There is a queue = queuing system
  • Pub-sub There is a topic Use case = pen moves data from 1 OLTP to 2 OLAP (1 as primary, 2 as backup) using PUB- SUB Source → message system → processing engine → message system → target

Processing engine = apache kafka[kafka stream(it provides API)], apache flink(it doesn't batch and it's native streaming, apache spark streaming(it's not native streaming, it's rather low level, meaning it doesn't really stream, microbatch) Flink and spark = must have their own cluster Kafka = it doesn't need additional clusters Messaging systems need clusters For flink and spark they mean they need two clusters = 1 for kafka and one for themselves Case difference is at least once and exactly once In the streaming process there is windowing too Data pushed = if you make an API create a CRUD db from the transaction table for example When the data becomes more complex, they start to use event-driven. Sent to the messaging system. For example, a shopping event continues with the goods delivery process (it will be processed according to the event flow)