Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Batch and Stream Processing Flow, Cheat Sheet of Data Warehousing

Universitas Pelita Harapan (UPH)Data Warehousing

The difference between batch and stream processing flow, and the importance of monitoring and detecting data anomaly errors. It also discusses OLTP and OLAP, and the process of moving data from source to target using a data pipeline. The document also introduces the concept of message systems and their types, and provides a use case scenario.

Typology: Cheat Sheet

2015/2016

Available from 11/21/2023

maydore 🇮🇩

2 documents

1 / 2

This page cannot be seen from the preview

Don't miss anything!

BATCH AND STREAM PROCESS

Batch n stream processing flow: there are window slide, fixed, session (can be selected from

the required use case)

Monitoring to see if there are any problems with the application we are running. If there is a

problem, we have to deal with it. Detect data anomaly errors

We can immediately process and analyze the data that appears (make decisions)

Critical time requires streaming piepline data

Non-analysis data = OLTP [if the storage is heavy, large companies don't mess with OLTP

because it will impact the performance of the application]

Data analysis = OLAP

So, data from OLTP is immediately moved to OLAP so that 1 million users are not disturbed

by slow applications

Streaming is expensive because the service has to run 24 hours. This is different from batches,

for example monthly, so the system runs once a month

If the streaming process time for the e-commeerce promo = so you need to increase the pipeline

specifications

Data pipeline = data flow, the basic task of a data pipeline === moving data from source to

target

So there is a process in the data pipeline

If the batch is charged from a script that we created (we take it from the source, not the source

that was sent to us)

Message system = so that two systems can communicate / activate two-way and asynchronous

communication applications (chat via WA, if the opponent has read it, it has been delivered)

If it's synchronous (call via WA, if you don't pick up, wait for the message, then call again)

There are 2 types =

• Point to point type (point to point, 1 message can only be used by 1 consumer)

There is a queue = queuing system

• Pub-sub

There is a topic

Use case = pen moves data from 1 OLTP to 2 OLAP (1 as primary, 2 as backup) using PUB-

SUB

Source → message system → processing engine → message system → target

Partial preview of the text

Download Batch and Stream Processing Flow and more Cheat Sheet Data Warehousing in PDF only on Docsity!

BATCH AND STREAM PROCESS

Batch n stream processing flow: there are window slide, fixed, session (can be selected from the required use case) Monitoring to see if there are any problems with the application we are running. If there is a problem, we have to deal with it. Detect data anomaly errors We can immediately process and analyze the data that appears (make decisions) Critical time requires streaming piepline data Non-analysis data = OLTP [if the storage is heavy, large companies don't mess with OLTP because it will impact the performance of the application] Data analysis = OLAP So, data from OLTP is immediately moved to OLAP so that 1 million users are not disturbed by slow applications Streaming is expensive because the service has to run 24 hours. This is different from batches, for example monthly, so the system runs once a month If the streaming process time for the e-commeerce promo = so you need to increase the pipeline specifications Data pipeline = data flow, the basic task of a data pipeline === moving data from source to target So there is a process in the data pipeline If the batch is charged from a script that we created (we take it from the source, not the source that was sent to us) Message system = so that two systems can communicate / activate two-way and asynchronous communication applications (chat via WA, if the opponent has read it, it has been delivered) If it's synchronous (call via WA, if you don't pick up, wait for the message, then call again) There are 2 types =

Point to point type (point to point, 1 message can only be used by 1 consumer) There is a queue = queuing system
Pub-sub There is a topic Use case = pen moves data from 1 OLTP to 2 OLAP (1 as primary, 2 as backup) using PUB- SUB Source → message system → processing engine → message system → target

Processing engine = apache kafka[kafka stream(it provides API)], apache flink(it doesn't batch and it's native streaming, apache spark streaming(it's not native streaming, it's rather low level, meaning it doesn't really stream, microbatch) Flink and spark = must have their own cluster Kafka = it doesn't need additional clusters Messaging systems need clusters For flink and spark they mean they need two clusters = 1 for kafka and one for themselves Case difference is at least once and exactly once In the streaming process there is windowing too Data pushed = if you make an API create a CRUD db from the transaction table for example When the data becomes more complex, they start to use event-driven. Sent to the messaging system. For example, a shopping event continues with the goods delivery process (it will be processed according to the event flow)

Batch and Stream Processing Flow, Cheat Sheet of Data Warehousing

Related documents

Partial preview of the text

Download Batch and Stream Processing Flow and more Cheat Sheet Data Warehousing in PDF only on Docsity!

BATCH AND STREAM PROCESS