3 Essential Questions to Address When Building an API-Involved Incremental Data Loading Script | by Daniel Khoa Le

Now let’s say you already extracted a bunch of information by making API requests with the above-mentioned params, it’s time so that you can determine the way you wish to write them to the vacation spot desk.

👉 Reply: Merge/Dedup mode (really useful)

This query considerations the selection of Write disposition or Sync mode. The quick reply is that, given you need to load your knowledge incrementally, you’ll seemingly decide to put in writing your extracted knowledge in both append mode or merge mode (often known as deduplication mode).

Nevertheless, let’s step again to look at our choices extra carefully and decide which technique is greatest fitted to incremental loading.

Listed here are the favored write inclinations.

🟪 overwrite/change: drop all current information within the vacation spot tables after which insert the extracted information.
🟪 append: merely append extracted information to the vacation spot tables.
🟪 merge / dedup: insert new(*) information and replace(**) current information.

(*) How do we all know which information are new?: Normally, we’ll use a major key to find out that. For those who use dlt, their merging technique may be extra subtle than that, together with the excellence between merge_key and primary_key (one is used for merging and one is used for dedupication earlier than merging) or dedup_sort (which information are to be deleted with the identical key within the dedup course of). I’ll go away that half for one more tutorial.

(**) It is a easy rationalization, if you wish to discover out extra about how dlt handles this merging technique, learn extra here.

👁️👁️ Right here is an instance to assist us perceive the outcomes of various write inclinations.

↪️ On 2024.06.19: We make the primary sync.

🅰️ Knowledge in supply utility️️

🅱️ ️Knowledge loaded to our vacation spot database

It doesn’t matter what sync technique you select, the desk on the vacation spot is actually a replica of the supply desk.

Saved state of updated_at= 2024–06–03, which is the newest updated_at mong the two information we synced.

↪️ On 2024.06.2: We make the second sync.

🅰️ ️️️️️️️Knowledge in supply utility

✍️ Modifications within the supply desk:

File id=1 was up to date (gross sales determine).
File id=2 was dropped.
File id=3 was inserted.

At this sync, we ONLY extract information with the updated_at> 2024–06–03 (state saved from final sync). Subsequently, we’ll extracted solely file id=1 and id=3. Since file id=2 was faraway from the supply knowledge, there isn’t a approach for us to acknowledge this variation.

With the second sync, you now will see the distinction among the many write methods.

🅱️ Knowledge loaded to our vacation spot database

❗ State of affairs 1: Overwrite

The vacation spot desk can be overwritten by the two information extracted this time.

❗ State of affairs 2: Append

The two extracted information can be appended to the vacation spot desk, the present information usually are not affected.

❗ State of affairs 3: Merge or dedup

The two extracted information with id=1 and three will change the present information at vacation spot. This processing is so known as merging or deduplicating. File id=2 within the vacation spot desk stays intact.

🟢 Takeaways: The merge (dedup) technique may be efficient within the incremental knowledge loading pipeline, but when your desk could be very giant, this dedup course of would possibly take a substantial period of time.

Source link

You Kon’t Know Jacc(ard). When the Jaccard similarity index isn’t… | by Eleanor Hanna | Jul, 2024

RFM Segmentation: Unleashing Customer Insights | by Vito Rihaldijiran | Jul, 2024

Decktopus Pricing, Pros Cons, Features, Alternatives

Leave A Reply Cancel Reply

‘Google AI’ is coming to the Pixel 9. Here’s what it looks like and everything we know

Google’s greenhouse gas emissions climbed nearly 50 percent in five years due to AI

AI tools could ease caseload of therapists feeling burnt out

A new Resident Evil game is in the works from the director of Resident Evil 7

Exploring Microsoft’s AutoGen Framework for Agentic Workflow

Most Popular

The Hamas Threat of Hostage Execution Videos Looms Large Over Social Media

Revolutionizing the Way We Find Love

Federal Investigators Widen Tesla Inquiry, Company Says

Our Picks

‘Google AI’ is coming to the Pixel 9. Here’s what it looks like and everything we know

Google’s greenhouse gas emissions climbed nearly 50 percent in five years due to AI

AI tools could ease caseload of therapists feeling burnt out

3 Essential Questions to Address When Building an API-Involved Incremental Data Loading Script | by Daniel Khoa Le | Jun, 2024

👉 Reply: Merge/Dedup mode (really useful)

Related Posts

Leave A Reply Cancel Reply