Now let’s say you already extracted a bunch of information by making API requests with the above-mentioned params, it’s time so that you can determine the way you wish to write them to the vacation spot desk.
👉 Reply: Merge/Dedup mode (really useful)
This query considerations the selection of Write disposition or Sync mode. The quick reply is that, given you need to load your knowledge incrementally, you’ll seemingly decide to put in writing your extracted knowledge in both append mode or merge mode (often known as deduplication mode).
Nevertheless, let’s step again to look at our choices extra carefully and decide which technique is greatest fitted to incremental loading.
Listed here are the favored write inclinations.
- 🟪 overwrite/change: drop all current information within the vacation spot tables after which insert the extracted information.
- 🟪 append: merely append extracted information to the vacation spot tables.
- 🟪 merge / dedup: insert new(*) information and replace(**) current information.
(*) How do we all know which information are new?: Normally, we’ll use a major key to find out that. For those who use dlt, their merging technique may be extra subtle than that, together with the excellence between merge_key
and primary_key
(one is used for merging and one is used for dedupication earlier than merging) or dedup_sort
(which information are to be deleted with the identical key within the dedup course of). I’ll go away that half for one more tutorial.
(**) It is a easy rationalization, if you wish to discover out extra about how dlt handles this merging technique, learn extra here.
👁️👁️ Right here is an instance to assist us perceive the outcomes of various write inclinations.
↪️ On 2024.06.19: We make the primary sync.
🅰️ Knowledge in supply utility
️️
🅱️ ️Knowledge loaded to our vacation spot database
It doesn’t matter what sync technique you select, the desk on the vacation spot is actually a replica of the supply desk.
Saved state of updated_at
= 2024–06–03, which is the newest updated_at
mong the two information we synced.
↪️ On 2024.06.2: We make the second sync.
🅰️ ️️️️️️️Knowledge in supply utility
✍️ Modifications within the supply desk:
- File id=1 was up to date (gross sales determine).
- File id=2 was dropped.
- File id=3 was inserted.
At this sync, we ONLY extract information with the updated_at
> 2024–06–03 (state saved from final sync). Subsequently, we’ll extracted solely file id=1 and id=3. Since file id=2 was faraway from the supply knowledge, there isn’t a approach for us to acknowledge this variation.
With the second sync, you now will see the distinction among the many write methods.
🅱️ Knowledge loaded to our vacation spot database
❗ State of affairs 1: Overwrite
The vacation spot desk can be overwritten by the two information extracted this time.
❗ State of affairs 2: Append
The two extracted information can be appended to the vacation spot desk, the present information usually are not affected.
❗ State of affairs 3: Merge or dedup
The two extracted information with id=1 and three will change the present information at vacation spot. This processing is so known as merging or deduplicating. File id=2 within the vacation spot desk stays intact.
🟢 Takeaways: The merge (dedup) technique may be efficient within the incremental knowledge loading pipeline, but when your desk could be very giant, this dedup course of would possibly take a substantial period of time.