Beforehand, I laid down the mission’s groundwork with knowledge assortment and cleansing, which is essential for with the ability to fine-tune our mannequin successfully. This time, I’ll talk about the design of our fine-tuning dataset which will probably be derived from the collected knowledge.
We would like the fine-tuned mannequin to study two issues:
- How do I reply to an e-mail when given the supply e-mail + data on the content material?
- How do I ship an e-mail from scratch when given content material data?
To handle these questions, we’ll acquire all email-reply pairs the place I used to be the replier AND all emails the place I used to be the primary sender. That is the place the EmailThread knowledge construction shines since we are able to simply iterate over consecutive pairs to seek out email-reply pairs that match our standards.
import pandas as pd# 'authentic' is the supply e-mail
# 'era' is the reply e-mail
knowledge = pd.DataFrame({col:[] for col in ["original", "generation"]})
for thread in email_threads:
if len(thread) > 0:
if "garreth" in thread[0].sender.decrease():
knowledge = pd.concat([data, pd.DataFrame(dict(original = [None], era = [thread[0].message]))], ignore_index = True)
for authentic, reply in zip(thread[1:], thread[2:]):
if "garreth" in reply.sender.decrease() and "garreth" not in authentic.sender.decrease():
knowledge = pd.concat([data, pd.DataFrame(dict(original = [original.message], era = [reply.message]))], ignore_index = True)
Sadly, quite a lot of the information is redundant. If I’ve an e-mail thread and somebody replies to me, as a substitute of modifying the present thread, it creates a brand new thread with that extra reply.
As a result of overlapping nature of e-mail threads, there must be deduplication of the information so we don’t finetune the mannequin utilizing redundant knowledge.
The deduplication pipeline can have a number of steps:
- Precise dedup (actual string comparability, conserving first amongst duplicates)
- Jaccard Similarity dedup (we are able to get away with this since most duplicates have little or no distinction)
import re# Checklist to retailer clusters of indices
clusters = []
SIMILARITY_THRESHOLD = 0.65
# Common expression sample to match signatures - keep away from false positives
signature_regex = re.compile("Greatest,(.*)Partnerships Director", flags=re.DOTALL)
def jaccard_similarity(list1, list2):
"""Calculate the Jaccard similarity between two lists."""
intersection_cardinality = len(set(list1).intersection(list2))
union_cardinality = len(set(list1).union(list2))
if union_cardinality == 0:
return -1
return intersection_cardinality / float(union_cardinality)
# Loop by way of pairs of generations to seek out related signatures
for i, gen1 in enumerate(data_dedup["generation"]):
for j, gen2 in enumerate(data_dedup["generation"]):
if i != j and all(i not in cluster and j not in cluster for cluster in clusters):
# Right here, we strip away the signature
# which is at all times the identical and may grow to be
# a false sign for similarity
sgen1 = signature_regex.sub("", gen1)
sgen2 = signature_regex.sub("", gen2)
if jaccard_similarity(sgen1, sgen2) >= SIMILARITY_THRESHOLD:
for cluster in clusters:
if i in cluster:
cluster.add(j)
break
elif j in cluster:
cluster.add(i)
break
else:
new_cluster = {i, j}
clusters.append(new_cluster)
unique_indices = [min(cluster) for cluster in clusters]
final_data = data_dedup.iloc[unique_indices].reset_index(drop=True)
We find yourself with a deduplicated dataset of email-reply pairs
At this level, the one factor left to do is to reformat the “authentic” column into a brand new “immediate” column, which comprises directions and context that when fed into an LLM would generate the textual content within the “era” column.
In different phrases, if I copied an instance from the “immediate” column on to the fine-tuned LLM, it ought to generate one thing related to what’s within the “era” column.
There are two elements to the instruction: the unique e-mail and the context to create the reply. I’ve selected the under format for its simplicity:
[ORIGINAL]
{If it exists, the unique e-mail that's being replied to, in any other case None}
[CONTEXT]
{the data being conveyed within the generated e-mail}
An instance is as follows:
[ORIGINAL]
None
[CONTEXT]
to: [email protected],
about: attend HelloHacks as mentor,
particulars: utility profitable, would love to ask,
date is Might 1-2 2025, hellohacks is 2 day hackathon for rookies,
recognize your experience
For the [CONTEXT], I’ve determined to stay with the “to”, “about”, and “particulars” construction because it encompasses probably the most primary particulars required to compose an e-mail (the recipient, topic, and message).
I hand-labeled round 100 examples and used this because the fine-tuning dataset. With extra knowledge, we’d get higher outcomes, however I needed to check the feasibility of this quantity of information.
Lastly, I transformed the dataset into jsonl, the place it’s then loaded as a HuggingFace Dataset. We at the moment are prepared to start out fine-tuning!