Now that we’ve made some candidate semantic chunks, it could be helpful to see how comparable they’re to 1 one other. This may assist us get a way for what data they include. We’ll proceed by embedding the semantic chunks, after which use UMAP to cut back the dimensionality of the ensuing embeddings to 2D in order that we are able to plot them.
UMAP stands for Uniform Manifold Approximation and Projection, and is a strong, normal dimensionality discount method that may seize non-linear relationships. A full clarification of the way it works will be discovered here. The aim of utilizing it right here is to seize one thing of the relationships that exist between the embedded chunks in 1536-D area in a 2-D plot
from umap import UMAPdimension_reducer = UMAP(
n_neighbors=5,
n_components=2,
min_dist=0.0,
metric="cosine",
random_state=0
)
reduced_embeddings = dimension_reducer.fit_transform(semantic_embeddings)
splits_df = pd.DataFrame(
{
"reduced_embeddings_x": reduced_embeddings[:, 0],
"reduced_embeddings_y": reduced_embeddings[:, 1],
"idx": np.arange(len(reduced_embeddings[:, 0])),
}
)
splits_df["chunk_end"] = np.cumsum([len(x) for x in semantic_text_groups])
ax = splits_df.plot.scatter(
x="reduced_embeddings_x",
y="reduced_embeddings_y",
c="idx",
cmap="viridis"
)
ax.plot(
reduced_embeddings[:, 0],
reduced_embeddings[:, 1],
"r-",
linewidth=0.5,
alpha=0.5,
)
UMAP is sort of delicate to the n_neighbors
parameter. Typically the smaller the worth of n_neighbors
, the extra the algorithm focuses on the usage of native construction to learn to mission the information into decrease dimensions. Setting this worth too small can result in projections that don’t do a fantastic job of capturing the big scale construction of the information, and it ought to typically improve because the variety of datapoints grows.
A projection of our information is proven under and its fairly informative: Clearly we’ve three clusters of comparable that means, with the first and third being extra comparable to one another than both is to the 2nd. The idx
coloration bar within the plot above exhibits the chunk quantity, whereas the purple line provides us a sign of the sequence of the chunks.
What about computerized clustering? This might be useful if we needed to group the chunks into bigger segments or subjects, which might function helpful metadata to filter on in a RAG utility with hybrid search, for instance. We additionally may be capable of group chunks which can be far aside within the textual content (and subsequently wouldn’t have been grouped by the usual semantic chunking in part 1) however have comparable meanings.
There are lots of clustering approaches that could possibly be used right here. HDBSCAN is a risk, and is the default methodology really helpful by the BERTopic package deal. Nonetheless, on this case hierarchical clustering appears extra helpful because it may give us a way of the relative significance of no matter teams emerge. To run hierarchical clustering, we first use UMAP to cut back the dimensionality of the dataset to a smaller variety of parts. As long as UMAP is working properly right here, the precise variety of parts shouldn’t considerably have an effect on the clusters that get generated. Then we use the hierarchy module from scipy to carry out the clustering and plot the end result utilizing seaborn
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
from umap import UMAP
import seaborn as sns# arrange the UMAP
dimension_reducer_clustering = UMAP(
n_neighbors=umap_neighbors,
n_components=n_components_reduced,
min_dist=0.0,
metric="cosine",
random_state=0
)
reduced_embeddings_clustering = dimension_reducer_clustering.fit_transform(
semantic_group_embeddings
)
# create the hierarchy
row_linkage = hierarchy.linkage(
pdist(reduced_embeddings_clustering),
methodology="common",
optimal_ordering=True,
)
# plot the heatmap and dendogram
g = sns.clustermap(
pd.DataFrame(reduced_embeddings_clustering),
row_linkage=row_linkage,
row_cluster=True,
col_cluster=False,
annot=True,
linewidth=0.5,
annot_kws={"measurement": 8, "coloration": "white"},
cbar_pos=None,
dendrogram_ratio=0.5
)
g.ax_heatmap.set_yticklabels(
g.ax_heatmap.get_yticklabels(), rotation=0, measurement=8
)
The end result can be fairly informative. Right here n_components_reduced
was 4, so we diminished the dimensionality of the embeddings to 4D, subsequently making a matrix with 4 options the place every row represents one of many semantic chunks. Hierarchical clustering has recognized the 2 main teams (i.e. timber and Namibia), two giant subgroup inside timber (i.e. medical makes use of vs. different) and an variety of different teams that could be price exploring.
Word that BERTopic uses a similar technique for topic visualization, which could possibly be seen as an extension of what’s being introduced right here.
How is this convenient in our exploration of semantic chunking? Relying on the outcomes, we could select to group a number of the chunks collectively. That is once more fairly subjective and it could be vital to check out a number of various kinds of grouping. Let’s say we regarded on the dendrogram and determined we needed 8 distinct teams. We might then reduce the hierarchy accordingly, return the cluster labels related to every group and plot them.
cluster_labels = hierarchy.cut_tree(linkage, n_clusters=n_clusters).ravel()
dimension_reducer = UMAP(
n_neighbors=umap_neighbors,
n_components=2,
min_dist=0.0,
metric="cosine",
random_state=0
)
reduced_embeddings = dimension_reducer.fit_transform(semantic_embeddings)splits_df = pd.DataFrame(
{
"reduced_embeddings_x": reduced_embeddings[:, 0],
"reduced_embeddings_y": reduced_embeddings[:, 1],
"cluster_label": cluster_labels,
}
)
splits_df["chunk_end"] = np.cumsum(
[len(x) for x in semantic_text_groups]
).reshape(-1, 1)
ax = splits_df.plot.scatter(
x="reduced_embeddings_x",
y="reduced_embeddings_y",
c="cluster_label",
cmap="rainbow",
)
ax.plot(
reduced_embeddings[:, 0],
reduced_embeddings[:, 1],
"r-",
linewidth=0.5,
alpha=0.5,
)
The ensuing plot is proven under. We have now 8 clusters, and their distribution within the 2D area seems to be affordable. This once more demonstrates the significance of visualization: Relying on the textual content, utility and stakeholders, the best quantity and distribution of teams will probably be totally different and the one solution to test what the algorithm is doing is by plotting graphs like this.
Assume after a number of iterations of the steps above, we’ve settled on semantic splits and clusters that we’re pleased with. It then is sensible to ask what these clusters truly symbolize? Clearly we might learn the textual content and discover out, however for a big corpus that is impractical. As a substitute, let’s use an LLM to assist. Particularly, we are going to feed the textual content related to every cluster to GPT-4o-mini and ask it to generate a abstract. This can be a comparatively easy activity with LangChain, and the core elements of the code are proven under
import langchain
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers.string import StrOutputParser
from langchain.callbacks import get_openai_callback
from dataclasses import dataclass@dataclass
class ChunkSummaryPrompt:
system_prompt: str = """
You're an knowledgeable at summarization and knowledge extraction from textual content. You may be given a bit of textual content from a doc and your
activity is to summarize what's taking place on this chunk utilizing fewer than 10 phrases.
Learn by means of the complete chunk first and consider carefully about the details. Then produce your abstract.
Chunk to summarize: {current_chunk}
"""
immediate: langchain.prompts.PromptTemplate = PromptTemplate(
input_variables=["current_chunk"],
template=system_prompt,
)
class ChunkSummarizer(object):
def __init__(self, llm):
self.immediate = ChunkSummaryPrompt()
self.llm = llm
self.chain = self._set_up_chain()
def _set_up_chain(self):
return self.immediate.immediate | self.llm | StrOutputParser()
def run_and_count_tokens(self, input_dict):
with get_openai_callback() as cb:
end result = self.chain.invoke(input_dict)
return end result, cb
llm_model = "gpt-4o-mini"
llm = ChatOpenAI(mannequin=llm_model, temperature=0, api_key=api_key)
summarizer = ChunkSummarizer(llm)
Working this on our 8 clusters and plotting the end result with datamapplot provides the next