It’s fascinating how massive language fashions (LLMs) have revolutionized pure language processing (NLP) duties. Their performances are really exceptional, suggesting they could possibly be game-changers throughout many duties, together with annotating knowledge sooner and cheaper than people. However typically, LLMs make errors, e.g., when inputs are complicated, or duties are domain-specific. They might even introduce biases into coaching knowledge.
So, what’s the answer? The reply is collaboration! As an alternative of utterly changing human annotators with LLMs, we have to leverage the strengths of either side to acquire correct and dependable annotations. This text will talk about tips on how to successfully make the most of LLMs as collaborators for knowledge annotation.
We suggest a multi-step human-LLM collaborative framework for knowledge annotation to make sure accuracy and trustworthiness. First, LLMs predict labels and generate explanations. Subsequent, a skilled verifier mannequin assesses LLM labels and explanations. Lastly, human annotators re-annotate a subset of labels which have the bottom verifier scores.
The important thing differentiating concept right here is using LLMs’ self-explanation functionality to clarify their labeling choices. In Step 2, the LLM-generated explanations can present extra info on LLMs’ reasoning processes for the verifier mannequin [Marasovic et al., 2022]. Within the human re-annotation step, LLM explanations can assist human annotators to grasp and belief LLMs as collaborators [Wang et al., 2023].
To realize efficient collaboration in such a framework, we discover two analysis questions in our CHI 2024 paper.
- RQ1: How can we confirm LLM-generated labels utilizing indicators from inputs, LLM labels, and LLM explanations?
- RQ2: Does offering LLM-generated labels and explanations assist people in re-annotation?
Within the following, we are going to talk about the above questions intimately.
We developed a verifier mannequin that assigns scores to LLM-generated labels. The scores can assist establish a subset of probably faulty labels to keep away from losing human efforts re-annotating already appropriate LLM labels.
LLMs take textual content samples because the enter and generate labels and explanations because the output. We collected options from these three dimensions alongside the LLM annotation course of. In whole, we ended up with 70 enter options (e.g., textual content attribute options and pattern consultant options corresponding to coherence, perplexity, and readability), 7 label options (e.g., logit and entropy), and 73 rationalization options (e.g., rationalization sufficiency, simulatability).
We carried out experiments utilizing totally different settings throughout varied datasets. Outcomes present that our verifier higher identifies incorrect LLM labels than a logit-based baseline, extensively used to estimate LLMs’ uncertainty. Which means further indicators from enter and rationalization are helpful in distinguishing doubtlessly incorrect LLM labels.
Within the re-annotation step, people label a subset of LLM labels steered by the verifier within the earlier step.
We carried out a human-subject research to establish the optimum technique for enhancing human re-annotation efficiency: (1) not presenting any LLM mannequin’s outputs to human annotators, (2) presenting solely the LLM-generated labels, and (3) presenting each the LLM-generated labels and the reasons.
Determine 3: Activity interface used within the human-subject research for various re-annotation methods.
For every knowledge level, members reviewed the info with or with out LLM help, after which supplied their closing annotations (Determine 3).
For the SNLI dataset, human accuracy was increased when offering each LLM labels and explanations than when solely offering LLM labels or with none help. Alternatively, outcomes on the stance detection activity didn’t present any statistically important variations between AI help therapies.
We additional analyzed if the impact of LLM help differed on LLM appropriate cases and LLM fallacious cases. We discovered that when LLM was appropriate, the members had been extra correct with extra LLM help. Apparently, when LLM was incorrect, offering the fallacious LLM labels damage human accuracy. There have been no variations between displaying each the LLM rationalization and label, and displaying solely the LLM label.
For extra particulars on research setup and extra human notion analyses, please discuss with our CHI 2024 paper.
We mentioned tips on how to design LLM-human collaborative annotation frameworks by leveraging a LLM’s label and self-explanation in computerized verification and re-annotation. Our findings from the verifier experiments counsel that totally different indicators corresponding to self-explanations could be informative when robotically verifying LLM-generated annotations — i.e., don’t solely depend on logits. The crowdsourced research requires the necessity to quantify and enhance the standard of LLM explanations and punctiliously resolve when explanations are useful for human re-annotation.
With this analysis and ML practitioners in thoughts, we now have constructed MEGAnno, an annotation software that makes use of human-LLM collaboration with verification. Try it!
Article written by: Hannah Kim and Megagon Labs.