LongEval Guide: A Big Breakthrough in ChatGPT Text Evaluation
The new evaluation tool LongEval introduces common standards for validating AI texts.
Recently, the public has reacted strongly to the release of generative neural networks, such as ChatGPT. Many consider this technology a great advance in the field of communication, others predict its detrimental consequences.
However, generated text is notorious for its shortcomings, and human judgment remains the gold standard for ensuring accuracy, especially when generating long summaries (summaries and abstracts) of complex texts. But at the same time, there are currently no accepted standards for human evaluation of long resumes, which is doubtful even in the “gold standard”.
To remedy this situation, a team of computer scientists from the United States introduced a set of guidelines called ” longeval “. The principles were presented at the European Section of the Association for Computational Linguistics, where they were awarded the Best Paper Award.
According to experts, there is currently no reliable way to evaluate long generated texts without human intervention, and even existing human evaluation protocols are costly, time-consuming and highly variable.
During the study, the team studied 162 research papers on long resumes. The analysis showed that 73% of the works did not undergo human evaluation at all, and various evaluation methods were applied to the rest.
In order to promote efficient, reproducible, and standardized protocols for human evaluation of generated resumes, the study authors developed a list of three overarching recommendations that cover how and what an assessor should read in order to judge the reliability of a resume.
The LongEval guide includes the following recommendations:
- Assess the credibility of the summary by individual fragments (sentences or clauses), and not by the whole text. This allows you to increase the consistency of assessments between different experts and reduce the burden on them;
- Use automatic alignment between resume and source snippets to make it easier to find relevant information in long documents. It also helps to avoid the mistakes of paraphrasing or summarizing information in a resume;
- Select an appropriate set of fragments for evaluation, depending on the purpose of the study. For example, you can evaluate all fragments, a random subsample, or only those that contain key information.
The researchers applied LongEval to two sets of long text summary datasets in different fields (SQuALITY And PubMed) and showed that a finer estimate reduces the scatter of all text reliability scores. The experts also showed that scores from partial abstracts are highly correlated with scores from full abstracts.
- SQuALITY is a dataset of 5 summaries for each of 100 public domain short stories. The first summary gives an overview of the entire story, while the other four answer specific questions about plot, characters, theme, and style.
- PubMed is a data set consisting of 10,000 scientific articles from the medical field and their summaries. Abstracts are 150 to 300 words long and contain the main results and conclusions of the articles.
The specialists promise that LongEval will allow people to “accurately and quickly evaluate algorithms for generating long text.” Experts release LongEval as a library Pythonand the community will be able to use and develop LongEval in their research.