Programmatic and model-based evaluations
Duties in CURIE are diverse and have ground-truth annotations in blended and heterogeneous kind, e.g., as JSONs, latex equations, YAML recordsdata, or free-form textual content. Evaluating free-form technology is difficult as a result of solutions are sometimes descriptive, and even when a format is specified, as in most of our circumstances, the response to every area can have differing types. For instance, supplies grid factors might typically be specified as “[p, q, r]” and at different occasions as “p × q × r”. Therefore, along with the programmatic analysis metrics, corresponding to ROUGE-L, intersection-over-inion (used for BIOGR), and identification ratio (utilized in PDB), we suggest two model-based analysis metrics.
(1) LMScore: Prompts an LLM asking how carefully the predictions match floor fact on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are a lot of minor errors, and “unhealthy” if there are main errors. We contemplate the weighted common of the log-likelihood scores of the tokens to provide a last confidence.
(2) LLMSim: Is used for retrieval duties the place we ask the mannequin to exhaustively extract many particulars, e.g., descriptors, properties and values of supplies from a analysis doc, and supply as output an unordered checklist of dictionaries or information. We use a chain-of-thought (CoT) immediate that asks the LLM to take a look at every ground-truth document and establish the expected information that accurately match every area (key) and worth of the bottom fact. As soon as we match the ground-truth information with predicted information, we will then measure precision and recall for the retrieval activity, and compute the imply common precision, recall and F1 scores throughout all paperwork.