Skip to main content

Table 3 Precision, recall, and F1 metrics for entities identified in the outputs of the LLMs, using the first evaluation strategy

From: Comparative analysis of generative LLMs for labeling entities in clinical notes

 

Variation 1

Variation 2

Variation 3

Model

P

R

F

P

R

F

P

R

F

llama-2-7b

0.986

0.084

0.156

0.907

0.023

0.046

0.967

0.052

0.099

llama-2-7b-chat

0.971

0.182

0.306

0.952

0.191

0.318

0.957

0.145

0.252

codellama-7b-instruct

0.971

0.383

0.549

0.961

0.220

0.358

0.967

0.191

0.318

mistral-7b-v0.1

0.960

0.043

0.083

0.951

0.046

0.088

0.941

0.038

0.074

mistral-7b-instruct-v0.2

0.966

0.222

0.361

0.955

0.267

0.418

0.929

0.118

0.209

mixtral-8x7b-instruct-v0.1

0.974

0.428

0.595

0.974

0.365

0.531

0.958

0.193

0.321

  1. The highest F1-scores for each prompt variation are highlighted in the table