Skip to main content

Table 4 Precision, recall, and F1 metrics for entities present in the formatted outputs of the LLMs, using the second evaluation strategy and expanded by entity category

From: Comparative analysis of generative LLMs for labeling entities in clinical notes

  

Variation 2

Variation 3

Model

Category

P

R

F1

P

R

F1

llama-2-7b

DISEASE

0.500

0.008

0.015

0.333

0.003

0.005

 

PROCEDURE

0.000

0.000

0.000

0.500

0.001

0.003

 

SYMPTOM

0.667

0.004

0.008

0.000

0.000

0.000

 

micro avg

0.556

0.003

0.006

0.222

0.001

0.003

llama-2-7b-chat

DISEASE

0.349

0.055

0.096

0.500

0.040

0.075

 

PROCEDURE

0.000

0.000

0.000

0.323

0.015

0.028

 

SYMPTOM

0.494

0.080

0.138

0.472

0.033

0.062

 

micro avg

0.432

0.040

0.073

0.434

0.027

0.051

codellama-7b-instruct

DISEASE

0.413

0.096

0.155

0.000

0.000

0.000

 

PROCEDURE

0.000

0.000

0.000

0.000

0.000

0.000

 

SYMPTOM

0.560

0.210

0.305

0.000

0.000

0.000

 

micro avg

0.512

0.091

0.155

0.000

0.000

0.000

mistral-7b-v0.1

DISEASE

0.167

0.003

0.005

0.462

0.015

0.029

 

PROCEDURE

0.667

0.003

0.006

0.250

0.001

0.003

 

SYMPTOM

0.462

0.012

0.023

0.400

0.004

0.008

 

micro avg

0.409

0.006

0.011

0.409

0.006

0.011

mistral-7b-instruct-v0.2

DISEASE

0.467

0.141

0.217

0.571

0.101

0.171

 

PROCEDURE

0.000

0.000

0.000

0.421

0.012

0.023

 

SYMPTOM

0.517

0.208

0.297

0.390

0.080

0.133

 

micro avg

0.498

0.102

0.170

0.459

0.056

0.100

mixtral-8x7b-instruct-v0.1

DISEASE

0.463

0.189

0.268

0.372

0.146

0.210

 

PROCEDURE

0.429

0.031

0.058

0.319

0.087

0.137

 

SYMPTOM

0.434

0.237

0.307

0.398

0.157

0.225

 

micro avg

0.443

0.137

0.209

0.363

0.124

0.185

  1. The highest F1-scores for each prompt variation are highlighted in the table