๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

NLP/Hugging Face

HuggingFace(ํ—ˆ๊น…ํŽ˜์ด์Šค) ๋ชจ๋ธ Fine-Tuning(Trainer ์‚ฌ์šฉ)

<์ถœ์ฒ˜ ๋ฐ ์ฐธ๊ณ ์ž๋ฃŒ>

https://huggingface.co/learn/nlp-course/chapter3/1

 

Introduction - Hugging Face NLP Course

2. Using ๐Ÿค— Transformers 3. Fine-tuning a pretrained model 4. Sharing models and tokenizers 5. The ๐Ÿค— Datasets library 6. The ๐Ÿค— Tokenizers library 9. Building and sharing demos new

huggingface.co

0. ์ •๋ฆฌ

 - transformers์˜ Trainer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ„๋‹จํ•˜๊ฒŒ fine-tuning์ด ๊ฐ€๋Šฅํ•จ.

 - NLP๋ชฉ์ ์— ๋งž๊ฒŒ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€์„œ ์‚ฌ์šฉ(๋ถ„๋ฅ˜, ์ƒ์„ฑ ๋“ฑ)

 - ์ „์ฒด ์ฝ”๋“œ๋ฅผ ์ œ์‹œํ•˜๊ณ  ์ฝ”๋“œ ์„ค๋ช…

 - hugging face์˜ ์ฝ”๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜์˜€์Œ.

 - torch.utils.data.Dataset ์œผ๋กœ pytorch ๋ฐ์ดํ„ฐ์…‹ ๋งŒ๋“œ๋Š” ๋ฒ•์„ ์•Œ๊ณ  ์žˆ์–ด์•ผ ํ•จ(์ž์‹ ๋งŒ์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ›ˆ๋ จ)

 

1. ์ „์ฒด์ฝ”๋“œ ๋ฐ ์„ค๋ช…

#๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(A)
import torch

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification #๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ํ•™์Šต
from transformers import AdamW #optimizer ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

#๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(B)
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) #์ •๋‹ต์˜ labels ์ˆ˜ ์ง€์ •

#๋ฐ์ดํ„ฐ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(C)
from datasets import load_dataset
raw_datasets = load_dataset('glue', 'mrpc')

#ํ† ํฌ๋‚˜์ด์ € ์ผ๊ด„์ ์šฉ์„ ์œ„ํ•œ ํ•จ์ˆ˜(D)
def tokenizer_function(example): 
  return tokenizer(example['sentence1'], example['sentence2'], truncation=True)

#raw_dataset์˜ map ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ tokenizer_function์„ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ์ ์šฉ(E)
tokenized_datasets = raw_datasets.map(tokenizer_function, batched=True)

#๋™์ padding์„ ์œ„ํ•œ ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ(F)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#ํ‰๊ฐ€ ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ(G)
import evaluate
import numpy as np
def compute_metrics(eval_preds):
  metric = evaluate.load('glue', 'mrpc')
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

#Train์— ์‚ฌ์šฉํ•  ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •(H)
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch") #์ €์žฅ์œ„์น˜"test-trainer"๋งŒ ์„ค์ •, epoch๋‹จ์œ„๋กœ ์ถœ๋ ฅ

#Trainer ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์ •์˜(I)
from transformers import Trainer
trainer = Trainer(
    model, 
    training_args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['validation'],
    data_collator=data_collator, #Trainier์˜ ๊ธฐ๋ณธ data_collator๋Š” DataCollatorWithPadding์ด๋ผ์„œ ์ƒ๋žต๊ฐ€๋Šฅํ•˜์ง€๋งŒ ์จ์ฃผ๋Š” ๊ฒƒ์ด ์ข‹์Œ.
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

#fine-tuning(J)
trainer.train()

 

2. ์ž์‹ ์˜ ๋ฐ์ดํ„ฐ๋กœ fine-tuning

 - ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ์™€ ํ‰๊ฐ€ ํ•จ์ˆ˜๋ฅผ ์ˆ˜์ •ํ•ด์ฃผ์–ด์•ผ ํ•จ.

 - ๋ฐ์ดํ„ฐ๊ฐ€  torch.utils.data.Dataset ํ˜•์‹์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์€ ํ˜•ํƒœ๋กœ Trainer์˜ train_dataset, eval_dataset์—  ๋“ค์–ด๊ฐ€์•ผ ํ•จ.

{'input_ids': [[  101,  2572,  3217,  5831,  5496,  2010,  2567,  1010,  3183,  2002,
           2170,  1000,  1996,  7409,  1000,  1010,  1997,  9969,  4487, 23809,
           3436,  2010,  3350,  1012,   102,  7727,  2000,  2032,  2004,  2069,
           1000,  1996,  7409,  1000,  1010,  2572,  3217,  5831,  5496,  2010,
           2567,  1997,  9969,  4487, 23809,  3436,  2010,  3350,  1012,   102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1]],
 'label': 1}

 - Dataset ๋งŒ๋“ค๊ธฐ(์˜ˆ์‹œ)

from torch.utils.data import Dataset

class MyMapDataset(Dataset):

    #๋ฐ์ดํ„ฐ์…‹ ์ •์˜(์ดˆ๊ธฐํ™”) (A)
    def __init__(self, data):
        self.data = data

    #๋ฐ์ดํ„ฐ์…‹ ๊ธธ์ด(B)
    def __len__(self):
        return len(self.data)

    #์ถœ๋ ฅ (C)
   def __getitem__(self, index):
   	return {'input_ids': self.data['input_ids'][index], 
                'token_type_ids': self.data['token_type_ids'][index], 
                'attention_mask': self.data['attention_mask'][index], 
                'label': self.data['label'][index]}

 - ์ถœ๋ ฅ(C) ๋ถ€๋ถ„์ด ์•„๋ž˜์™€ ๊ฐ™์ด dict ํ˜•ํƒœ๋กœ ๋‚˜์˜ค๋„๋ก ์ˆ˜์ •

{'input_ids': [[  101,  2572,  3217,  5831,  5496,  2010,  2567,  1010,  3183,  2002,
           2170,  1000,  1996,  7409,  1000,  1010,  1997,  9969,  4487, 23809,
           3436,  2010,  3350,  1012,   102,  7727,  2000,  2032,  2004,  2069,
           1000,  1996,  7409,  1000,  1010,  2572,  3217,  5831,  5496,  2010,
           2567,  1997,  9969,  4487, 23809,  3436,  2010,  3350,  1012,   102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1]],
 'label': 1}

 - ํ‰๊ฐ€ํ•จ์ˆ˜๋ฅผ sklearn์˜ accuracy_socore, f1_score๋กœ ์ˆ˜์ •

import numpy as np
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(eval_preds):
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  #๋”•์…”๋„ˆ๋ฆฌ์˜ key๋ฅผ ํ‰๊ฐ€ ์ง€ํ‘œ ์ด๋ฆ„์œผ๋กœ ์„ค์ •
  return {'accuracy': accuracy_score(labels, predictions), 'f1': f1_score(labels, predictions)}

 - ์ตœ์ข… ์ˆ˜์ • ์ฝ”๋“œ

#๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(A)
import torch

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification #๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ํ•™์Šต
from transformers import AdamW #optimizer ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

#๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(B)
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) #์ •๋‹ต์˜ labels ์ˆ˜ ์ง€์ •

#๋ฐ์ดํ„ฐ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(C)
#pytorch ๋ฐ์ดํ„ฐ์…‹ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
from datasets import load_dataset
raw_datasets = load_dataset('glue', 'mrpc')

#๋ฐ์ดํ„ฐ ํ˜•์‹ ๋ณ€ํ™˜, ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ณ€ํ™˜ ์˜ˆ์ •(D)
#pytorch ๋ฐ์ดํ„ฐ์…‹ __ini__ ๋ถ€๋ถ„์— ๋„ฃ์–ด๋„ ๋จ.
train_data = {'input_ids': [],
              'token_type_ids': [],
              'attention_mask': [],
              'label': []}

for i in range(3668):
  tokenize = tokenizer(raw_datasets['train']['sentence1'][i], raw_datasets['train']['sentence2'][i], truncation=True)
  train_data['input_ids'].append(tokenize['input_ids']) 
  train_data['token_type_ids'].append(tokenize['token_type_ids'])
  train_data['attention_mask'].append(tokenize['attention_mask'])
  train_data['label'].append(raw_datasets['train']['label'][i])
  

valid_data = {'input_ids': [],
              'token_type_ids': [],
              'attention_mask': [],
              'label': []}

for i in range(408):
  tokenize = tokenizer(raw_datasets['validation']['sentence1'][i], raw_datasets['validation']['sentence2'][i], truncation=True)
  valid_data['input_ids'].append(tokenize['input_ids']) 
  valid_data['token_type_ids'].append(tokenize['token_type_ids'])
  valid_data['attention_mask'].append(tokenize['attention_mask'])
  valid_data['label'].append(raw_datasets['validation']['label'][i])   
  
  
#pytorch dataset์œผ๋กœ ๋ณ€ํ™˜(E)
from torch.utils.data import Dataset

class TestDataset(Dataset):

    def __init__(self, data):
    #(D)๋ถ€๋ถ„์„ ๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ์— ๋”ฐ๋ผ์„œ ๋™์ ์œผ๋กœ ๋ณ€ํ™˜ ๊ฐ€๋Šฅํ•˜๊ฒŒ ์ˆ˜์ •ํ•ด์„œ ๋„ฃ๋Š” ๊ฒƒ์ด ์ข‹์Œ
        self.data = data

    def __len__(self):
        return len(self.data['label'])

    def __getitem__(self, index):
        return {'input_ids': self.data['input_ids'][index], 
                'token_type_ids': self.data['token_type_ids'][index], 
                'attention_mask': self.data['attention_mask'][index], 
                'label': self.data['label'][index]}
                
train_dataset = TestDataset(train_data)
eval_dataset = TestDataset(valid_data)


#๋™์ padding์„ ์œ„ํ•œ ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ(F)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#ํ‰๊ฐ€ ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ(G)
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(eval_preds):
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return {'accuracy': accuracy_score(labels, predictions), 'f1': f1_score(labels, predictions)}

#Train์— ์‚ฌ์šฉํ•  ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •(H)
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch") #์ €์žฅ์œ„์น˜"test-trainer"๋งŒ ์„ค์ •, epoch๋‹จ์œ„๋กœ ์ถœ๋ ฅ

#Trainer ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์ •์˜(I)
from transformers import Trainer
trainer = Trainer(
    model, 
    training_args,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    data_collator=data_collator, #Trainier์˜ ๊ธฐ๋ณธ data_collator๋Š” DataCollatorWithPadding์ด๋ผ์„œ ์ƒ๋žต๊ฐ€๋Šฅํ•˜์ง€๋งŒ ์จ์ฃผ๋Š” ๊ฒƒ์ด ์ข‹์Œ.
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

#fine-tuning(J)
trainer.train()