[파이토치 기초] 모델 저장/불러오기

dendya 2025. 1. 9. 16:03

모델 학습은 시간이 오래 걸리므로, 학습 결과를 저장해 두면 나중에 쉽게 불러와 활용할 수 있다.
파이토치에서는 모델 학습이 완료된 후 또는 특정 에폭이 끝날 때마다 모델을 저장한다.
모델 파일은 주로 .pt 또는 .pth 확장자로 저장된다.

모델 전체 저장/불러오기

모델 전체를 저장하는 경우, 학습에 사용된 모델 클래스의 구조와 학습 상태 등을 모두 저장한다.
모델의 레이어 구조, 모델 파라미터 등이 모두 기록된 상태로 저장하기 때문에 모델 파일로도 동일한 구조를 구현할 수 있다. 다음은 모델 저장 함수의 구조이다.

torch.save(model, path)

모델 저장 함수는 학습 결과를 저장하려는 모델 인스턴스(model)와 학습 결과 파일이 생성될 경로(path)를 설정해 학습된 모델을 저장한다.
모델 전체를 저장하기 때문에 기가바이트 단위의 용량이 요구되는 경우도 있다. 다음은 모델 불러오기 함수의 구조이다.

model = torch.load(path, map_location, weights_only)

모델 불러오기 함수는 모델이 저장될 경로(path)를 불러와 모델의 파라미터를 적용해 인스턴스를 생성한다.
map_location 파라미터는 모델을 불러올 때 적용하려는 장치 상태(cuda, cpu 등...)를 지정한다.
weights_only 파라미터는 모델을 불러올 때 모델의 가중치만을 불러올지, 모델의 전체 구조를 불러올지 선택할 수 있다.
아래 예제는 모델을 저장하는 방법과 모델을 불러오는 방법을 보여준다.

한 가지 유의할 점은, 모델을 불러올 때 동일한 형태의 클래스를 선언해야 한다.
아래의 예제와 같이 CustomModel 클래스가 동일한 구조로 선언되어야 동일하게 추론을 진행할 수 있다.
만약 모델 전체 파일을 가지고 있으나 모델 구조를 알지 못한다면, 모델 구조를 출력해 이를 확인할 수 있다.

# 모델 저장

torch.save(
    model,
    "models/model.pt"
)

# 모델 불러오기

class CustomModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(2, 1)

    def forward(self, x):
        x = self.layer(x)
        return x
    
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device) # 장치 확인
model = torch.load("file_path", map_location=device, weights_only=False)
print(model) # 모델 구조 출력

# 불러온 모델을 사용한 추론
with torch.no_grad():
    model.eval()

    inputs = torch.FloatTensor(
        [
            [1 ** 2, 1],
            [5 ** 2, 5],
            [11 * 2, 11]
        ]
    ).to(device)

    outputs = model(inputs)
    print(outputs)

CustomModel(
  (layer): Linear(in_features=2, out_features=1, bias=True)
)
tensor([[ 1.7022],
        [69.3758],
        [49.8483]], device='cuda:0')

모델 상태 저장/불러오기

모델의 모든 정보를 저장하면 많은 저장 공간이 필요하기 때문에 모델의 파라미터만 저장해 활용하는 방법 또한 사용한다.
모델 전체를 저장하는 방법과 한 가지 다른 점 model 변수가 아닌 state_dict 메소드를 사용한다는 것이다.
모델 상태(torch.state_dict)는 모델에서 학습할 수 있는 파라미터를 순서가 있는 딕셔너리(OrderedDict) 형식으로 반환한다.

# 모델 상태 저장

torch.save(
    model.state_dict(),
    "models/model_state_dict.pt"
)

위의 방식은 학습된 모델의 가중치와 편향을 저장한다.
추론에 필요한 데이터만을 저장하기 때문에, 모델의 구조가 코드에 동일하게 구현되어 있어야 한다.
모델 상태 파일 또한 torch.load 함수를 이용해 불러온다.

# 모델 상태 불러오기

class CustomModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(2, 1)

    def forward(self, x):
        x = self.layer(x)
        return x
    
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CustomModel().to(device)

model_state_dict = torch.load("file_path", map_location=device, weights_only=False)
print(model_state_dict) # 모델 상태 출력
model.load_state_dict(model_state_dict)

# 불러온 모델을 사용한 추론
with torch.no_grad():
    model.eval()

    inputs = torch.FloatTensor(
        [
            [1 ** 2, 1],
            [5 ** 2, 5],
            [11 * 2, 11]
        ]
    ).to(device)

    outputs = model(inputs)
    print(outputs)

OrderedDict([('layer.weight', tensor([[ 3.1035, -1.7028]], device='cuda:0')), ('layer.bias', tensor([0.3015], device='cuda:0'))])
tensor([[ 1.7022],
        [69.3758],
        [49.8483]], device='cuda:0')

체크포인트 저장/불러오기

체크포인트(Checkpoint)는 학습 과정에서 특정 시점마다 모델의 상태를 저장하는 것을 의미한다.
데이터가 많고 모델 구조가 깊을 경우, 학습에 오랜 시간이 걸리는데, 이 과정에서 예기치 않은 오류나 리소스 과부하 등으로 학습이 중단될 수 있다.
이때, 모델 학습이 중단되어 결과가 사라지는 것을 방지하기 위해, 일정 에폭마다 학습된 결과를 저장해, 나중에 학습을 이어갈 수 있게 한다.

체크포인트도 torch.save 함수를 이용해 여러 상태를 딕셔너리 형식으로 저장한다.
학습을 이어서 진행하기 위한 목적이므로 에폭(epoch), 모델 상태(model.state_dict), 최적화 상태(optimizer.state_dict) 등은 필수로 포함하며, 정수형, 실수형, 문자열 등 체크포인트에 대한 추가적인 정보를 저장하기도 한다.

이후 저장된 체크포인트도 torch.load 함수를 통해 체크포인트의 다양한 상태를 불러올 수 있다.
학습을 이어서 진행할 수 있도록 체크포인트의 에폭을 반복문 시작 값에 적용한다.

# 체크포인트 저장

import torch
import pandas as pd
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# (생략)

checkpoint = 1
for epoch in range(10000):
    cost = 0.0

    for x, y in train_dataloader:
        x = x.to(device)
        y = y.to(device)

        output = model(x)
        loss = criterion(output, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        cost += loss
    
    cost = cost / len(train_dataloader)

    if (epoch + 1) % 1000 == 0:
        torch.save(
            {
                "model": "CustomModel", # 모델 이름
                "epoch": epoch, # 에폭
                "model_state_dict": model.state_dict(), # 모델의 파라미터
                "optimizer_state_dict": optimizer.state_dict(), # 최적화 함수의 상태
                "cost": cost, # 현재 손실 값
                "description": f"CustomModel Checkpoint-{checkpoint}", # 체크포인트에 대한 설명
            },
            f"models/checkpoint-{checkpoint}.pt" # 모델 파일의 저장 경로
        )

        print(f"checkpoint-{checkpoint} saved!") # 체크포인트 저장 확인
        checkpoint += 1

checkpoint-1 saved!
checkpoint-2 saved!
checkpoint-3 saved!
checkpoint-4 saved!
checkpoint-5 saved!
checkpoint-6 saved!
checkpoint-7 saved!
checkpoint-8 saved!
checkpoint-9 saved!
checkpoint-10 saved!

# 체크포인트 불러오기

import torch
import pandas as pd
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# (생략)

# 체크포인트와 체크포인트의 정보 불러오기
checkpoint = torch.load("models/checkpoint-6.pt", weights_only=False)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
checkpoint_epoch = checkpoint["epoch"]
checkpoint_description = checkpoint["description"]
print(checkpoint_description)

# 체크포인트의 에폭부터 학습 다시 진행
for epoch in range(checkpoint_epoch + 1, 10000):
    cost = 0.0

    for x, y in train_dataloader:
        x = x.to(device)
        y = y.to(device)

        output = model(x)
        loss = criterion(output, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        cost += loss
    
    cost = cost / len(train_dataloader)

    if (epoch + 1) % 1000 == 0:
        print(f"Epoch: {epoch + 1: 4d}, Model: {list(model.parameters())}, Cost: {cost: .3f}")

CustomModel Checkpoint-6
Epoch:  7000, Model: [Parameter containing:
tensor([[ 3.1092, -1.7025]], device='cuda:0', requires_grad=True), Parameter containing:
tensor([-0.0634], device='cuda:0', requires_grad=True)], Cost:  0.199
Epoch:  8000, Model: [Parameter containing:
tensor([[ 3.1084, -1.7024]], device='cuda:0', requires_grad=True), Parameter containing:
tensor([-0.0159], device='cuda:0', requires_grad=True)], Cost:  0.193
Epoch:  9000, Model: [Parameter containing:
tensor([[ 3.1082, -1.7029]], device='cuda:0', requires_grad=True), Parameter containing:
tensor([0.0276], device='cuda:0', requires_grad=True)], Cost:  0.186
Epoch:  10000, Model: [Parameter containing:
tensor([[ 3.1068, -1.7029]], device='cuda:0', requires_grad=True), Parameter containing:
tensor([0.0672], device='cuda:0', requires_grad=True)], Cost:  0.141