JINWOOJUNG

[EECS 498] Assignment 2. Two Layer Neural Network...(1) 본문

딥러닝/Michigan EECS 498

[EECS 498] Assignment 2. Two Layer Neural Network...(1)

Jinu_01 2024. 12. 29. 23:23
728x90
반응형

본 포스팅은 Michigan Univ.의 EECS 498 강의를 수강하면서 공부한 내용을 정리하는 포스팅입니다.


https://jinwoo-jung.tistory.com/125

 

[EECS 498] Assignment 2. Linear Classifier...(2)

본 포스팅은 Michigan Univ.의 EECS 498 강의를 수강하면서 공부한 내용을 정리하는 포스팅입니다.https://jinwoo-jung.com/124 [EECS 498] Assignment 2. Linear Classifier...(1)본 포스팅은 Michigan Univ.의 EECS 498 강의를

jinwoo-jung.com

 


Introduction

본 과제는 Classification을 위한 FC Layer를 가진 Neural Network를 설계하고 CIFAR-10 Dataset을 기반으로 테스트한다.

해당 Networks는 Softmax Loss Function을 사용하고, L2 Regularization을 수행한다. 또한, 첫번째 FC Layer 이후 ReLU를 Activation Function으로 사용한다.

 

$$ Input \rightarrow FC Layer \rightarrow ReLU \rightarrow FC Layer \rightarrow Softmax$$

 

Implement Neural Network

  • Forward pass : compute scores
def nn_forward_pass(params: Dict[str, torch.Tensor], X: torch.Tensor):
    """
    The first stage of our neural network implementation: Run the forward pass
    of the network to compute the hidden layer features and classification
    scores. The network architecture should be:

    FC layer -> ReLU (hidden) -> FC layer (scores)

    As a practice, we will NOT allow to use torch.relu and torch.nn ops
    just for this time (you can use it from A3).

    Inputs:
    - params: a dictionary of PyTorch Tensor that store the weights of a model.
      It should have following keys with shape
          W1: First layer weights; has shape (D, H)
          b1: First layer biases; has shape (H,)
          W2: Second layer weights; has shape (H, C)
          b2: Second layer biases; has shape (C,)
    - X: Input data of shape (N, D). Each X[i] is a training sample.

    Returns a tuple of:
    - scores: Tensor of shape (N, C) giving the classification scores for X
    - hidden: Tensor of shape (N, H) giving the hidden layer representation
      for each input value (after the ReLU).
    """
    # Unpack variables from the params dictionary
    W1, b1 = params["W1"], params["b1"]
    W2, b2 = params["W2"], params["b2"]
    N, D = X.shape

    # Compute the forward pass
    hidden = None
    scores = None
    ############################################################################
    # TODO: Perform the forward pass, computing the class scores for the input.#
    # Store the result in the scores variable, which should be an tensor of    #
    # shape (N, C).                                                            #
    ############################################################################
    hidden = X.mm(W1) + b1
    hidden[hidden<0] = 0
    scores = hidden.mm(W2) + b2
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return scores, hidden

 

Forward pass는 Score를 계산하는 과정이다. 입력 Tensor X를 Weight Matrix W1과 Matrix Multiply 이후 Bias b1을 더해 Hidden Layer를 계산한다. 이후 Activation Function ReLU를 구현하기 위해 Hidden Layer의 Value가 음수인 경우 모두 0으로 설정하였다. 마지막 FC Layer도 첫번째 FC Layer에서 계산한 방식과 동일하게 계산하여 각 Class에 대한 Score를 계산한다.

 

import eecs598
from eecs598.a2_helpers import get_toy_data
from two_layer_net import nn_forward_pass

eecs598.reset_seed(0)
toy_X, toy_y, params = get_toy_data()

# YOUR_TURN: Implement the score computation part of nn_forward_pass
scores, _ = nn_forward_pass(params, toy_X)
print('Your scores:')
print(scores)
print(scores.dtype)
print()
print('correct scores:')
correct_scores = torch.tensor([
        [ 9.7003e-08, -1.1143e-07, -3.9961e-08],
        [-7.4297e-08,  1.1502e-07,  1.5685e-07],
        [-2.5860e-07,  2.2765e-07,  3.2453e-07],
        [-4.7257e-07,  9.0935e-07,  4.0368e-07],
        [-1.8395e-07,  7.9303e-08,  6.0360e-07]], dtype=torch.float32, device=scores.device)
print(correct_scores)
print()

# The difference should be very small. We get < 1e-10
scores_diff = (scores - correct_scores).abs().sum().item()
print('Difference between your scores and correct scores: %.2e' % scores_diff)

 

eecs598.aw_helpers의 get_toy_data Method를 통해 Random Input, Weight, Zero Bias를 생성한 뒤, 앞서 구현한 nn_forward_paass Method를 통해 각 클래스에 대한 Score를 계산한다.

 

Your scores:
tensor([[ 9.7003e-08, -1.1143e-07, -3.9961e-08],
        [-7.4297e-08,  1.1502e-07,  1.5685e-07],
        [-2.5860e-07,  2.2765e-07,  3.2453e-07],
        [-4.7257e-07,  9.0935e-07,  4.0368e-07],
        [-1.8395e-07,  7.9303e-08,  6.0360e-07]], device='cuda:0')
torch.float32

correct scores:
tensor([[ 9.7003e-08, -1.1143e-07, -3.9961e-08],
        [-7.4297e-08,  1.1502e-07,  1.5685e-07],
        [-2.5860e-07,  2.2765e-07,  3.2453e-07],
        [-4.7257e-07,  9.0935e-07,  4.0368e-07],
        [-1.8395e-07,  7.9303e-08,  6.0360e-07]], device='cuda:0')

Difference between your scores and correct scores: 2.24e-11

 

매우 작은 Difference를 보이는 것으로 보아 적절히 구현하였음을 확인할 수 있다.

 

  • Forward pass : compute losses

현재 입력 데이터, Weight Matrix, Bias의 크기는 다음과 같다.

# X.shape:  torch.Size([5, 4])
# W1.shape:  torch.Size([4, 10])
# b1.shape:  torch.Size([10])
# W2.shape:  torch.Size([10, 3])
# b2.shape:  torch.Size([3])

 

먼저 Loss를 구하는 과정은 nn_forward_paass를 통해 계산된 scores 정보를 바탕으로 이전에 구현한 Cross-Entropy Loss를 계산 해 주면 된다. 

num_train = X.shape[0]

max_score = scores.max(axis = 1).values.view(-1, 1)   # 각 행(Input Data)에 대한 Max Score
scores -= max_score
scores = torch.exp(scores)
prob = scores / scores.sum(axis = 1).view(-1, 1)   # 각 행(Input Data)의 총 합으로 나눔 -> Prob

idx0 = torch.arange(0, num_train)
loss = -torch.log(prob[idx0, y]).sum()   # 각 Input Data의 정답 Class에 대한 Probability만 가져와 Loss 계산

# W1, W2에 대한 Regularization Term 
loss = loss / num_train + reg * (torch.sum(W1*W1) + torch.sum(W2*W2))

 

Probability를 계산하기 이전, exponential에 의한 Overflow를 방지하기 위해, 각 행에 대하여 최대 Score를 계산한 뒤 빼주었다. 이후, torch.exp를 통해 exponential을 한 뒤, 총 합으로 나누어 Probability를 계산하였다. 

Cross-Entropy Loss의 경우 정답 Class에 대해서만 계산하면 되므로, 각 Input Data에 대하여 정답 Class에 대한 확률 정보를 불러와 계산 한 뒤, 각 Loss의 총 합을 구하고 학습 데이터의 총합으로 나눠준 뒤 Regularization Term을 더하였다. 이때, Two Layer이므로 $W_1, W_2$ 모두 고려하였다.

import eecs598
from eecs598.a2_helpers import get_toy_data
from two_layer_net import nn_forward_backward

eecs598.reset_seed(0)
toy_X, toy_y, params = get_toy_data()

# YOUR_TURN: Implement the loss computation part of nn_forward_backward
loss, _ = nn_forward_backward(params, toy_X, toy_y, reg=0.05)
print('Your loss: ', loss.item())
correct_loss = 1.0986121892929077
print('Correct loss: ', correct_loss)
diff = (correct_loss - loss).item()

# should be very small, we get < 1e-4
print('Difference: %.4e' % diff)

# Your loss:  1.0986121892929077
# Correct loss:  1.0986121892929077
# Difference: 0.0000e+00

 

Forward pass 과정을 통해 계산한 h1, scores를 기반으로 Gradient를 계산하는 과정은 Backpropagation을 통해 진행된다. 

 

현재 Two Layer Network를 간략하게 표현하면 다음과 같다. 

$$Scores = W_2h_1 + b_2 \quad , h_1 = ReLU(W_1X + b_1)$$

 

따라서 몇가지 예를 들면 다음과 같다.

$$ \frac{\partial L}{\partial W_2} = \frac{\partial s}{\partial W_2} \frac{\partial L}{\partial s}$$

 

이때, $ \frac{\partial s}{\partial W_2}$는 $h_1$임을 쉽게 알 수 있다. 또한, 지난 과제에서 $ \frac{\partial L}{\partial s}$는 결국 $j==y_i$일 때만 $\frac{p_j-1}{N} $ 나머지는 $\frac{p_j}{N} $임을 기억 할 것이다. 이처럼 Chain Rule을 기반으로 각각의 Weight, Bias에 대한 Gradient를 계산 할 것이다. 

 

prob[idx0, y] -= 1
d_loss_score = prob / num_train

grads['W2'] = h1.T.mm(d_loss_score)     # 10x3
grads['W2'] += 2 * reg * W2             # Regularization Term

grads['b2'] = d_loss_score.sum(axis=0)  # 3, Vector

d_loss_h1 = d_loss_score.mm(W2.T).clone()       # 5x10
d_loss_h1[h1<=0] = 0

grads['W1'] = X.T.mm(d_loss_h1)         # 4x10
grads['W1'] += 2 * reg * W1

grads['b1'] = d_loss_h1.sum(axis=0)     # 10, Vector

 

먼저, $\frac{\partial L}{\partial s}$인 d_loss_score를 계산 해 준다. 계산의 편의성을 위해 먼저 prob에서 각 정답 클래스에 대한 확률에 대해 1을 빼준 뒤 N(num_train)으로 나눠준다.

 

앞서 말한 것 처럼, $ \frac{\partial L}{\partial W_2} $는 쉽게 계산할 수 있다. 이후, $W_2$에 대한 Regularization Term을 추가 해 준다. 

 

$ \frac{\partial L}{\partial b_2 } $의 경우 다음과 같이 Chain Rule을 적용시킬 수 있다.

$$ \frac{\partial L}{\partial b_2} = \frac{\partial s}{\partial b_2} \frac{\partial L}{\partial s}$$

이때, $\frac{\partial s}{\partial b_2}$는 1이므로 결국 $\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial s}$이다. 이때, Bias는 각 Weight마다 존재하기에 d_loss_score를 각 열에 대하여 합을 구함으로써 계산할 수 있다. 

 

$W_1, b_1$ 역시 동일하게 Chain Rule을 사용하여 계산할 수 있다. 하나 주의할 점은 

$$\frac{\partial L}{\partial W_1} = \frac{\partial h_1}{\partial W_1} \frac{\partial s}{\partial h_1} \frac{\partial L}{\partial s}$$

위와 같이 Chain Rule을 잘 적용해야 한다.

 

 

Error가 1e-4보다 모두 작은것을 보아 Gradient를 잘 구했음을 확인할 수 있다. 

 

  • Train the network
def nn_train(
    params: Dict[str, torch.Tensor],
    loss_func: Callable,
    pred_func: Callable,
    X: torch.Tensor,
    y: torch.Tensor,
    X_val: torch.Tensor,
    y_val: torch.Tensor,
    learning_rate: float = 1e-3,
    learning_rate_decay: float = 0.95,
    reg: float = 5e-6,
    num_iters: int = 100,
    batch_size: int = 200,
    verbose: bool = False,
):
    """
    Train this neural network using stochastic gradient descent.

    Inputs:
    - params: a dictionary of PyTorch Tensor that store the weights of a model.
      It should have following keys with shape
          W1: First layer weights; has shape (D, H)
          b1: First layer biases; has shape (H,)
          W2: Second layer weights; has shape (H, C)
          b2: Second layer biases; has shape (C,)
    - loss_func: a loss function that computes the loss and the gradients.
      It takes as input:
      - params: Same as input to nn_train
      - X_batch: A minibatch of inputs of shape (B, D)
      - y_batch: Ground-truth labels for X_batch
      - reg: Same as input to nn_train
      And it returns a tuple of:
        - loss: Scalar giving the loss on the minibatch
        - grads: Dictionary mapping parameter names to gradients of the loss with
          respect to the corresponding parameter.
    - pred_func: prediction function that im
    - X: A PyTorch tensor of shape (N, D) giving training data.
    - y: A PyTorch tensor of shape (N,) giving training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - X_val: A PyTorch tensor of shape (N_val, D) giving validation data.
    - y_val: A PyTorch tensor of shape (N_val,) giving validation labels.
    - learning_rate: Scalar giving learning rate for optimization.
    - learning_rate_decay: Scalar giving factor used to decay the learning rate
      after each epoch.
    - reg: Scalar giving regularization strength.
    - num_iters: Number of steps to take when optimizing.
    - batch_size: Number of training examples to use per step.
    - verbose: boolean; if true print progress during optimization.

    Returns: A dictionary giving statistics about the training process
    """
    num_train = X.shape[0]
    iterations_per_epoch = max(num_train // batch_size, 1)

    # Use SGD to optimize the parameters in self.model
    loss_history = []
    train_acc_history = []
    val_acc_history = []

    for it in range(num_iters):
        X_batch, y_batch = sample_batch(X, y, num_train, batch_size)

        # Compute loss and gradients using the current minibatch
        loss, grads = loss_func(params, X_batch, y=y_batch, reg=reg)
        loss_history.append(loss.item())

        #########################################################################
        # TODO: Use the gradients in the grads dictionary to update the         #
        # parameters of the network (stored in the dictionary self.params)      #
        # using stochastic gradient descent. You'll need to use the gradients   #
        # stored in the grads dictionary defined above.                         #
        #########################################################################
        for param in params.keys():
          params[param] -= learning_rate * grads[param]
        #########################################################################
        #                             END OF YOUR CODE                          #
        #########################################################################

        if verbose and it % 100 == 0:
            print("iteration %d / %d: loss %f" % (it, num_iters, loss.item()))

        # Every epoch, check train and val accuracy and decay learning rate.
        if it % iterations_per_epoch == 0:
            # Check accuracy
            y_train_pred = pred_func(params, loss_func, X_batch)
            train_acc = (y_train_pred == y_batch).float().mean().item()
            y_val_pred = pred_func(params, loss_func, X_val)
            val_acc = (y_val_pred == y_val).float().mean().item()
            train_acc_history.append(train_acc)
            val_acc_history.append(val_acc)

            # Decay learning rate
            learning_rate *= learning_rate_decay

    return {
        "loss_history": loss_history,
        "train_acc_history": train_acc_history,
        "val_acc_history": val_acc_history,
    }

 

Neural Network를 학습시키는 방법은 기존에 구현한 Loss, Gradient를 기반으로 Weight를 Update하면서 학습시키면 된다. nn_train Method의 입력으로 주어지는 pred_func()는 앞서 구현한 nn_forward_backward로 Backpropagation을 하면서 계산한 Weight, Bias의 Gradient가 grads에 Dictionary 형태로 저장되어 있다. 따라서 각 Parameter의 key에 접근하여 Parameter를 learning_rate의 비율로 감소시키면 된다. 

 

import eecs598
from eecs598.a2_helpers import get_toy_data
from two_layer_net import nn_forward_backward, nn_train, nn_predict

eecs598.reset_seed(0)
toy_X, toy_y, params = get_toy_data()

# YOUR_TURN: Implement the nn_train function.
#            You may need to check nn_predict function (the "pred_func") as well.
stats = nn_train(params, nn_forward_backward, nn_predict, toy_X, toy_y, toy_X, toy_y,
                 learning_rate=1e-1, reg=1e-6,
                 num_iters=200, verbose=False)

print('Final training loss: ', stats['loss_history'][-1])
# Final training loss:  0.5211756229400635

# plot the loss history
plt.plot(stats['loss_history'], 'o')
plt.xlabel('Iteration')
plt.ylabel('training loss')
plt.title('Training Loss history')
plt.show()

 

Train Iteration이 증가하면서 Loss가 줄어듬을 확인할 수 있으며, Train, Validation Accuracy도 Iteration이 증가하면서 증가함을 확인할 수 있다. 하지만, Epoch이 50이 넘어가면서부터 이미 Acc가 1에 가까운 정확도를 보인다. 이는 Loss History에서도 75 Epoch 이후부터는 감소율이 거의 보이지 않음을 확인할 수 있다.

 

 

728x90
반응형