DonHurry

step49. Dataset ν΄λž˜μŠ€μ™€ μ „μ²˜λ¦¬ λ³Έλ¬Έ

DeZero/πŸ—»μ œ4κ³ μ§€

step49. Dataset ν΄λž˜μŠ€μ™€ μ „μ²˜λ¦¬

_도녁 2023. 3. 1. 16:45

πŸ“’ λ³Έ ν¬μŠ€νŒ…μ€ λ°‘λ°”λ‹₯λΆ€ν„° μ‹œμž‘ν•˜λŠ” λ”₯λŸ¬λ‹3을 기반으둜 μž‘μ„±ν•˜μ˜€μŠ΅λ‹ˆλ‹€. 배운 λ‚΄μš©μ„ κΈ°λ‘ν•˜κ³ , 개인적인 곡뢀λ₯Ό μœ„ν•΄ μž‘μ„±ν•˜λŠ” ν¬μŠ€νŒ…μž…λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©μ€ ꡐ재 ꡬ맀λ₯Ό κ°•λ ₯ μΆ”μ²œλ“œλ¦½λ‹ˆλ‹€.

 

 

이번 λ‹¨κ³„μ—μ„œλŠ” μ €λ²ˆ λ‹¨κ³„μ—μ„œ μ‚¬μš©ν–ˆλ˜ Dataset ν΄λž˜μŠ€μ— λŒ€ν•΄ μžμ„Ένžˆ μ‚΄νŽ΄λ³΄κ² μŠ΅λ‹ˆλ‹€. μ‚¬μš©μžκ°€ μ‹€μ œλ‘œ μ‚¬μš©ν•˜λŠ” 데이터셋은 기반 클래슀λ₯Ό 상속받아 κ΅¬ν˜„ν•©λ‹ˆλ‹€. λ¨Όμ € 기반 클래슀 μ½”λ“œλ₯Ό λ³΄κ² μŠ΅λ‹ˆλ‹€.

 

Dataset ν΄λž˜μŠ€μ—μ„œλŠ” __getitem__κ³Ό __len__ λ©”μ„œλ“œκ°€ μ€‘μš”ν•©λ‹ˆλ‹€. __len__ λ©”μ„œλ“œλŠ” len ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•  λ•Œ 호좜되고, __getitem__ λ©”μ„œλ“œλŠ” μ§€μ •λœ μΈλ±μŠ€μ— μœ„μΉ˜ν•˜λŠ” 데이터λ₯Ό κΊΌλ‚Ό λ•Œ μ‚¬μš©ν•©λ‹ˆλ‹€. 기본적으둜 λ”₯λŸ¬λ‹ ν•™μŠ΅μ—λŠ” 데이터와 라벨이 ν•„μš”ν•˜λ―€λ‘œ, μƒμ„±μžμ—μ„œ 지정해주도둝 ν•©λ‹ˆλ‹€. λ³„κ°œλ‘œ transform은 데이터셋 μ „μ²˜λ¦¬ μ‹œμ— ν™œμš©ν•©λ‹ˆλ‹€. ν•™μŠ΅ν•˜κΈ° μ „ λ°μ΄ν„°μ—μ„œ νŠΉμ • 값을 μ œκ±°ν•˜κ±°λ‚˜, 쒌우 λ°˜μ „ λ“±μ˜ 데이터 수λ₯Ό μΈμœ„μ μœΌλ‘œ λŠ˜λ¦¬λŠ” κ²½μš°κ°€ λ§ŽμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ μ „μ²˜λ¦¬ κΈ°μˆ λ“€μ— λŒ€μ‘ν•˜κΈ° μœ„ν•΄ transform κΈ°λŠ₯을 μΆ”κ°€ν•©λ‹ˆλ‹€. λ§Œμ•½ transform이 인자둜 λ“€μ–΄μ˜€μ§€ μ•ŠλŠ” 경우 lambda x: xλ₯Ό 톡해 원본 데이터λ₯Ό κ·ΈλŒ€λ‘œ λ°˜ν™˜ν•©λ‹ˆλ‹€.

class Dataset:
    def __init__(self, train=True, transform=None, target_transform=None):
        self.train = train
        self.transform = transform
        self.target_transform = target_transform
        if self.transform is None:
            self.transform = lambda x: x
        if self.target_transform is None:
            self.target_transform = lambda x: x
        
        self.data = None
        self.label = None
        self.prepare()

    def __getitem__(self, index):
        assert np.isscalar(index)
        if self.label is None:
            return self.transform(self.data[index]), None
        else:
            return self.transform(self.data[index]),\
                    self.target_transform(self.label[index])
    
    def __len__(self):
        return len(self.data)
    
    def prepare(self):
        pass

 

이제 μœ„μ˜ Dataset 클래슀λ₯Ό 상속 λ°›μ•„ 슀파이럴 데이터셋을 κ΅¬ν˜„ν•΄λ³΄κ² μŠ΅λ‹ˆλ‹€. 사싀 핡심 데이터 ꡬ성은 get_spiral λ©”μ„œλ“œμ— κ΅¬ν˜„λ˜μ–΄μžˆμ§€λ§Œ, 이런 μ‹μœΌλ‘œ Dataset 클래슀λ₯Ό 상속 λ°›μ•„ κ΅¬ν˜„ν•œλ‹€λŠ” κ²ƒλ§Œ μ•Œμ•„λ‘μ‹œλ©΄ λ©λ‹ˆλ‹€.

class Spiral(Dataset):
    def prepare(self):
        self.data, self.label = get_spiral(self.train)

 

λ§Œμ•½ 데이터셋이 μ—„μ²­λ‚˜κ²Œ 큰 경우 λ‹€λ₯Έ 방식을 μ‚¬μš©ν•΄μ•Ό ν•©λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄ data 디렉터리와 label 디렉터리에 각각 100만 개의 데이터가 μ €μž₯λ˜μ–΄ μžˆλ‹€λ©΄, BigData 클래슀 μ΄ˆκΈ°ν™” μ‹œμ— 데이터λ₯Ό μ½λŠ” 것이 μ•„λ‹ˆλΌ 데이터에 μ ‘κ·Όν•  λ•Œ 읽도둝 ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€. np.loadλŠ” 53λ‹¨κ³„μ—μ„œ μ„€λͺ…ν•œλ‹€κ³  ν•©λ‹ˆλ‹€.

class BigData(Dataset):
	def __getitem__(index):
    	x = np.load('data/{}.npy'.format(index))
        t = np.load('label/{}.npy'.format(index))
        return x, t
        
    def __len__():
    	return 1000000

 

이제 ν•™μŠ΅μ„ μ§„ν–‰ν•˜κ² μŠ΅λ‹ˆλ‹€. 이전과 달라진 점은 Dataset을 뢈러올 λ•Œ 클래슀λ₯Ό ν™œμš©ν•œλ‹€λŠ” μ μž…λ‹ˆλ‹€. λ˜ν•œ Spiral ν΄λž˜μŠ€μ—μ„œ 데이터λ₯Ό λ―Έλ‹ˆλ°°μΉ˜λ‘œ κ°€μ Έμ˜¬ λ•Œμ˜ μ½”λ“œκ°€ λ‹¬λΌμ‘ŒμŠ΅λ‹ˆλ‹€. 이 뢀뢄은 μ•„λž˜μ—μ„œ λ”°λ‘œ μ‚΄νŽ΄λ³΄κ² μŠ΅λ‹ˆλ‹€.

import math
import numpy as np
import dezero
import dezero.functions as F
from dezero import optimizers
from dezero.models import MLP


max_epoch = 300
batch_size = 30
hidden_size = 10
lr = 1.0

train_set = dezero.datasets.Spiral(train=True)
model = MLP((hidden_size, 3))
optimizer = optimizers.SGD(lr).setup(model)

data_size = len(train_set)
max_iters = math.ceil(data_size / batch_size)


for epoch in range(max_epoch):
    index = np.random.permutation(data_size)
    sum_loss = 0

    for i in range(max_iters):
        batch_index = index[i * batch_size:(i + 1) * batch_size]
        batch = [train_set[i] for i in batch_index]
        batch_x = np.array([example[0] for example in batch])
        batch_t = np.array([example[1] for example in batch])

        y = model(batch_x)
        model.cleargrads()
        loss = F.softmax_cross_entropy(y, batch_t)
        loss.backward()
        optimizer.update()

        sum_loss += float(loss.data) * len(batch_t)
    
    avg_loss = sum_loss / data_size
    print('epoch %d, loss %.2f' % (epoch + 1, avg_loss))

 

λ¨Όμ € 인덱슀λ₯Ό μ§€μ •ν•˜μ—¬ λ―Έλ‹ˆλ°°μΉ˜λ₯Ό κΊΌλƒ…λ‹ˆλ‹€. λ‹€μŒμœΌλ‘œ μΈλ±μŠ€μ— 따라 batch에 μ—¬λŸ¬ 데이터λ₯Ό 리슀트둜 μ €μž₯ν•©λ‹ˆλ‹€. λ‹€μŒμœΌλ‘œ batch_x, batch_t에 ν•˜λ‚˜μ˜ ndarray μΈμŠ€ν„΄μŠ€λ‘œ λ³€ν™˜ν•˜μ—¬ ν• λ‹Ήν•΄μ€λ‹ˆλ‹€. 이 과정을 λ°˜λ³΅ν•˜μ—¬ 신경망에 λ―Έλ‹ˆλ°°μΉ˜λ₯Ό μž…λ ₯ν•˜μ—¬ μ€λ‹ˆλ‹€.

train_set = dezero.datasets.Spiral()

batch_index = [0, 1, 2]  # 0λ²ˆμ§Έμ—μ„œ 2λ²ˆμ§ΈκΉŒμ§€μ˜ 데이터 κΊΌλ‚΄κΈ°
batch = [train_set[i] for i in batch_index]
# batch = [(data_0, label_0), (data_1, label_1), (data_2, label_2)]
batch_x = np.array([example[0] for example in batch])
batch_t = np.array([example[1] for example in batch])

print(x.shape)
print(t.shape)

(3, 2)

(3,)

 

μœ„μ—μ„œ μ„€λͺ…ν–ˆλ˜ 데이터 μ „μ²˜λ¦¬ transform은 dezero/transform.py에 μ—¬λŸ¬ λ³€ν™˜ μ²˜λ¦¬λ“€μ΄ μ€€λΉ„λ˜μ–΄ μžˆλ‹€κ³  ν•©λ‹ˆλ‹€. μ±…μ—μ„œ λ”°λ‘œ μ„€λͺ…ν•˜κ³  μžˆμ§€λŠ” μ•ŠμŠ΅λ‹ˆλ‹€. κ΄€μ‹¬μžˆλŠ” 뢄은 μ°Έκ³ ν•˜μ‹œλ©΄ 쒋을 것 κ°™μŠ΅λ‹ˆλ‹€.

 

GitHub - WegraLee/deep-learning-from-scratch-3: γ€Žλ°‘λ°”λ‹₯λΆ€ν„° μ‹œμž‘ν•˜λŠ” λ”₯λŸ¬λ‹ ❸』(ν•œλΉ›λ―Έλ””μ–΄, 2020)

γ€Žλ°‘λ°”λ‹₯λΆ€ν„° μ‹œμž‘ν•˜λŠ” λ”₯λŸ¬λ‹ ❸』(ν•œλΉ›λ―Έλ””μ–΄, 2020). Contribute to WegraLee/deep-learning-from-scratch-3 development by creating an account on GitHub.

github.com