Home Artificial Intelligence When the dataset is small, features are your folks

When the dataset is small, features are your folks

0
When the dataset is small, features are your folks

Photo by Thomas T on Unsplash

Within the rapidly evolving world of Artificial Intelligence (AI), data has develop into the lifeblood of countless progressive applications and solutions. Indeed, large datasets are sometimes considered the backbone of strong and accurate AI models. Nevertheless, what happens when the dataset at hand is comparatively small? In this text, we explore the critical role of feature engineering in overcoming the constraints posed by small datasets.

Our journey starts with the creation of the dataset. In this instance, we are going to perform nice and straightforward signal classification. The dataset has two classes; sine waves of frequency 1 belong to class 0, and sine waves of frequency 2 belong to class 1. The code for signal generation is presented below. The code generates a sine wave, applies additive gaussian noise, and randomizes phase shift. As a result of the addition of noise and phase shift, we obtain diverse signals, and the classification problem becomes non-trivial (albeit still easy with correct feature engineering).

def signal0(samples_per_signal, noise_amplitude):
x = np.linspace(0, 4.0, samples_per_signal)
y = np.sin(x * np.pi * 0.5)
n = np.random.randn(samples_per_signal) * noise_amplitude

s = y + n

shift = np.random.randint(low=0, high=int(samples_per_signal / 2))
s = np.concatenate([s[shift:], s[:shift]])

return np.asarray(s, dtype=np.float32)

def signal1(samples_per_signal, noise_amplitude):
x = np.linspace(0, 4.0, samples_per_signal)
y = np.sin(x * np.pi)
n = np.random.randn(samples_per_signal) * noise_amplitude

s = y + n

shift = np.random.randint(low=0, high=int(samples_per_signal / 2))
s = np.concatenate([s[shift:], s[:shift]])

return np.asarray(s, dtype=np.float32)

Visualizations of signals in school 0.
Visualizations of signals in school 1

State-of-The-Art models for signal processing are Convolutional Neural Networks (CNN). So, let’s create one. This particular network comprises two one-dimensional convolutional layers and two fully connected ones. The code is listed below.

class Network(nn.Module):

def __init__(self, signal_size):

c = int(signal_size / 10)
if c < 3:
c = 3

super().__init__()
self.cnn = nn.Sequential(
nn.Conv1d(1, 8, c),
nn.ReLU(),
nn.AvgPool1d(2),
nn.Conv1d(8, 16, c),
nn.ReLU(),
nn.AvgPool1d(2),
nn.ReLU(),
nn.Flatten()
)

l = 0
with torch.no_grad():
s = torch.randn((1,1,SAMPLES_PER_SIGNAL))
o = self.cnn(s)
l = o.shape[1]

self.head = nn.Sequential(
nn.Linear(l, 2 * l),
nn.ReLU(),
nn.Linear(2 * l, 2),
nn.ReLU(),
nn.Softmax(dim=1)
)

def forward(self, x):

x = self.cnn(x)
x = self.head(x)

return x

CNNs are models that may process the raw signal. Nevertheless, on account of its parameter-heavy architecture, they have a tendency to wish lots of data. Nevertheless, at first, let’s assume we’ve got enough data to coach neural networks. I used signal generation to create a dataset with 200 signals. Each experiment was repeated ten times to cut back the interference of random variables. The code is shown below:

SAMPLES_PER_SIGNAL = 100
SIGNALS_IN_DATASET = 20
NOISE_AMPLITUDE = 0.1
REPEAT_EXPERIMENT = 10

X, Y = [], []

stop = int(SIGNALS_IN_DATASET / 2)
for i in range(SIGNALS_IN_DATASET):

if i < stop:
x = signal0(SAMPLES_PER_SIGNAL, NOISE_AMPLITUDE)
y = 0
else:
x = signal1(SAMPLES_PER_SIGNAL, NOISE_AMPLITUDE)
y = 1

X.append(x.reshape(1,-1))
Y.append(y)

X = np.concatenate(X)
Y = np.array(Y, dtype=np.int64)

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.1)

accs = []
train_accs = []

for i in range(REPEAT_EXPERIMENT):

net = NeuralNetClassifier(
lambda: Network(SAMPLES_PER_SIGNAL),
max_epochs=200,
criterion=nn.CrossEntropyLoss(),
lr=0.1,
callbacks=[
#('lr_scheduler', LRScheduler(policy=ReduceLROnPlateau, monitor="valid_acc", mode="min", verbose=True)),
('lr_scheduler', LRScheduler(policy=CyclicLR, base_lr=0.0001, max_lr=0.01, step_size_up=10)),
],
verbose=False,
batch_size=128
)

net = net.fit(train_x.reshape(train_x.shape[0], 1, SAMPLES_PER_SIGNAL), train_y)
pred = net.predict(test_x.reshape(test_x.shape[0], 1, SAMPLES_PER_SIGNAL))
acc = accuracy_score(test_y, pred)

print(f"{i} - {acc}")

accs.append(acc)

pred_train = net.predict(train_x.reshape(train_x.shape[0], 1, SAMPLES_PER_SIGNAL))
train_acc = accuracy_score(train_y, pred_train)
train_accs.append(train_acc)

print(f"Train Acc: {train_acc}, Test Acc: {acc}")

accs = np.array(accs)
train_accs = np.array(train_accs)

print(f"Average acc: {accs.mean()}")
print(f"Average train acc: {train_accs.mean()}")
print(f"Average acc where training was successful: {accs[train_accs > 0.6].mean()}")
print(f"Training success rate: {(train_accs > 0.6).mean()}")

CNNs obtained a test accuracy of 99.2%, it was to be expected for the State-of-The-Art model. Nevertheless, this metric was obtained for these experiment runs, where training was successful. By “successful,” I mean that accuracy on the training dataset exceeded 60%. In this instance, CNNs weights initialization is a make-or-break for training, and it sometimes happens, as CNNs are complicated models vulnerable to problems with unlucky randomized weights initialization. The success rate of coaching was 70%.

Now, let’s see what happens when the dataset is brief. I reduced amount of signals within the dataset to twenty. In consequence, CNNs obtained 71.4% test accuracy, and the accuracy dropped by 27.8 percentage points. That will not be acceptable. Nonetheless, what to do now? The dataset must be longer to make use of State-of-The-Art models. In industrial applications, acquiring more data is either unfeasible or, on the very least, very expensive. Should we drop the project and move on?

No. When the dataset is small, features are your folks.

This particular example involves the classification of signals based on their frequency. So, we will apply the great old Fourier Transform. The Fourier Transform decomposes the signal right into a series of sine waves parametrized by frequency and amplitude. In consequence, we will use Fourier Transform to look at the importance of every frequency in forming the signal. Such data representation should simplify the duty enough for the small dataset to suffice. Also, Fourier Transform structures the info in order that we will use simpler models like, for instance, the Random Forest classifier.

The visualization of signals transformed into spectrums. On the left is the spectrum of the signal from class 0, and on the proper is the spectrum of the signal from class 1. These plots have logarithmic scales for higher visibility. The models utilized in this instance interpreted signals on a linear scale.

The code for transforming the signal and training Random Forest Classifier is shown below:

X, Y = [], []

stop = int(SIGNALS_IN_DATASET / 2)
for i in range(SIGNALS_IN_DATASET):

if i < stop:
x = signal0(SAMPLES_PER_SIGNAL, NOISE_AMPLITUDE)
y = 0
else:
x = signal1(SAMPLES_PER_SIGNAL, NOISE_AMPLITUDE)
y = 1

# Transforming signal into spectrum
x = np.abs(fft(x[:int(SAMPLES_PER_SIGNAL /2 )]))

X.append(x.reshape(1,-1))
Y.append(y)

X = np.concatenate(X)
Y = np.array(Y, dtype=np.int64)

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.1)

accs = []
train_accs = []

for i in range(REPEAT_EXPERIMENT):
model = RandomForestClassifier()
model.fit(train_x, train_y)

pred = model.predict(test_x)
acc = accuracy_score(test_y, pred)

print(f"{i} - {acc}")

accs.append(acc)

pred_train = model.predict(train_x)
train_acc = accuracy_score(train_y, pred_train)
train_accs.append(train_acc)

print(f"Train Acc: {train_acc}, Test Acc: {acc}")

accs = np.array(accs)
train_accs = np.array(train_accs)

print(f"Average acc: {accs.mean()}")
print(f"Average train acc: {train_accs.mean()}")
print(f"Average acc where training was successful: {accs[train_accs > 0.6].mean()}")
print(f"Training success rate: {(train_accs > 0.6).mean()}")

The Random Forest classifier achieved 100% test accuracy on 20 and 200 signals-long datasets, and the training success rate can be 100% for every dataset. In consequence, we obtained even higher results than CNNs with a smaller amount of information required — all because of feature engineering.

Although feature engineering is a robust tool, one must also remember to cut back unnecessary features from the input data. The more features are in input vectors, the upper the prospect of overfitting — especially in small datasets. Each unnecessary feature provides the danger of introducing random fluctuations that the machine learning model may consider essential patterns. The less data within the dataset, the upper the danger of random fluctuations, making a correlation that doesn’t exist in the true world.

One in every of the mechanisms which will assist in pruning too large feature collections are search heuristics just like the genetic algorithm. The features pruning may be expressed as a task to search out the least amount of features that facilitate successful training of the machine-learning model. It may be encoded by making a binary vector of length equal to the dimensions of feature data. The “0” determines that the feature will not be present within the dataset, and the “1” indicates that feature is present. Then the fitness function of such a vector is a summation of the machine-learning model’s accuracy achieved on the pruned dataset and the count of zeros within the vector scaled down by sufficient weight.

This is just certainly one of many solutions to remove unnecessary features. Nevertheless, it is sort of powerful.

Although the instance presented is comparatively easy, it presents typical problems with applying Artificial Intelligence systems within the industry. Currently, Deep Neural Networks can do almost the whole lot we desire on condition of providing enough data. Nevertheless, the info is often scarce and expensive. So, industrial applications of Artificial Intelligence normally involve doing extensive features engineering to simplify the issue and, because of this, reduce the quantity of information needed to coach the model.

Thanks for reading. The code for this instance generation is accessible under the link: https://github.com/aimagefrombydgoszcz/Notebooks/blob/foremost/when_dataset_is_small_features_are_your_friend.ipynb

All images unless otherwise noted are by the writer.

LEAVE A REPLY

Please enter your comment!
Please enter your name here