Processing a complete Numerai era directly with Transformers.
Just give me the code 🛠️:
GPU -> Run all notebook that generates a submission file. Runs withing 15 mins.
If you happen to are latest to Numer.ai’s tournaments, following posts could be helpful to get context about this post.
May not be accurate. But still price a try… every thing (all) directly!
The provided data set (v4.1: Sunshine) comprises obfuscated stock market data of past many years divided into groups called “Era” (an integer number). Eras represents a relative sense of timing, meaning era 1 happened before era 2 and so forth until about 1050 eras in total. They’re split into training and validation sets.
Most models treat each row as a sample with none context of era. Approaches similar to Era Boosting trains additional trees on the eras where the present ensemble is struggling. Nevertheless, this still doesn’t have a look at the whole era for prediction. Because the primary goal of the info scientists is to rank the samples in a given era, it is smart to find a way to have a look at not only features but additionally all of the samples in an era. This leads us to wondering what can we possibly perform if we will have a look at all of the samples in an era and infer something from that! Possibilities are
I even have been attempting to work on these ideas for some time now. Initial experiments with LSTM didn’t succeed (possibly I couldn’t implement it properly back then). My initial trials with Transformers treated each row as as a sequence by utilizing embedding of 5 unique input values {0, 1, 2, 3, 4}. Nevertheless, It should find a way to process entire era directly, just by Transposing.
As a substitute of treating each row as a sample, the era is padded to a set MAX_LENGTH and thus is treated as a sequence. The chosen features are fed as embedding to the model together with mask at padded locations as shown within the illustration below.
def pad_sequence(inputs, padding_value=-1, max_len=None):
if max_len is None:
max_len = max([input.shape[0] for input in inputs])
padded_inputs = []
masks = []
for input in inputs:
pad_len = max_len - input.shape[0]
padded_input = F.pad(input, (0, 0, 0, pad_len), value=padding_value)
mask = torch.ones((input.shape[0], 1), dtype=torch.float)
masks.append(
torch.cat((mask, torch.zeros((pad_len, 1), dtype=torch.float)), dim=0)
)
padded_inputs.append(padded_input)
return torch.stack(padded_inputs), torch.stack(masks)def convert_to_torch(era, data):
inputs = torch.from_numpy(
data[feature_names].values.astype(np.int8))
labels = torch.from_numpy(
data[target_names].values.astype(np.float32))
padded_inputs, masks_inputs = pad_sequence(
[inputs], padding_value=PADDING_VALUE, max_len=MAX_LEN)
padded_labels, masks_labels = pad_sequence(
[labels], padding_value=PADDING_VALUE, max_len=MAX_LEN)
return {
era: (
padded_inputs,
padded_labels,
masks_inputs
)
}
def get_era2data(df):
res = Parallel(n_jobs=-1, prefer="threads")(
delayed(convert_to_torch)(era, data)
for era, data in tqdm(df.groupby("era_int")))
era2data = {}
for r in tqdm(res):
era2data.update(r)
return era2data
era2data_train = get_era2data(train)
era2data_validation = get_era2data(validation)
class TransformerEncoder(nn.Module):
def __init__(
self,
input_dim,
d_model,
output_dim,
num_heads,
num_layers,
dropout_prob=0.15,
max_len=5000,
):
super(TransformerEncoder, self).__init__()
self.input_dim = input_dim
self.output_dim = output_dim
self.num_heads = num_heads
self.num_layers = num_layers
self.dropout_prob = dropout_prob
self.d_model = d_modelself.positional_encoding = PositionalEncoding(d_model, max_len)
self.attention = MultiHeadLinearAttention(d_model, num_heads)
self.fc = nn.Sequential(
nn.Linear(d_model, d_model),
)
self.layers = nn.ModuleList(
[
nn.Sequential(FeedForwardLayer(d_model=d_model))
for _ in range(num_layers)
]
)
self.mapper = nn.Sequential(
nn.Linear(input_dim, d_model), nn.Linear(d_model, d_model)
)
def forward(self, inputs, mask=None):
x = self.mapper(inputs)
pe = self.positional_encoding(x)
x = x + pe # works without PE as well
for layer in range(self.num_layers):
attention_weights = self.attention(x, mask)
x = x + attention_weights
x = F.layer_norm(x, x.shape[1:])
x = F.dropout(x, p=self.dropout_prob)
op = self.layers[layer](x)
x = x + op
x = F.layer_norm(x, x.shape[1:])
x = F.dropout(x, p=self.dropout_prob)
outputs = self.fc(x)
return outputs
The positional encoding is added to the input. Nevertheless, as we don’t understand how the samples are even ranked originally, it may not be very helpful to make use of it but adding it wouldn’t hurt. The subsequent step was to get the eye mechanism working.
The vanilla attention was too heavy for my laptop GPU so I attempted using something like a Linear Attention because it would cut back the matrix multiplication load to first after which . The usual Colab runtime within the provided notebook uses the vanilla attention mechanism on the provided subset of features because it has higher GPU memory. The notebook has each available as a drop-in alternative for Multi-head attention block. Use whichever works for you.
The experiments within the provided Colab uses “small” feature subset because of memory limitations. Nevertheless, it uses the vanilla attention mechanism. Below is a small model trained on small feature set for 10 iterations on low learning rate to optimize for MSE and Pearson’s correlation.
A loss function that might implement orthogonal embeddings within the encoder output would make way more sense for clustering. The present implementation is more of an early experiment and can get more accurate to perfect implementation. It appear to be working though.
Training Performance
Validation Performance
Relative to meta-model
Clustering
- Run Colab notebook (GPU->Run all). Generates a submission file
- Expand the feature set (small -> medium -> full set)
- Play with Attention mechanisms.
- Loss functions to improve embeddings 📉
- We have now full idea of knowledge in Signals; It is going to be way more applicable there.
- Generate synthetic Eras using Diffusion process
- Self-supervised training