Home Artificial Intelligence “Eras” of Transformers The Data (Eras) ⌗ Era as a Sequence ⠿ The (Era)Transformer to the rescue 🤖 Results What’s Next? References

“Eras” of Transformers The Data (Eras) ⌗ Era as a Sequence ⠿ The (Era)Transformer to the rescue 🤖 Results What’s Next? References

“Eras” of Transformers
The Data (Eras) ⌗
Era as a Sequence ⠿
The (Era)Transformer to the rescue 🤖
What’s Next?

Processing a complete Numerai era directly with Transformers.

Just give me the code 🛠️:

If you happen to are latest to Numer.ai’s tournaments, following posts could be helpful to get context about this post.

May not be accurate. But still price a try… every thing (all) directly!

Illustration of knowledge set V2; https://docs.numer.ai/tournament/learn

I even have been attempting to work on these ideas for some time now. Initial experiments with LSTM didn’t succeed (possibly I couldn’t implement it properly back then). My initial trials with Transformers treated each row as as a sequence by utilizing embedding of 5 unique input values {0, 1, 2, 3, 4}. Nevertheless, It should find a way to process entire era directly, just by Transposing.

A diagram of current implementation of EraTransformer
def pad_sequence(inputs, padding_value=-1, max_len=None):
if max_len is None:
max_len = max([input.shape[0] for input in inputs])
padded_inputs = []
masks = []
for input in inputs:
pad_len = max_len - input.shape[0]
padded_input = F.pad(input, (0, 0, 0, pad_len), value=padding_value)
mask = torch.ones((input.shape[0], 1), dtype=torch.float)
torch.cat((mask, torch.zeros((pad_len, 1), dtype=torch.float)), dim=0)
return torch.stack(padded_inputs), torch.stack(masks)

def convert_to_torch(era, data):

inputs = torch.from_numpy(
labels = torch.from_numpy(

padded_inputs, masks_inputs = pad_sequence(
[inputs], padding_value=PADDING_VALUE, max_len=MAX_LEN)
padded_labels, masks_labels = pad_sequence(
[labels], padding_value=PADDING_VALUE, max_len=MAX_LEN)

return {
era: (

def get_era2data(df):
res = Parallel(n_jobs=-1, prefer="threads")(
delayed(convert_to_torch)(era, data)
for era, data in tqdm(df.groupby("era_int")))
era2data = {}
for r in tqdm(res):
return era2data

era2data_train = get_era2data(train)
era2data_validation = get_era2data(validation)

class TransformerEncoder(nn.Module):
def __init__(
super(TransformerEncoder, self).__init__()
self.input_dim = input_dim
self.output_dim = output_dim
self.num_heads = num_heads
self.num_layers = num_layers
self.dropout_prob = dropout_prob
self.d_model = d_model

self.positional_encoding = PositionalEncoding(d_model, max_len)
self.attention = MultiHeadLinearAttention(d_model, num_heads)
self.fc = nn.Sequential(
nn.Linear(d_model, d_model),
self.layers = nn.ModuleList(
for _ in range(num_layers)
self.mapper = nn.Sequential(
nn.Linear(input_dim, d_model), nn.Linear(d_model, d_model)

def forward(self, inputs, mask=None):
x = self.mapper(inputs)
pe = self.positional_encoding(x)
x = x + pe # works without PE as well
for layer in range(self.num_layers):
attention_weights = self.attention(x, mask)
x = x + attention_weights
x = F.layer_norm(x, x.shape[1:])
x = F.dropout(x, p=self.dropout_prob)

op = self.layers[layer](x)
x = x + op
x = F.layer_norm(x, x.shape[1:])
x = F.dropout(x, p=self.dropout_prob)

outputs = self.fc(x)
return outputs

The positional encoding is added to the input. Nevertheless, as we don’t understand how the samples are even ranked originally, it may not be very helpful to make use of it but adding it wouldn’t hurt. The subsequent step was to get the eye mechanism working.

The vanilla attention was too heavy for my laptop GPU so I attempted using something like a Linear Attention because it would cut back the matrix multiplication load to first after which . The usual Colab runtime within the provided notebook uses the vanilla attention mechanism on the provided subset of features because it has higher GPU memory. The notebook has each available as a drop-in alternative for Multi-head attention block. Use whichever works for you.

A loss function that might implement orthogonal embeddings within the encoder output would make way more sense for clustering. The present implementation is more of an early experiment and can get more accurate to perfect implementation. It appear to be working though.

Training Performance

Performance on Training set

Validation Performance

Performance on validation set

Relative to meta-model

Correlation between example predictions and historical meta-model prediction


  1. Run Colab notebook (GPU->Run all). Generates a submission file
  2. Expand the feature set (small -> medium -> full set)
  3. Play with Attention mechanisms.
  4. Loss functions to improve embeddings 📉
  5. We have now full idea of knowledge in Signals; It is going to be way more applicable there.
  6. Generate synthetic Eras using Diffusion process
  7. Self-supervised training


Please enter your comment!
Please enter your name here