# Time Series Forecasting using Transformers and Ordinal Regression

## Raisa's Domain and Time Series Prediction

• One of the things we do at Raisa is forecasting oil and gas production for all wells in the United States. Forecasting are usually done for the next twelve months based on previous data, which is by definition a time series forecasting problem.
• With the rise of transformers in Natural Language Processing and Computer Vision, we conducted an experiment that tries to use transformers to predict the data described above.
• This blog summarizes one of many experiments which is Ordinal Regression (Classification)

### Motivation

• Why did we choose Ordinal Regression instead of Regression?
• We tried to use regression at first but the results were not impressive, which pushed us to try and re-examine the way we look at our data and our problem.
• Since transformers are already proven as Language Models, we thought that we could create a "Well Model".
• Just like you could give a Language Model a prompt and it would continue and come up with a full sentence, We wanted to train a "Well Model" that you could give some production values and it would predict the rest of the production stream as if it was predicting words to complete a sentence.

# Data Processing

• To apply classification on continuous data, we needed to bin the data so that we can treat each production value similar to a word in a corpus

### Binning Algorithm

• The algorithm we used for binning was quantile binning but with a slight modification.

Quantile binning: Values are binned so that each bin contains a certain percentage of values.

• The modification we added is that instead of moving to a new bin after reaching a certain percentage of the data, we added a threshold for the difference between each bin and the next one.
• If we have an array of bins $$B = b_1, b_2, ..., b_n$$, this rule applies to almost all of the bins $$b_{i+ 1} - b_{i} < 0.1 * b_{i}$$
• We applied this rule after the value of 100 barrels, as before then 10% of a value would be very negligible.
• e.g. 10% of 2 is 0.2, which would mean that the value 2 would be its own bin no matter what percentage of values is in it.
• This was done to reduce the information lost when binning the production values while making sure each bin has enough data in it for the training process

# Model Architecture

• As we wanted to have a model that can generate new timesteps given some production values, we thought a decoder architecture would be the most fitting and chose the GPT2 to be the body of the model.

• We will show how we implemented this using PyTorch

## Data Embeddings

• The first step in our model was how to transform the bins into vectors and add the time encodings, for that we chose a simple nn.Embedding layer for both time and token embeddings
class DataEmbeddings(nn.Module):
def __init__(self, config: DecoderConfig):
self.config = config
self.time_positional_encoding = nn.Embedding(MAX_TIMESTEPS, config.embed_dim)
self.token_embedding = nn.Embedding(config.num_of_bins, config.embed_dim)

def forward(self, x: torch.Tensor, pos: torch.Tensor):
token_embedding = self.token_embedding(x)
time_embedding = self.time_positional_encoding(pos)



## Decoder

• The decoder consists of many different parts that we need to define before creating the decoder itself, those things include DecoderBlock, FeedForward, and MultiHeadSelfAttention. ### Attention

• To simplify attention as much as possible, think of it as a weighted average of the features given to the transformer, where the weights are learned during training.
class AttentionHead(nn.Module):
def __init__(self, config: DecoderConfig, attention_head_dim: int) -> None:
super().__init__()


• The function scaled_dot_product_attention is the averaging step, you will notice that we pass the tensor x to the function as the key, value, and query which is why we call it self-attention.
• You will notice that the function takes causal_mask and padding_mask as arguments, these are used to ignore certain timesteps during the attention.
def safe_softmax(x, eps=1e-8):
"""Softmax modified so that we can output 0 in all elements
i.e. pay attention to 0 timesteps"""
e_x = torch.exp(x - x.max())
return e_x / (e_x.sum(dim=-1, keepdim=True) + eps)

scores = torch.bmm(q, k.transpose(1, 2)) / math.sqrt(q.size(-1))

# If element is True, then we don't want to do the attention on this element.
weights = safe_softmax(scores)


• We also include multiple self-attention blocks in each decoder which is why we call it MultiHeadSelfAttention
class MultiHeadSelfAttention(nn.Module):
def __init__(self, config: DecoderConfig) -> None:
embed_dim = config.embed_dim

assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_attention_heads"
self.dropout = nn.Dropout(config.attention_dropout_prob)

self.proj_linear = nn.Linear(embed_dim, embed_dim)

x = self.dropout(self.proj_linear(x))

return x

@staticmethod
"""Create causal mask for self attention"""
causal_mask = torch.triu(torch.ones(MAX_TIMESTEPS, MAX_TIMESTEPS), diagonal=1) == 1



• Since we need to input a fixed amount of timesteps each forward pass, there will be cases where the well has less than that amount of timesteps, and we would need to pad those elements with a certain value, and then tell the transformer to ignore it.
• We ignore it through the padding_mask
• Causal mask is a lower triangular square matrix that tells the decoder which
timesteps to ignore in attention module, this is done so that the model doesn’t
cheat by looking into the future when predicting. Each element in $$C$$ has a value of 1 or 0.

• If $$C_{ij} = 1$$ this means that at timestep $$i$$ the model can observe the value at timestep $$j$$ when predicting timestep $$i + 1$$

## Feed Forward Network

• This is a simple two-layer neural network with layer normalization
class FeedForward(nn.Module):
def __init__(self, config: DecoderConfig):
self.linear_1 = nn.Linear(config.embed_dim, config.get_intermediate_dim())
self.linear_2 = nn.Linear(config.get_intermediate_dim(), config.embed_dim)
self.dropout = nn.Dropout(config.ff_dropout_prob)
self.activation = NewGELU()

def forward(self, x):
x = self.dropout(self.linear_2(self.activation(self.linear_1(x))))
return x


## Decoder Block

• Now that we have all the components of the decoder, we can just stack them up to get one decoder block as follows
class DecoderBlock(nn.Module):
def __init__(self, config: DecoderConfig):
super().__init__()
self.config = config
self.layer_norm_1 = nn.LayerNorm(config.embed_dim)
self.layer_norm_2 = nn.LayerNorm(config.embed_dim)
self.ffn = FeedForward(config)

x = x + self.attention(self.layer_norm_1(x), mask)
x = x + self.ffn(self.layer_norm_2(x))

return x


## Decoder

• Now that we have the decoder block ready, we replicate it 12 times just like what is done in the GPT2 paper and put all the components in one forward pass
class PDPDecoderTransformer(nn.Module):
def __init__(
self,
config: DecoderConfig,
):
# transforms each bin into a vector of size config.d_model
self.data_embeddings = DataEmbeddings(config)
# decoder blocks
decoder_block = DecoderBlock(config)
self.decoder_blocks = _clones(decoder_block, config.num_decoder_layers)
self.dropout = nn.Dropout(config.ff_dropout_prob)
self.layer_norm = nn.LayerNorm(config.embed_dim)

def forward(
self,
x: torch.Tensor,
):
B = x.size(0)
pos = torch.arange(0, MAX_TIMESTEPS, device=x.device).long().unsqueeze(0).repeat(B, 1)
x = self.data_embeddings(x, pos)
for i, decoder_block in enumerate(self.decoder_blocks):
x = self.layer_norm(x)



## Loss Functions

### Cross Entropy Loss

• Cross entropy is the loss function used in most classification problems, it has a formula of :

$$L(y, p) = -1 \frac{1}{n} \sum_{i = 1}^n \sum_{c = 1}^C y_{ij} \log (p_{ij})$$

• We wanted to guide the model that some bins were more similar to each other, e.g. if the correct bin is 5 and you predict bin 10, you should have a higher loss than somebody who predicted bin 6.

• So we are proposing different loss functions

### Gaussian Cross Entropy Loss

• This is very similar to cross entropy but with a slight modification. $$y$$ is now a gaussian centered around the index of the correct bin, instead of a one-hot vector.
• The formula is still
$$L(y, p) = -1 \frac{1}{n} \sum_{i = 1}^n \sum_{c = 1}^C y_{ij} \log (p_{ij})$$
but $$y$$ represents a different vector.
STD_DEV = 0.27
# config.num_of_bins = 10
gaussians = torch.zeros(config.num_of_bins, config.num_of_bins, device=device)
x = torch.arange(config.num_of_bins, device=device)
for mean_around in range(config.num_of_bins):
g_x = torch.exp(-((x - mean_around) ** 2) / (2 * STD_DEV ** 2))
g_x = g_x / torch.sum(g_x)
gaussians[mean_around] = g_x
# you can then access the gaussian vector through indexing the tensor
print(gaussians)
# tensor([0.0000e+00, 1.5516e-27, 1.2142e-12, 1.0481e-03, 9.9790e-01, 1.0481e-03, 1.2142e-12, 1.5516e-27, 0.0000e+00, 0.0000e+00])


• This way the model understands that it should have a lower loss if the bin predicted is close to the correct bin, instead of just treating close predictions the same as predictions far from the correct bin.

### Custom Ordinal Cross Entropy Loss

• This adds a factor that also penalizes the model more if the bin is farther from the correct index.

• It has a formula of

$$L(y, p) = -1 \frac{1}{n} \sum_{i = 1}^n (idx_{predicted} - idx_{correct})^2 \sum_{c = 1}^C y_{ij} \log (p_{ij})$$

After trying all of these approaches, they provided similar results, so we went with the regular cross entropy.

## Inference

### Predict Max Bin

• This was the naive approach that we decided on earlier in the experiment, we would pick the bin that had the highest probability of being correct.

### Weighted Average

In this method, we need to define two arrays

• pred : shape (1, # bins) for a single example, which we will refer to
as p where $$p_i = p(y = bin_i)$$
• bin representation values: shape (1, # bins), which we will refer to
as r, where $$r_i$$ is the value we use to represent $$bin_i$$

We take the weighted average of the representation value of the top k bins, where the weights are the probabilities of each of the top k bins

e.g. if $$r = [100, 500, 1000, 2000, 5000], p = [0.25, 0.45, 0.15, 0.1, 0.05]$$ and k = 2
our prediction would be $$y = \frac{p2∗r2+p1∗r1}{p1+p2} = \frac{0.45∗500+0.25∗100}{0.45+0.25} = 357.14$$

But we added a condition before doing this, we will take the top k bins or
until the sum of the first n bins reaches 50%
e.g. if k = 3 and $$p = [0.25, 0.45, 0.15, 0.1, 0.05]$$, we would only take bins 1 and 2, as $$p1 + p2 > 0.5$$
You can tune the percentage that you stop before depending on your problem.

The second approach was much better than the first one especially in the later month as the model error accumulates and the model is less sure of its predictions

## Results

• The model's performance was indicative of actual learning and sucessful training but it did has not outperformed the current model at Raisa. • Here are some examples of model predictions.    ## Conclusion

• During our experiment, we used the transformer in a different domain in which it is usually used.
• We found out that with a lot of tweaks the model shows signs of learning, but it still doesn't outperform the current model at Raisa
• The upside of this is that it is easier to scale up the transformer, we plan on experimenting with multivariate classification.
• We would make the model predict the oil and gas data simultaneously and make use of both features.
• This experiment could give the transformer the edge it needs in order to surpass the performance of the current model.