Predicting the weather is a difficult task that involves statistics and data processing. Nowadays, in weather forecast, Machine Learning plays a crucial role. However, it is in numerical models where ML is more used.
This article is an explanation of a project I made for predicting tomorrow’s gradient of max temperatures based on today’s gradient using Deep Learning. The model used is called Pix2Pix. Is a version of GANs used to generate images based on others (link). The way it works is as follows: we enter as input an image of today’s gradient of max temperatures in the US, then the model returns an image (prediction) of tomorrow’s gradient of max temperatures.
Before talking about the project, the data that was used and the way the model was trained, we must talk about the model itself.
What is Pix2Pix?
Pix2Pix is a version of a more general model called Generative Adversarial Networks (GAN). Within this model, we have two networks the Generator and the Discriminator. The Generator Network must learn to elaborate fake images in order to fool the Discriminator which task is to say whether an image is real or fake (generated by the Generator). In other words, the Generator generates new images and the Discriminator discriminates them, this is why they are called that way.
But, how does the Generator generates new images? Suppose we want to generate flowers. In order to do so, we must have a dataset of images of the flowers we want our neural net to generate. After preparing this dataset, we generate another dataset based on the same images but with random noise added to them. By doing this we don’t have flowers anymore but blurs of the flowers. Using these two datasets we pass as input the real image of the flower and the image with the blur. Using both images the generator tries to elaborate an image that resembles the real one starting from the noisy one. After doing so, the generated image is passed to the discriminator which catalogs the image as real or fake. This is the way this model is trained.
In addition, one can notice that the tasks described above are trying to compete against each other. We can say that they are adversarial. This gives name to GANs.
We are Data Scientists; we want to know more about the architecture of this model. Well, internally, this model has two networks, as stated above. Let’s focus on the Generator. Everyone must be familiarized with Convolutional Neural Networks, CNN (if not read this article). This model takes an image as input and processes it using convolutional layers until we only have as many percentages as classes our classification model has. In other words, we pass from an image to number. We can do the inverse with a so-called Deconvolutional Neural Network (DNN). It receives a number as an input and gives an image as an output. Using these two models we build our Generator.
If you want to know more about this model you can read the paper.
Having in mind how does this model work and train, we are going to focus on the project I made.
Predicting the weather and restoring maps using Pix2Pix
This project is a two in one project because the same model was used to predict the weather and to restore the images the predictor gives as output. More precisely, the model was trained two times. The first one with a dataset of images generated from NOAA’s temperatures registers. The second one with the images the first model gives as output.
Both projects will be explained but in different sections.
Before getting into both projects it is necessary to explain the data processing that was done. I obtained 3.5 GB of data from NOAA (https://www.noaa.gov), in which meteorological data recorded by different stations around the world were collected. The records were distributed by months, in a period from 1840 to today in day. Being a lot of data I had to do a massive cleaning since there were erroneous records, some that did not match the date on which it was registered, etc.
For the cleaning process, I used C#. The first step was to separate the variable of interest (maximum temperature), this reduced the dataset to 2 GB. Then, having data from many weather stations, I proceeded to take the latitude and longitude of each of these to graph and see in which area there was more data (each pixel is a station):
As you can see, the USA has the most stations, so I decided to restrict the TMAX dataset to the stations in the USA. The dataset was reduced to 1.6 GB. All this was done with C#. The next step was to separate the stations along with their latitude and longitude, by year, month and day and the temperature associated with that day. This generated many records in the dataset and raised it to 6 GB. This dataset was introduced into a database to handle it more easily. After uploading it, using ‘queries’, I kept cleaning the dataset. I removed more inconsistent records, adjusted the temperatures to a logical range since there were many temperatures that exceeded the maximum record temperature or were below the historical minimum. I did this because of the fact that it was probably a bad measure of that day, so I restricted the registers to 52ºC and -18ºC, which are the historical records of the USA.
Once all the data cleaning was done, using C#, only the data from 1980 to 2019 were indexed and the values of the TMAX given by a station were mapped through color to a map to be able to represent the temperature gradient. The final result is as follows (image 256×256):
The colored lines on the sides are a color code that represents day and year. The horizontal bar represents the days (1 to 365) and the vertical the year. This, in some way, for the first project, helps the next day’s prediction.
The purpose of the first project was to use the Pix2Pix model to predict tomorrow’s weather based on today’s temperature map. In this way, the input was today’s map and the target was tomorrow’s map.
The main motivation of using Pix2Pix to predict the weather is because of the curiosity of checking if this structure of neural networks, used to transform images into images, can also be adapted to infer new data from existing ones. That is, to verify if the model can generate a real image in a real one (or at least close to it) and not in a hypothetical one.
Both images are in the same folder since the target in one iteration is the input of the next. They carry a name that identifies them with the following format yyyymmdd-xxxxxx.png where the last six digits are the image number in the dataset.
Training was done with 3200 images. 20% of that was used for testing.
The results were the following:
The conclusion drawn from this project is that it generated predicted maps quite close to reality. Although we believe that increasing the EPOCHs and extending the training dataset the result should have been more reliable.
In the second project, it was given a twist with the aim of making the output of project 1 look more similar to what a real map of maximum temperatures used in meteorology is. In this way, climatologists can use this model to improve the maps generated from their data.
1000 images of the dataset were taken and an ‘action’ in Photoshop was applied in which the image was retouched to leave it as acceptable as possible:
Clearly, the new target will be an improved image.
The model was trained over 1000 images (20% test). The results were the following:
As we see, it manages to reconstruct the image in an accurate way. This, without doubts, is very useful for different applications in weather forecasting and improvement or customization of maps.
Here is a gif of the recreated outputs:
It is concluded that the model used serves to infer data from data, although this model was not intended for this type of application.
In turn, it was possible to improve and adapt a map using the same model to bring it to a representation more in line with the types of images used in meteorology.
All the results shown in this article can be found on the GitHub repository of the project