Samuel Blixen 4375 Torre A 1408
+598 2614 4965
hi@kreilabs.com

Completing historical temperature records

Based on public data from l.a.c.a. we completed the historical temperature record of the azapa-chile weather station between 1977-1980.

Imagen que contiene hombre, persona, pared, interior Descripción generada automáticamente

Gabriel Naya | Kreilabs | 10/6/2019

My intention in this exercise was to locate a set of data on the climate of Uruguay, to review the evolution of the daily average temperature marks registered in different points of the country, and -applying artificial intelligence- complete the records in case there were missing data.

It was impossible for me to find a public dataset of Uruguay, but I could access data published by Latin American Climate Assessment & Dataset (LACA&D), obtaining TXT files by consulting the site: http://lacad.ciifen.org/ES/.

Problem Formulation

The exercise consists of delimiting the data set to the weather stations of a region (Chile), looking for files that report records in similar periods of time (in years).

Once the data from at least 6 weather stations have been obtained, we select one as a destination, and try to find a way to predict the values of the missing records.

The dataset

Records were pre-selected and obtained from the following stations:

STAID STANAME CN LAT LON HGHT
553 CAQUENA CL -18:03:15 -69:12:06 4400
557 CHUNGARA AJATA CL -18:14:07 -69:11:00 4585
562 PARINACOTA EX ENDESA CL -18:12:15 -69:16:06 4420
564 GUALLATIRE CL -18:29:54 -69:09:17 4240
565 CHILCAYA CL -18:47:38 -69:05:04 4270
584 PACOLLO CL -18:10:37 -69:30:33 4185
585 PUTRE CL -18:11:57 -69:33:37 3545
587 PUTRE (DCP) CL -18:11:42 -69:33:32 3560
588 LLUTA CL -18:24:37 -70:10:09 290
589 MURMUNTANE CL -18:21:07 -69:33:07 3550
595 ARICA OFICINA CL -18:28:39 -70:19:15 20
596 AZAPA CL -18:30:56 -70:10:50 365
597 U. DEL NORTE CL -18:29:00 -70:17:37 55
598 EL BUITRE AERODROMO CL -18:30:43 -70:17:03 110
599 CHACA CL -18:49:01 -70:09:00 350
600 CODPA CL -18:49:56 -69:44:38 1870

Among them, after analyzing the reported periods and the quality of the imputed data, the following stations in northern Chile and southern Peru were definitively selected as input information support:

STAID STANAME CN LAT LON HGHT
553 CAQUENA CL -18:03:15 -69:12:06 4400
585 PUTRE CL -18:11:57 -69:33:37 3545
588 LLUTA CL -18:24:37 -70:10:09 290
595 ARICA OFICINA CL -18:28:39 -70:19:15 20
597 U. DEL NORTE CL -18:29:00 -70:17:37 55

And the AZAPA weather station (596) was selected as the target.

STAID STANAME CN LAT LON HGHT
596 AZAPA CL -18:30:56 -70:10:50 365

Basically all files have a date (DATE), a source id (SOUID), the temperature in tenths of degrees Celsius (TG) and a quality data indicator (Q_TG), where Q_TG=zero implies a real data, and Q_TG=9 implies a manually completed data.

We are going to import the date as index, and only the TG column, creating each of the data frames and joining them by the date index:

At this point we have a dataframe with the integrated data set to start the different approaches to a solution, let’s go for it!

Data pre-view and feature engineering

The first thing we are going to do is to assign as null the registers that have TG=-9999

And display the null data by column

If we analyze the missing information by period in a heatmap:

We can observe that the information (black background) is quite compact in the period October-1976 to December 1984.

When applying the elimination of null records in the whole dataset we are left with a data set of 1477 records, located between 12/28/1976 and 11/21/1984.

Deploying data by year:

Let’s go for the approaches to the solution.

Strategy

Always before approaching a problem, I like design a strategy. In fact, I believe that the strategy is in my head, but before fully developing it, is very useful to make it explicit and then come back to it. As much as we know that as we explore solutions, new elements emerge that make us move away from this original idea, make it vary, it is good to write it before starting, and modify it at the end of the work.

About this exercise, my original idea was very modest, I started it as a Saturday morning exercise to keep me in shape, and it has really challenged me more than I expected:

– Let’s see, this is a regression problem, we must obtain the temperature of one of the stations, so let’s try regression algorithms and surely the job is finished.

– The data source is homogeneous, we wouldn’t need to do many things with features.

– The data set is wide enough to think about good results, it does not present imbalances in the data periods, or missing in certain months, things of the style that lead us to think about data balancing.

– If the regression algorithms were not sufficient with the columns of the other stations, the date can be used as additional data to add seasonal information.

– If the algorithms are not sufficient, a neural network of type RNN – LSTM can be explored.

Let’s go for a regression algorithm

Before testing a set of regression algorithms, we added two features to the dataset, the sine and cosine of the “day” of the year, which goes between 0 and 360, taking the first of January as day 1 and so ascending to 359. The last days of the year, greater or equal to 359 are equal to zero to simplify the conversion.

Let’s remember that the algorithms of numpy for trignomoetria have their income in radians, for that reason we do the conversion to degrees before obtaining sine and cosine:

If at any time we need these features, we will have to scale the rest of the columns to leave the information in a homogeneous range and not divert the internal functioning of the algorithms or networks we use.

Prepare data entry to a regression model:

After applying GradientBoostingRegressor from sklearn.ensemble, we obtain the following results:

Gradiente RMSLE score on train data:

5.4432688838663195

Accuracy –> 98.16939124093663

Gradiente RMSLE score on test data:

16.675179454983343

Accuracy –> 63.10001902521414

This means that in a first approximation with the training data an interesting accuracy is obtained (which can be taken as a basis to start trying to improve); however, with the test data the precision of the prediction collapses: overfitting.

Let’s try to apply KFold to our algorithm, we try with groups of data of 10 and 5, and in both cases the score is very low, which tells us that a priori as much precision as we get in training, when applied to the separate data set for testing, the accuracy is going to be bad.

Well, by definitely getting the cross validation accuracy, we confirm that we’re doing very badly. The algorithm itself with the dataset is trained well, but it is lost when we take it out to the testing dataset, or if we put it into production.

We could apply different techniques to try to improve this, but luckily there are alternative strategies and the baseline leads us to try directly with a LSTM network to see if we start, we have better results..

LSTM neuronal network

First let’s scale the data using MinMaxScaler

First let’s scale the data using MinMaxScaler:

We build a LSTM network of 4 perceptrons, with a look back window of 2 elements, we train it during 100 and 150 epochs with a batch_size of 3, obtaining:

Obtaining accuracy (4.71 tenths of a degree in the daily mean is a reasonable accuracy for an initial attempt):

Train: 6.56 RMSE

Test: 4.71 RMSE

Training set and predictions:

Testing set ans predictions:

We changed the short memory window to 5 look back records and we got values that didn’t make it better, even things seem to have worked a little worse …

Train: 6.72 RMSE

Test: 5.04 RMSE

We modify the structure of our neural network, increasing the number of perceptrons from 4 to 6 and then yes, the accuracy changes and improves substantially (although the training of the network is a little slower):

Train: 0.58 RMSE

Test: 0.46 RMSE

Now, we plot again train set and predictions :

(it looks like a single line but it really is two, believe me!)

Once the definitive model has been obtained, it must be applied to the data line of the 596-Azapa weather station, indicating with a “9” in the Q_TG column, to indicate that the data was obtained artificially and not through the actual initial records.

Summary

Once obtained the fine data set to enter the predictive models, it is good to consider a strategy, have alternatives, an idea, but it is also good to make a survey between the different “line sbases” and see what preliminary results are obtained, respecting the rule of always going from the simplest to the most complex.