Using Machine Learning to Understand Climate Change
In the two previous articles we explored the National Oceanic and Atmospheric Administration (NOAA) Global Surface Summary of Day Weather Data (GSOD) data available in the BigQuery public dataset and identified temperature anomalies based on the methods that scientist use.(1)(2) We then created visualizations using PowerBI and looked at station distribution and anomaly variance over time. In this article we will combine Worldbank greenhouse gas emissions and area population data with the NOAA climate and anomaly dataset, then use the random forest regression machine learning algorithm to find the best indicators for predicting anomalies. Finally, we will look at the relationships between some of the variables and anomaly values.
The Worldbank data is available in the BigQuery public dataset. There are several different Worldbank datasets in BigQuery, for this article we are going to focus on the average total greenhouse gas emissions based on the country where the station is located. The world_bank_wdi(3) dataset shows the total greenhouse gas emissions by country for several years. The series_summary table provides more information about how the total is calculated. Another dataset in BigQuery that we will enrich our dataset with is the population_grid_1km table. This table (last updated in January 2020) has the total population for each kilometer grid location around the world, and it includes 218,986,920 rows for that date.
The 29,567 historical stations will be joined to the population_grid_1km table based on the latitude and longitude being within 10 kilometers of the station returning multiple rows for each station. The population will then be summed to provide the total population within the radius. Using the BigQuery geography functions, the ST_DWITHIN function can be used in the join to determine whether the distance between the station location is within 10 kilometers of the 1-kilometer area.(4) Adding the world_bank_wdi total greenhouse gas emissions column will be based on the average value by country, making the join to the NOAA data based on the country in which the station is located.
After applying the logic discussed in article two and adding in the average total greenhouse gas emissions and population columns, the dataset includes 4,276,271 rows. Each row is a unique station, country, month, and year. There are cases where a station is located within 10 kilometers of another country. In these cases, they will be assessed with both countries average yearly total greenhouse gas emissions value. The final column values that will be analyzed in this article are: longitude and latitude, year, month, elevation, average temperature, average dew point, average station pressure, average wind speed, average wind gust speed, average precipitation, average snow depth, average fog indicator value, average rain drizzle indicator value, average snow or ice pellet indicator value, average hail indicator value, average thunder indicator value, average tornado funnel cloud indicator value, average total yearly greenhouse gas emissions, and total population within a 10 kilometer (6.2 mile) radius of the station. All indicator value variables are binary, if the variable is not an indicator, then it is the actual value average.
Since analyzing a large dataset can take time and requires compute resources, for this article we will be using a Jupyter Notebook with Python 3. The Google Cloud Platform offers integrated Jupyter Notebooks with the AI Platform managed service. This service includes an easy to deploy instance where the language along with preinstalled libraries and custom compute engine settings can be configured based on the need. The billing is based on resource usage and the notebook instance can be stopped when not in use. So far, we are still under the 1 terabyte BigQuery free tier, so the only resource that will be billed is the compute engine resource cost. The compute engine pricing is based on the machine configuration. The default machine configuration resource cost is around $0.133 per hour. Accounting for the data volume that will be processing, we set up a memory efficient instance with 8 vCPUs, and 64GB of RAM. This machine configuration will allow for the random forest model to process the data and it only raises the cost to $0.362 per hour.(5)
Using the Scikit Learn library, a random forest regression can be performed on the data.(6) The random forest algorithm creates multiple decision trees using random features from our dataset, then averages the result for the prediction. Using the feature importance property, the algorithm can be applied to training data and provide a list of the most important variables that are used in the predictive model. Before the algorithm can be applied, the data needs to be formatted into arrays and null values need to be replaced. The NumPy library can be imported and used to transform the data into arrays. The null values will be changed to zero for this study and the training and testing split will be set to 75% training and 25% testing.
Using 100 estimators, the resulting model preformed with an R2 value of 0.81 and a mean absolute error of 1.11 degrees. That means that the features or the columns being used to predict the temperature anomaly value can account for 81% of the variance in the predicted value and the average difference between the predicted value and the actual value is +- 1.11 degrees. The top ten features include: latitude, longitude, average temperature, month, year, elevation, average dew point, average wind speed, average population, and average visibility.
To visualize the random forest regression tree, we will cut the tree to only go three levels deep and look at one of the trees as an example. The example tree below shows the node and requirements for each decision that is made, these are called decision rules.
The actual predictive model uses 100 trees with randomly ordered decision nodes and there is no limit on the number of decision rules before the value prediction is made. It is hard to imagine the number of decision rules that the predictive model uses to determine the anomaly value. But, because the model is a regression model, each temperature anomaly value would need to be predicted. This created some issues with using the model to predict the anomaly value. In future models, it may be more practical to create binned values for the temperature anomalies and use the random forest classifier algorithm.
We again used PowerBI to develop a visualization that demonstrated temperature anomalies over time. The below visualization shows the absolute value of the average temperature anomalies by year, latitude, and longitude rounded to the nearest whole number. The color shows the average temperature anomaly value. This visualization shows the relationship between the year and the temperature anomaly value. Based on the example decision tree, if the year is less than or equal to 1997.5, then the value of the temperature anomaly will be less, except in the cases when the latitude is greater than 42.3 and the average temperature is less than –1.4.
Looking at the same chart by the average temperature anomaly, the same pattern exists. There are more positive value anomalies after 1997 than before.
In the below visualization, we see that the latitude below 42.3 degrees north does have fewer positive values than the latitude above 42.3 degrees north. The longitude visualization is interesting because there is a split at the 0-degree value of the Prime Meridian. However, it still supports the decision tree example with fewer positive values above 93.4 degrees east.
By analyzing one the many datasets that scientists use to measure climate change, we saw that there has been an increase in temperature anomaly values over the years and that the temperature anomalies have increased with each passing year. We also saw how data collection on climate increased as more stations across the world were added over the decades. The feature additions provided some interesting results. The greenhouse gas emissions variable that we added to the NOAA GSOD dataset did not show up in the top ten features, but the population value that we added did. We were able to find the top ten features and visualize some of the relationships between them and the temperature anomalies. We were also able to complete this analysis with minimal cost using the Google Cloud Platform.
In conclusion, we hope this article series has ignited some curiosity and thought. There are many other interesting aspects to climate change and climate data analysis worth exploring. One interesting area of climate study is paleoclimatology. Paleoclimatology derives data “from natural sources such as tree rings, ice cores, corals, and ocean and lake sediments”.(7) It might be interesting to see the climate and temperature change over the last 100 years. The dataset is vast and available on the NOAA website (Paleoclimatology Datasets | National Centers for Environmental Information (NCEI) formerly known as National Climatic Data Center (NCDC) (noaa.gov)).
References: (1) CarbonBrief, Explainer: How do scientists measure global temperature? | Carbon Brief (2) NASA Earth Observatory, World of Change: Global Temperatures (nasa.gov) (3) Google Cloud Platform “World Development Indicators (WDI)”, World Development Indicators (WDI) – Marketplace – Google Cloud Platform (4) Google Cloud, Geography functions in Standard SQL | BigQuery | Google Cloud (5) Google Cloud, Notebooks | Google Cloud (6) Scikit Learn, sklearn.ensemble.RandomForestRegressor — scikit-learn 0.24.2 documentation (7) , Paleoclimatology Data | National Centers for Environmental Information (NCEI) formerly known as National Climatic Data Center (NCDC) (noaa.gov)