Do statistical outliers from weather data help predict severe weather events

A warm and wet atmosphere can hold more water—about 7% more for every degree. The extra heat and moisture in the atmosphere means more energy for storms that generate intense rainfall. As a result of this process, we expect more severe rainfall in the future, with heightened floods and damage to buildings and roads. Climate change increases the risk of coastal flooding due to higher sea levels and more storms.

The severe weather alerts show that some extreme events are drastically affected by climate change. Wildfires are now highly dangerous, and fire seasons have substantially lengthened. Hurricane alerts are constantly being sent throughout the year. Climate change has also already increased how often heatwaves happen. We need an in-depth understanding of what will happen next to protect our planet.

In machine learning and data science, we usually describe an unusual activity as an “outlier.” But there is a significant difference between extreme events and outliers.

An outlier is an observation that significantly differs from other observations of the same feature. Suppose a time series of data is plotted. In that case, outliers are usually the unexpected spikes or dips in observations at a given time. An outlier may exist due to a rare event, incorrect values resulting from errors or breakdowns, or bad recording practices.

While extreme events differ significantly from traditional outliers in many ways, an extreme event is not always an outlier; it does not always fall outside the normal activity distribution.

Let’s use Ambee’s air quality API, weather API, and wildfire data to understand how statistical outliers can help predict extreme events like forest fires in the USA.

By looking at the previous three days’ hourly data, the distribution looks like this:

If we used our traditional outlier detection methods, we might treat temperatures above 42 and PM25 above 50 as safe outliers. But wait, if we look at the series of fire events in the USA and plot them against the air quality and weather data distribution above, we will find that those “outliers” were extreme events.

The Dragon fire occurred in Arizona state. The exact location is shown below:

Now let’s look at the time series plot around the Arizona state from the past three days:

Time series plot of PM25 pollutants for the previous three days, with the Dragon Fire occurrence marked.

The temperature time series plot for the previous three days, with the occurrence of Dragon Fire marked.

Time series plot of humidity for the previous three days, with the occurrence of Dragon Fire marked.

The Dragon Fire occurrence shows a gradual increase in PM25, leading to an “anomalous” value of PM25 just after the event.

Let’s look at the forest fire data for one more occurrence, the Avalanche Fire, that occurred in California.

Now let’s look at the time series plot around the California state from the past three days:

The time series plot of PM25 for the previous three days, marking the occurrence of the Avalanche fire.

The temperature time series plot for the previous three days was marked by the occurrence of an avalanche fire.

The time series plot of humidity for the previous three days, marked with the occurrence of an Avalanche fire.

The occurrence of the Avalanche fire also proves that it leads to a gradual increase in PM25's value, making it look like an “anomalous” value 4 hours after the fire event, which otherwise was the outcome of the fire event.

If we had removed the outliers by using percentile methods, say removing those above 99% or removing those below 2%, then we would have definitely missed out on predicting extreme climate events. Since climate change is a dynamic phenomenon, most traditional data wrangling methods do not make sense or are dangerous if applied.

In reality, assessing changes in extremes is not trivial. For statistical reasons, valid analysis of extremes requires long time series to obtain reasonable estimates of the intensity and frequency of rare events, such as the 10-year historical data. Also, continuous data series with at least a daily time resolution will also be highly useful for assessing extreme events.

The simplest approach to creating features for extreme events would be to look at time series data. Aggregation over previous hours of data as well as aggregation by looking at neighboring places’ data, we can say that two primary techniques can help:

Time aggregation: For each feature, create columns for

Value the last day
Mean over last ten days
Variance over the previous 10 days
Max & min values over the previous 10 days

Spatial aggregation: For each feature, create columns for

Mean of values in a radius of 100 miles
The variance of values in a radius of 100 miles
Max & min values within a radius of 100 miles

Forecasting based on the aggregated features might help predict the next values and extreme events. Thus, we see that removing outliers is not always necessary and depends on the use case.

Try Ambee's severe weather alerts API today!