Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering to find the group of stations which show the same weather condition.

Environment Canada Weather Station (Source: Summit Post)

There are several clustering techniques and algorithms used widely across domains and industries. These clustering algorithms are classified into seven classes such as Hierarchical algorithms, Density-based algorithms, Recommendation Engine, Market Segmentation, Partitional algorithms, Graph-based algorithms, combinational algorithms, Grid-based algorithms , and Model-based algorithms.

When the clusters are different shapes, sizes, and densities, and when data contains noise and outlier, it becomes difficult to find clusters in object, particularly high dimensional object. To create clusters of arbitrary shapes and to form clusters on varying densities, we have to rely on Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering. It works based on two parameters: Epsilon and Minimum Points Epsilon to determine a specified radius that if includes enough number of points within, we call it Dense Area Minimum Samples. It uses Euclidean distance and great circle distance for geographical data. Let us try to understand with an example. We will use DBSCAN algorithm by using the dataset “weather.csv” file.

(Reference for source: https://s3-api.us-geo.objectstorage.softlayer.net) and find the group of stations which show the same weather condition.

Visualization

Using basemap package, it will help us to visualize stations on map. We have to use matplotlib basemap toolkit for plotting 2D data on maps in Python. Keep in mind that the size of each data points represents the average of maximum temperature for each station in a year.

Figure 1. Visualization of stations on map

Clustering of stations based on longitude and latitude

Using basemap package, it will help us to visualize stations on map. We have to use matplotlib basemap toolkit for plotting 2D data on maps in Python. Keep in mind that the size of each data points represents the average of maximum temperature for each station in a year. The goal is to find core samples of high density and expands clusters from them as shown below.

Cluster Labels

Visualization of clusters using basemap

Let us visualize the clusters. After execution of code in jupyter notebook, it will plot the following map below along with print out of cluster numbers and Average temperature. For example,

Cluster 3 has an average temperature value of -15.300 while Cluster 1 has an average temperature value of 1.95.

Figure 2. Cluster numbers with average temperature

Final Visualization of Clustering of stations

Let us visualize the clustering of 9 base stations with average temperatures.

Figure 3. clustering of 9 base stations with average temperatures

Conclusion

So, in above example, we got familiar with clustering algorithm DBSCAN and how it was useful to find the group of stations which show the same weather condition. So, it not only finds different arbitrary shaped clusters, but it also find the denser part of data-centered samples by ignoring less-dense areas or noises. Please bear in mind that there is a much better and recent version of algorithm known as HDBSCAN which uses Hierarchical Clustering combined with regular DBSCAN. It is much faster and accurate than DBSCAN.

Robotics, Data science and Artificial Intelligence Enthusiast, software developer,volunteer work, reading good books, and spending time with family