Species Distribution Modeling architecture based on the Python programming language and its open libraries
Species Distribution Modeling is a name for set of computational techniques used by biology and ecology experts to predict where some species are more or less likely to occur. Those predictions may be related to the space and landscape variables such us elevation, soil type or distance to the human settlements.
Analyst may be interested not only where some species are located at a given point in a time but also when they might be or might leave some place. Then he/she will look more precisely into a bioclimate indicators: precipitation, temperature or vegetation season changes and with the climate change models he/she could estimate where and when species spreads into a new area or leaves the niche. It is extremely important technique in the case of vector epidemiology and it is widely used by Data Lions as a part of our data processing and analysis chain.
There are few technical packages which are used for the Species Distribution Modeling. The most important of them are:
- MaxEnt modeling package (http://biodiversityinformatics.amnh.org/open_source/maxent/)
- R and packages: biomod2, dismo, sdm and other
- DIVA-GIS with BIOCLIM packages
If model is simple and consists only niche modeling unit then mentioned packages are the best possible choice. Problem arises when SDM is a cog in the bigger machine…
Section 1: Python and why only Python
In the business environment there are two factors which are affecting decisions which software package will be used. Those are law and technical issues.
Law / Financial issues
R software and its packages are very promising for SDM processing. The main and big disadvantage of this system is licensing policy. Each package has a different license and it is very time-consuming process to track licenses in all packages over a road. Unfortunately some of them are not suitable for the business purposes (especially Creative Commons). This is the main issue preventing from R usage but there are also…
There are several technical problems related to the software mentioned above:
- End-to-end system is written in the different programming language: then additional steps are required to set up whole system with a single box (as example: you may have database, data preparation unit and web application written in Python and you want to join SDM model MaxEnt with your Python’s environment then this process needs: additional Python unit which works as the data handler for MaxEnt and additional Python unit for observation of the MaxEnt processes).
- Leverage SDM model into wider Deep Learning system: preferred language is Python with Keras or PyTorch libraries.
- Control over a system and change of capabilities and algorithms: program must be written in non-compiled language.
That’s why at the end of a day Data Lions chooses Python and its well-established geoprocessing and machine learning packages to build open source Species Distribution Modeling package.
Section 2: System design
The project of SDM library written in Python begins with initial assumptions:
- One license (or one family of licenses) for the library which allows contributors to adapt and create own functionalities easily. Additionally license must be friendly for a commercial purposes. Licensing in the same manner as scientific Python libraries.
- Modular (law issues): functionalities distributed across the package as modules to prevent the situation where one permissive license of the third-party product may affect whole library.
- Modular (technical issues): separation of classes and functionalities, development and maintenance of a code, bug-prone system. Possibility to use only selected modules instead of whole library. Incorporation into third-party packages and software.
- Multi-source data processing: library must handle GeoTIFF, JPEG2000, HDF, NetCDF, GRIB data sources as the most popular Earth Observation / climate datasets.
- Full pipeline for a species distribution modeling among package modules: from data retrieving, processing and harmonization, analysis and publication.
System must be able to automatize data retrieval (by external API’s), data processing, analysis and visualization / publication. Specific capabilities for each module are presented in the diagram in Image 1. System has the extra requirements, which are very specific for the spatial data. It must be able to automatically detect projections and harmonize them, the same for raster resolutions. NetCDF / GRIB, HDF, JPEG2000 and GeoTIFF – it does not matter which data format is used for the analysis. System is developed over the GeoPandas python library. It must aid analysis with only-positive binary datasets by the special pseudo-random occurrences generator (where a priori knowledge may be used to define where species are more or less likely to occur).
The data processing and analysis pipeline with Python libraries used as the sdm-python foundation is presented in the Image 2.
With all of these assumptions we have started system development. Its base functionalities and architecture was presented for the GIS users working with the environmental data at the BioGIS conference in Poznan, on March, 22-23, 2019.
What system is capable to do? Example in the Image 3 and in the Image 4 show how to based on the assumptions for the Germany (species occurrence) we are able to predict where are the most suitable niches for Ixodes ricinus ticks in Poland. It’s indeed very powerful technique!
SDM-python library is currently under development. New articles will be published among with the package documentation and new releases. It is completely open source project based on the MIT license.
Developers version of the library is available here: https://github.com/szymon-datalions/sdm-python