Species Distribution Modeling using Maxent Model

geospatial

map

spatial optimization

maxent model

drawing

Introduction

Understanding Species Distribution Models (SDMs)

Maxent is a species distribution modeling (SDM) system that uses species observations and environmental data to predict where a species might be found under past, present or future environmental conditions.

It’s a presence/background model. Maxent uses data on where a species is present and, instead of data on where it is absent, it uses a random sample of the landscape where you might expect to find that species. This is convenient because it reduces data collection requirements, but it also introduces some unique challenges for fitting and interpreting models.

Formally, Maxent estimates habitat suitability (i.e. the fundamental niche) using:

species occurrence records (y = 1),
randomly sampled “background” location records (y = 0), and
environmental covariate data (x) for each location of y.

Maxent doesn’t directly estimate relationships between presence/background data and environmental covariates (so, not just y ~ x). Instead, it fits a series of feature transformatons to the covariate data (z = f(x)), like computing quadratic transformations or computing the pairwise products of each covariate.

Maxent then estimates the conditional probability of finding a species given a set of environmental conditions as:

\[Pr(y = 1 | f(z)) = \frac{f_1(z) \cdot Pr(y = 1)}{f(z)}\]

This can be interpreted as: “the relative likelihood of observing a species at any point on the landscape is determined by differences in the distributions of environmental conditions where a species is found relative to the distributions at a random sample across the landscape.”

A null model would posit that a species is likely to be distributed uniformly across a landscape; you’re equally likely to find it wherever you go. Maxent uses this null model—the “background” conditions—as a prior probability distribution to estimate relative suitability based on the conditions where species have been observed.

Because of this formulation, the definition of both the landscape and the background are directly related to the definition of how habitat suitability is estimated. These should be defined with care.

in this project, We’ll use this model for extracting raster data from points and polygons, assigning sample weights based on sample density, and creating spatially-explicit train/test data, while also showing general patterns

Data Preparation

Load dataset

The sample data we’ll use includes species occurrence records from the genus Ariolimax, a group of North American terrestrial slugs, as well as some climate and vegetation data.

Raster Data

Environmental data are typically provided as image data with coordinate reference information, or “raster data”. We load raster data that contains some parameters, there are:

Cloud Cover Mean
Cloud Cover Stdv
LAI Mean (vegetation growth)
LAIS tdv vegetation growth
Surface Temp Mean
Surface Temp Stdv

path = os.getcwd()+'/data'
raster_names = [
    "ca-cloudcover-mean.tif",
    "ca-cloudcover-stdv.tif",
    "ca-leafareaindex-mean.tif",
    "ca-leafareaindex-stdv.tif",
    "ca-surfacetemp-mean.tif",
    "ca-surfacetemp-stdv.tif",
]
rasters = [f"{path}/{raster}" for raster in raster_names]
labels = ["CloudCoverMean", "CloudCoverStdv", "LAIMean", "LAIStdv", "SurfaceTempMean", "SurfaceTempStdv"]

Point-Format Data

Occurrence data are typically recorded with latitude/longitude coordinates for a single location, also known as “point-format data”. We load point-format that contains locations on the landscape where a species has been observed and identified and recorded as present in an x/y location.

We then model the relative likelihood of observing that species in other locations based on the environmental conditions where that species is observed.

gdf = gpd.read_file(os.getcwd()+'/data/ariolimax-ca.gpkg')

Next, we can plot data from four species of Banana Slug across California.

drawing

And we can add raster data into our previous plot to get comparison between this two data.

drawing

Drawing random pseudoabsence locations

In addition to species occurrence records (where y = 1), species distributions models often require a set of random pseudo-absence/background points (y = 0). These are a random geographic sampling of where you might expect to find a species across the target landscape.

The background point distribution is extremely important for presence-only models, and should be selected with care.

By using maps.random_sempling we can generate random pseudoabsence as background data. This process can be calculate by raster and/or point-format data as well.

pseudoabsence = maps.random_sampling(main_data_raster, count=10000, 
                                    types= 'spatial', markersize=0.25,
                                    ignore_mask=False, main_data=None,figsize=(8,8))

This function requires the following parameters:

main_data_raster (raster data): raster data
count (int): number of pseudoabsence point
types (string): type of pseudoabsence [spatial, bias]
markersize (int): size of marker on plot
ignore_mask (boolean): to use the rectangular bounds of the raster to draw samples
main_data (geodataframe): point-format data
figsize (tuple): size of figure

uniform random geographic sample from unmasked locations within a raster’s extent

we can use raster data to create a uniform random geographic sample from unmasked locations within a raster’s extent.

drawing

uniform random geographic sample from unmasked locations within a raster’s extent

we can use raster data to use the rectangular bounds of the raster to draw samples.

drawing

From a bias raster

We could also provide a “bias” file, where the raster grid cells contain information on the probability of sampling an area. Bias adjustments are recommended because species occurrence records are often biased towards certain locations (near roads, parks, or trails).

The grid cells can be an arbitrary range of values. What’s important is that the values encode a linear range of numbers that are higher where you’re more likely to draw a sample. The probability of drawing a sample is dependent on two factors: the range of values provided and the frequency of values across the dataset.

We’ll use the leaf area index (LAI) covariate to select more background samples from highly vegetated areas.

drawing

We can see that we are selecting a disproportionately high number of points along California’s coast - home to dense redwood forests where slugs thrive - and disproportionately few samples in the dry deserts in the southeast.

From a vector polygon

Another way to address bias would be to only select points inside area where that species is known to occur (within a certain ecoregion, for example).

The Ariolimax dataset contains occurrence records for four species within the genus. If we’re interested in understanding the distributions of only one of them, we could construct a rudimentary “range map” for this genus by computing the bounding geometry of all points then sampling within those bounds.

drawing

The yellow polygon above shows the simple “range map”.
The blue points are point samples selected from within that range.
The red points are the uniform sample, plotted for reference.

Point annotation

Annotation refers to extracting raster values at the each point occurrence record and storing those records together in a GeoDataFrame. Once you have your presence and pseudo-absence records, you can annotate these records with the corresponding covariate data.

It also allows you to pass multiple raster files, which can be in different projections, extents, or grid sizes. This means you don’t have to explicitly re-sample your raster data prior to analysis, which is always a chore.

presence, background, zs = generate_raster_df(main_data, background_bias,
                                        main_data_raster, area_hull=None, labels=[], figsize=(6,9), row_col=[2,3])

This function requires the following parameters:

main_data (geodataframe): point-format data
main_data_raster (raster data): raster data
background_bias (geodataframe): point-format from background
area_hull (geodataframe): geometry area
labels (list): label of raster data
row_col (list): coordinate of subplot
figsize (tuple): size of figure

Next let’s plot some histograms to understand the similarities and differences between the conditions at presence locations and at background locations.

drawing

This checks out. Slugs are commonly found in areas with higher annual cloud cover, consistently cool temperatures, and high vegetation growth (even after adjusting for bias and oversampling areas with high LAI).

Zonal statistics

This code reads an array of raster data covering the extent of a polygon, masks the areas outside the polygon, and computes summary statistics such as the mean, standard deviation, and mode of the array.

	CloudCoverMean_mean	CloudCoverMean_min	CloudCoverMean_max	CloudCoverMean_count	CloudCoverMean_sum	CloudCoverMean_stdv	CloudCoverMean_skew	CloudCoverMean_kurt	CloudCoverMean_mode	CloudCoverMean_25pct	CloudCoverMean_75pct	CloudCoverStdv_mean	CloudCoverStdv_min	CloudCoverStdv_max	CloudCoverStdv_count	CloudCoverStdv_sum	CloudCoverStdv_stdv	CloudCoverStdv_skew	CloudCoverStdv_kurt	CloudCoverStdv_mode	CloudCoverStdv_25pct	CloudCoverStdv_75pct	LAIMean_mean	LAIMean_min	LAIMean_max	LAIMean_count	LAIMean_sum	LAIMean_stdv	LAIMean_skew	LAIMean_kurt	LAIMean_mode	LAIMean_25pct	LAIMean_75pct	LAIStdv_mean	LAIStdv_min	LAIStdv_max	LAIStdv_count	LAIStdv_sum	LAIStdv_stdv	LAIStdv_skew	LAIStdv_kurt	LAIStdv_mode	LAIStdv_25pct	LAIStdv_75pct	SurfaceTempMean_mean	SurfaceTempMean_min	SurfaceTempMean_max	SurfaceTempMean_count	SurfaceTempMean_sum	SurfaceTempMean_stdv	SurfaceTempMean_skew	SurfaceTempMean_kurt	SurfaceTempMean_mode	SurfaceTempMean_25pct	SurfaceTempMean_75pct	SurfaceTempStdv_mean	SurfaceTempStdv_min	SurfaceTempStdv_max	SurfaceTempStdv_count	SurfaceTempStdv_sum	SurfaceTempStdv_stdv	SurfaceTempStdv_skew	SurfaceTempStdv_kurt	SurfaceTempStdv_mode	SurfaceTempStdv_25pct	SurfaceTempStdv_75pct	geometry
0	0.350516	0.208601	0.945052	6394	63796	0.0897973	1.38584	2.9456	0.299384	0.285211	0.402993	0.468056	0.227878	0.5	6394	85189	0.0226815	-0.42534	0.905295	0.457988	0.451198	0.48941	1.34066	0	4.86163	6394	244008	0.883253	1.25309	1.18815	0	0.744408	1.71113	0.709633	0	2.18535	6394	129157	0.422168	1.01065	0.47011	0	0.415228	0.915175	26.1743	8.99165	43.5559	6394	4.74417e+06	6.32528	0.111637	-0.788336	18.3256	20.6344	30.8102	9.86202	3.05516	15.3094	6394	1.78752e+06	1.74895	-0.251056	0.190947	8.39623	8.83032	11.0598	POLYGON ((769358.1680785755 3763373.5768406843, 604251.5230777487 4016601.3086676924, 598773.1208982626 4025958.906228845))

Geographic sample weights

This step help us to handle geographic information during model selection or in model scoring.

Add spatial information to a model is to compute geographically-explicit sample weights. elapid does this by calculating weights based on the distance to the nearest neighbor. Points nearby other points receive lower weight scores; far-away points receive higher weight scores.

presence, background = sample_weight(main_data, background_bias, 
                                    n_neighbors=1, types='distance')

This function requires the following parameters:

main_data (geodataframe): point-format data
background_bias (geodataframe): point-format from background
n_neighbors (int): number of neighbors that will be considering for weight.
types (list): type of sample weight [density, distance]

Compute weights based on the distance to the nearest point

This code with distance will calculate weights based on the distance to the nearest neighbor.

drawing

There’s one lone point in the northwest corner that is assigned a very high weight because it’s so isolated.

Compute weights based on the distance to the nearest point

This code with density will compute the average distance from each point to all other points, computing sample weights from point density instead of the distance to the single nearest point. This may be useful if you have small clusters of a few points far away from large, densely populated regions.

drawing

As you can see, there are tradeoffs for finding the best sample weights depending on the spatial distributions of each occurrence dataset. We’ll use the sample density weights here to upweight the samples in the sierras (the northeastern cluster of points).

Train / test splits

Uniformly random train/test splits are generally discouraged in spatial modeling because of the strong spatial structure inherent in many datasets. The non-independence of these data is referred to as spatial autocorrelation. Using distance- or density-based sample weights is one way to mitigate these effects. Another is to split the data into geographically distinct train/test regions to try and prioritize model generalization.

xtrain, ytrain, sample_weight_train, xtest, ytest, sample_weight_test = maps.data_prep(main_data, background_bias, target_col='', weight_col='', 
                    types= 'checkerboard', grid_size=50000, 
                    fold=0, n_splits=3, training=True)

This function requires the following parameters:

main_data (geodataframe): point-format data
background_bias (geodataframe): point-format from background
target_col (string): targeted column name (dependent variable)
weight_col (string): sample weight column name
grid_size (string): grid size to split train and test dataset
fold (string): number of fold in cross validation
n_splits (string): number splits in cross validation
training (boolean): is do training

Checkerboard splits

With a “checkerbox” train/test split, points are intersected along a regular grid, and every other grid is used to split the data into train/test sets.

The height and width of the grid used to split the data is controlled by the grid_size parameter. This should specify distance in the units of the point data’s CRS.

drawing

Geographic k-fold splits

Create k geographically-clustered folds using the GeographicKFold cross validation strategy. This method is effective for understanding how well models fit in one region will generalize to areas outside the training domain.

This method uses KMeans clusters fit with the x/y locations of the point data, grouping points into spatially distinct groups. This cross-validation strategy is a good way to test how well models generalize outside of their training extents into novel geographic regions.

drawing

Modeling

Fitting models using default settings, then evaluate the effects of different forms of train/test splitting strategies.

Initiate Models

By using model_pred.model_collection we can initiate all models yang we want to use for this prediction. This function will return default condition for all models and we need to change some parameters inside this function to get more accurate and robust models.

list_model = ['random_forest','extra_tree',"xgb","lgbm"]
result = model_pred.model_collection(list_model,types='classification')
result

This function requires the following parameters:

list_model (list): list of models
types (string): type of machine learning that we handle

[['maxent_model', MaxentModel()]]

Training Model

After examining cross validation training result, we can move to training our data by calling model_pred.modeling function. This function will return its model, its prediction and probability, its prediction and time duration.

models,models_calib,names_model,probs_model,probs_true_model,probs_dual_model,probs_bool_model,pred_model,time_1,time_2 = model_pred.modeling(
   X_train=train_xs,X_test=target_xs,y_train=train_y,y_test=None ,list_model=list_model,
   types='classification',tresh=0.5,detail=False, method='isotonic')

This function requires the following parameters:

train_xs (dataframe): independent train dataframe
train_ys (dataframe): dependent train dataframe
target_xs (dataframe): independent test dataframe
test_ys (dataframe): dependent test dataframe
list_model (list): list of models
types (string): type of machine learning that we handle
tresh (int): numebr of threshold to seperate 2 class
detail (boolean): generate plot after training
method (string): method to calibrate our training

we have done with these models:
Maxent Model

Evaluate the model

After training our data, we can evaluate the model to determine how well our model is during learning process. by calling model_pred.modeling function. This function will return its model, its prediction and probability, its prediction and time duration.

print_score(y_test,y_pred,y_probs=None,types='classification',labels=None,time1=1,
            time2=2,X_train=None,y_train=None,y_pred_test=None,X_test=None,
            axis_labels=['Actual','Predicted'],cost_mat=None,sample_weight=None)

This function requires the following parameters:

y_test (dataframe): y test dataframe
y_pred (dataframe): y prediction dataframe from model
types (string): type of machine learning that we handle
labels (list): label to shown in graph
time1 (time): time when start modeling
time2 (time): time when end modeling
X_train (dataframe): X train dataframe
y_train (dataframe): y train dataframe
y_pred_test (dataframe): y prediction dataframe from test dataset
X_test (dataframe): X test dataframe
axis_labels (list): label to shown in graph
cost_mat (list): list of cost per each error
sample_weight (list): sample weight for test and training process

confusion matrix = 
[[2169   99]
 [ 277  611]]

drawing

accuracy_score =  0.8809
balanced_accuracy_score =  0.8222
Logloss =  0.4926
Brier score loss (the smaller the better) =  0.0954

homogeneity score =  0.3864
completeness score =  0.4306
v measure score =  0.4073
adjusted rand score =  0.5544
adjusted mutual info score =  0.4071

precision score =  0.8606
average precision score =  0.6799
recall score =  0.6881

F1 score =  0.7647
F2 score =  0.7168
F3 score =  0.7021
F_beta score (0.5) =  0.8195

AUC of Precision-Recall Curve on Testing =  0.8347
Best Threshold for Precision-Recall Curve = 0.292900
F-Score = 0.802
AUC of ROC =  0.9304

Best Threshold for ROC = 0.247900
G-Mean = 0.869
Best Threshold with Youden’s J statistic = 0.247900

Cohens kappa =  0.6863
Gini =  0.6694

Expected Approval Rate =  0.225
Expected Default Rate =  0.1394

classification_report
              precision    recall  f1-score   support

           0       0.89      0.96      0.92      2268
           1       0.86      0.69      0.76       888

    accuracy                           0.88      3156
   macro avg       0.87      0.82      0.84      3156
weighted avg       0.88      0.88      0.88      3156

time span=  0:00:01.383780

drawing drawing

As you can see our model has good result, with:

accuracy_score = 0.8809
AUC of ROC = 0.9304

but need to mention that our model has low recall score recall score = 0.6881 due to high FN value.

we can used output model to create partial dependence plots (PDP) to show how model predictions vary across the range of covariates

drawing

Result

we can plot the suitability predictions drawing

Source

https://earth-chris.github.io/elapid/sdm/maxent/