We are well and truly in the machine learning (ML) age. Hardly any field of human endeavour, with the exception of perhaps in the field of the arts, will be immune from its effects, not least the geosciences themselves. Biostratigraphic data - essentially a basic 2-D array of fossil occurrences against some sort of measured depth scale - seems particularly well set to be interrogated and "worked up" by digital means. At the same time, huge archives of historic biostratigraphic data languish seldom used in the vaults of operating companies and other institutions, morribund for the want of dwindling human resources to work on them.

A short article on Automation in Biostratigraphy is presented elsewhere on this site (click here). But first biostratigraphers need to ensure that ML data scientists are aware of the "mechanics" of carrying out biostratigraphic (and paleoenvironmental) interpretations on paleontological data, especially the large variety of "nuances" inherent in paleontological data that biostratigraphers routinely deal with on a day-to-day basis. Data scientists will also have to reconcile the numerous little "quirks" by which workers in different companies and institutions present and display their interpretations. Although there are many sources that deal with biostratigraphic theory, there are few which discuss how to actually do biostratigraphy on a practical basis and (as far as this author knows) none which discuss and recommend a universal "best practice" for the industry. If we cannot agree among ourselves how best to present consistent interpretations of our data, we risk data scientists attempting (no doubt with the best intentions) to do it for us.

A simple illustration of alternative ways to display interpreted biostratigraphic data is shown below. In both cases the data are from cuttings samples and therefore it is the tops of the biozones are identified and defined, in order to mitigate against the effects of borehole caving. In Figure 1 below, the interpreted biozones (shown in green) are displayed using the principle of "top defines base" - i.e. the base of the overlying interval is defined automatically by the top of the underlying interval.

Figure 1: Biozones displayed using "top defines base" principle.

In Figure 2, the biozones (yellow and grey) are displayed according to the principle of "sample defines base" - the base of the overlying biozone is placed at the level of the sample immediately above the defining sample of the underlying biozone. Note the gaps between successive intervals.

Figure 2: Biozones displayed using the "sample defines base" principle.

Both methods are used by companies throughout the industry and there may be additional refinements to these principles based on whether companies choose to "adjust" their interpreted interval tops to line up with non-biostratigraphic markers such as log picks or lithostratigraphic and/or seismic boundaries.

It may seem a rather minor point but ML algorithms need to know (via their programmers) which of these methods works best. There are numerous other "quirks and foibles" used at various places across industry including how to "weight" biostratigraphic events in different types of samples. For example, is a primary biostratigraphic marker event (good) found in a cuttings sample (poor) better than a secondary marker event (fair) in a core sample (good)? Should a set of paleontological data be weighed less favourably if it comes from a sandy sample compared with a muddy one?

These and many more questions need to be discussed to at least try and approach some kind of universal agreement. A project has been set up on ResearchGate (Best Practice in Biostratigraphy) to try and encourage discussion within the commercial biostratigraphic community and anyone with an interest is welcome to join.