Data Science
Measuring data drift is essential in machine learning applications where model scoring (evaluation) is done on data samples that differ from those used in training. The Kullback-Leibler divergence is a common measure of shifted probability distributions, for which discretized versions are invented to deal with binned or categorical data. We present the Unstable Population Indicator, a robust, flexible and numerically stable, discretized implementation of Jeffrey's divergence, along with an implementation in a Python package that can deal with continuous, discrete, ordinal and nominal data in a variety of popular data types. We show the numerical and statistical properties in controlled experiments. It is not advised to employ a common cut-off to distinguish stable from unstable populations, but rather to let that cut-off depend on the use case.
2024-06-26
Measuring Data Drift with the Unstable Population Indicator
datascience@marcelhaas.com
Marcel R. Haas
L.Sibbald@tilburguniversity.edu
Lisette Sibbald
Department of Methodology and Statistics and Department of Cognitive Neuropsychology, Tilburg University, Prof. Cobbenhagenlaan 125, 5037 DB Tilburg, The Netherlands
Business Intelligence, University of Amsterdam, Spui 21, 1012WX Amsterdam, The Netherlands
Public Health and Primary Care, Leiden University Medical Center, Albinusdreef 2, The Netherlands
GigaByte
In China, 65 types of venomous snakes exist, with the Chinese Cobra Naja atra being prominent and a major cause of snakebites in humans. Furthermore, N. atra is a protected animal in some areas, as it has been listed as vulnerable by the International Union for Conservation of Nature. Recently, due to the medical value of snake venoms, venomics has experienced growing research interest. In particular, genomic resources are crucial for understanding the molecular mechanisms of venom production. Here, we report a highly continuous genome assembly of N. atra, based on a snake sample from Huangshan, Anhui, China. The size of this genome is 1.67 Gb, while its repeat content constitutes 37.8% of the genome. A total of 26,432 functional genes were annotated. This data provides an essential resource for studying venom production in N. atra. It may also provide guidance for the protection of this species.
2023-11-20
The genome assembly and annotation of the Chinese cobra, Naja atra
hefengping@outlook.com
Fengping He
huguohai@cngb.org
Guohai Hu
Jiangang Wang
China National GeneBank
Yunnan Agricultural University