The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.
Haploid human genomes, which are contained in germ cells (the egg and sperm gamete cells created in the meiosis phase of sexual reproduction before fertilization) consist of 3,054,815,472 DNA base pairs (if X chromosome is used), while female diploid genomes (found in somatic cells) have twice the DNA content.
While there are significant differences among the genomes of human individuals (on the order of 0.1% due to single-nucleotide variants and 0.6% when considering indels), these are considerably smaller than the differences between humans and their closest living relatives, the bonobos and chimpanzees (~1.1% fixed single-nucleotide variants and 4% when including indels). Size in basepairs can vary too; the telomere length decreases after every round of DNA replication.
Although the sequence of the human genome has been completely determined by DNA sequencing in 2022 (including methylation), it is not yet fully understood. Most, but not all, genes have been identified by a combination of high throughput experimental and bioinformatics approaches, yet much work still needs to be done to further elucidate the biological functions of their protein and RNA products (in particular, annotation of the complete CHM13v2.0 sequence is still ongoing). And yet, overlapping genes are quite common, in some cases allowing two protein coding genes from each strand to reuse base pairs twice (for example, genes DCDC2 and KAAG1). Recent results suggest that most of the vast quantities of noncoding DNA within the genome have associated biochemical activities, including regulation of gene expression, organization of chromosome architecture, and signals controlling epigenetic inheritance. There are also a significant number of retroviruses in human DNA, at least 3 of which have been proven to possess an important function (i.e., HIV-like HERV-K, HERV-W, and HERV-FRD play a role in placenta formation by inducing cell-cell fusion).
In 2003, scientists reported the sequencing of 85% of the entire human genome, but as of 2020 at least 8% was still missing. In 2021, scientists reported sequencing the complete female genome (i.e., without the Y chromosome). This sequence identified 19,969 protein-coding sequences, accounting for approximately 1.5% of the genome, and 63,494 genes in total, most of them being non-coding RNA genes. The genome consists of regulatory DNA sequences, LINEs, SINEs, introns, and sequences for which as yet no function has been determined. The human Y chromosome, consisting of 62,460,029 base pairs from a different cell line and found in all males