To our knowledge, there is no large-scale database that systematically ranks all the most-cited scientists in each and every scientific field to a sufficient ranking depth; e. Moreover, self-citations are not excluded in these existing rankings. We have tried to offer a solution to overcome many of the technical problems and provide a comprehensive database of a sufficiently large number of most-cited scientists across science. Here, we used Scopus data to compile a database of the , most-cited authors across all scientific fields based on their ranking of a composite indicator that considers six citation metrics total citations; Hirsch h-index; coauthorship-adjusted Schreiber hm-index; number of citations to papers as single author; number of citations to papers as single or first author; and number of citations to papers as single, first, or last author [ 2 ].

The methodology behind the composite indicator has been already extensively described along with its strengths and residual caveats in [ 2 ]. We offer two versions of the database. For papers published from until , the citations received in — are also included in the calculations, but the citations received up to are not. Therefore, this version provides a measure of long-term performance, and for most living, active scientists, this also reflects their career-long impact or is a very good approximation thereof.

In order to assess the robustness and validity of the calculations, they have been replicated on a second, independent platform and a data set with a slightly different timestamp less than one month difference. It provides a measure of performance in that single recent year.

Therefore, it removes the bias that may exist in comparing scientists with long accrual of citations over many years of active work versus younger ones with shorter time frame during which they may accumulate citations because it focuses on citation accrual only during a single year.

The constructed database shows, for each scientist, the values for each of the six metrics that are used in the calculation of the composite as well as the composite indicator itself, and all indicators are given with and without self-citations. Institutional affiliation and the respective country are inferred based on most recent publications according to the Scopus data as of May Therefore, only one affiliation is provided even though scientists may have worked in several institutions.

Nevertheless, all their work in different institutions is all captured within their author record. We provide data that exclude self-citations to a paper by any author of that paper and, separately, data including all citations, e. Among the top , authors for — data, the median percentage of self-citations is Among the top , authors for the single-year data, the median percentage of self-citations is 9.

With very high proportions of self-citations, we would advise against using any citation metrics since extreme rates of self-citation may herald also other spurious features. These need to be examined on a case-by-case basis for each author, and simply removing the self-citations may not suffice [ 3 ]. We also provide data on the number of citing papers and on the ratio of citations divided by the number of citing papers. High ratios deserve more in-depth assessment of these authors.

Sometimes, this may reflect that it is common for a small number of papers of the same author to be cited together. All science is divided into 22 large fields e. Thus, users can rank scientists according to each of the six metrics or the composite indicator and can limit the ranking to scientists with similar scientific field or top subfield for different levels of desired similarity.

A total of 6,, scientists have published at least five papers. Therefore, the scientist is in the top 0. For all 6,, scientists, Table 1 shows the career-long 25th, 50th, 75th, and 90th percentile of total citations and composite citation index according to each of the 22 fields. Table S3 provides the same information along with 95th and 99th percentiles for each of the subfields as well.

Thus, one can see the relative citation density of different fields. Total citations include self-citations. Existing ranking systems typically focus on single fields e. They also do not account for self-citation phenomena. Nevertheless, our databases still have limitations that have been discussed in detail previously in describing the methodology behind the composite indicator [ 2 ].

  6. We should also caution again that citations from before are missing from our analysis. Overall, whole-career metrics place young scientists at a disadvantage. Single-year metrics remove much of this problem, although again, younger scientists have fewer years of publication history and thus probably fewer papers that can be cited in We have included the year of first earliest publication and the year of last more recent indexed publication of each author.

    Although apparent faults were found in only 13 of the 74 annotated genes examined, a smaller proportion than expected to be pseudogenes from our interpretation of the reporter gene fusion results, the mode of analysis means that this must be an underestimate. Effectively, only annotated genes with close homologs in the C. Furthermore, the examination was only cursory.

    Pseudogenes may be difficult to distinguish from functional genes by sequence analysis alone or even when combined with experimental analysis. The predominant fate of duplicated genes will be to accumulate mutations that render them nonfunctional pseudogenes. Premature stop codons and frameshifting mutations are the most obvious defining characteristics of a pseudogene, but gene structure prediction programs may find alternative splicing patterns around such obstacles, particularly if there is no good homology with a functionally well-characterized gene or EST data to act as a guide.


    Genes that have been disabled by damaged splice sites or promoters will be even harder to recognize as pseudogenes, and such genes may linger before genetic drift makes them clearly pseudogenes from inspection of the sequence alone. Although it might have been anticipated that the integrity of the protein-coding region of a recently duplicated gene may be more sensitive to genetic drift than the promoter, our results suggest that this may not be the case.

    The conservation of protein-coding regions beyond that of introns for these recently duplicated genes suggests that these genes were initially functional and subject to evolutionary selection before they became inactivated by genetic drift. These findings raise many questions about the evolution of the C. We suggest that many of the considerable number of recently duplicated genes in the C.

    This implies that the C. Other sequenced animal genomes may contain fewer pseudogenes, which could have made this problem easier to detect in C. Nevertheless, this problem may be present Schmid and Aquadro , but harder to deal with in other species in which gene structure is even more difficult to predict, and experience gained with C. Generation of the reporter gene fusions involved standard molecular biology procedures as described previously Hope ; Lynch et al. The vectors were modified from pPD This cassette allows the reading frame to be corrected simply by digestion with either Asc I or Not I, depending on the shift needed and recircularization.

    Expression of the reporter gene was examined in wild-type N2 C. All assayed plasmids are available on request. All 74 annotated genes in the duplicated category, which had failed to drive reporter gene expression, were analyzed. Obvious defects in what were otherwise excellent alignments were sought by direct inspection. A gene would be investigated further if the predicted protein product lacked several consecutive amino acid residues that were highly conserved across the protein family. Frequently, part of the missing coding region could be found, but had been omitted from the gene structure prediction because a smaller coding-region deletion, a translational reading frame-shift, or a stop codon prevented their inclusion in any potentially functional gene structure.

    This work depended crucially upon free access to the C. The publication costs of this article were defrayed in part by payment of page charges.

    E-MAIL i. Article published online before print in April View all Previous Section Next Section. Figure 1. View this table: In this window In a new window. Table 1. Figure 2. Figure 3. Figure 4. Previous Section. Science : — Google Scholar. Costanzo M. Nucleic Acids Res. Darnell J. Scientific American Books , New York. Fire A. Gene 93 : — Harrison P. Hill A.