The citation graph is one of humankind's most important intellectual achievements

Originally published at:

When researchers write, we don’t just describe new findings – we place them in context by citing the work of others. Citations trace the lineage of ideas, connecting disparate lines of scholarship into a cohesive body of knowledge, and forming the basis of how we know what we know.


So how does one actually use it to look something up (as one does easily with Scopus or WoS if one is a subscriber)? All I can find at is a page warning that

Ingestion of new citation data into the OpenCitations Corpus is currently suspended

1 Like

Great article, but from the title I was expecting it to be about websites that provide citation graphs that end as soon as something comes in from the outside, “where attribution goes to die” (hopefully I’m not misattributing that quote).

Is there a reason we couldn’t just turn a crawler loose on sci-hub and reconstruct this index?


Hi @d_r, thanks for your comment. Let me tackle this from different angles.

First of all, I think it is important to clarify that I4OC does not make available any open dataset of citation data. Rather, it is a group that advocates the open availability of citations and that has continuously discussed with publishers depositing the citation data of their articles in Crossref to make them freely available to the world by means of the Crossref API. These free availability of citation data, then, has been reused by several parties, including companies, to offer (either open or commercial) services: in addition to OpenCitations, there are also Scholia, ScienceOpen, and Dimentions, just to mention a few.

From the OpenCitations side – disclosure: I’m one of the Director of this service organisation –, it implements an workflow that continuously ingest new data into the OpenCitations Corpus (OCC), i.e. the RDF repository of open citation data. As mentioned in the website, the ingestion has been suspended in these months since we are migrating the system in a new powerful infrastructure which will allow us to increase the ingestion rate by 30-fold – thus having additional 0.5M citations per month.

Hopefully, we can complete the full migration in June. However, several services made available by OpenCitations are already up and running in the new infrastructure. In particular, it is possible to access the data in the OCC in different ways: downloading the dumps, accessing the bibliographic resources directly via HTTP in different formats, querying the repository by using SPARQL (a query language for RDF data), and searching and browsing information in the OCC by means of textual-string search interfaces.

In addition, we will announce the release of a quite huge dataset of additional citation information in the next weeks on our Twitter account (@opencitations) – release that won’t be possible without the effort that I4OC has done during the past year.


Hi @anon62122146,

I would say, even if it is possible in principle to parse all the documents in Sci-Hub to extract citation data, you cannot republish them on the Web in the public domain (e.g. in CC0) since they are copyrighted material.

What I4OC is pushing for is to have a freely open citation graph which is entirely legal, and the path followed in the past year was to convince publishers that already deposited their citation data in Crossref to release them to the public in a totally free a legal way, making them accessible by means of the Crossref API – see my previous comment to @d_r.


Just a fix in a passage I wrote in the previous comment.

The old OpenCitations infrastructure allowed us to ingest additional 0.5M citations per month. The estimate for the new infrastructure mentioned in the comment will make possible to ingest 0.5M citations per day.

1 Like

Let me rephrase my comment. I am editor in chief of an academic journal and a (voting) Crossref member. Recently Crossref has been telling us of the option to “option to distribute your references openly”. While this is apparently trivial for many journals, on my journal this would be an extra effort on our part. Our staff is all volunteer faculty. Please let me know what the benefit to us would be of this extra effort.


In 2016, Thomson Reuters sold its Intellectual Property & Science business to a private-equity fund for $3.55 billion. … it’s ironic that the vision of building a comprehensive index of scientific literature has turned into a billion-dollar business.

Except it hasn’t turned into a billion-dollar business. That $3.55 billion paid for a large portfolio that included Cortellis, Derwent Innovation, EndNote, GeneGo, IDdb, IDRAC, MicroPatent, Techstreet Industry Standards, as well as Web of Science.

My company subscribes to about half of these products. Web Of Science is the cheapest by far.

Besides the legal points brought up, actually parsing the documents would be nearly impossible. Scientific papers are generally found on sci-hub as PDFs identical to the printed copy, not as marked up metadata. Parsing such things is non-trivial (as the poorly parsed epub versions of pdfs found on attests)

Thanks for the info. I agree this would be better. If you parsed it from sci-hub, you’d pretty much have to only make it available the same way as sci-hub.

A simple parser would not be enough. You’d have to have some croudsourcing code, like asking the person using the graph to classify marginal cases for their graph, and adding the results to the database.

I think discoverability is one of the main advantages here. Opening citations results in making available additional paths to reach the citing and cited publications, where these paths become accessible to anyone. In addition, if you don’t make this information available to the public, all your journals and publications cannot be used by applications, analyses and derivative databases that use these open citation data. As a compendium of this argument, at OpenCitations we often use an analogy to introduce the advantages of this open availability of citation data at a big scale.

About the effort needed (and without knowing your particular situation): if you, as a publisher, are already depositing these citation data to Crossref, then you only need to ask them to turn on the reference distribution for all the DOI prefixes you control. However, if you need to extend your current publication workflow so as to produce these metadata to submit to Crossref, then the things can get a bit more complex. The task could be easily automated if you use a machine-readable format for storing the sources of your publications (e.g. XML), otherwise more ad-hoc solutions are needed, of course. I guess you are in this latter situation, aren’t you?

I kind of like the fact that it has not been turned into a multi-billion dollar business. Maybe this will spur the private equity firm to release more to the public domain. If they wanted a tax break for doing so…it might be worth it. A lot of private equity firms aren’t into long term visions.

So not really much of an advantage to me, or to people in my field, who already have plenty of tools for discovering the relevant important literature in their area. I can imagine how it might help some large commercial publisher – Hindawi comes to mind, because of their cosy relationship with CrossRef – decide where to target their next set of 100 pay-to-play journals.

Honestly, if you’re going to call something “one of humankind’s most important intellectual achievements” you need to make a case for this.

1 Like

Wasn’t there a little scandal a year or two ago when it turned out one of the most cited articles (or perhaps it was many of the most cited articles?) turned out to say something completely different from what people thought it did, because someone cited it once and everyone else cited it on the basis of that first citation – without bothering to read what it actually said?


The citation graph is only meta-data. I use the ‘citation graph’ (searching references and backward references) all the time for my own scientific research, in the sense that I see what and why people referenced something else and see if it is relevant, but it is secondary to the content. The citation graph is not too valuable unless you also have the context and reason for the citation, or the content (at least the title and abstract) of the cited source. I use citation searches for discovering possible leads that slipped through keyword searches and the like. for a new area I’m working in, i’d guess that if I see a new paper, at least half the works cited are always not relevant to whatever question I have and i can tell from the context or the title (neither are part of the citation graph); another 1/4 -1/3 are famous papers that I already know. Of the rest, less than half are things I would add to my own work–and this is often just confirmation that a paper I saw somewhere else is judged as relevant broadly. Where the citation graph appears to mostly comes into play is allotting credit for work, computing influence factors and H-factors.

By analogy, the link graph of wikipedia is completely open, but certainly an order of magnitude less important that the wikipedia content. Page-rank used the citation graph of webpages as the core, for ranking relevance among pages that match keywords. Note that the content is the first primary relevance; the citation graph helps find the most relevant within that set. The citation graph is only useful if you have the content and context; if you have that, you already have the citation graph (i.e., the references section).

1 Like

The citation graph is not too valuable unless you also have the context and reason for the citation

I totally agree with this, and I’ve personally put a lot of effort in making this part accessible also to machines – e.g. see the SPAR Ontologies. Honestly, this is something very difficult to achieve in an automatic or semi-automatic fashion, and it is not the point of the I4OC, at least not at this stage.

or the content (at least the title and abstract) of the cited source.

Well, when I wrote about citation data in the previous comments I actually meant that other metadata are also included or can be easily derived. For instance, the title is among the information that Crossref returns, in addition to the authors, publication dates, etc. Since these can be considered ‘facts’, they can be released in the public domain without any issue.

I guess this is more tricky for the abstract, since that specific part is actually copyrighted – it is not a fact, it is a creative content written by the authors. It is worth mentioning that this holds for both closed- and open-access publications. In fact, a CC-BY license (usually adopted by open access journals) is not enough for guaranteeing a fully open reuse. Only a CC0 Public Domain Dedication and Waiver permits unrestricted reuse, since it is more permissive that a CC-BB license and is widely adopted for data. The intended goal of I4OC is to push for CC0 adoption.

The citation graph is only useful if you have the content and context; if you have that, you already have the citation graph (i.e., the references section).

I slightly disagree here. The citation graph is useful, period, and can be used for computing the metrics you refer to and for additional things as well (e.g. browsing within a large set of the literature available). Having the content and the context is for sure an added value (which could present issues related to the license, as mentioned before). However, having the content open, e.g. because the publication has been released with a CC-BY license, is not enough to have citation data that are structured, separable, and open, which is the main goal the I4OC wants to reach. The fact that these data are available in Crossref means to have a tool for accessing and querying them as a whole, something that you don’t have if you have to parse publications every time you need these data.

I think that it is important to point out when they say “citation map,” they’re talking about not just the ability to look backwards at what articles a paper has cited. They’re also talking about the ability to see what articles cited the article that you are looking at. This requires that the citation data from those later articles be structured enough to be consistently be linked to the article that they are citing. This can be more difficult that it looks when we consider different style manuals, transcription errors, different naming conventions (Given or family name listed first), title abbreviations (does J. Phys. mean Journal of physics or journale de physique)

It is also worth remembering just how hard West Publishing tried to claim copyright in the CITATIONS to public domain works.

This topic was automatically closed after 5 days. New replies are no longer allowed.