MakeDataCount counts all relationships as citations #6139

qqmyers · 2019-08-30T21:02:46Z

As written, I think the MakeDataCountApi citation counting will also report the "unique-resolutions-machine" and similar relationships between a dataset and the report sent in to register its views and downloads. Those events have the same structure and the code only checks subj-id and obj-id and doesn't filter on "source-id" or "relation-type-id".

The example URL in the code (curl https://api.datacite.org/events?doi=10.7910/dvn/hqzoob&source=crossref) shows the type of view/download events that I think will get picked up.

I have not confirmed this specifically, but instead noted that this code does pick up the is-part-of relationships between files and a dataset that we've added for QDR ( #2778 is related) and those are reported as citations. That lead me to inspect the code and, unless I missed something, I think it will count the reported views/downloads as well.

Minimally, I think it should filter those out, but there's a potentially larger question of whether all of the relationships DataCite will report should count as citations. That may be a MakeDataCount question rather than for Dataverse alone. (My guess is that there are few systems that give DOIs to datasets and files as Dataverse can, and few if any of those actually reporting the ispartof/haspart relationship to DataCite as we've started to do in QDR and as is planned in #2778).

qqmyers · 2019-09-12T19:30:23Z

FWIW: https://support.datacite.org/v1.1/docs/eventdata-query-api-guide#section-filtering-events-links-by-type says the following should excluded:

HasVersion
IsVersionOf
IsNewVersionOf
IsPreviousVersionOf
IsIdenticalTo
HasPart
IsPartOf

They recommend retrieving all relationships and filtering on the client side. (In their api, you can get events for just one type, but not for all types except the above, so there's no easy way to exclude 1000 ispartof relationships. :-( ), which means paging probably has to be managed even if there aren't many 'real' citations.

pdurbin · 2019-09-12T21:46:02Z

@qqmyers I appreciate the legwork on this. Out of curiosity, have you tried hitting the DataCite API directly (outside of Dataverse, I mean) to try to figure out how many dataset citations QDR (or TDL) or any installation has accumulated? Someday I'd love to get a count of all citations for all datasets hosted in an installation of Dataverse. 😄

jggautier · 2019-11-18T14:57:46Z

@qqmyers, I see that the recent PR to get a better count of citations uses a whitelist with the relation types "cites", "references" and their inverses. I'm curious why just those four. From what I can tell, for QDR's datasets the Event Data database has 1 "references" relation and 27 "is-supplement-to" relations, like this EventData record. Should the whitelist include "is-supplement-to" and its inverse?

qqmyers · 2019-11-18T15:18:26Z

@jggautier - probably. I wasn't sure which relationships would be considered citation versus 'structural' across the community so I thought I'd start with the obvious ones (and make sure people thought a whitelist was a good approach). If there's community agreement, I can add others as needed (or others can - the PR is editable). If not, it may be that the whitelist has to be configurable. We'll probably be discussing this at QDR later today (if @adam3smith doesn't chime in here first).

FWIW: what that PR does already is get rid of the 2000+ is-part-of/has-part relationships between files and datasets that QDR is reporting, which then gives us a reasonable number of citations to start looking at GUI/display issues, etc.)

adam3smith · 2019-11-18T16:03:15Z

Yeah, this is a super-tricky quesiton. Here was DataCite's original thinking on this

(with duplicates removed) for their own counts. Discussions in the steering group indicated that that was likely still too broad. One particular challenge they face is that clients implement the relationships very inconsistently. E.g. ICPSR uses "isDocumentedBy" for their (very substantial) catalog of data citations. (Don't ask me why).

I think @jggautier is definitely right that isSupplementTo should be included.
On the other hand, I would exclude outgoing cites and references -- e.g. if a dataset cites 100 articles (not completely implausible e.g. for historic data) that shouldn't be reflected. Similarly, incoming isCitedBy and isReferencedBy shouldn't be included.
Can we make that sort of distinction in the data?

qqmyers · 2019-11-18T16:17:50Z

A couple thoughts:

The event record includes the ids for the subject and object as well as the relationship name, so one can definitely filter on the direction of the relationship. There's also a 'source-id' that could be used as a filter - the is-part/has-part relationships between datasets and files are from 'datacite-related' (versus 'datacite-crossref', 'crossref', etc.). So one could potentially distinguish between a file being metadata-for a dataset and a dataset being metadata-for a paper (or vice versa) - the latter meant as an example where one might consider a relationship to be a citation if the subject/object are really independent.

I don't know if MDC addresses it, but one could also split citations of this dataset from things this dataset cites and display both.

pdurbin referenced this issue Sep 24, 2019

MakeDataCount only retrieves 25 citations #6138

Open

qqmyers referenced a pull request that will close this issue Nov 16, 2019

MakeDataCount counts all relationships as citations #6379

Open

0 of 5 tasks complete

Please note that GitHub no longer supports your web browser.

IQSS/dataverse

MakeDataCount counts all relationships as citations #6139

MakeDataCount counts all relationships as citations #6139

qqmyers commented Aug 30, 2019

This comment has been minimized.

qqmyers commented Sep 12, 2019

This comment has been minimized.

pdurbin commented Sep 12, 2019

This comment has been minimized.

jggautier commented Nov 18, 2019

This comment has been minimized.

qqmyers commented Nov 18, 2019

This comment has been minimized.

adam3smith commented Nov 18, 2019

This comment has been minimized.

qqmyers commented Nov 18, 2019

Please note that GitHub no longer supports your web browser.

IQSS/dataverse

Join GitHub today

MakeDataCount counts all relationships as citations #6139

Comments

qqmyers commented Aug 30, 2019

This comment has been minimized.

qqmyers commented Sep 12, 2019

This comment has been minimized.

pdurbin commented Sep 12, 2019

This comment has been minimized.

jggautier commented Nov 18, 2019

This comment has been minimized.

qqmyers commented Nov 18, 2019

This comment has been minimized.

adam3smith commented Nov 18, 2019

This comment has been minimized.

qqmyers commented Nov 18, 2019