Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upMakeDataCount counts all relationships as citations #6139
Comments
This comment has been minimized.
This comment has been minimized.
FWIW: https://support.datacite.org/v1.1/docs/eventdata-query-api-guide#section-filtering-events-links-by-type says the following should excluded: HasVersion They recommend retrieving all relationships and filtering on the client side. (In their api, you can get events for just one type, but not for all types except the above, so there's no easy way to exclude 1000 ispartof relationships. :-( ), which means paging probably has to be managed even if there aren't many 'real' citations. |
This comment has been minimized.
This comment has been minimized.
@qqmyers I appreciate the legwork on this. Out of curiosity, have you tried hitting the DataCite API directly (outside of Dataverse, I mean) to try to figure out how many dataset citations QDR (or TDL) or any installation has accumulated? Someday I'd love to get a count of all citations for all datasets hosted in an installation of Dataverse. |
This comment has been minimized.
This comment has been minimized.
@qqmyers, I see that the recent PR to get a better count of citations uses a whitelist with the relation types "cites", "references" and their inverses. I'm curious why just those four. From what I can tell, for QDR's datasets the Event Data database has 1 "references" relation and 27 "is-supplement-to" relations, like this EventData record. Should the whitelist include "is-supplement-to" and its inverse? |
This comment has been minimized.
This comment has been minimized.
@jggautier - probably. I wasn't sure which relationships would be considered citation versus 'structural' across the community so I thought I'd start with the obvious ones (and make sure people thought a whitelist was a good approach). If there's community agreement, I can add others as needed (or others can - the PR is editable). If not, it may be that the whitelist has to be configurable. We'll probably be discussing this at QDR later today (if @adam3smith doesn't chime in here first). FWIW: what that PR does already is get rid of the 2000+ is-part-of/has-part relationships between files and datasets that QDR is reporting, which then gives us a reasonable number of citations to start looking at GUI/display issues, etc.) |
This comment has been minimized.
This comment has been minimized.
Yeah, this is a super-tricky quesiton. Here was DataCite's original thinking on this I think @jggautier is definitely right that |
This comment has been minimized.
This comment has been minimized.
A couple thoughts: The event record includes the ids for the subject and object as well as the relationship name, so one can definitely filter on the direction of the relationship. There's also a 'source-id' that could be used as a filter - the is-part/has-part relationships between datasets and files are from 'datacite-related' (versus 'datacite-crossref', 'crossref', etc.). So one could potentially distinguish between a file being metadata-for a dataset and a dataset being metadata-for a paper (or vice versa) - the latter meant as an example where one might consider a relationship to be a citation if the subject/object are really independent. I don't know if MDC addresses it, but one could also split citations of this dataset from things this dataset cites and display both. |
qqmyers commentedAug 30, 2019
As written, I think the MakeDataCountApi citation counting will also report the "unique-resolutions-machine" and similar relationships between a dataset and the report sent in to register its views and downloads. Those events have the same structure and the code only checks subj-id and obj-id and doesn't filter on "source-id" or "relation-type-id".
The example URL in the code (curl https://api.datacite.org/events?doi=10.7910/dvn/hqzoob&source=crossref) shows the type of view/download events that I think will get picked up.
I have not confirmed this specifically, but instead noted that this code does pick up the is-part-of relationships between files and a dataset that we've added for QDR ( #2778 is related) and those are reported as citations. That lead me to inspect the code and, unless I missed something, I think it will count the reported views/downloads as well.
Minimally, I think it should filter those out, but there's a potentially larger question of whether all of the relationships DataCite will report should count as citations. That may be a MakeDataCount question rather than for Dataverse alone. (My guess is that there are few systems that give DOIs to datasets and files as Dataverse can, and few if any of those actually reporting the ispartof/haspart relationship to DataCite as we've started to do in QDR and as is planned in #2778).