Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MakeDataCount counts all relationships as citations #6139

Open
qqmyers opened this issue Aug 30, 2019 · 6 comments · May be fixed by #6379

Comments

@qqmyers
Copy link
Member

@qqmyers qqmyers commented Aug 30, 2019

As written, I think the MakeDataCountApi citation counting will also report the "unique-resolutions-machine" and similar relationships between a dataset and the report sent in to register its views and downloads. Those events have the same structure and the code only checks subj-id and obj-id and doesn't filter on "source-id" or "relation-type-id".

The example URL in the code (curl https://api.datacite.org/events?doi=10.7910/dvn/hqzoob&source=crossref) shows the type of view/download events that I think will get picked up.

I have not confirmed this specifically, but instead noted that this code does pick up the is-part-of relationships between files and a dataset that we've added for QDR ( #2778 is related) and those are reported as citations. That lead me to inspect the code and, unless I missed something, I think it will count the reported views/downloads as well.

Minimally, I think it should filter those out, but there's a potentially larger question of whether all of the relationships DataCite will report should count as citations. That may be a MakeDataCount question rather than for Dataverse alone. (My guess is that there are few systems that give DOIs to datasets and files as Dataverse can, and few if any of those actually reporting the ispartof/haspart relationship to DataCite as we've started to do in QDR and as is planned in #2778).

@qqmyers

This comment has been minimized.

Copy link
Member Author

@qqmyers qqmyers commented Sep 12, 2019

FWIW: https://support.datacite.org/v1.1/docs/eventdata-query-api-guide#section-filtering-events-links-by-type says the following should excluded:

HasVersion
IsVersionOf
IsNewVersionOf
IsPreviousVersionOf
IsIdenticalTo
HasPart
IsPartOf

They recommend retrieving all relationships and filtering on the client side. (In their api, you can get events for just one type, but not for all types except the above, so there's no easy way to exclude 1000 ispartof relationships. :-( ), which means paging probably has to be managed even if there aren't many 'real' citations.

@pdurbin

This comment has been minimized.

Copy link
Member

@pdurbin pdurbin commented Sep 12, 2019

@qqmyers I appreciate the legwork on this. Out of curiosity, have you tried hitting the DataCite API directly (outside of Dataverse, I mean) to try to figure out how many dataset citations QDR (or TDL) or any installation has accumulated? Someday I'd love to get a count of all citations for all datasets hosted in an installation of Dataverse. 😄

@qqmyers qqmyers referenced a pull request that will close this issue Nov 16, 2019
0 of 5 tasks complete
@jggautier

This comment has been minimized.

Copy link
Contributor

@jggautier jggautier commented Nov 18, 2019

@qqmyers, I see that the recent PR to get a better count of citations uses a whitelist with the relation types "cites", "references" and their inverses. I'm curious why just those four. From what I can tell, for QDR's datasets the Event Data database has 1 "references" relation and 27 "is-supplement-to" relations, like this EventData record. Should the whitelist include "is-supplement-to" and its inverse?

@qqmyers

This comment has been minimized.

Copy link
Member Author

@qqmyers qqmyers commented Nov 18, 2019

@jggautier - probably. I wasn't sure which relationships would be considered citation versus 'structural' across the community so I thought I'd start with the obvious ones (and make sure people thought a whitelist was a good approach). If there's community agreement, I can add others as needed (or others can - the PR is editable). If not, it may be that the whitelist has to be configurable. We'll probably be discussing this at QDR later today (if @adam3smith doesn't chime in here first).

FWIW: what that PR does already is get rid of the 2000+ is-part-of/has-part relationships between files and datasets that QDR is reporting, which then gives us a reasonable number of citations to start looking at GUI/display issues, etc.)

@adam3smith

This comment has been minimized.

Copy link

@adam3smith adam3smith commented Nov 18, 2019

Yeah, this is a super-tricky quesiton. Here was DataCite's original thinking on this
image
(with duplicates removed) for their own counts. Discussions in the steering group indicated that that was likely still too broad. One particular challenge they face is that clients implement the relationships very inconsistently. E.g. ICPSR uses "isDocumentedBy" for their (very substantial) catalog of data citations. (Don't ask me why).

I think @jggautier is definitely right that isSupplementTo should be included.
On the other hand, I would exclude outgoing cites and references -- e.g. if a dataset cites 100 articles (not completely implausible e.g. for historic data) that shouldn't be reflected. Similarly, incoming isCitedBy and isReferencedBy shouldn't be included.
Can we make that sort of distinction in the data?

@qqmyers

This comment has been minimized.

Copy link
Member Author

@qqmyers qqmyers commented Nov 18, 2019

A couple thoughts:

The event record includes the ids for the subject and object as well as the relationship name, so one can definitely filter on the direction of the relationship. There's also a 'source-id' that could be used as a filter - the is-part/has-part relationships between datasets and files are from 'datacite-related' (versus 'datacite-crossref', 'crossref', etc.). So one could potentially distinguish between a file being metadata-for a dataset and a dataset being metadata-for a paper (or vice versa) - the latter meant as an example where one might consider a relationship to be a citation if the subject/object are really independent.

I don't know if MDC addresses it, but one could also split citations of this dataset from things this dataset cites and display both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.