Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topic annotation for articles should be throttled #683

Open
egonw opened this Issue Apr 25, 2019 · 4 comments

Comments

3 participants
@egonw
Copy link
Collaborator

commented Apr 25, 2019

@Daniel-Mietchen, this is just a heads up... I've been lurking on #wikidata on IRC for some time now. The WDQS servers have been building up lag quite a few times in the past couple of months. Sjoerd has been looking into the issue, and it seems related to the size of the items being edited. Now, it seems that articles are generally large, or at least in terms of the number of statements:

image

If not mistaken, your batches have been shut down (see your talk page).

So, here's one scalability issue for Scholia: mass annotation of articles with topics is fairly expensive, and the old WDQS cluster does not handle the data bandwidth well. One issue is that it needs to pass around the full content of the item. So, the problem scales with the size of the item. Andra further told me there seems to be an issue with the max JSON size of an item.

No action needed, but something we could report on at some point.

I will also run my code on the key types behind all Scholia aspects.

@egonw egonw changed the title topic annotation for articles should be throttled to less than on per second topic annotation for articles should be throttled Apr 25, 2019

@egonw

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 25, 2019

@wetneb

This comment has been minimized.

Copy link

commented Apr 25, 2019

I know you have already read that many times, but just for the record: this is just one of the many symptoms of the inadequacy of Wikidata to host Wikicite. It's not just about annotating topics: disambiguating authors, adding publication identifiers, adding affiliations… running any of these operations at a significant scale involves editing many items, which happen to be quite large now. At the moment doing this at 60 edits/minute in this domain is already too much for the servers.

Even assuming that the WMF wins the lottery and gets servers that are 10 times more powerful, allowing you to edit at 600 edits/min, this thoughput is still going to be way below what is needed to efficiently maintain a database of articles. In https://dissem.in/ we index more than 100 million papers and much higher edit rates are needed even just to keep the database in sync with the metadata sources. The orders of magnitude just do not match up.

I wish Wikicite acknowledged that fully and realized that the current edits in Wikidata are doing more harm than good (I weigh my words), given that they put a significant strain on Wikidata without any hope to reach a useful state any time soon. It would be great if the roadmap discussion could be taken seriously: please just stop editing in this domain while no solution has emerged from that debate.

@egonw

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 25, 2019

Some other Scholia-related types:

image

@Daniel-Mietchen

This comment has been minimized.

Copy link
Collaborator

commented Apr 25, 2019

Let me just acknowledge that I've seen this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.