Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign uptopic annotation for articles should be throttled #683
Comments
egonw
assigned
Daniel-Mietchen
Apr 25, 2019
egonw
changed the title
topic annotation for articles should be throttled to less than on per second
topic annotation for articles should be throttled
Apr 25, 2019
This comment has been minimized.
This comment has been minimized.
Oh, the source code for that plot is now available from this Rmd file: https://github.com/egonw/wikidata-item-size/blob/master/wikidata_item_size.Rmd and as HTML at https://egonw.github.io/wikidata-item-size/wikidata_item_size.html |
This comment has been minimized.
This comment has been minimized.
wetneb
commented
Apr 25, 2019
I know you have already read that many times, but just for the record: this is just one of the many symptoms of the inadequacy of Wikidata to host Wikicite. It's not just about annotating topics: disambiguating authors, adding publication identifiers, adding affiliations… running any of these operations at a significant scale involves editing many items, which happen to be quite large now. At the moment doing this at 60 edits/minute in this domain is already too much for the servers. Even assuming that the WMF wins the lottery and gets servers that are 10 times more powerful, allowing you to edit at 600 edits/min, this thoughput is still going to be way below what is needed to efficiently maintain a database of articles. In https://dissem.in/ we index more than 100 million papers and much higher edit rates are needed even just to keep the database in sync with the metadata sources. The orders of magnitude just do not match up. I wish Wikicite acknowledged that fully and realized that the current edits in Wikidata are doing more harm than good (I weigh my words), given that they put a significant strain on Wikidata without any hope to reach a useful state any time soon. It would be great if the roadmap discussion could be taken seriously: please just stop editing in this domain while no solution has emerged from that debate. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Let me just acknowledge that I've seen this. |
egonw commentedApr 25, 2019
@Daniel-Mietchen, this is just a heads up... I've been lurking on #wikidata on IRC for some time now. The WDQS servers have been building up lag quite a few times in the past couple of months. Sjoerd has been looking into the issue, and it seems related to the size of the items being edited. Now, it seems that articles are generally large, or at least in terms of the number of statements:
If not mistaken, your batches have been shut down (see your talk page).
So, here's one scalability issue for Scholia: mass annotation of articles with topics is fairly expensive, and the old WDQS cluster does not handle the data bandwidth well. One issue is that it needs to pass around the full content of the item. So, the problem scales with the size of the item. Andra further told me there seems to be an issue with the max JSON size of an item.
No action needed, but something we could report on at some point.
I will also run my code on the key types behind all Scholia aspects.