Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd Datasets section #88
Conversation
This comment has been minimized.
This comment has been minimized.
Good idea! The only question I have is whether it’s better to just link to our set of datasets rather than linking to each of them? I mean, we won’t be adding new ones very often so maybe it doesn’t matter? |
This comment has been minimized.
This comment has been minimized.
I'd say it should be sets/groups. Otherwise, I could drop in 30-40 individual AU datasets now, and that'd look really cluttered. |
This comment has been minimized.
This comment has been minimized.
I think, it will be a subjective call and will depend on the specific situation. I would go by the top level pointer from where a researcher can find the rest, but the collection of datasets needs to be well described to give an idea what is inside. However, in cases where there are finite and distinct datasets put together with very different purpose, we might want to add them separately to better describe them in the listing. |
This comment has been minimized.
This comment has been minimized.
Thinking out loud, should we use nested bullets to describe individual sets under a bigger set? |
This comment has been minimized.
This comment has been minimized.
I think it should be a separate list since it will dominate the current list because there is a pretty large amount of web archive related data sets out there; trec data sets, common crawl, LoC, UKWA, AU, and docnow to just name a few. |
This comment has been minimized.
This comment has been minimized.
In that case we should only list the top-level items for now and see if the list grows bigger than a handful then we can branch it off in a separate awesome list and link it from here. |
This comment has been minimized.
This comment has been minimized.
I have updated UKWA dataset to its top level page. I am not adding anything else to this PR yet to respect one item per PR guideline. Bootstrapped the section with too item to seed it and make it not look odd with just a single entry. |
Not sold on the section. Blocking until we have consensus. If we do add it, guidelines need to be updated. |
This comment has been minimized.
This comment has been minimized.
I agree! |
This comment has been minimized.
This comment has been minimized.
I have updated contribution guidelines for datasets. /ping @ruebot and @anjackson |
This comment has been minimized.
This comment has been minimized.
Still not sold on datasets in this list, and haven't heard any other moving arguments for them here. |
This comment has been minimized.
This comment has been minimized.
I feel that listing Web arching related data sets for research would be a useful resource, be it here in this list or somewhere else in a separate list. Personally, I think this list would be a good starting point with the aim to spin off a separate list if and when the size grows beyond certain limits. However, I am okay if others feel otherwise. |
ibnesayeed commentedMar 7, 2020
Introduced a new section with Common Crawl and UKWA CDX as initial seeds of the section.