Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Datasets section #88

Open
wants to merge 3 commits into
base: master
from
Open

Add Datasets section #88

wants to merge 3 commits into from

Conversation

@ibnesayeed
Copy link
Contributor

ibnesayeed commented Mar 7, 2020

Introduced a new section with Common Crawl and UKWA CDX as initial seeds of the section.

@anjackson

This comment has been minimized.

Copy link
Member

anjackson commented Mar 7, 2020

Good idea! The only question I have is whether it’s better to just link to our set of datasets rather than linking to each of them? I mean, we won’t be adding new ones very often so maybe it doesn’t matter?

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Mar 7, 2020

I'd say it should be sets/groups. Otherwise, I could drop in 30-40 individual AU datasets now, and that'd look really cluttered.

@ibnesayeed

This comment has been minimized.

Copy link
Contributor Author

ibnesayeed commented Mar 7, 2020

I think, it will be a subjective call and will depend on the specific situation. I would go by the top level pointer from where a researcher can find the rest, but the collection of datasets needs to be well described to give an idea what is inside. However, in cases where there are finite and distinct datasets put together with very different purpose, we might want to add them separately to better describe them in the listing.

@ibnesayeed

This comment has been minimized.

Copy link
Contributor Author

ibnesayeed commented Mar 7, 2020

Thinking out loud, should we use nested bullets to describe individual sets under a bigger set?

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Mar 7, 2020

I think it should be a separate list since it will dominate the current list because there is a pretty large amount of web archive related data sets out there; trec data sets, common crawl, LoC, UKWA, AU, and docnow to just name a few.

@ibnesayeed

This comment has been minimized.

Copy link
Contributor Author

ibnesayeed commented Mar 7, 2020

In that case we should only list the top-level items for now and see if the list grows bigger than a handful then we can branch it off in a separate awesome list and link it from here.

@ibnesayeed

This comment has been minimized.

Copy link
Contributor Author

ibnesayeed commented Mar 7, 2020

I have updated UKWA dataset to its top level page. I am not adding anything else to this PR yet to respect one item per PR guideline. Bootstrapped the section with too item to seed it and make it not look odd with just a single entry.

Copy link
Member

ruebot left a comment

Not sold on the section. Blocking until we have consensus.

If we do add it, guidelines need to be updated.

@ibnesayeed

This comment has been minimized.

Copy link
Contributor Author

ibnesayeed commented Mar 8, 2020

If we do add it, guidelines need to be updated.

I agree!

@ibnesayeed

This comment has been minimized.

Copy link
Contributor Author

ibnesayeed commented Mar 22, 2020

I have updated contribution guidelines for datasets.

/ping @ruebot and @anjackson

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Mar 23, 2020

Still not sold on datasets in this list, and haven't heard any other moving arguments for them here.

@ibnesayeed

This comment has been minimized.

Copy link
Contributor Author

ibnesayeed commented Mar 23, 2020

Still not sold on datasets in this list, and haven't heard any other moving arguments for them here.

I feel that listing Web arching related data sets for research would be a useful resource, be it here in this list or somewhere else in a separate list. Personally, I think this list would be a good starting point with the aim to spin off a separate list if and when the size grows beyond certain limits. However, I am okay if others feel otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.