Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upExample collection #13
Comments
ruebot
added a commit
that referenced
this issue
Mar 3, 2019
ruebot
self-assigned this
Mar 3, 2019
This comment has been minimized.
This comment has been minimized.
Hmm. Let's think a bit more on this. Agreed that CPP is a bit too big. The current example data we're using isn't ideal. I think we'd like a small-ish collection with:
I think Victoria might have some ideal candidate collections. I can try to find a cycle to dig through some of the Archive-It pages, but @SamFritz if you have a moment do you want to take a look around the UVic archive-it pages and see if there are any ones that fit that criteria? |
This comment has been minimized.
This comment has been minimized.
Actually, do any of these collections have manageable derivative sizes? (I don't have a UVic collection synced in the Cloud right) If any of those stand out, I can write to UVic to see if they are interested in being used as "sample data." |
This comment has been minimized.
This comment has been minimized.
The Trans Web:
British Columbia Local Governments:
B.C. Teachers' Labour Dispute (2014):
Trans Web:
|
This comment has been minimized.
This comment has been minimized.
OK great, thanks @ruebot. I like BC Teachers Labour Dispute: neat topic, has mostly content from 2014 but also from 2015, fair number of domains, and different domains that take very divergent perspectives on the issue. Plus it's about the size that we could bundle with the image, knock on wood. @greebie @ruebot @SamFritz provide any thoughts you might have on using this as a sample datasets.. if I get thumbs up, I'd like to reach out to UVic. |
This comment has been minimized.
This comment has been minimized.
Once we're in agreement, I'll create a branch for it. |
This comment has been minimized.
This comment has been minimized.
I have the UVIc account logged into my cloud account. I think the Teachers labor dispute has legs. I like the Transweb one, but I don't think it has much in terms of years available yet. |
ruebot
added
enhancement
question
labels
Mar 4, 2019
This comment has been minimized.
This comment has been minimized.
Should we have a section in the README like we do in |
ianmilligan1
added a commit
that referenced
this issue
Mar 4, 2019
This comment has been minimized.
This comment has been minimized.
agreed, I think the BC Teachers Labour Dispute collection would work well, as a second runner I probably select the Trans Web collection (text wise it's a bit larger). |
This comment has been minimized.
This comment has been minimized.
Perfect, thanks all. I'll send them an e-mail to see if there's interest. |
This comment has been minimized.
This comment has been minimized.
The next spark job in the queue is for the BC Teachers collections. Should be done later tonight, or early tomorrow. I'll create a branch, and we'll see if it works. I think we'll be fine with the GitHub size limits. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Back to the drawing board. We need a collection where all the derivatives are under a 100MB.
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Maybe we could just truncate the text? The script will only read the first 2500 lines anyway. |
This comment has been minimized.
This comment has been minimized.
Yeah, I think truncating the text would work here. Trim the text to 35MB or so and just make clear that it’s a sample in the README? |
This comment has been minimized.
This comment has been minimized.
Cool. 43k lines of text from the file is: |
greebie
closed this
Mar 5, 2019
greebie
reopened this
Mar 5, 2019
This comment has been minimized.
This comment has been minimized.
Ooops. Sorry - I had a comment and then closed the issue instead of deleting it. |
ruebot commentedMar 3, 2019
Do we want to use this one? If so, we should probably cite it in the notebook. We normally do Canadian Political Parties and Interest Groups, but those are some big derivatives.