Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame schemas; resolves #45. #46

Merged
merged 4 commits into from Feb 10, 2020
Merged

Add DataFrame schemas; resolves #45. #46

merged 4 commits into from Feb 10, 2020

Conversation

@ruebot
Copy link
Member

ruebot commented Feb 10, 2020

@ianmilligan1 @lintool if you're good with this path, let me know, and I'll add one for 0.50.0 so we have that.

...I'm guessing there is a bit of prose we can add too. Feel free to comment, or just push to the branch.

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Feb 10, 2020

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Feb 10, 2020

👍

This is exactly where I would have put it...

@@ -0,0 +1,157 @@
# Archives Unleashed Toolkit DataFrames

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 10, 2020

Member

Yeah we should add a tiny bit of descriptive text here. I assume this is for advanced users?

Something like:

Below you can find all of the DataFrame schemas found in each object. For example, if you extract .all from WARCs, you will see the fields below. Some of the most popular ones include all (which includes content, URLs, and file types); webpages (which includes full-text content and language); and webgraph which includes hyperlink information.

- `mime_type_tika` (string)
- `content` (string)
- `language` (string)
- `content` (string)

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 10, 2020

Member

Why does content appear twice? (typo or deliberate?)

This comment has been minimized.

Copy link
@ruebot

ruebot Feb 10, 2020

Author Member

heh. copypasta.

@@ -0,0 +1,157 @@
# Archives Unleashed Toolkit DataFrames

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 10, 2020

Member

Add text here too (same as whatever language we coalesce around above).

- `mime_type_tika` (string)
- `content` (string)
- `language` (string)
- `content` (string)

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 10, 2020

Member

Noting the second content here too.

@ruebot ruebot marked this pull request as ready for review Feb 10, 2020
@ianmilligan1 ianmilligan1 merged commit c7b99dd into master Feb 10, 2020
@ianmilligan1 ianmilligan1 deleted the issue-45 branch Feb 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.