Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upTools for Misfigured Urls #18
Comments
This comment has been minimized.
This comment has been minimized.
Given discussion in #28, we should include operations for reading xml and html as well. |
This comment has been minimized.
This comment has been minimized.
I'll work on html and xml parsing |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
greebie commentedFeb 5, 2019
There is an unresolved issue when parsing for urls that bleed into regular text (often because of rich text features like tables etc.).
For example,
https://www.example.com/index.html.Beginning_of_following_paragraph
which could be resolved by accepting only one period after the url, except thathttps://www.example.com/index.htmlBeginning_of_following_paragraph
would still not be resolved.I think an easier solution might be to offer some optional cleaning functions for the dataframes that archivr produces, but there could be other ideas.