Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools for Misfigured Urls #18

Open
greebie opened this issue Feb 5, 2019 · 2 comments

Comments

@greebie
Copy link
Collaborator

commented Feb 5, 2019

There is an unresolved issue when parsing for urls that bleed into regular text (often because of rich text features like tables etc.).

For example,

https://www.example.com/index.html.Beginning_of_following_paragraph which could be resolved by accepting only one period after the url, except that

https://www.example.com/index.htmlBeginning_of_following_paragraph would still not be resolved.

I think an easier solution might be to offer some optional cleaning functions for the dataframes that archivr produces, but there could be other ideas.

@greebie

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 17, 2019

Given discussion in #28, we should include operations for reading xml and html as well.

@adam3smith

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2019

I'll work on html and xml parsing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.