Tools for Misfigured Urls #18

greebie · 2019-02-05T18:37:32Z

There is an unresolved issue when parsing for urls that bleed into regular text (often because of rich text features like tables etc.).

For example,

https://www.example.com/index.html.Beginning_of_following_paragraph which could be resolved by accepting only one period after the url, except that

https://www.example.com/index.htmlBeginning_of_following_paragraph would still not be resolved.

I think an easier solution might be to offer some optional cleaning functions for the dataframes that archivr produces, but there could be other ideas.

greebie · 2019-10-17T14:51:36Z

Given discussion in #28, we should include operations for reading xml and html as well.

adam3smith · 2019-10-17T15:03:44Z

I'll work on html and xml parsing

adam3smith added the enhancement label Feb 5, 2019

greebie referenced this issue Oct 17, 2019

Use readtext for other extensions #28

Merged

greebie referenced this issue Oct 17, 2019

More Elegant Exit when result returns NULL #29

Open

adam3smith referenced this issue Oct 17, 2019

Separately parse xml and html documents #30

Open

Please note that GitHub no longer supports your web browser.

QualitativeDataRepository/archivr

Tools for Misfigured Urls #18

Tools for Misfigured Urls #18

greebie commented Feb 5, 2019

This comment has been minimized.

greebie commented Oct 17, 2019

This comment has been minimized.

adam3smith commented Oct 17, 2019

Please note that GitHub no longer supports your web browser.

QualitativeDataRepository/archivr

Join GitHub today

Tools for Misfigured Urls #18

Comments

greebie commented Feb 5, 2019

This comment has been minimized.

greebie commented Oct 17, 2019

This comment has been minimized.

adam3smith commented Oct 17, 2019