Discuss relaxing URL regex #7

adam3smith · Jan 15, 2019

I might be wrong about this so feel free to tell me so, but I'm wondering if we shouldn't somewhat relax the regex for URLs for the extract functions. In particular, many styleguides tell authors use URLs that include www. without https?, so maybe just allow www\. as an alternate beginning of the URL string?

greebie · Jan 15, 2019

Always challenging to grab urls, because not all urls start with www. We could have something like https://docs.example.com or a non-url with a www.notaurl.

What I suggest is that we include the option for www.etc in addition to https, and maybe create a global for creating a vector of acceptable regexes for parsing. That way you can massage the settings as you realize you need more or less from the documents.

I can't imagine you wanting to archive something from gopher:// but who knows? :)

adam3smith · Jan 15, 2019

right, obviously you can't require www. but I think www.notaurl would be pretty rare (the gh URL parser agrees ;)
So I think having the configurable vector would be great, but relaxing the default to allow URLs to start with either https? or www. makes sense and seems to be what forums parsers do.

greebie · Jan 17, 2019

Found this regex for the solution: https://www.regextester.com/93652 and it will be on the next commit to the PR.

Notice that it's not perfect - it will not capture something like yahoo.com, but much more flexible that the previous regex.

adam3smith · Jan 17, 2019

That looks good -- note that it will catch yahoo.com (there's a ? after the whole https/www monster). We can test & see if that's too relaxed and if so, remove that part. or turn it into something like
(https?\/\/(www\.)?|https:\/\/) which is more restrictive
Talking of monster, please use (https?:\/\/(www\.)?) for the first capturing group instead. No need to make this so long

greebie · Jan 17, 2019

Ooops. Nice job by them. I do need the monster unfortunately, because we want to return the whole url for parsing. The shortened version will only return the prefixes.

adam3smith · Jan 17, 2019

I just mean for the beginning, so the whole thing would be
^(https?:\/\/(www\.)?)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$
That should be equivalent

greebie · Jan 17, 2019

Well, if you think that's bad, remember that R requires a double backslash for escaping regex characters! Argh! :)

greebie referenced this issue Jan 17, 2019
Open
Add save to wayback function(s). (Issue #2) #3

QualitativeDataRepository/archivr

Discuss relaxing URL regex #7

Discuss relaxing URL regex #7

adam3smith commented Jan 15, 2019

This comment has been minimized.

greebie commented Jan 15, 2019

This comment has been minimized.

adam3smith commented Jan 15, 2019

This comment has been minimized.

greebie commented Jan 17, 2019

This comment has been minimized.

adam3smith commented Jan 17, 2019

This comment has been minimized.

greebie commented Jan 17, 2019

This comment has been minimized.

adam3smith commented Jan 17, 2019

This comment has been minimized.

greebie commented Jan 17, 2019

greebie referenced this issue Jan 17, 2019

Add save to wayback function(s). (Issue #2) #3

QualitativeDataRepository/archivr

Join GitHub today

Discuss relaxing URL regex #7

Comments

adam3smith commented Jan 15, 2019

This comment has been minimized.

greebie commented Jan 15, 2019

This comment has been minimized.

adam3smith commented Jan 15, 2019

This comment has been minimized.

greebie commented Jan 17, 2019

This comment has been minimized.

adam3smith commented Jan 17, 2019

This comment has been minimized.

greebie commented Jan 17, 2019

This comment has been minimized.

adam3smith commented Jan 17, 2019

This comment has been minimized.

greebie commented Jan 17, 2019

greebie referenced this issue Jan 17, 2019

Add save to wayback function(s). (Issue #2) #3