New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss relaxing URL regex #7

Open
adam3smith opened this Issue Jan 15, 2019 · 7 comments

Comments

Projects
None yet
2 participants
@adam3smith
Copy link

adam3smith commented Jan 15, 2019

I might be wrong about this so feel free to tell me so, but I'm wondering if we shouldn't somewhat relax the regex for URLs for the extract functions. In particular, many styleguides tell authors use URLs that include www. without https?, so maybe just allow www\. as an alternate beginning of the URL string?

@greebie

This comment has been minimized.

Copy link
Collaborator

greebie commented Jan 15, 2019

Always challenging to grab urls, because not all urls start with www. We could have something like https://docs.example.com or a non-url with a www.notaurl.

What I suggest is that we include the option for www.etc in addition to https, and maybe create a global for creating a vector of acceptable regexes for parsing. That way you can massage the settings as you realize you need more or less from the documents.

I can't imagine you wanting to archive something from gopher:// but who knows? :)

@adam3smith

This comment has been minimized.

Copy link

adam3smith commented Jan 15, 2019

right, obviously you can't require www. but I think www.notaurl would be pretty rare (the gh URL parser agrees ;)
So I think having the configurable vector would be great, but relaxing the default to allow URLs to start with either https? or www. makes sense and seems to be what forums parsers do.

@greebie

This comment has been minimized.

Copy link
Collaborator

greebie commented Jan 17, 2019

Found this regex for the solution: https://www.regextester.com/93652 and it will be on the next commit to the PR.

Notice that it's not perfect - it will not capture something like yahoo.com, but much more flexible that the previous regex.

@adam3smith

This comment has been minimized.

Copy link

adam3smith commented Jan 17, 2019

That looks good -- note that it will catch yahoo.com (there's a ? after the whole https/www monster). We can test & see if that's too relaxed and if so, remove that part. or turn it into something like
(https?\/\/(www\.)?|https:\/\/) which is more restrictive
Talking of monster, please use (https?:\/\/(www\.)?) for the first capturing group instead. No need to make this so long

@greebie

This comment has been minimized.

Copy link
Collaborator

greebie commented Jan 17, 2019

Ooops. Nice job by them. I do need the monster unfortunately, because we want to return the whole url for parsing. The shortened version will only return the prefixes.

@adam3smith

This comment has been minimized.

Copy link

adam3smith commented Jan 17, 2019

I just mean for the beginning, so the whole thing would be
^(https?:\/\/(www\.)?)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$
That should be equivalent

@greebie

This comment has been minimized.

Copy link
Collaborator

greebie commented Jan 17, 2019

Well, if you think that's bad, remember that R requires a double backslash for escaping regex characters! Argh! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment