Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upDiscuss relaxing URL regex #7
Comments
This comment has been minimized.
This comment has been minimized.
Always challenging to grab urls, because not all urls start with www. We could have something like https://docs.example.com or a non-url with a www.notaurl. What I suggest is that we include the option for www.etc in addition to https, and maybe create a global for creating a vector of acceptable regexes for parsing. That way you can massage the settings as you realize you need more or less from the documents. I can't imagine you wanting to archive something from gopher:// but who knows? :) |
This comment has been minimized.
This comment has been minimized.
adam3smith
commented
Jan 15, 2019
right, obviously you can't require www. but I think www.notaurl would be pretty rare (the gh URL parser agrees ;) |
This comment has been minimized.
This comment has been minimized.
Found this regex for the solution: https://www.regextester.com/93652 and it will be on the next commit to the PR. Notice that it's not perfect - it will not capture something like |
This comment has been minimized.
This comment has been minimized.
adam3smith
commented
Jan 17, 2019
That looks good -- note that it will catch yahoo.com (there's a |
This comment has been minimized.
This comment has been minimized.
Ooops. Nice job by them. I do need the monster unfortunately, because we want to return the whole url for parsing. The shortened version will only return the prefixes. |
This comment has been minimized.
This comment has been minimized.
adam3smith
commented
Jan 17, 2019
I just mean for the beginning, so the whole thing would be |
This comment has been minimized.
This comment has been minimized.
Well, if you think that's bad, remember that R requires a double backslash for escaping regex characters! Argh! :) |
adam3smith commentedJan 15, 2019
I might be wrong about this so feel free to tell me so, but I'm wondering if we shouldn't somewhat relax the regex for URLs for the
extract
functions. In particular, many styleguides tell authors use URLs that include www. withouthttps?
, so maybe just allowwww\.
as an alternate beginning of the URL string?