Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upFinding Hyperlinks within Collection on Pages with Certain Keyword #377
Conversation
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Nov 12, 2019
Codecov Report
@@ Coverage Diff @@
## master #377 +/- ##
==========================================
+ Coverage 76.36% 76.37% +0.01%
==========================================
Files 40 40
Lines 1413 1414 +1
Branches 268 268
==========================================
+ Hits 1079 1080 +1
Misses 217 217
Partials 117 117 |
This comment has been minimized.
This comment has been minimized.
Looks good!
^^^ @ianmilligan1 that what you're looking for? If so, I'll squash and merge, and add this to the cookbook section if that makes sense. |
This comment has been minimized.
This comment has been minimized.
hey @ruebot is this something that should be encoded in a test case while we're at it? |
This comment has been minimized.
This comment has been minimized.
@lintool we already have a test case for |
This comment has been minimized.
This comment has been minimized.
@ruebot add a separate test case that explicitly include filtering? Maybe not. I dunno. |
This comment has been minimized.
This comment has been minimized.
Yeah, we could add a new test that add ...and if that's the case, @SinghGursimran, want to update the PR with an updated test? |
This comment has been minimized.
This comment has been minimized.
@ruebot Looks perfect to me - and I love the sample results with "keystone" leading to David Suzuki. Thanks @SinghGursimran, great work. |
This comment has been minimized.
This comment has been minimized.
Should I add a separate test for ExtractLink udf with a filter? |
This comment has been minimized.
This comment has been minimized.
@SinghGursimran yeah, why not. Let's go with that. |
SinghGursimran commentedNov 12, 2019
•
edited
Extract hyperlinks within a collection filtered on pages containing a particular keyword (case insensitive) using df.
#238
Returns a csv file with URL, Domain, crawl_date, and destination_page.