Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDataframe Code Request: Finding Hyperlinks within Collection on Pages with Certain Keyword #238
Comments
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 do you still need help on this one? |
This comment has been minimized.
This comment has been minimized.
I think the problem here is that we don’t have text as part of our dataframe implementation yet. |
- Replace ExtractBaseDomain with ExtractDomain - Closes #367 - Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader - Closes #369 - Wraps RemoveHttpHeader and RemoveHTML for use in data frames. - Partially addresses #238 - Updates tests where necessary - Punts on #368 UDF CaMeL cASe consistency issues
This comment has been minimized.
This comment has been minimized.
@SinghGursimran here's one for you. |
This comment has been minimized.
This comment has been minimized.
import io.archivesunleashed._ val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1)) val df= RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF() df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page")) I have written this script to answer the above query. It works but things can be done in a better way if I create a separate function in the code specifically for this query. I am not sure whether that's a good idea considering there can be many random queries. @ruebot @lintool @ianmilligan1 |
This comment has been minimized.
This comment has been minimized.
We have to import import io.archivesunleashed._
import io.archivesunleashed.df._
import io.archivesunleashed.matchbox._
val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1))
val df= RecordLoader.loadArchives("/home/nruest/Projects/au/aut/src/test/resources/arc/example.arc.gz", sc).extractValidPagesDF()
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
.filter($"content".contains("keystone"))
df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page"))
.show()
// Exiting paste mode, now interpreting.
<console>:31: error: reference to ExtractDomain is ambiguous;
it is imported twice in the same scope by
import io.archivesunleashed.matchbox._
and import io.archivesunleashed.df._
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
^
<console>:31: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
^
<console>:31: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link")) |
This comment has been minimized.
This comment has been minimized.
I have added another udf in aut/src/main/scala/io/archivesunleashed/df/package.scala. (forgot to mention it here) val ExtractLinks = udf(io.archivesunleashed.matchbox.ExtractLinks.apply(_:String,_:String)) Sorry for the confusion. I will create a pull request. |
* Add example for archivesunleashed/aut#377 / archivesunleashed/aut#238 * review
ianmilligan1 commentedMay 24, 2018
Use Case
I am interested in finding what pages particular organizations link to when they contain certain keywords.
In this case, I am curious to see where the Liberal Party of Canada (liberal.ca) has linked to from pages that contain the words "Keystone". Do they link to news sources? Do they link to activist groups? Do they link to opposition parties? Let's find out.
Data
In this case we need to do the following:
Keystone
. Let's make it case insensitive.Keystone
.Desired Output
The desired output would be a CSV looking like:
Question
What data frame query can we use to first filter down to pages containing a certain keyword, and then extract just links from them? Hopefully this is clear but do let me know if I can clarify further.