Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upFixing Documentation Errors #76
Conversation
Since we're correcting documentation for the current 0.80.0 release, and the "next" version, we need to update in two places. @ianmilligan1 do you want me to push up something to show how this is done? |
...or, you can make the changes yourself. Since we're editing the "next" version and the 0.80.0 version, you just need to make the changes in two places:
|
@@ -116,12 +116,12 @@ RecordLoader.loadArchives("/path/to/warcs", sc) | |||
import io.archivesunleashed._ | |||
import io.archivesunleashed.udfs._ | |||
val domains = Array("www.archive.org") | |||
val domain = Array("www.archive.org") |
This comment has been minimized.
This comment has been minimized.
ruebot
Jun 8, 2020
Member
Maybe we should just add another item to the array here. What we're trying to demonstrate here is that you can filter for multiple items at once. So, why don't we change it to: val domains = Array("www.archive.org", "geocities.org")
.select($"crawl_date", extractDomain($"url").alias("domains"), $"url", removeHTML(removeHTTPHeader($"content").alias("content"))) | ||
.filter(hasDomains($"domain", lit(domains))) | ||
.select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content"))) | ||
.filter(hasDomains($"domain", lit(domain))) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Jun 11, 2020
Author
Member
If I change them all to domains
it fails. I think it should be this:
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val domains = Array("www.archive.org", "geocities.org")
RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
.webpages()
.select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
.filter(hasDomains($"domain", lit(domains)))
.take(10)
(which keeps domain
and domains
distinct in there, one being the variable with what we're looking for, and the other being the alias for the extracted domains)
Makes sense - will do! |
Two scripts in link-analysis are causing us trouble. import io.archivesunleashed._
import io.archivesunleashed.udfs._
val urlPattern = Array("(?i)http://www.archive.org/details/.*")
RecordLoader.loadArchives("/path/to/warcs", sc)
.webpages()
.filter($"url", lit(urlPattern))
.select(explode(extractLinks($"url", $"content")).as("links")
.select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
.groupBy("src", "dest")
.count()
.filter($"count" > 5)
.write.csv("details-links-all-df/") Error message:
And then this other one. import io.archivesunleashed._
import io.archivesunleashed.udfs._
val urlPattern = Array("http://www.archive.org/details/.*")
RecordLoader.loadArchives("/path/to/warcs", sc)
.webpages()
.filter($"url", lit(urlPattern))
.select(explode(extractLinks($"url", $"content")).as("links"))
.select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
.groupBy("src", "dest")
.count()
.filter($"count" > 5)
.write.csv("sitelinks-details-df/") leads to this error
My guess is both are actually the same error with that filter. Any thoughts @ruebot? |
There's a missing closing parenthesis on the first select statement. It should be: import io.archivesunleashed._
import io.archivesunleashed.udfs._
val urlPattern = Array("(?i)http://www.archive.org/details/.*")
RecordLoader.loadArchives("/path/to/warcs", sc)
.webpages()
.filter($"url", lit(urlPattern))
.select(explode(extractLinks($"url", $"content")).as("links"))
.select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
.groupBy("src", "dest")
.count()
.filter($"count" > 5)
.write.csv("details-links-all-df/") |
Second one looks like I missed actually putting the UDF ( import io.archivesunleashed._
import io.archivesunleashed.udfs._
val urlPattern = Array("http://www.archive.org/details/.*")
RecordLoader.loadArchives("/path/to/warcs", sc)
.webpages()
.filter(hasUrlPatterns($"url", lit(urlPattern)))
.select(explode(extractLinks($"url", $"content")).as("links"))
.select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
.groupBy("src", "dest")
.count()
.filter($"count" > 5)
.write.csv("sitelinks-details-df/") |
ianmilligan1 commentedJun 4, 2020
Just a draft pull-request to both get the hang of submitting pull requests to our new docusaurus branch, and also to incorporate feedback from Sarah's comprehensive walkthrough of all our scripts. I'll be adding to this over the next few days.