Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upFixing Documentation Errors #76
Conversation
Since we're correcting documentation for the current 0.80.0 release, and the "next" version, we need to update in two places. @ianmilligan1 do you want me to push up something to show how this is done? |
...or, you can make the changes yourself. Since we're editing the "next" version and the 0.80.0 version, you just need to make the changes in two places:
|
@@ -116,12 +116,12 @@ RecordLoader.loadArchives("/path/to/warcs", sc) | |||
import io.archivesunleashed._ | |||
import io.archivesunleashed.udfs._ | |||
val domains = Array("www.archive.org") | |||
val domain = Array("www.archive.org") |
This comment has been minimized.
This comment has been minimized.
ruebot
Jun 8, 2020
Member
Maybe we should just add another item to the array here. What we're trying to demonstrate here is that you can filter for multiple items at once. So, why don't we change it to: val domains = Array("www.archive.org", "geocities.org")
.select($"crawl_date", extractDomain($"url").alias("domains"), $"url", removeHTML(removeHTTPHeader($"content").alias("content"))) | ||
.filter(hasDomains($"domain", lit(domains))) | ||
.select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content"))) | ||
.filter(hasDomains($"domain", lit(domain))) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Jun 11, 2020
Author
Member
If I change them all to domains
it fails. I think it should be this:
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val domains = Array("www.archive.org", "geocities.org")
RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
.webpages()
.select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
.filter(hasDomains($"domain", lit(domains)))
.take(10)
(which keeps domain
and domains
distinct in there, one being the variable with what we're looking for, and the other being the alias for the extracted domains)
Makes sense - will do! |
Two scripts in link-analysis are causing us trouble. import io.archivesunleashed._
import io.archivesunleashed.udfs._
val urlPattern = Array("(?i)http://www.archive.org/details/.*")
RecordLoader.loadArchives("/path/to/warcs", sc)
.webpages()
.filter($"url", lit(urlPattern))
.select(explode(extractLinks($"url", $"content")).as("links")
.select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
.groupBy("src", "dest")
.count()
.filter($"count" > 5)
.write.csv("details-links-all-df/") Error message:
And then this other one. import io.archivesunleashed._
import io.archivesunleashed.udfs._
val urlPattern = Array("http://www.archive.org/details/.*")
RecordLoader.loadArchives("/path/to/warcs", sc)
.webpages()
.filter($"url", lit(urlPattern))
.select(explode(extractLinks($"url", $"content")).as("links"))
.select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
.groupBy("src", "dest")
.count()
.filter($"count" > 5)
.write.csv("sitelinks-details-df/") leads to this error
My guess is both are actually the same error with that filter. Any thoughts @ruebot? |
There's a missing closing parenthesis on the first select statement. It should be: import io.archivesunleashed._
import io.archivesunleashed.udfs._
val urlPattern = Array("(?i)http://www.archive.org/details/.*")
RecordLoader.loadArchives("/path/to/warcs", sc)
.webpages()
.filter($"url", lit(urlPattern))
.select(explode(extractLinks($"url", $"content")).as("links"))
.select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
.groupBy("src", "dest")
.count()
.filter($"count" > 5)
.write.csv("details-links-all-df/") |
Second one looks like I missed actually putting the UDF ( import io.archivesunleashed._
import io.archivesunleashed.udfs._
val urlPattern = Array("http://www.archive.org/details/.*")
RecordLoader.loadArchives("/path/to/warcs", sc)
.webpages()
.filter(hasUrlPatterns($"url", lit(urlPattern)))
.select(explode(extractLinks($"url", $"content")).as("links"))
.select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
.groupBy("src", "dest")
.count()
.filter($"count" > 5)
.write.csv("sitelinks-details-df/") |
@ianmilligan1 can we get this out of draft, and ready to merge? I have some updates for an incoming PR I'd like to get in as well. |
@ruebot I can try to do so today, kids schedule pending (on a call until 1pm). |
Another broken one here - any thoughts @ruebot? Trying to map over some other fixes while on a call break. import io.archivesunleashed._
import io.archivesunleashed.udfs._
val languages = Array("th","de","ht")
RecordLoader.loadArchives("/path/to/warcs",sc)
.webpages()
.select($"language", $"url", $"content")
.filter($"language".isin(languages)) Leads to
And on the same page, this Python script from aut import *
from pyspark.sql.functions import col
urls = ["www.archive.org"]
WebArchive(sc, sqlContext, "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz") \
.all() \
.select("url", "content") \
.filter(~col("url").isin(urls) leads to
|
Let me know if you want to shelve those scripts until the next PR, @ruebot, or have any thoughts - and then I can move stuff over to the |
@ianmilligan1 can you create a separate issue, and we can do the same for others as you and Sarah go through the docs? I'd like to get this wrapped up so I can get some docs in, and we have some things here that should be published immediately. |
Perfect! Thanks @ianmilligan1! |
ianmilligan1 commentedJun 4, 2020
Just a draft pull-request to both get the hang of submitting pull requests to our new docusaurus branch, and also to incorporate feedback from Sarah's comprehensive walkthrough of all our scripts. I'll be adding to this over the next few days.