Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing Documentation Errors #76

Draft
wants to merge 5 commits into
base: docusaurus
from
Draft

Fixing Documentation Errors #76

wants to merge 5 commits into from

Conversation

@ianmilligan1
Copy link
Member

ianmilligan1 commented Jun 4, 2020

Just a draft pull-request to both get the hang of submitting pull requests to our new docusaurus branch, and also to incorporate feedback from Sarah's comprehensive walkthrough of all our scripts. I'll be adding to this over the next few days.

@ianmilligan1 ianmilligan1 mentioned this pull request Jun 6, 2020
0 of 22 tasks complete
@ruebot
Copy link
Member

ruebot commented Jun 8, 2020

Since we're correcting documentation for the current 0.80.0 release, and the "next" version, we need to update in two places. @ianmilligan1 do you want me to push up something to show how this is done?

@ruebot
Copy link
Member

ruebot commented Jun 8, 2020

...or, you can make the changes yourself. Since we're editing the "next" version and the 0.80.0 version, you just need to make the changes in two places:

  1. Where you are doing it now, in docs
  2. Additionally in, website/versioned_docs/version-0.80.0
@@ -116,12 +116,12 @@ RecordLoader.loadArchives("/path/to/warcs", sc)
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val domains = Array("www.archive.org")
val domain = Array("www.archive.org")

This comment has been minimized.

Copy link
@ruebot

ruebot Jun 8, 2020

Member

Maybe we should just add another item to the array here. What we're trying to demonstrate here is that you can filter for multiple items at once. So, why don't we change it to: val domains = Array("www.archive.org", "geocities.org")

docs/text-analysis.md Show resolved Hide resolved
.select($"crawl_date", extractDomain($"url").alias("domains"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
.filter(hasDomains($"domain", lit(domains)))
.select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
.filter(hasDomains($"domain", lit(domain)))

This comment has been minimized.

Copy link
@ruebot

ruebot Jun 8, 2020

Member

domains

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jun 11, 2020

Author Member

If I change them all to domains it fails. I think it should be this:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val domains = Array("www.archive.org", "geocities.org")

RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .webpages()
  .select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
  .filter(hasDomains($"domain", lit(domains)))
  .take(10)

(which keeps domain and domains distinct in there, one being the variable with what we're looking for, and the other being the alias for the extracted domains)

@ianmilligan1
Copy link
Member Author

ianmilligan1 commented Jun 8, 2020

...or, you can make the changes yourself. Since we're editing the "next" version and the 0.80.0 version, you just need to make the changes in two places:

Makes sense - will do!

@ianmilligan1
Copy link
Member Author

ianmilligan1 commented Jun 11, 2020

Two scripts in link-analysis are causing us trouble.

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("(?i)http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter($"url", lit(urlPattern))
  .select(explode(extractLinks($"url", $"content")).as("links")
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("details-links-all-df/")

Error message:

The pasted code is incomplete!

<pastie>:16: error: ')' expected but '}' found.
}
^

And then this other one.

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter($"url", lit(urlPattern))
  .select(explode(extractLinks($"url", $"content")).as("links"))
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("sitelinks-details-df/")

leads to this error

<console>:33: error: overloaded method value filter with alternatives:
  (func: org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (func: org.apache.spark.sql.Row => Boolean)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (conditionExpr: String)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 cannot be applied to (org.apache.spark.sql.ColumnName, org.apache.spark.sql.Column)
         .filter($"url", lit(urlPattern))
          ^

My guess is both are actually the same error with that filter. 🤔

Any thoughts @ruebot?

@ruebot
Copy link
Member

ruebot commented Jun 12, 2020

There's a missing closing parenthesis on the first select statement. It should be:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("(?i)http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter($"url", lit(urlPattern))
  .select(explode(extractLinks($"url", $"content")).as("links"))
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("details-links-all-df/")
@ruebot
Copy link
Member

ruebot commented Jun 12, 2020

Second one looks like I missed actually putting the UDF (hasUrlPatterns) in:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter(hasUrlPatterns($"url", lit(urlPattern)))
  .select(explode(extractLinks($"url", $"content")).as("links"))
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("sitelinks-details-df/")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.