Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing Documentation Errors #76

Merged
merged 9 commits into from Jun 16, 2020
Merged

Fixing Documentation Errors #76

merged 9 commits into from Jun 16, 2020

Conversation

@ianmilligan1
Copy link
Member

ianmilligan1 commented Jun 4, 2020

Just a draft pull-request to both get the hang of submitting pull requests to our new docusaurus branch, and also to incorporate feedback from Sarah's comprehensive walkthrough of all our scripts. I'll be adding to this over the next few days.

@ianmilligan1 ianmilligan1 mentioned this pull request Jun 6, 2020
0 of 22 tasks complete
@ruebot
Copy link
Member

ruebot commented Jun 8, 2020

Since we're correcting documentation for the current 0.80.0 release, and the "next" version, we need to update in two places. @ianmilligan1 do you want me to push up something to show how this is done?

@ruebot
Copy link
Member

ruebot commented Jun 8, 2020

...or, you can make the changes yourself. Since we're editing the "next" version and the 0.80.0 version, you just need to make the changes in two places:

  1. Where you are doing it now, in docs
  2. Additionally in, website/versioned_docs/version-0.80.0
@@ -116,12 +116,12 @@ RecordLoader.loadArchives("/path/to/warcs", sc)
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val domains = Array("www.archive.org")
val domain = Array("www.archive.org")

This comment has been minimized.

Copy link
@ruebot

ruebot Jun 8, 2020

Member

Maybe we should just add another item to the array here. What we're trying to demonstrate here is that you can filter for multiple items at once. So, why don't we change it to: val domains = Array("www.archive.org", "geocities.org")

docs/text-analysis.md Show resolved Hide resolved
.select($"crawl_date", extractDomain($"url").alias("domains"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
.filter(hasDomains($"domain", lit(domains)))
.select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
.filter(hasDomains($"domain", lit(domain)))

This comment has been minimized.

Copy link
@ruebot

ruebot Jun 8, 2020

Member

domains

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jun 11, 2020

Author Member

If I change them all to domains it fails. I think it should be this:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val domains = Array("www.archive.org", "geocities.org")

RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .webpages()
  .select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
  .filter(hasDomains($"domain", lit(domains)))
  .take(10)

(which keeps domain and domains distinct in there, one being the variable with what we're looking for, and the other being the alias for the extracted domains)

@ianmilligan1
Copy link
Member Author

ianmilligan1 commented Jun 8, 2020

...or, you can make the changes yourself. Since we're editing the "next" version and the 0.80.0 version, you just need to make the changes in two places:

Makes sense - will do!

@ianmilligan1
Copy link
Member Author

ianmilligan1 commented Jun 11, 2020

Two scripts in link-analysis are causing us trouble.

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("(?i)http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter($"url", lit(urlPattern))
  .select(explode(extractLinks($"url", $"content")).as("links")
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("details-links-all-df/")

Error message:

The pasted code is incomplete!

<pastie>:16: error: ')' expected but '}' found.
}
^

And then this other one.

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter($"url", lit(urlPattern))
  .select(explode(extractLinks($"url", $"content")).as("links"))
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("sitelinks-details-df/")

leads to this error

<console>:33: error: overloaded method value filter with alternatives:
  (func: org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (func: org.apache.spark.sql.Row => Boolean)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (conditionExpr: String)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 cannot be applied to (org.apache.spark.sql.ColumnName, org.apache.spark.sql.Column)
         .filter($"url", lit(urlPattern))
          ^

My guess is both are actually the same error with that filter. 🤔

Any thoughts @ruebot?

@ruebot
Copy link
Member

ruebot commented Jun 12, 2020

There's a missing closing parenthesis on the first select statement. It should be:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("(?i)http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter($"url", lit(urlPattern))
  .select(explode(extractLinks($"url", $"content")).as("links"))
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("details-links-all-df/")
@ruebot
Copy link
Member

ruebot commented Jun 12, 2020

Second one looks like I missed actually putting the UDF (hasUrlPatterns) in:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter(hasUrlPatterns($"url", lit(urlPattern)))
  .select(explode(extractLinks($"url", $"content")).as("links"))
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("sitelinks-details-df/")
@ruebot
Copy link
Member

ruebot commented Jun 16, 2020

@ianmilligan1 can we get this out of draft, and ready to merge? I have some updates for an incoming PR I'd like to get in as well.

@ianmilligan1
Copy link
Member Author

ianmilligan1 commented Jun 16, 2020

@ruebot I can try to do so today, kids schedule pending (on a call until 1pm).

@ianmilligan1
Copy link
Member Author

ianmilligan1 commented Jun 16, 2020

Another broken one here - any thoughts @ruebot? Trying to map over some other fixes while on a call break.

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val languages = Array("th","de","ht")

RecordLoader.loadArchives("/path/to/warcs",sc)
  .webpages()
  .select($"language", $"url", $"content")
  .filter($"language".isin(languages))

Leads to

org.apache.spark.sql.AnalysisException: cannot resolve '(`language` IN ([th,de]))' due to data type mismatch: Arguments must be same type but were: string != array<string>;;
'Filter language#59 IN ([th,de])
+- Project [language#59, url#56, content#60]
   +- LogicalRDD [crawl_date#55, url#56, mime_type_web_server#57, mime_type_tika#58, language#59, content#60], false

And on the same page, this Python script

from aut import *
from pyspark.sql.functions import col

urls = ["www.archive.org"]

WebArchive(sc, sqlContext, "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz") \
  .all() \
  .select("url", "content") \
  .filter(~col("url").isin(urls)

leads to

  File "<ipython-input-4-e1e43f4bf7e2>", line 5
    WebArchive(sc, sqlContext, "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz")   .all()   .select("url", "content")   .filter(~col("url").isin(urls)
                                                                                                                                                                      ^
SyntaxError: unexpected EOF while parsing
@ianmilligan1
Copy link
Member Author

ianmilligan1 commented Jun 16, 2020

Let me know if you want to shelve those scripts until the next PR, @ruebot, or have any thoughts - and then I can move stuff over to the versioned_docs/version-0.80.0 directory and get it ready for review.

@ruebot
Copy link
Member

ruebot commented Jun 16, 2020

@ianmilligan1 can you create a separate issue, and we can do the same for others as you and Sarah go through the docs? I'd like to get this wrapped up so I can get some docs in, and we have some things here that should be published immediately.

@ianmilligan1 ianmilligan1 marked this pull request as ready for review Jun 16, 2020
@ianmilligan1 ianmilligan1 requested a review from ruebot Jun 16, 2020
@ruebot
ruebot approved these changes Jun 16, 2020
Copy link
Member

ruebot left a comment

Perfect! Thanks @ianmilligan1!

@ruebot ruebot merged commit d4767a8 into docusaurus Jun 16, 2020
2 checks passed
2 checks passed
delivery
Details
delivery
Details
@ruebot ruebot deleted the doc-fixes branch Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.