Fixing Documentation Errors #76

ianmilligan1 · 2020-06-04T19:39:48Z

Just a draft pull-request to both get the hang of submitting pull requests to our new docusaurus branch, and also to incorporate feedback from Sarah's comprehensive walkthrough of all our scripts. I'll be adding to this over the next few days.


        Typo fixes to scripts


        Removing erroneous import utils that led to error

ruebot · 2020-06-08T12:24:18Z

Since we're correcting documentation for the current 0.80.0 release, and the "next" version, we need to update in two places. @ianmilligan1 do you want me to push up something to show how this is done?

ruebot · 2020-06-08T12:39:53Z

...or, you can make the changes yourself. Since we're editing the "next" version and the 0.80.0 version, you just need to make the changes in two places:

Where you are doing it now, in docs
Additionally in, website/versioned_docs/version-0.80.0

ruebot · 2020-06-08T12:43:16Z

docs/text-analysis.md

@@ -116,12 +116,12 @@ RecordLoader.loadArchives("/path/to/warcs", sc)
 import io.archivesunleashed._
 import io.archivesunleashed.udfs._

-val domains = Array("www.archive.org")
+val domain = Array("www.archive.org")


Maybe we should just add another item to the array here. What we're trying to demonstrate here is that you can filter for multiple items at once. So, why don't we change it to: val domains = Array("www.archive.org", "geocities.org")

docs/text-analysis.md

ruebot · 2020-06-08T12:43:16Z

docs/text-analysis.md

-  .select($"crawl_date", extractDomain($"url").alias("domains"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
-  .filter(hasDomains($"domain", lit(domains)))
+  .select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content")))
+  .filter(hasDomains($"domain", lit(domain)))


If I change them all to domains it fails. I think it should be this:

import io.archivesunleashed._ import io.archivesunleashed.udfs._ val domains = Array("www.archive.org", "geocities.org") RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc) .webpages() .select($"crawl_date", extractDomain($"url").alias("domain"), $"url", removeHTML(removeHTTPHeader($"content").alias("content"))) .filter(hasDomains($"domain", lit(domains))) .take(10)

(which keeps domain and domains distinct in there, one being the variable with what we're looking for, and the other being the alias for the extracted domains)

ianmilligan1 · 2020-06-08T13:34:43Z

...or, you can make the changes yourself. Since we're editing the "next" version and the 0.80.0 version, you just need to make the changes in two places:

Makes sense - will do!


        More fixes and additions


        Review (still in progress)

ianmilligan1 · 2020-06-11T20:32:45Z

Two scripts in link-analysis are causing us trouble.

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("(?i)http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter($"url", lit(urlPattern))
  .select(explode(extractLinks($"url", $"content")).as("links")
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("details-links-all-df/")

Error message:

The pasted code is incomplete!

<pastie>:16: error: ')' expected but '}' found.
}
^

And then this other one.

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter($"url", lit(urlPattern))
  .select(explode(extractLinks($"url", $"content")).as("links"))
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("sitelinks-details-df/")

leads to this error

<console>:33: error: overloaded method value filter with alternatives:
  (func: org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (func: org.apache.spark.sql.Row => Boolean)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (conditionExpr: String)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 cannot be applied to (org.apache.spark.sql.ColumnName, org.apache.spark.sql.Column)
         .filter($"url", lit(urlPattern))
          ^

My guess is both are actually the same error with that filter. 🤔

Any thoughts @ruebot?

ruebot · 2020-06-12T00:31:05Z

There's a missing closing parenthesis on the first select statement. It should be:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("(?i)http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter($"url", lit(urlPattern))
  .select(explode(extractLinks($"url", $"content")).as("links"))
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("details-links-all-df/")

ruebot · 2020-06-12T00:36:44Z

Second one looks like I missed actually putting the UDF (hasUrlPatterns) in:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("http://www.archive.org/details/.*")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .filter(hasUrlPatterns($"url", lit(urlPattern)))
  .select(explode(extractLinks($"url", $"content")).as("links"))
  .select(removePrefixWWW(extractDomain(col("links._1"))).as("src"), removePrefixWWW(extractDomain(col("links._2"))).as("dest"))
  .groupBy("src", "dest")
  .count()
  .filter($"count" > 5)
  .write.csv("sitelinks-details-df/")


        Fixes two scripts on link-analysis page

ruebot · 2020-06-16T13:33:59Z

@ianmilligan1 can we get this out of draft, and ready to merge? I have some updates for an incoming PR I'd like to get in as well.

ianmilligan1 · 2020-06-16T14:52:29Z

@ruebot I can try to do so today, kids schedule pending (on a call until 1pm).

ianmilligan1 · 2020-06-16T15:09:48Z

Another broken one here - any thoughts @ruebot? Trying to map over some other fixes while on a call break.

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val languages = Array("th","de","ht")

RecordLoader.loadArchives("/path/to/warcs",sc)
  .webpages()
  .select($"language", $"url", $"content")
  .filter($"language".isin(languages))

Leads to

org.apache.spark.sql.AnalysisException: cannot resolve '(`language` IN ([th,de]))' due to data type mismatch: Arguments must be same type but were: string != array<string>;;
'Filter language#59 IN ([th,de])
+- Project [language#59, url#56, content#60]
   +- LogicalRDD [crawl_date#55, url#56, mime_type_web_server#57, mime_type_tika#58, language#59, content#60], false

And on the same page, this Python script

from aut import *
from pyspark.sql.functions import col

urls = ["www.archive.org"]

WebArchive(sc, sqlContext, "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz") \
  .all() \
  .select("url", "content") \
  .filter(~col("url").isin(urls)

leads to

  File "<ipython-input-4-e1e43f4bf7e2>", line 5
    WebArchive(sc, sqlContext, "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz")   .all()   .select("url", "content")   .filter(~col("url").isin(urls)
                                                                                                                                                                      ^
SyntaxError: unexpected EOF while parsing


        One more fix

ianmilligan1 · 2020-06-16T15:26:27Z

Let me know if you want to shelve those scripts until the next PR, @ruebot, or have any thoughts - and then I can move stuff over to the versioned_docs/version-0.80.0 directory and get it ready for review.

ruebot · 2020-06-16T17:14:35Z

@ianmilligan1 can you create a separate issue, and we can do the same for others as you and Sarah go through the docs? I'd like to get this wrapped up so I can get some docs in, and we have some things here that should be published immediately.


        Mapped files over to versioned_docs


        Fixing ids in versioned_docs


        Adding original_ids back in

ruebot

Perfect! Thanks @ianmilligan1!

ianmilligan1 added 2 commits Jun 4, 2020

Typo fixes to scripts

Loading status checks…

f1ca579

Removing erroneous import utils that led to error

Loading status checks…

f72e6fd

ianmilligan1 mentioned this pull request Jun 6, 2020

Copy Editing for Documentation #71

Open

0 of 22 tasks complete

ruebot requested changes Jun 8, 2020

View changes

ianmilligan1 added 2 commits Jun 11, 2020

More fixes and additions

Loading status checks…

7ba85d8

Review (still in progress)

Loading status checks…

b4ee3ed

Fixes two scripts on link-analysis page

Loading status checks…

e88993c

ruebot mentioned this pull request Jun 16, 2020

Add Python implementation of SaveBytes. archivesunleashed/aut#482

Merged

One more fix

Loading status checks…

ff842bd

ianmilligan1 added 3 commits Jun 16, 2020

Mapped files over to versioned_docs

Loading status checks…

deb3928

Fixing ids in versioned_docs

Loading status checks…

dda6585

Adding original_ids back in

Loading status checks…

d877b2a

ianmilligan1 marked this pull request as ready for review Jun 16, 2020

ianmilligan1 requested a review from ruebot Jun 16, 2020

ruebot approved these changes Jun 16, 2020

View changes

ruebot merged commit d4767a8 into docusaurus Jun 16, 2020
2 checks passed

2 checks passed

delivery
Details

delivery
Details

ruebot deleted the doc-fixes branch Jun 16, 2020

archivesunleashed / aut-docs

Fixing Documentation Errors #76

Fixing Documentation Errors #76

ianmilligan1 commented Jun 4, 2020

ruebot commented Jun 8, 2020

ruebot commented Jun 8, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ianmilligan1 commented Jun 8, 2020

ianmilligan1 commented Jun 11, 2020

ruebot commented Jun 12, 2020

ruebot commented Jun 12, 2020

ruebot commented Jun 16, 2020

ianmilligan1 commented Jun 16, 2020

ianmilligan1 commented Jun 16, 2020 •

edited

ianmilligan1 commented Jun 16, 2020

ruebot commented Jun 16, 2020

ruebot left a comment

archivesunleashed / aut-docs

Join GitHub today

Fixing Documentation Errors #76

Fixing Documentation Errors #76

Conversation

ianmilligan1 commented Jun 4, 2020

ruebot commented Jun 8, 2020

ruebot commented Jun 8, 2020

This comment has been minimized.

ruebot Jun 8, 2020

This comment has been minimized.

ruebot Jun 8, 2020

This comment has been minimized.

ianmilligan1 Jun 11, 2020

ianmilligan1 commented Jun 8, 2020

ianmilligan1 commented Jun 11, 2020

ruebot commented Jun 12, 2020

ruebot commented Jun 12, 2020

ruebot commented Jun 16, 2020

ianmilligan1 commented Jun 16, 2020

ianmilligan1 commented Jun 16, 2020 • edited

ianmilligan1 commented Jun 16, 2020

ruebot commented Jun 16, 2020

ruebot left a comment

ianmilligan1 commented Jun 16, 2020 •

edited