Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Permalink
Browse files

Add example for archivesunleashed/aut#377 / h… (#20)

  • Loading branch information...
ruebot authored and ianmilligan1 committed Nov 12, 2019
1 parent 4f73504 commit 930f8c81ed41d04bb3f6ef603280bdc56148f827
Showing with 70 additions and 1 deletion.
  1. +1 −1 current/cookbook.md
  2. +69 −0 current/link-analysis.md
@@ -164,4 +164,4 @@ val df_txt = RecordLoader.loadArchives("/path/to/warcs/*", sc).extractTextFilesD
val res_txt = df_txt.select($"bytes", $"extension").saveToDisk("bytes", "/path/to/binaries/text/collection-prefix-text", "extension")
sys.exit
```
```
@@ -310,3 +310,72 @@ TODO
### Python DF

TODO

## Finding Hyperlinks within Collection on Pages with Certain Keyword

The following script will extract a DataFrame with the following columns, `domain`, `URL`, `crawl date`, `origin page`, and `destination page`, given a search term `Keystone` of the content (full-text). The example uses the sample data in [`aut-resources`](https://github.com/archivesunleashed/aut-resources/tree/master/Sample-Data).

### Scala RDD

TODO

### Scala DF
```scala
import io.archivesunleashed._
import io.archivesunleashed.df._
val result = udf((vs: Seq[Any]) => vs(0)
.toString
.split(",")(1))
val df = RecordLoader
.loadArchives("Sample-Data/*gz", sc)
.extractValidPagesDF()
.select(RemovePrefixWWW(ExtractDomain($"url"))
.as("Domain"), $"url"
.as("url"),$"crawl_date", explode_outer(ExtractLinks($"url", $"content"))
.as("link"))
.filter($"content".contains("keystone"))
df
.select($"url", $"Domain", $"crawl_date", result(array($"link"))
.as("destination_page"))
.show()
// Exiting paste mode, now interpreting.
+--------------------+---------------+----------+--------------------+
| url| Domain|crawl_date| destination_page|
+--------------------+---------------+----------+--------------------+
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org| 20091219|http://www.davids...|
+--------------------+---------------+----------+--------------------+
only showing top 20 rows
import io.archivesunleashed._
import io.archivesunleashed.df._
result: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,None)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Domain: string, url: string ... 2 more fields]
```

### Python DF

TODO

0 comments on commit 930f8c8

Please sign in to comment.
You can’t perform that action at this time.