The following Spark script generates plain text renderings for all the web pages in a collection with a URL matching a regular expression pattern. In the example case, it will go through the collection and find all of the URLs beginning with `http://geocities.com/EnchantedForest/`. The `(?i)` makes this query case insensitive.
The following Spark script generates plain text renderings for all the web pages in a collection with a URL matching a regular expression pattern. In the example case, it will go through a WARC file and find all of the URLs beginning with `http://archive.org/details/`, and save the text of those URLs.
There is also `discardContent` which does the opposite, if you have a frequent keyword you are not interested in.
@@ -496,23 +498,23 @@ Before `.countItems()` to find just the documents that are linked to more than f
### Extraction of a Site Link Structure, organized by URL pattern
In this following example, we run the same script but only extract links coming from URLs matching the pattern `http://geocities.com/EnchantedForest/.*`. We do so by using the `keepUrlPatterns` command.
In this following example, we run the same script but only extract links coming from URLs matching the pattern `http://www.archive.org/details/*`. We do so by using the `keepUrlPatterns` command.
```scala
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util._
val links = RecordLoader.loadArchives("/path/to/many/warcs/*.gz", sc)
val links = RecordLoader.loadArchives("example.arc.gz", sc)
0 comments on commit
877fec0