Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex error in Collection Analysis #82

Closed
ianmilligan1 opened this issue Jun 17, 2020 · 1 comment
Closed

Regex error in Collection Analysis #82

ianmilligan1 opened this issue Jun 17, 2020 · 1 comment

Comments

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 17, 2020

This script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("""http://[^/]+/[^/]+/""".r)

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .select($"url")
  .filter(hasUrlPatterns($"url", lit(urlPattern)))
  .show(10, false)

Fails with

org.apache.spark.sql.AnalysisException: Unsupported component type class scala.util.matching.Regex in arrays;
  at org.apache.spark.sql.catalyst.expressions.Literal$.componentTypeToDataType(literals.scala:129)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:80)
  at org.apache.spark.sql.catalyst.expressions.Literal$.$anonfun$create$2(literals.scala:148)
  at scala.util.Failure.getOrElse(Try.scala:222)
  at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:148)
  at org.apache.spark.sql.functions$.typedLit(functions.scala:132)
  at org.apache.spark.sql.functions$.lit(functions.scala:115)
  ... 77 elided
@ruebot
Copy link
Member

@ruebot ruebot commented Jun 18, 2020

Drop the .r. urlPattern Is being called a regex twice: https://github.com/archivesunleashed/aut/blob/1ac97ef4981ebd31dd14e4ed08101eb22da15bbd/src/main/scala/io/archivesunleashed/udfs/package.scala#L64-L70

val urlPattern = Array("http://[^/]+/[^/]+/") works for me.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("http://[^/]+/[^/]+/")

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .webpages()
  .select($"url")
  .filter(hasUrlPatterns($"url", lit(urlPattern)))
  .show(10, false)

// Exiting paste mode, now interpreting.

+----------------------------------------------+                                
|url                                           |
+----------------------------------------------+
|http://geocities.com/rogersnyder71/           |
|http://www.geocities.com/rinkisydanystava/    |
|http://www.geocities.com/justinchan_chancheuk/|
|http://geocities.com/jerrysimmons94/          |
|http://geocities.com/dungeonofdiv0/           |
|http://geocities.com/duo2sakura/              |
|http://geocities.com/uclaphonestudy/          |
|http://geocities.com/dweebyone/               |
|http://geocities.com/ucs2222/                 |
|http://geocities.com/ukwriting/               |
+----------------------------------------------+
only showing top 10 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
urlPattern: Array[String] = Array(http://[^/]+/[^/]+/)
ianmilligan1 added a commit that referenced this issue Jun 18, 2020
@ruebot ruebot closed this in 11f49f4 Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.