Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Scripts on Filter-DF #77

Open
ianmilligan1 opened this issue Jun 16, 2020 · 1 comment
Open

Broken Scripts on Filter-DF #77

ianmilligan1 opened this issue Jun 16, 2020 · 1 comment

Comments

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 16, 2020

Two broken scripts.

This one:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val languages = Array("th","de","ht")

RecordLoader.loadArchives("/path/to/warcs",sc)
  .webpages()
  .select($"language", $"url", $"content")
  .filter($"language".isin(languages))

Leads to

org.apache.spark.sql.AnalysisException: cannot resolve '(`language` IN ([th,de]))' due to data type mismatch: Arguments must be same type but were: string != array<string>;;
'Filter language#59 IN ([th,de])
+- Project [language#59, url#56, content#60]
   +- LogicalRDD [crawl_date#55, url#56, mime_type_web_server#57, mime_type_tika#58, language#59, content#60], false

And on the same page, this Python script

from aut import *
from pyspark.sql.functions import col

urls = ["www.archive.org"]

WebArchive(sc, sqlContext, "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz") \
  .all() \
  .select("url", "content") \
  .filter(~col("url").isin(urls)

leads to

  File "<ipython-input-4-e1e43f4bf7e2>", line 5
    WebArchive(sc, sqlContext, "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz")   .all()   .select("url", "content")   .filter(~col("url").isin(urls)
                                                                                                                                                                      ^
SyntaxError: unexpected EOF while parsing
@ruebot
Copy link
Member

@ruebot ruebot commented Jun 17, 2020

First script should be using hasLanguages. .filter(!hasContent($"language", lit(languages))) should do it.

Second script is missing a closing parentheses at the end of the filter line.

.filter(~col("url").isin(urls))

ianmilligan1 added a commit that referenced this issue Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.