Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upUpdate PlainTextExtractor to just extract text #452
Closed
Labels
Comments
This comment has been minimized.
This comment has been minimized.
Yes, that's a great idea @ruebot - I think that's more in spirit of its name, and you could imagine using it in a pipeline through to text analysis better than |
ruebot
added a commit
that referenced
this issue
Apr 21, 2020
ianmilligan1
pushed a commit
that referenced
this issue
Apr 22, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
ruebot commentedApr 21, 2020
Currently there is a fair bit of overlap between the
PlainTextExtractor
andWebPagesExtractor
. Really, the only different between them now is the name of the content/text column, andWebPagesExtractor
has some additional columns.I propose that
PlainTextExtractor
moves to something that is more in the spirit of its name. It should runRemoveHTMLDF
,RemoveHTTPHeaderDF
, a DataFrame version ofExtractBoilerpipeTextRDD
, and output a single column (csv or parquet), or possibly a single text file.