Update PlainTextExtractor to just extract text #452

ruebot · 2020-04-21T21:33:08Z

Currently there is a fair bit of overlap between the PlainTextExtractor and WebPagesExtractor. Really, the only different between them now is the name of the content/text column, and WebPagesExtractor has some additional columns.

I propose that PlainTextExtractor moves to something that is more in the spirit of its name. It should run RemoveHTMLDF, RemoveHTTPHeaderDF, a DataFrame version of ExtractBoilerpipeTextRDD, and output a single column (csv or parquet), or possibly a single text file.

ianmilligan1 · 2020-04-21T21:40:13Z

Yes, that's a great idea @ruebot - I think that's more in spirit of its name, and you could imagine using it in a pipeline through to text analysis better than WebPagesExtractor.


        Update PlainTextExtractor to output a single column; text.

- Resolves #452 - PlainTextExtractor runs RemoveHTML, and ExtractBoilerplate on `content` - Update test


        Update PlainTextExtractor to output a single column; text. (#453)

- Resolves #452 - PlainTextExtractor runs ExtractBoilerplate on `content` - Update test

ruebot added enhancement Scala DataFrames App labels Apr 21, 2020

ruebot self-assigned this Apr 21, 2020

ruebot mentioned this issue Apr 21, 2020

Update PlainTextExtractor to output a single column; text. #453

Merged

ianmilligan1 closed this in #453 Apr 22, 2020

archivesunleashed / aut

Update PlainTextExtractor to just extract text #452

Update PlainTextExtractor to just extract text #452

ruebot commented Apr 21, 2020

This comment has been minimized.

ianmilligan1 commented Apr 21, 2020

archivesunleashed / aut

Join GitHub today

Update PlainTextExtractor to just extract text #452

Update PlainTextExtractor to just extract text #452

Comments

ruebot commented Apr 21, 2020

This comment has been minimized.

ianmilligan1 commented Apr 21, 2020