.webpages() additional tokenized columns? #402

ruebot · 2020-01-09T20:05:30Z

Currently .webpages() creates a DataFrame with the following columns:

crawl_date
url
mime_type_web_server
mime_type_tika
content

The content column is the full text of the page, with HTTP headers, and HTML removed.

In experimenting with full-text analysis in the parquet_text_analyis.ipynb, we add some additional column via nltk for tokenized words, and tokenized text word count.

The tokenization process is pretty intensive. It takes around 30 minutes to complete in the example notebook, with the banq dataset. It also nearly exhausts the ~25G of RAM that is allotted via Colab.

So, instead of doing this post, why don't we consider doing this upfront in .webpages()? The Spark MLlib has a tokenizer, and a few other options. Since I'm not a text analysis expert, and it'd just be stabbing in the dark tossing in new columns, let's get a sense of what would actually be useful.

Check out this, and let us know what else would be useful out of the box in .webpages() DataFrame.

ruebot · 2020-01-09T20:31:08Z

Tokenization is language-dependent (think of Chinese which has no spaces between words), so if you offer it, it should be done right. Language detection would be useful (cld2, http://langid.py or detectlang etc.)

Yves Maurer

ruebot · 2020-01-09T20:39:18Z

Might be worth giving Spark NLP a more exhaustive look again.

ruebot · 2020-01-09T20:49:17Z

We should probably add a language column, and we can do that pretty easily with DetectLanguage. Then, we could use that for tokenization, as Yves rightfully calls out.

lintool · 2020-01-09T21:33:31Z

I don't think we should include tokenizations as an additional column. My general thinking is to be as conservative as possible - unless there are scholarly clamoring for a pre-generated field, don't include it. Otherwise, the derivatives will just become larger and larger and more unwieldy over time.

lintool · 2020-01-09T21:35:41Z

Another reason against - there is no such thing as a "canonical" tokenization. Every tokenizer behaves differently... so unless a scholar happens to want exactly your tokenization, it's not going to be useful...

SinghGursimran · 2020-01-09T22:49:35Z

To reduce time required for tokenization, if the scholar can setup a distributed environment, we can add a guide for text analysis in pyspark. Instead of simple python where we are converting to pandas dataframe, we can use pyspark dataframe and perform analytics on it using mllib.

lintool · 2020-01-09T22:54:44Z

For basic NLP, spaCy https://spacy.io/ has become the go-to toolkit... try it from the Python end?

ruebot · 2020-01-10T01:13:08Z

spacy.io looks really promising. Memory footprint appears to be a lot smaller than NLTK. But, I'm now well over an hour into executing tokenization on a DataFrame, and the NLTK option takes ~30 minutes.

Overall, I'm just tying to find some balance between the valid issues @lintool raises, and the reality of taking a derivative from .webpages() and being stuck with a seemingly endless spinning wheel. Or, rephrased as, thinking of the balance between those who have the capability and know-how to run aut as a library, and those who just want to take the derivative output and continue their research in a notebook on their laptop.

Definitely lots of food for thought. Hopefully, we get some good feedback from researchers looking to use our derivative output for text analysis.

...
...
...

Another option could be to just create an .enhancedWebpages() function? 🤷‍♂

lintool · 2020-01-10T01:18:29Z

-1 on .enhancedWebpages()

I think this is a good potential scenario for the derivatives of derivatives idea we discussed with Raymie.

organisciak · 2020-01-10T18:15:27Z

SpaCy is especially costly, but you can turn off certain modules, e.g.

doc = nlp(text, disable=['tagger', 'parser', 'ner'])

I agree with @lintool in spirit, that different scholars may want different tokenizations approaches, but very much believe in giving them something - it lowers the barrier to access and those with different needs can retokenize.

I assume Colab only gives you one process? If you have a multi-core machine you can swap out pandas for dask, and then apply will use multi-processing or multi-threading (depending on settings). I vaguely recall this not being too useful with this exact use case (SpaCy) because too much of the processing was locked from parallelizing, but I don't recall where I formed that impression! Maybe worth trying?

When using apply, I expect it would be quicker to do it just on the Series comprising the column you care about. I'm not sure if it's trivially faster or notably, but instead of pages.apply(lambda row: tokenize(row.content)), trying pages.content.apply(lambda txt: tokenize(txt)) may help since you're not passing extra data around?

Another thing that might be worth considering is a dumb tokenize function, again in the spirit of giving something basically useful if not perfect. e.g. splitting on whitespace: pages['dumb_tokens'] = pages.content.str.split().

organisciak · 2020-01-10T18:21:13Z

By the way, I see that you use parquet. @bmschmidt smartly pointed out (massivetexts/htrc-feature-reader#8) that when you have repeating values in columns, like your mime type and crawl date columns, the order in which you sort the columns affects the compression size notably - even when using snappy compression, which ostensibly favors speed over compression factor. Hot tip :)

ruebot · 2020-01-12T04:44:57Z

@organisciak tried the pages.content.apply, and the time difference was negligible between the two methods. Both took just over 26 minutes. I'm looping back around to using PySpark and MLlib in the the sample text-analysis notebook. That said, it might be useful to chat sometime. I'm curious what your experiences are with using HTRC data. It'd be useful to compare it with our experiences of working with TBs of web archive data.

@lintool, et al., quick testing with PySpark and MLlib in Colab seems to moving a lot quicker that just plain Pandas and NLTK. If researchers are going to use the CSV or Parquet output of .webpages(), I see the rationale for not including tokenized text, since we'd be assuming too much. I really see my naiveity writing up the issue now. The feedback here, and on Twitter has been really great!

As I'm hacking on this notebook, and thinking about the feedback, if a consumer of the output of .webpages() wants to go down the tokenization path, would it be helpful to give them one more column, the output of DetectLanguage? That way they'll at least have a decent idea of what the language is for a given row, and could run tokenization, or anything similar based on it.

All this for archivesunleashed/notebooks#4 @ianmilligan1 😆

lintool · 2020-01-12T13:37:07Z

+1 on language id


        Add language detection column to webpages.

- Addresses #402


        Add language detection column to webpages. (#403)

- Addresses #402

ruebot · 2020-01-23T14:29:05Z

Seeing no more discussion, I'll mark this as resolved with bc0d663

ruebot added the discussion label Jan 9, 2020

ruebot self-assigned this Jan 9, 2020

ruebot changed the title ~~.webpages() addition tokenized columns~~ .webpages() additional tokenized columns? Jan 9, 2020

ruebot mentioned this issue Jan 12, 2020

Add language detection column to webpages. #403

Merged

ianmilligan1 added a commit that referenced this issue Jan 12, 2020

Add language detection column to webpages. (#403)

Loading status checks…

bc0d663

- Addresses #402

ruebot closed this Jan 23, 2020

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut

.webpages() additional tokenized columns? #402

.webpages() additional tokenized columns? #402

ruebot commented Jan 9, 2020

This comment has been minimized.

ruebot commented Jan 9, 2020

This comment has been minimized.

ruebot commented Jan 9, 2020

This comment has been minimized.

ruebot commented Jan 9, 2020

This comment has been minimized.

lintool commented Jan 9, 2020

This comment has been minimized.

lintool commented Jan 9, 2020

This comment has been minimized.

SinghGursimran commented Jan 9, 2020

This comment has been minimized.

lintool commented Jan 9, 2020

This comment has been minimized.

ruebot commented Jan 10, 2020

This comment has been minimized.

lintool commented Jan 10, 2020

This comment has been minimized.

organisciak commented Jan 10, 2020 •

edited

This comment has been minimized.

organisciak commented Jan 10, 2020

This comment has been minimized.

ruebot commented Jan 12, 2020

This comment has been minimized.

lintool commented Jan 12, 2020

This comment has been minimized.

ruebot commented Jan 23, 2020

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut

Join GitHub today

.webpages() additional tokenized columns? #402

.webpages() additional tokenized columns? #402

Comments

ruebot commented Jan 9, 2020

This comment has been minimized.

ruebot commented Jan 9, 2020

This comment has been minimized.

ruebot commented Jan 9, 2020

This comment has been minimized.

ruebot commented Jan 9, 2020

This comment has been minimized.

lintool commented Jan 9, 2020

This comment has been minimized.

lintool commented Jan 9, 2020

This comment has been minimized.

SinghGursimran commented Jan 9, 2020

This comment has been minimized.

lintool commented Jan 9, 2020

This comment has been minimized.

ruebot commented Jan 10, 2020

This comment has been minimized.

lintool commented Jan 10, 2020

This comment has been minimized.

organisciak commented Jan 10, 2020 • edited

This comment has been minimized.

organisciak commented Jan 10, 2020

This comment has been minimized.

ruebot commented Jan 12, 2020

This comment has been minimized.

lintool commented Jan 12, 2020

This comment has been minimized.

ruebot commented Jan 23, 2020

organisciak commented Jan 10, 2020 •

edited