Track tasks and feature requests
Join 40 million developers who use GitHub issues to help identify, assign, and keep track of the features and bug fixes your projects need.
Sign up for free See pricing for teams and enterprisesDataFrame discussion: open thread #190
Comments
This comment has been minimized.
This comment has been minimized.
I like it. |
This comment has been minimized.
This comment has been minimized.
I think the main advantage is that it follows (roughly) a typical SQL select statement, which is something that most technical librarians would be comfortable with. I wonder if parquet format would help us down the road? https://parquet.apache.org/ There are Ruby gems for the format -- might be useful if we get to PetaByte size archives a decade from now. |
This comment has been minimized.
This comment has been minimized.
Or this:
Or, in straight-up SQL if you prefer:
|
This comment has been minimized.
This comment has been minimized.
Yeah -- coding dataframes is definitely more intuitive than mapping through rdds. Lambda's are oddly confusing unless you are used to them. |
This comment has been minimized.
This comment has been minimized.
cjer
commented
Apr 5, 2018
Major like on this. Would make it easier for R and Python-pandas minded people to get into Scala code with great ease |
This comment has been minimized.
This comment has been minimized.
This is great, @lintool! It would let people interactively explore rather than using our rote scripts. I don't think we could expect historians to know SQL syntax – I don't know it, for example – but with documentation I think the |
This comment has been minimized.
This comment has been minimized.
I think that once we implement scala dataframes, the pressure will be on to get it into pyspark. In particular the |
This comment has been minimized.
This comment has been minimized.
Aye, here's the rub, though... I can imagine three different ways of doing analyses:
And, in theory, you multiply those by the number of languages (e.g., Scala, Python, R) However, it will be a huge pain to make sure all the functionalities are exactly the same - for example, UDFs might be defined a bit differently, so we need to make sure all versions are in sync. IMO we should standardize on a "canonical way" and then offer people "use at your own risk" functionalities. |
This comment has been minimized.
This comment has been minimized.
Do you think it's possible to call UDFs from Scala via PySpark as per #148? In theory it looks possible, but @MapleOx had challenges getting it to work. Ideally, all pyspark calls would be running Scala UDFs with a small set of helper scripts. Or we could go the direction of using AUT plugins that are "use at your own risk." |
This comment has been minimized.
This comment has been minimized.
For One concrete proposal would be to deprecate the current RDDs + transformations in favor of fluent SQL. Not right now, but plan for it... |
This comment has been minimized.
This comment has been minimized.
@greebie on the data you just handed me...
|
This comment has been minimized.
This comment has been minimized.
Beauty! I think the longview should be to move to dataframes more intensely. My little bit of research suggests that dataframes to parquet is where the real big data is at right now. If not parquet, then probably some other column-based/matrix-oriented format. Warcs are definitely not going to get smaller over time -- any way to boost speed and shrink storage will be very important down the road. However, by "longview" I'm thinking at least 5+ years. Getting historians and non-data scientists engaged is much more important in the short-run, and that's going to require more user-friendly approaches vs super-powerful data formats. |
This comment has been minimized.
This comment has been minimized.
@greebie I'm familiar with Parquet but that's not very relevant for our work... that's a physical format, and it's unlikely that organizations are going to rewrite their TBs of holdings... |
This comment has been minimized.
This comment has been minimized.
I've just been seeing some papers that suggest dataframes --> parquet are the way to go for huge datasets. Also suggestions around that to parquet is better than to text for df outputs. But I see your point. |
This comment has been minimized.
This comment has been minimized.
I've pushed a prototype to branch The following script works:
@ianmilligan1 @greebie give it a try on some more substantial amount of data? How does it compare to what we're currently doing now? |
This comment has been minimized.
This comment has been minimized.
Just ran on a slightly more substantial amount of data (6GB) and looks great. I'll run the same script on a much more substantial corpus now.
@lintool would it be possible to use df to reproduce the plain text (say by domain and by date) and the hyperlinks? I'm just not familiar with the syntax enough to know off hand how to construct that, and you might be able to do so in your sleep. |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 not yet, I need to port the UDFs over. On my TODO list is to replicate all the standard derivatives using DFs. |
This comment has been minimized.
This comment has been minimized.
Cool. This is great, and the |
This comment has been minimized.
This comment has been minimized.
Yes, I believe the |
This comment has been minimized.
This comment has been minimized.
This works nicely. The runspeeds are pretty close, with rdd slightly faster. I could not import the matchbox and df at the same time due to ambiguity of the ExtractDomain function.
|
This comment has been minimized.
This comment has been minimized.
This data frame discussion is great When you start from an ArchiveSpark RDD or you convert an AUT RDD using the bridge with Maybe it would be worth having a look at that and build your new UDFs on top of it instead of reinventing the wheel and write your own conversions. This way we could reuse the idea of enrichments and help it grow + support data frames with UDFs to run queries on top of that. Further, the pre-processed JSON dataset can also be saved and directly reloaded as a data frame, so ArchiveSpark can be used for corpus building here, and a reloaded corpus would be directly supported by your work on data frames. We haven't spent much work on data frames specifically yet, but it turns out this is a very easy way to integrate both approaches, without much additional work. |
This comment has been minimized.
This comment has been minimized.
Updated
Result is something like this:
|
This comment has been minimized.
This comment has been minimized.
Late to this, but this is working quite nicely @lintool – here they are on a decently-large collection sorted by descending count.
What's the best way to export these to CSV files? |
This comment has been minimized.
This comment has been minimized.
I was reading this http://sigdelta.com/blog/scala-spark-udfs-in-python/ and it looks like a good path forward. Create .callUdf() and .registerUdf() functions for the objects in aut and then they can be used in Python or Scala. Then the main Python code can just be a bridge to aut which would reduce the long-term maintenance burden. |
This comment has been minimized.
This comment has been minimized.
RD Lightning Talk + Discussion We had a great discussion at the Toronto Datathon about data frames. Just some quick notes made during @lintool lightning talk and feedback from the community: The AUT team would like the community’s input in regards to data frames. Just for a bit of context, the discussion revolves around whether there is interest in moving from RDDs to data frames/table framework. We would like to gauge interest as well as what this would mean for accessibility, impact, uptake, etc. Community input will help to direct future project priorities and determine dev cycles. Ultimately we want to focus future work on features/items that will be useful and used. Discussion/Points brought up by AUT Datathon Participants:
Other points to consider:
Further discussion from datathon re: DF to be documented |
ruebot
added a commit
that referenced
this issue
Apr 27, 2018
This comment has been minimized.
This comment has been minimized.
#214 has been merged now. #209 has some examples on how to use PySpark. If you use Spark 2.3.0, it doesn't require the FYI: @digitalshawn, @justinlittman |
ruebot
added this to In Progress
in DataFrames and PySpark
May 21, 2018
ruebot
added this to To Do
in 1.0.0 Release of AUT
Aug 13, 2018
ruebot
moved this from In Progress
to ToDo
in DataFrames and PySpark
Aug 13, 2018
ruebot
added
the
discussion
label
Aug 20, 2018
This comment has been minimized.
This comment has been minimized.
What is to be the relationship between the class |
This comment has been minimized.
This comment has been minimized.
^^^ @lintool ? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
His input would be valuable. It gets to some of the point @SamFritz summarizes above: do we go all-in on DF, and deprecate RDD? Is anything important lost in doing so? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Couldn't stay away. I was looking at the "resolve before 0.18.0" tag, and noticed @lintool's comment about how filtering, in the case of a certain issue, is being done on RDD, when our plan is eventually to move everything to DF. After my contributions of the past week I thought I'd see how much work would be needed to use DataFrames from the initial I created a branch that does this (54af833), and re-implements Demo
|
This comment has been minimized.
This comment has been minimized.
Nice work! My main goal here is to get @lintool's input before the 0.18.0 release, but not necessarily implement it all before then. I'm really looking the correct path forward here, and on #231, which overlaps a fair bit. That way we have a focused path post 0.18.0 release. If we have this branch, I think that is a really good start, and hopefully will align with @lintool's thinking. |
This comment has been minimized.
This comment has been minimized.
That makes sense. It turns out my
It seems that the VM's array size limit is reached when converting certain RDD records -- presumably ones containing large files -- to DataFrame rows. Searching for the problem I see recommendations about tweaking spark settings, such as parallelism and number of partitions, but I think we're actually limited by the number of WARC files we're dealing with, so I don't know that this would help. Anyway, it's something to start with. |
lintool commentedApr 5, 2018
I've been playing a bit with DataFrames. Open thread just to share what the "user experience" might look like. I've got things working in my own local repo, but I'm hiding a bit of the magic here.
Assume we've got a DataFrames created. Here's the schema:
Shows the rows
We can extract domains from the urls, and the group by and count as follows:
Thoughts?