Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WriteGraph DataFrame implementation #397

Closed

Conversation

@SinghGursimran
Copy link
Collaborator

SinghGursimran commented Jan 2, 2020

WriteGraph DataFrame implementation

#223

Reference to: archivesunleashed/aut-docs#34

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.df._

val target = udf((vs: Any) => {
								var res = ""
								if(vs != null){
									res = vs.toString.split(",")(1)}
								res
							})
val src = udf((vs: Any) => {
							var res = ""
							if(vs != null){
								val s = vs.toString.split(",")(0)
								if(s.length() != 0)
									res = s.drop(1)}
							res
						})	
val modify = udf((str: String) => str.replaceAll("^\\\\s*www\\\\.", ""))

var df = RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc).webpages()
					 .select($"crawl_date",explode_outer(ExtractLinksDF($"url",$"content")).as("link"))


df = df.select($"crawl_date",modify(ExtractDomainDF(src($"link"))).as("Source"),modify(ExtractDomainDF(target($"link"))).as("Target"))
.filter($"Source" =!= "")
.filter($"Target" =!= "")
.groupBy("crawl_date","Source","Target")
.count()
.filter($"count" > 5)
.orderBy(desc("count"))

val Columns = Seq("crawl_date","Source","Target","count")
df = df.toDF(Columns:_*)

Step 1: Creating Nodes with Id from DataFrame

WriteGraph.nodesWithIdsDF(df).show(20)

Step 2: Creating Edges from Nodes and Dataframe

WriteGraph.edgeNodesDF(df).show(10)

Step 3: Writing Graph

WriteGraph.asGraphmlDF(df,"output.csv")

Remaining Tests.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jan 2, 2020

Codecov Report

Merging #397 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #397   +/-   ##
=======================================
  Coverage   75.14%   75.14%           
=======================================
  Files          40       40           
  Lines        1537     1537           
  Branches      281      281           
=======================================
  Hits         1155     1155           
  Misses        259      259           
  Partials      123      123
g285sing and others added 3 commits Jan 2, 2020
g285sing
g285sing
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Jan 17, 2020

@SinghGursimran are you still working on this one? It's still listed as draft. Just wanted to double-check.

@SinghGursimran

This comment has been minimized.

Copy link
Collaborator Author

SinghGursimran commented Jan 17, 2020

Yes, implementation is completed. I still have to find a way to design test cases for it.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Apr 14, 2020

Closed by c1f9b31

@ruebot ruebot closed this Apr 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
DataFrames and PySpark
  
In Progress
1.0.0 Release of AUT
  
In Progress
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.