Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WriteGraph DataFrame implementation #397

Draft
wants to merge 21 commits into
base: master
from

Conversation

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Jan 2, 2020

WriteGraph DataFrame implementation

#223

Reference to: archivesunleashed/aut-docs#34

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.df._

val target = udf((vs: Any) => {
								var res = ""
								if(vs != null){
									res = vs.toString.split(",")(1)}
								res
							})
val src = udf((vs: Any) => {
							var res = ""
							if(vs != null){
								val s = vs.toString.split(",")(0)
								if(s.length() != 0)
									res = s.drop(1)}
							res
						})	
val modify = udf((str: String) => str.replaceAll("^\\\\s*www\\\\.", ""))

var df = RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc).webpages()
					 .select($"crawl_date",explode_outer(ExtractLinksDF($"url",$"content")).as("link"))


df = df.select($"crawl_date",modify(ExtractDomainDF(src($"link"))).as("Source"),modify(ExtractDomainDF(target($"link"))).as("Target"))
.filter($"Source" =!= "")
.filter($"Target" =!= "")
.groupBy("crawl_date","Source","Target")
.count()
.filter($"count" > 5)
.orderBy(desc("count"))

val Columns = Seq("crawl_date","Source","Target","count")
df = df.toDF(Columns:_*)

Step 1: Creating Nodes with Id from DataFrame

WriteGraph.nodesWithIdsDF(df).show(20)

Step 2: Creating Edges from Nodes and Dataframe

WriteGraph.edgeNodesDF(df).show(10)

Step 3: Writing Graph

WriteGraph.asGraphmlDF(df,"output.csv")

Remaining Tests.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jan 2, 2020

Codecov Report

Merging #397 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #397   +/-   ##
=======================================
  Coverage   75.14%   75.14%           
=======================================
  Files          40       40           
  Lines        1537     1537           
  Branches      281      281           
=======================================
  Hits         1155     1155           
  Misses        259      259           
  Partials      123      123
g285sing and others added 3 commits Jan 2, 2020
g285sing
g285sing
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Jan 17, 2020

@SinghGursimran are you still working on this one? It's still listed as draft. Just wanted to double-check.

@SinghGursimran

This comment has been minimized.

Copy link
Contributor Author

SinghGursimran commented Jan 17, 2020

Yes, implementation is completed. I still have to find a way to design test cases for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.