$24
Task: Build the Wikipedia Link Graph
Introduction:
In this assignment, we will extract the links from the Wikipedia dump and build a link graph from them.
Requirements:
100GB free disk space in your machine.
Instructions:
Write a LinkGraph.scala file to extract the links from the English Wikipedia dump and build the link graph, by taking the following steps:
Read in the output files of your last assignment, in which you have obtained the articles. The format should be:
One page (of the type of article) per line
In each line, you have two fields: the title and the text, which are separated by a tab
An example page looks like:
here is a tab
For each page, extract its outlinks’ titles. An outlink appears in the Wikipedia dump in the following format:
[[the title of an outline page]]
For instance, the page titled “Alvin Toffler” has an outlink to another page titled “Future Shock”.
(Alvin Toffler {{Use mdy dates|date=September 2013}}{{Infobox person| name = Alvin Toffler| image = Alvin Toffler 02.jpg| image_size = 210px| caption = Alvin Toffler (2006)| birth_name =| birth_date = {{Birth date and age |1928|10|4}}| birth_place =
New York City<ref{{cite web|last=The European Graduate School|title=ALVIN TOFFLER - BIOGRAPHY|url=http://www.egs.edu/library/alvin-toffler/biography/|accessdate=January 7,
2014}}</ref| death_date = <!-- {{Death date and age|YYYY|MM|DD|YYYY|MM|DD}} --|
death_place =| death_cause =| resting_place =| residence = Los Angeles,
[[Future
Shock]]''
3. However, not all the things inside [[]] are good outlink titles. We will need to do the following:
3.1. First, ignore an outlink title contains colons “:”. Basically, they are not titles for any article, but for something else. For instance, “WP:CSD#R3D3” is not a title name for an article:
3.2. Second, extract the parts before an “|”, “#”, or “,”, if an outlink title contains these symbols. If a title contains “|”, it has multiple variations of the title; we only keep the first one. For instance, for [[The Third Wave (book)|The Third Wave]], we will only keep [[The Third Wave (book)]]. If a title contains “#”, it has both the title and a section name. Similarly, we only keep the former. For instance, for [[Uncial script#Half-uncial|semi-uncial]], We will only keep [[Uncial script]]. If a title contains “,”, it conflicts with the Spark’s default delimiter. To allow we will be able to match an outlink page to its own entry, we only keep the part before comma. In summary we will extract
the part before “|” in an outlink title with “|” (title name variations),
the part before “#” in an outlink title with “#” (book mark sections), and
the part before “,” in an outlink title with “,” (Spark’s default delimiter in saved files)
Save the title and the outlink titles for each page in Wiki dump. Optionally, you can save your files into compressed format by using saveAsTextFile(filename, classOf[GzipCodec]). The output format is described as follows:
One page per line
In each line, you have the title of a page and a list of the titles of the outlinks in the page
Each outline title is inside [[]], and separated by a tab “\t”.
The title and the list of links is separated by “,”. (this is the default in Spark) For instance, for page titled “Alvin Toffler”, we expect the following as the output:
Save the number of outlinks for each page separately.
You are welcome to use the following code template:
import scala.util.matching.Regex
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.hadoop.io.compress.GzipCodec
object LinkGraph {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName(“Wiki LinkGraph”)
val sc = new SparkContext(sparkConf)
val input = sc.textFile(“./wikiarticles”) // your output directory from the last assignment
val page = input.map{ l =
val pair = l.stripPrefix(“(").stripSuffix(")").split("\t", 2)
(pair(0), pair(1)) // get the two fields: title and text
}
val links = page.map(r = (r._1, extractLinks(r._2))) // extract links from text
val linkcounts = links.map(r = (r._1, r._2.split(“\t").length)) // count number of links
save the links and the counts in compressed format (save your disk space) links.saveAsTextFile("./links", classOf[GzipCodec]) linkcounts.saveAsTextFile("./inks-counts", classOf[GzipCodec])
}
def extractLinks(text: String) : String = {
// you will need to work on a way to extract the links
}
}
What to Submit:
Your code
Screen captures of the beginning of your saved link graphs (e.g. the first 20 lines on the screen. Hint: Use ‘gunzip part-00000.gz” to unzip, then use ‘less yourfile’ to view the documents and screen capture)
Screen captures of the beginning of the saved counts of links (e.g. the first 20 lines on the screen. Hint: Use ‘gunzip part-00000.gz” to unzip, then use ‘less yourfile’ to view the documents and screen capture)
What NOT to Submit:
- Your input or output files
Where to submit:
- Canvas