Web Search and Sense-making Assignment 4 Solution

Starting from:

~~$30~~

$24

Home

Task: Parse Wikipedia using Scala.xml

Introduction:

In this assignment, we will parse the Wikipedia dump using an xml parser and keep only the articles from the dump.

Requirements:

100GB free disk space in your machine.

Instructions:

Write a WikiArticle.scala file to extract the articles from the English Wikipedia dump, by taking the following steps:

Read in the output file of your last assignment (one Wiki page per line).

Parse the file using the xml parser in scala. Make sure you can access the title and text fields.

Get the articles, one type of Wiki pages, from the pages. The Wikipedia page types are stubs, redirects, disambiguation pages, and articles. Articles are pages that are not stubs, redirects and disambiguation pages.

You are welcome to use the following code template:

import scala.util.matching.Regex

import scala.xml.XML

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

object WikiArticle {

def main(args: Array[String]) {

val sparkConf = new SparkConf().setAppName(“Wiki WordCount")

val sc = new SparkContext(sparkConf)

val txt = sc.textFile(“./wikidump.page.per.line”) // your output from the last assignment

val getTitleAndText = txt.map{ l =

val line = XML.loadString(l)

val title = (line \ “title").text

val text = …. // You code come here, to get the strings in <text</text

COSC 589 - Web Search and Sense-making

// your code goes here to output the title and the text

}

// you will need to write a function isArticle

val articles = getTitleAndText.filter { r = isArticle(r._2.toLowerCase) }

save the articles. See the format in 4.

}

}

Save the articles (using saveAsTextFile), in the following format:

Each line is one article

First element is the title of the article

Second element is the content of the article

The first and the second elements are delimited by the tab ‘\t’. Note that Spark’s default delimiter is ‘,’. We will have to use something different from it because lots of Wikipedia titles contain comma in themselves. To avoid the confusion, when we output the title and text, we separate them by a tab.

Print the total number of articles in English Wikipedia to the screen

Perform a WordCount for the Wikipedia Articles. Save the outputs.

What to Submit:

Your code

Screen capture of the article count results that you print to the screen
Screen capture of the total number of words in Wiki Articles that you print to the screen
Screen capture of the total number of unique words in Wiki Articles that you print to the screen
Screen captures of the beginning of your saved articles (let us say the 2 screen captures of the first 20 lines on the screen)
Screen captures of the beginning of your saved WordCount outputs (let us say the 2 screen captures of the the first 20 lines on the screen)

What NOT to Submit:

- Your input or output files

Where to submit:

- Canvas