Starting from:


Web Search and Sense-Making Assignment 3 Solution

Task: Clean Wikipedia


In this assignment, we will perform initial cleaning of the Wikipedia data.


100GB free disk space in your machine.


Write a PreProc.scala file to preprocess the file. Basically, we will extract all content in <page…</page and output each per line into an output file. Please keep the two beginning and closing tags <page and </page in your output file.

It Is not required, but you are welcome to use the following code template:




import scala.collection.mutable.StringBuilder

object PreProc {

def main(args: Array[String]) {

val inputfile = “your_wikidump_file”

val outputfile = new PrintWriter(new File(“your_output_file”))

var a_output_line = new StringBuilder

write your code to extract content in every <page …. </page

write each of that into one line in your output file

for (inputline <- Source.fromFile(inputfile).getLines) { …….




Please see sample input and output files on Piazza

Print the total number of pages in English Wikipedia to the screen
COSC 589 - Web Search and Sense-Making

What to Submit:

Your code

Screen capture of the page count results that you print to the screen
Screen capture of the beginning of your output (by using ‘head -n20’ to show the first 20 lines)

What NOT to Submit:

- Your input or output files

Where to submit:

- Canvas

More products