Starting from:
$35

$29

ASSIGNMENT 4 SOLUTION


Assignment Overview




The HTML language, which is used to describe the layout and content of web pages, has a famously verbose syntax: relatively simple formatting and layout instructions can often require several layers of bulky HTML tags. Modern HTML is very exible for specifying visual aspects of the displayed content. However, extracting information from the HTML representation can be di cult. The goal of this assignment is to write a Python 3 program which converts HTML tables to a CSV representation. The CSV output will follow the speci cation given in Assignment 3.










 
HTML Tables




Tables in HTML are speci ed with the <table tag. A brief tutorial on the <table tag, along with interactive examples, can be found at http://www.w3schools.com/html/html_tables.asp. Note that whitespace in HTML is generally ignored: spaces, newlines and tabs are collapsed into a single space when the page is rendered. Line breaks are speci ed by the <br / tag.




Consider the HTML table below (which appears as the rst example in the tutorial linked above).




<table style="width:100%"




<tr




<thFirstname</th




<thLastname</th




<thAge</th




</tr




<tr




<tdJill</td




<tdSmith</td




<td50</td




</tr




<tr




<tdEve</td




<tdJackson</td




<td94</td




</tr




</table










1



The table described by the HTML code above would be rendered by a web browser in a similar format to the table below.







Firstname
Lastname
Age
Jill
Smith
50
Eve
Jackson
94






The <tr and </tr tags enclose the data for each row of the table, and the <td and </td tags enclose the contents of each cell. The <th and </th tags enclose the contents of header cells, but their use is optional (some authors use regular <td tags for header cells). Cells of a table may contain HTML code, including other HTML tables.




Tag names are not case sensitive, so `<tD' and `<TD' are both valid forms of the `<td' tag. Any HTML tag may contain attributes that change its appearance. The most common attribute in modern HTML is `style'. For example, make the contents of a particular cell boldfaced, the attribute style="font-weight: bold;" can be added to the <td tag:




<td style="font-weight: bold;"Jill</td




Additionally, any HTML tag may contain whitespace after the tag name, between attributes or be-fore the closing `' character. Whitespace is not permitted between the opening `<' character and the




tag name. For example, `<td ' and `</td ' are valid tags, but `< /td' and `< td ' are not.







Since whitespace in HTML is generally ignored, there is no requirement that the table be laid out in a readable way in the HTML code. The table in the example above could also be represented by the code below.




<table style="width:100%" <tr <th Firstname</th <th Lastname</th<thAge</th</tr <tr<tdJill</td




<tdSmith</td<td 50</td




</tr<tr<tdEve</td<tdJackson</td <td94</td</tr</table




This assignment will add the following extra constraints to the basic HTML table speci cation.




To be considered valid, a test input must be valid HTML. For example, tags like <td can only occur inside of a <tr tag, which in turn must be inside a <table tag. All opening tags must have a matching closing tag (note that some HTML tags, like <br /, are singular and do not need a closing tag), and vice versa.




Between the opening angle bracket (<) and closing angle bracket () of a tag, no other instances of closing angle brackets are permitted (including inside of attributes).




Commas may not appear inside the data for a cell. However, other aspects of the HTML which are not cell content (such as the style attributes of <td tags) may contain commas.




Cells may contain any data, including other HTML tags, but may not contain nested <table tags. The prohibition on comma use applies to all contents of each cell, including HTML tags. In other words, if the comma character appears between the opening <td tag and its matching </td tag, the input will be considered invalid.




There is no requirement that each row of the table contain the same number of columns. Every HTML table must have at least one row.




Every row of an HTML table must contain at least one cell.




The rowspan and colspan, which are used to make cells span multiple rows or columns, are not permitted.







2



 
HTML-to-CSV Converter




Your task is to write a Python 3 program called table_to_csv.py which reads HTML from stan-dard input and outputs a CSV representation of each table in the input, including any header cells speci ed with <th (if present). See the Assignment 3 speci cation for documentation on the CSV format.




The resulting CSV data will be printed to standard output in the following format:




TABLE 1:




<CSV data for first table in the input




TABLE 2:




<CSV data for the second table in the input




TABLE 3:




<CSV data for the third table in the input




...




Your implementation may assume that the input table complies with the constraints given in the previous section, and must also meet the following requirements.




All runs of one or more spaces, newlines, tabs, or other whitespace should be collapsed into a single space.




Within a table cell, all HTML tags are to be left intact.




The contents of each table cell should be stripped of all leading and trailing whitespace before




being output. For example, the cell `<td Lemon Meringue </td' should be output as `Lemon Meringue' (note that the multiple spaces between the two words are also collapsed into one space).




Every row of the output CSV spreadsheet must contain the same number of columns (recall that the number of columns in a row of a CSV spreadsheet is the number of commas in the row minus one). If the rows of the HTML table contain a di ering number of columns, then the number of columns in the output spreadsheet should be equal to the number of columns in the row of the input table with the largest number of columns. Other rows should be padded with blank cells to meet the column requirement.




Section 2 contains two di erent representations of the same HTML table. Both should produce the output below when provided as input to a correct implementation.




TABLE 1:




Firstname,Lastname,Age




Jill,Smith,50




Eve,Jackson,94




Consider the HTML table below (which has been posted to the git repository as `a4_example_table.html'). <table<tr




<thStudent Number</th<thStudent Name</th<thMajor</th <thA1 mark</th<thA2 mark</th</tr <tr<tdV00000001</td<td</td<td</td<td10</td <td11</td







3



</tr




<tr<tdV00123456</td<tdAlastair Avocado</td




<tdPsychology</td<td12</td<td</td</tr




<tr




<tdV00123457</td




<tdRebecca Raspberry




</td<tdComputer Science</td<td17</td<td14</td</tr <tr<tdV00314159</td<tdFiona Framboise</td




<td style="font-family: monospace; font-size: 20pt; font-weight: bold;" Computer Science







</td




<td </td<td17</td</tr <tr<tdV00654321</td<tdMeredith Malina</td




<td style="color: red;"Software Engineering</td<td18</td<td12</td</tr <tr<tdV00654322</td<tdHannah Hindbaer</td<tdPhysics</td<td15</td<td18</td</tr <tr<tdV00951413</td<tdNeal Naranja</td<tdAnthropology</td<td15</td<td15</td</tr </table




When provided as input to a correct implementation, the HTML table above would be converted to the following CSV spreadsheet.




TABLE 1:




Student Number,Student Name,Major,A1 mark,A2 mark V00000001,,,10,11




V00123456,Alastair Avocado,Psychology,12, V00123457,Rebecca Raspberry,Computer Science,17,14 V00314159,Fiona Framboise,Computer Science,,17 V00654321,Meredith Malina,Software Engineering,18,12 V00654322,Hannah Hindbaer,Physics,15,18 V00951413,Neal Naranja,Anthropology,15,15










 
Implementation Advice




Since HTML allows such a wide variation in the structure and formatting of tags, the use of regular expressions to match each tag pair is encouraged. However, you are not required to use regular expressions (or any other particular implementation technique, as long as your code is valid Python 3). If you use regular expressions, be aware of the following points.




By default, the `.' speci er does not match the newline character (`\n'), so if you are search-ing for something which crosses a line boundary, it will not match. For example, the pattern




`A.*B' would match `Axy z B' but not `Axy\n z B' by default. Since whitespace in HTML can be collapsed to a single space, you can remedy this problem by replacing all newlines characters with spaces. You can also use the `re.DOTALL' ag when performing regular ex-pression matching, which will cause newlines to be matched by the `.' speci er. Consider the interactive Python 3 session below, which contains examples of both methods.




4



 
s1 = 'Axy z B'




 
s2 = 'Axy\n z B'




 
re.match('A.*B',s1)




<_sre.SRE_Match object; span=(0, 8), match='Axy z B'




 
re.match('A.*B',s2)




 
re.findall('A.*B',s1) ['Axy z B']




 
re.findall('A(.*)B',s1) ['xy z ']




 
re.findall('A(.*)B',s2)




[]




 
re.findall('A(.*)B',s2, re.DOTALL) ['xy\n z ']




 
s3 = s2.replace('\n',' ')




 
s3




'Axy z B'




re.findall('A(.*)B',s3)




['xy z ']




Since HTML tag names are not case sensitive, you may want to use the `re.IGNORECASE' ag to enable case-insensitive matching. Consider the interactive session below.




 
x = 'abc'




 
y = 'Abc'




 
z = 'A------C'




 
re.findall('a.*c',x)




['abc']




 
re.findall('a.*c',y)




[]




 
re.findall('a.*c',z)




[]




 
re.findall('a.*c',y,re.IGNORECASE) ['Abc']




 
re.findall('a.*c',z,re.IGNORECASE) ['A------C']




Note that if you want to use multiple ags (such as both `re.DOTALL' and `re.IGNORECASE'), you can combine them with the bitwise-OR operator (for example, `re.findall('a.*c',z, re.IGNORECASE|re.DOTALL)').










 
Peer Testing




As in previous assignments, part of the mark for this assignment will come from creating test cases that will be used to test your peers' implementations. For this assignment, you are required to submit one valid and non-trivial test case in a le called table_testcase.html. Your test cases







5



will be considered invalid unless they meet the requirements in Section 2 (and will therefore receive a zero).




Your test case le must contain at least one HTML table. It may contain other HTML tags, but, to be consistent with the speci cation in Section 2, nested tables are not permitted. Additionally, images and dynamic content (including any use of Javascript) are not allowed.




Your table_to_csv.py implementation and test cases must be submitted by Tuesday, July 26th, 2016 at 11:00pm. After that deadline, the results of testing each implementation using each test case will be posted to the Testing Database on conneX.




Since your test cases will be published, please ensure that they contain no identifying information.




Your implementation of table_to_csv.py will not be published.




After the results of peer testing are posted, you will have the opportunity to revise and resubmit your table_to_csv.py implementation until Sunday, July 31st, 2016. You will not be permitted to resubmit if you did not submit an implementation before the July 26th deadline (and you will therefore receive a mark of zero). Since all test cases will be published, you will also not be permitted to resubmit your test cases.










 
Evaluation




Submit your table_to_csv.py and table_testcase.html les electronically through the Assign-ments tab on conneX. Do not submit any other les.




The assignment will be marked out of 16 marks and is worth 8% of your nal grade.




Your implementation and test case will be marked based on its performance on test cases developed by your instructor (not on its performance on your peers' test cases).




The marks are distributed among the components of the assignment as follows.







Marks
Component






 
The table_to_csv.py implementation functions correctly on a variety of valid in-puts.







 
The submitted test case contains a valid and non-trivial test case. You will receive zero marks for this component unless you also submit table_to_csv.py.







Ensure that all code les needed to run your code in ECS 242 are submitted. Only the les that you submit through conneX will be marked. The best way to make sure your submission is correct is to download it from conneX after submitting and test it. You are not permitted to revise your submission after the due date, and late submissions will not be accepted, so you should ensure that you have submitted the correct version of your code before the due date. conneX will allow you to change your submission before the due date if you notice a mistake. After submitting your assignment, conneX will automatically send you a con rmation email. If you do not receive such an email, your submission was not received. If you have problems with the submission process, send an email to the instructor before the due date.



















6

More products