Starting from:
$35

$29

CSC 4780/6780 Homework 4


    • Read


At this point you should have read up to the start of Chapter 8: Probability, Distributions, and Sampling.


    • Scrape a webpage


(6 points) Create a python program called scrape.py that takes a date in ISO format as an argument:


> python3    scrape.py 2022-10-02 result.xlsx


The program will then create and excel spreadsheet that lists the names of events that will happen on that date and their urls. It should look like this when you open it in Excel:











1
























Behind the scenes, your program will


    • fetch the web page at https://discoveratlanta.com/events/all/

    • parse the result using BeautifulSoup and html.parser

    • step through each article inspecting the dates of the events

    • skip articles that do not contain the desired date

    • for articles that have the desired date, note the title and the URL

    • make a dataframe with all the titles and URLs

    • write the dataframe to an ExcelWriter

    • resize the columns to be a reasonable width

    • write it to the  le named on the command line


You are putting data into only 2 columns { Don’t include the dataframe’s index in the excel    le.


    • Analyze the residual from the last exercise


(4 points) My solution to last week’s regression problem (linreg scikit.py and util.py) are in this directory. Extended it to save a histogram of the residual as res hist.png.


Extended linreg scikit.py again to use scipy’s kstest to con rm that the residual really resem-bles a normal distribution. The test returns a P-value; if the P-value is less than 0.05, you can assume the residual is normally distributed.



2

Now that you know it is a normal distribution, extend linreg scikit.py yet again to print your con dence like this "68% of the estimates done with this formula will be within $89.12 of the correct price. 95% will be within $140.19 of the correct price."



    • What to turn in


If your name is Fred Jones, you will turn in a zip le called HW04 Jones Fred.zip of a directory called HW04 Jones Fred. It will contain:


    • scrape.py

    • result.xlsx

    • linreg scikit.py

    • util.py

    • properties.xlsx

    • res hist.png


Be sure to format your python code with black before you submit it.

We will run your code like this:


cd HW04_Jones_Fred

python3 scrape.py 2022-10-02 result.xlsx

python3 linreg_scikit.py properties.xlsx


The output from the second program should look like this:


> python3 linreg_scikit.py properties.xlsx

Read 519 rows, 5 features from ’properties.xlsx’.

predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) + ($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) +

($-17,421.84 x miles_to_school) Kolmogorov-Smirnov: P-value = 4.154181404788638e-129

The residual follows a normal distribution.

68% of predictions with this formula will be within $91,849.54 of the actual price.

95% of predictions with this formula will be within $183,699.08 of the actual price.


And should generate a histogram like this:






3

























Do this work by yourself. Stackover ow is OK. A hint from another student is OK. Looking at another student’s code is not OK.


    • Extra help


Here is a nice tutorial on Beautiful Soup: https://youtu.be/87Gx3U0BDlo

Getting ahead: Soon we will be doing classi cation. Here is a good discussion of metrics for the

quality of a classi er: https://www.youtube.com/watch?v=8d3JbbSj-I8



























4

More products