CSC 4780/6780 Homework 1

Starting from:

~~$35~~

$29

Home

• Purpose

In every data science job, you get two things on the first day: an email address and a laptop. Your first task is to get the laptop ready for work.

The second thing you do is get a table of data that you are expected to understand. Here we will be using Pandas and SQLite to explore some data.

• Study

Read pages 71 - 133 of Practical Data Science with Python.

Optional: We will be using pandas all semester. You should get comfortable with it. Here is a

good video tutorial: https://youtu.be/PcvsOaixUh8

1
• Install python3

If you have not already, install python on your computer. You will be using this all semester. Python version 3.10 has been released, but anything after 3.7 is fine. You can check your version on the command line like this:

python3 --version

• Install some python tools and libraries

pip3 install pandas

pip3 install numpy

pip3 install scikit-learn

pip3 install matplotlib

pip3 install black

(If you prefer to use conda, that is OK with me.)

• JupyterLab

Jupyter notebooks allow an author to mix text (in Markdown) and code (in Python) in a single document that you can work with in a browser. I will use them from time to time to send you annotated examples.

Install JupyterLab on your machine.

pip3 install jupyterlab

Your homework directory includes main.ipnb. In that directory, start jupyter-lab:

jupyter-lab

In the pane on the your browser, you should see main.ipnb. Study and run the code in each cell.

Replace the question marks in this sentence: The mean waist size is 1.21 meters. (2 points)

2
• Apply for GitHub Copilot

As a student, you get free access to Copilot (which is used from an extension in VS Code). You should apply: Copilot will make your life easier, but it is also an astonishing example of what machine learning can do.

https://copilot.github.com

• Write some python using pandas

There is a file calledmake report.py. When you fill in the missing lines, you will be able to run it like this:

python3 make_report.py employees.csv

Then, it will print a report like this:

*** Basics ***

Rows: 10,000

Columns: 6

*** Columns ***

employee_id: int64

Range: 1712 - 9998838

gender: object

Missing in 82 rows (0.8%)

4917: m

4907: f

36: F

23: M

19: male

16: female

height: float64

Range: 1.34 - 2.07

Mean: 1.71

Standard deviation: 0.11

Median: 1.71

waist: float64

Range: 0.47 - 2.18

Mean: 1.21

Standard deviation: 0.23

Median: 1.19

salary: float64

Missing in 70 rows (0.7%)

3

Range: 297.0 - 140902.0

Mean: 63033.98

Standard deviation: 20093.83

Median: 63078.50

dob: object

Range: 1945-01-01 - 1984-12-21

death: object

Range: 1960-03-20 - 2022-06-12

DO NOT LOOP THROUGH ALL 10,000 ROWS. Let pandas do that for you.

Do this work by yourself. Stackoverflow is OK. A hint from another student is OK. Looking at another student’s code is not OK.

Include the completed make report.py in the zip file. Also copy your series report function here: (4 points)

def s e r i e s _ r e p o r t (

series , i s _ o r d i n a l = False , i s _ c o n t i n u o u s = False , i s _ c a t e g o r i c a l = False

):

...

In my solution, this function is 18 lines long.

• Write some SQL

There is an employees.db file containing similar data in sqlite3 format.

In one SQL query, get the mean height of all employees who have a salary greater than $35,000. SELECT rowid, AVG(height) FROM Employee WHERE salary ¿ 35000 (2 points)
sqlite >

• Install LaTeX and build a PDF

There are a lot of ways to install a TeX/LaTeX processing system. I use TeX Live (https:

//www.tug.org/texlive/).

After installing it, you will be able to render this document into PDF like this:

pdflatex report.tex

Open report.pdf to make sure it looks good. (Did you put your name and email in the author section?)

4

Include that pdf in the zip file you turn in.(2 points)

10 Tidy up

Before you zip up this directory and submit it, clean things up for the graders:

• Rename the folder. First name ”Derek”? Last name ”Zoolander”? The folder should be HW01 Zoolander Derek.

• Reformat your code with black: black make report.py.

• Delete intermediate files from pdflatex:report.aux, report.log, report.synctex.gz.

When you zip this directory, it should be called HW01 Zoolander Derek.zip

Our amazing TAs have to check homeworks from a lot of students, so this sort of tidiness is very important for every assignment. If your code doesn’t immediately run as-is, you will get points off. If we can’t find your name on the folder or the PDF, you will get points off.

The most common problem is that in your code you have the path to your data file is something like "C://home/zoolander/gsu/hw1/employees.csv". Leave the data file in the directory and use a relative path like "employees.csv".

11 Looking ahead

Want to look ahead a little? We will be doing data visualization with matplotlib next. Here is a good video tutorial: https://youtu.be/UO98lJQ3QGI

5