$29
• Purpose
In every data science job, you get two things on the first day: an email address and a laptop. Your first task is to get the laptop ready for work.
The second thing you do is get a table of data that you are expected to understand. Here we will be using Pandas and SQLite to explore some data.
• Study
Read pages 71 - 133 of Practical Data Science with Python.
Optional: We will be using pandas all semester. You should get comfortable with it. Here is a
good video tutorial: https://youtu.be/PcvsOaixUh8
1
• Install python3
If you have not already, install python on your computer. You will be using this all semester. Python version 3.10 has been released, but anything after 3.7 is fine. You can check your version on the command line like this:
python3 --version
• Install some python tools and libraries
pip3 install pandas
pip3 install numpy
pip3 install scikit-learn
pip3 install matplotlib
pip3 install black
(If you prefer to use conda, that is OK with me.)
• JupyterLab
Jupyter notebooks allow an author to mix text (in Markdown) and code (in Python) in a single document that you can work with in a browser. I will use them from time to time to send you annotated examples.
Install JupyterLab on your machine.
pip3 install jupyterlab
Your homework directory includes main.ipnb. In that directory, start jupyter-lab:
jupyter-lab
In the pane on the your browser, you should see main.ipnb. Study and run the code in each cell.
Replace the question marks in this sentence: The mean waist size is 1.21 meters. (2 points)
2
• Apply for GitHub Copilot
As a student, you get free access to Copilot (which is used from an extension in VS Code). You should apply: Copilot will make your life easier, but it is also an astonishing example of what machine learning can do.
https://copilot.github.com
• Write some python using pandas
There is a file calledmake report.py. When you fill in the missing lines, you will be able to run it like this:
python3 make_report.py employees.csv
Then, it will print a report like this:
*** Basics ***
Rows: 10,000
Columns: 6
*** Columns ***
employee_id: int64
Range: 1712 - 9998838
gender: object
Missing in 82 rows (0.8%)
4917: m
4907: f
36: F
23: M
19: male
16: female
height: float64
Range: 1.34 - 2.07
Mean: 1.71
Standard deviation: 0.11
Median: 1.71
waist: float64
Range: 0.47 - 2.18
Mean: 1.21
Standard deviation: 0.23
Median: 1.19
salary: float64
Missing in 70 rows (0.7%)
3
Range: 297.0 - 140902.0
Mean: 63033.98
Standard deviation: 20093.83
Median: 63078.50
dob: object
Range: 1945-01-01 - 1984-12-21
death: object
Range: 1960-03-20 - 2022-06-12
DO NOT LOOP THROUGH ALL 10,000 ROWS. Let pandas do that for you.
Do this work by yourself. Stackoverflow is OK. A hint from another student is OK. Looking at another student’s code is not OK.
Include the completed make report.py in the zip file. Also copy your series report function here: (4 points)
def s e r i e s _ r e p o r t (
series , i s _ o r d i n a l = False , i s _ c o n t i n u o u s = False , i s _ c a t e g o r i c a l = False
):
...
In my solution, this function is 18 lines long.
• Write some SQL
There is an employees.db file containing similar data in sqlite3 format.
In one SQL query, get the mean height of all employees who have a salary greater than $35,000. SELECT rowid, AVG(height) FROM Employee WHERE salary ¿ 35000 (2 points)
sqlite >
• Install LaTeX and build a PDF
There are a lot of ways to install a TeX/LaTeX processing system. I use TeX Live (https:
//www.tug.org/texlive/).
After installing it, you will be able to render this document into PDF like this:
pdflatex report.tex
Open report.pdf to make sure it looks good. (Did you put your name and email in the author section?)
4
Include that pdf in the zip file you turn in.(2 points)
10 Tidy up
Before you zip up this directory and submit it, clean things up for the graders:
• Rename the folder. First name ”Derek”? Last name ”Zoolander”? The folder should be HW01 Zoolander Derek.
• Reformat your code with black: black make report.py.
• Delete intermediate files from pdflatex:report.aux, report.log, report.synctex.gz.
When you zip this directory, it should be called HW01 Zoolander Derek.zip
Our amazing TAs have to check homeworks from a lot of students, so this sort of tidiness is very important for every assignment. If your code doesn’t immediately run as-is, you will get points off. If we can’t find your name on the folder or the PDF, you will get points off.
The most common problem is that in your code you have the path to your data file is something like "C://home/zoolander/gsu/hw1/employees.csv". Leave the data file in the directory and use a relative path like "employees.csv".
11 Looking ahead
Want to look ahead a little? We will be doing data visualization with matplotlib next. Here is a good video tutorial: https://youtu.be/UO98lJQ3QGI
5