$24
The goal of this exercise is to
introduce command line argument processing
show typical file processing
get experience with dictionaries
Submit a single plain text file called freq.py that contains your solution. As with last exercise, you do not have to submit a README nor do you have to zip the file. Just submit the plain text .py file.
Unix man pages
Note: this is a motivating example only. The actual task (described below) does not require any knowledge of unix commands beyond running a python script with python3
The typical unix command has a description that begins with something like:
sort [OPTION]... [FILE]...
The [OPTION] part is where command line options are placed, with the convention that options begin with - or --. For example
-c, --check
check whether input is sorted; do not sort
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
Some options are just flags, like -c, others are (key,value) pairs like --key=-1,2
After the options, the remaining tokens are typically file names.
Try
man sort
at the terminal to see the typical glory details of a man page.
Processing command line arguments
When you enter a command into the unix shell, the tokens on the line are passed as a list, sys.argv, to your program. You then have to interpret them as options and file names.
1 of 4
For example this program (cl1.py) shows all the options passed to it. Note that blanks are used to
Assignmentdivide tokens, so to have a blank in an option you needhttps://eclass.srv.ualberta.ca/mod/assign/view.phptoplaceitinquotes....
import sys
for t in sys.argv:
print("|{}|".format(t))
This command
$ python3 cl1.py arg "hello" ' ' "" "two tokens" last
outputs
|cl1.py|
|arg|
|hello|
| |
||
|two tokens|
|last|
Note how the program name is the first argument in the list.
There is a python module called argparse that makes command line parsing relatively simple. The progam cl2.py shows how you could do it directly, and more aptly illustrates why you want to use something like argparse. The program worder.py illustrates how to use argparse.
Your Task
Write a program freq.py that does text analysis on an input text. The main idea is to read in a text and do frequency analysis of the words in the text. After reading in the file and breaking it into words (a word is any sequence of non-blank characters), the program prints a frequency table of the words in the input file. Lines in a frequency table look like:
word count freq
where
word is the word in the text
count is the number of times the word mentioned occurs
freq is a number in the range [0,1] that is the ratio of the count for the word to the total number of words found in the text.
The behaviour of the program is controlled by the following options:
--ignore-case
an option that ignores the case (upper/lower) when doing all actions.
--sort=[{byfreq,byword}]
byfreq - print a frequency table sorted by frequency in decreasing frequency order.
byword - print a frequency table sorted by word in increasing lexicographical order.
2 of 4
--remove-punct - throws away any punctuation in a word. Punctuation characters are those for which
curses.ascii.ispunct() from the curses.ascii module returns True.
-04-14, 1:00 p.m.
--help or -h - print the help text below
https://eclass.srv.ualberta.ca/mod/assign/view.php...
Assignment
For an example, see the file sample-output.txt included with this description.
You may assume the user always uses this script correctly (i.e. will only call with valid command line arguments and a valid text file).
You are not required to use argparse, but you may if you want.
Examples of Help Output
Your program, on this command,
$ python3 freq.py --help
should produce something like this (it does not need to match exactly, as long as the intended message is made clear)
usage: freq.py [-h] [--help] [--sort=[{byfreq,byword}]] [--ignore-case] [--remove-punct] [infile]
Text frequency analysis.
positional arguments:
infile
file to be sorted, stdin if omitted.
optional arguments:
--help or -h
show this help message and exit
--sort=[{byfreq,byword}]
Frequency table sort options:
byfreq - (default) sort by decreasing frequency.
byword - sort by word in increasing lexicographical order
--ignore-case
ignore upper/lower case when doing all actions.
--remove-punct
remove all punctuation characters in a word, preserving only the alphanumeric characters
cl1.py
cl2.py
sample-input.txt
sample-output.txt
worder.py
4 of 4 -04-14, 1:00 p.m.