Exercise 1: Word frequency counter Solution

Starting from:

~~$30~~

$24

Home

The goal of this exercise is to

introduce command line argument processing

show typical file processing

get experience with dictionaries

Submit a single plain text file called freq.py that contains your solution. As with last exercise, you do not have to submit a README nor do you have to zip the file. Just submit the plain text .py file.

Unix man pages

Note: this is a motivating example only. The actual task (described below) does not require any knowledge of unix commands beyond running a python script with python3

The typical unix command has a description that begins with something like:

sort [OPTION]... [FILE]...

The [OPTION] part is where command line options are placed, with the convention that options begin with - or --. For example

-c, --check

check whether input is sorted; do not sort

-k, --key=POS1[,POS2]

start a key at POS1, end it at POS2 (origin 1)

Some options are just flags, like -c, others are (key,value) pairs like --key=-1,2

After the options, the remaining tokens are typically file names.

Try

man sort

at the terminal to see the typical glory details of a man page.

Processing command line arguments

When you enter a command into the unix shell, the tokens on the line are passed as a list, sys.argv, to your program. You then have to interpret them as options and file names.

1 of 4

For example this program (cl1.py) shows all the options passed to it. Note that blanks are used to

Assignmentdivide tokens, so to have a blank in an option you needhttps://eclass.srv.ualberta.ca/mod/assign/view.phptoplaceitinquotes....

import sys

for t in sys.argv:

print("|{}|".format(t))

This command

$ python3 cl1.py arg "hello" ' ' "" "two tokens" last

outputs

|cl1.py|

|arg|

|hello|

| |

||

|two tokens|

|last|

Note how the program name is the first argument in the list.

There is a python module called argparse that makes command line parsing relatively simple. The progam cl2.py shows how you could do it directly, and more aptly illustrates why you want to use something like argparse. The program worder.py illustrates how to use argparse.

Your Task

Write a program freq.py that does text analysis on an input text. The main idea is to read in a text and do frequency analysis of the words in the text. After reading in the file and breaking it into words (a word is any sequence of non-blank characters), the program prints a frequency table of the words in the input file. Lines in a frequency table look like:

word count freq

where

word is the word in the text

count is the number of times the word mentioned occurs

freq is a number in the range [0,1] that is the ratio of the count for the word to the total number of words found in the text.

The behaviour of the program is controlled by the following options:

--ignore-case

an option that ignores the case (upper/lower) when doing all actions.

--sort=[{byfreq,byword}]

byfreq - print a frequency table sorted by frequency in decreasing frequency order.

byword - print a frequency table sorted by word in increasing lexicographical order.
2 of 4
--remove-punct - throws away any punctuation in a word. Punctuation characters are those for which
curses.ascii.ispunct() from the curses.ascii module returns True.
-04-14, 1:00 p.m.

--help or -h - print the help text below
https://eclass.srv.ualberta.ca/mod/assign/view.php...
Assignment

For an example, see the file sample-output.txt included with this description.

You may assume the user always uses this script correctly (i.e. will only call with valid command line arguments and a valid text file).

You are not required to use argparse, but you may if you want.

Examples of Help Output

Your program, on this command,

$ python3 freq.py --help

should produce something like this (it does not need to match exactly, as long as the intended message is made clear)

usage: freq.py [-h] [--help] [--sort=[{byfreq,byword}]] [--ignore-case] [--remove-punct] [infile]

Text frequency analysis.

positional arguments:

infile

file to be sorted, stdin if omitted.

optional arguments:

--help or -h

show this help message and exit

--sort=[{byfreq,byword}]

Frequency table sort options:

byfreq - (default) sort by decreasing frequency.

byword - sort by word in increasing lexicographical order

--ignore-case

ignore upper/lower case when doing all actions.

--remove-punct

remove all punctuation characters in a word, preserving only the alphanumeric characters

cl1.py

cl2.py

sample-input.txt

sample-output.txt

worder.py

4 of 4 -04-14, 1:00 p.m.