$29
1 Introduction and purpose
In this project you will write a concurrent (multithreaded) program to perform a simple task. The purpose of the project is to get some experience with concurrency using Pthreads. The different lecture examples illustrating Pthreads show all the features that you will need to write this project. You just have to figure out how to put the pieces together, so the first step will be to carefully study those lecture examples.
This is another smaller project, so there is a shorter time for it to be done in. As has been emphasized many times earlier, the instructional staff cannot answer questions about projects via messages/email.
Important: after the semester is over, be sure to clean up your TerpConnect account, as described in Appendix D.
2 Problem to be solved
The project tarfile contains a program wc.c, which implements the basic functionality of the UNIX word count utility wc. wc counts and prints the number of lines, words, and characters in its standard input, or in a file whose name appears on its command line. If multiple filenames appear on its command line (or UNIX wildcard characters that describe multiple files), wc will print the number of lines, words, and characters in each file individually, followed by the total number of lines, words, and characters in all of the files combined. For example, if you extract the files from the Project #11 tarfile, cd to the project11 subdirectory, and run wc *, you will get the following output:
14
82
458
public1
12
113
707
public1.inputdata
1
3
15
public1.output
13
80
448
public2
1
3
15
public2.output
13
82
454
public3
0
0
0
public3.inputdata
1
3
15
public3.output
12
59
342
public4
94
1017
5475
public4.inputdata1
109
1141
5959
public4.inputdata2
1
3
16
public4.output
13
71
404
public5
285 33117 186808 public5.inputdata1
252 35535 199498 public5.inputdata2
1
3
18
public5.output
26
166
961
public6
169
8025
56735
public6.inputdata
1
3
16
public6.output
26
139
845
public7
2
10
73
public7.output
73 376 2442 wc.c
1119 80031 461704 total
As mentioned, the three numbers on each line are the number of lines, number of words, and number of characters for the file on that line, and the last line is the total number of lines, words, and characters for all files combined.
If all we care about is the last line, containing total counts for all files aggregated, we could use concurrent threads to speed up the computation, with a different thread reading and counting the lines, words, and characters of each file. Since I/O is very slow compared to the speed of the CPU and memory, being able to read and process multiple files concurrently could save time.
Our wc.c program produces this output, meaning the final three numbers, for all of the files on its command line. If you compile it and run it on the files in the project11 directory it will print a single output line reading 1119 80031 461704, which is the last line of the results above. (If you compile it so its executable is in the project11 directory
© 2023 L. Herman; all rights reserved 1
then the size of the executable will change the output, so I compiled it as gcc wc.c -o ../wc.x to put the executable in the parent directory, then I ran it as ../wc.x *, and got the single line with the three numbers above as output.)
As far as the results it produces are concerned, our program wc.c is wonderful, does exactly what we want, and works perfectly. The only problem is that we forgot that we had intended it to be multithreaded, as mentioned above. Your task is to rectify this.
In particular, you are copy wc.c to another file and modify it so that it creates one thread per filename appearing on its command line, so the number of threads created will equal the number of arguments appearing on the command line that follow the program name itself. Each thread needs to be passed the name of a file to read and count the lines, words, and characters of. Each thread should open the file with the name that was passed to it, read the contents of its file to count these three things, and return the results, meaning the total number of lines, words, and characters in that thread’s file. (If the file does not even exist the thread should return 0 for all three counts.)
The code that is currently in the main() function of wc.c, for counting lines, words, and characters of a single file, must be turned into a function that is invoked by each thread. main() will have to create the threads, access their return values, and sum the results returned by the threads to compute the three total numbers. When all the threads have finished, main() must print the total lines, words, and characters of all the files whose names appeared on the command line, which will be the sums of the counts returned by all the threads.
For any invalid filenames, i.e., command–line arguments that are not the names of existing files, our original wc.c program just uses a count of zero lines, zero words, and zero characters. Your multithreaded version should do the same.
The output of your multithreaded program will be exactly the same as the original single–threaded version of wc.c given to you. The only difference is that it will use multiple threads, one for each file argument.
Your modified program must use multiple threads, one for each file argument. You could submit a copy of the original (single–threaded) program that is given to you, and it would pass all the tests. But we will be checking during grading whether your program uses multiple threads. If it does not you will get zero points for the entire project, (consequently, due to the minimum requirements policy for projects, you would not pass the course unless you were to later submit a version by the end of the semester that passes at least half of the public tests and does use threads. Furthermore, just submitting our program as your work for this project could be considered an academic integrity violation, so we would have to involve the Office of Student Conduct to determine that.)
Note that besides only printing the total counts for multiple files (but not the counts for each file separately first) there is another way that our wc.c program differs from the behavior of the real UNIX wc utility, which is that we do not attempt to emulate the results that wc produces for word counts when operating upon binary files. But we only care about the counts of lines, words, and characters in text files, not binary files, so this discrepancy does not concern us. Note that all of the project tests will be text files.
• Development procedure review
A.1 Obtaining the project files and compiling your program
Log into the Grace machines and use commands similar to those from before:
cd ~/216
tar -zxvf ~/216public/project11/project11.tgz
This will create a directory project11 with the necessary files for the project, including our wc.c and the public tests. You need to cd to the project11 directory, copy wc.c to wc-threaded.c, and modify wc-threaded.c to use threads as described above.
In most projects this semester you wrote functions that were called from our main programs. A few projects were exceptions, in that your code was the main program. This project is one of those exceptions. Like wc.c, your wc-threaded.c will be a complete, standalone program (although one that uses multiple threads). Because there is only one executable program in the project, consisting of only one file, you don’t have to write a makefile, and you can just compile your program by hand. (You can write a makefile if you want; it will just be ignored on the submit server.) By now you should know how to use the gcc compiler (look in the UNIX tutorial for information if you need to), but don’t forget to add the -lpthread option necessary to compile (actually link) programs using Pthreads.
© 2023 L. Herman; all rights reserved 2
A.2 Running your program, checking your results, and submitting
The public test inputs are just text files, and your wc-threaded.x program will be run with the names of the files as command–line arguments. However, the public tests consist of small shell scripts, that just run your program with the right arguments. This is because some tests have multiple command–line arguments, and the shell scripts avoid your having to type commands with multiple arguments when you want to run the different tests; the public test scripts do this for you.
Run the script public1 to run your program for the first public test, public2 for the second one, etc. The scripts all assume that the executable filename for your compiled program is wc-threaded.x so that is the name you have to use when compiling. As before, use diff to compare your program’s output to the public test outputs that are in the project tarfile, for example public1 | diff - public1.output will test your code’s results on the first public test. You can also run run-tests2 to run your program on all the tests at once. Note: since there is not a required makefile for this project, be sure to recompile your program before running the tests, if you make any changes to it!
As has been discussed in class, a shell script is just a file that contains commands to be executed, so if your program is failing a test, look at that shell script to see what command or commands it is running, and run those commands manually to see what’s going on.
Running submit from the project directory will submit your project, but before you submit you must make sure you have passed all the public tests, by compiling and running them yourself.
A.3 Checking whether your program is concurrent
If you run one of the public tests and your program prints the three right numbers as output (diff says there are no differences between the test’s output and the expected output) then your program must be computing its result properly. But your program might be producing the right results, yet not be using threads correctly, which could cause you to lose significant credit during grading, despite your output being right. One error that is sometimes made by students without much experience with concurrency is writing programs that only run one thread at a time, which completely defeats the purpose of even using concurrency. There are various ways that things can be done wrong so that only one thread runs at a time, so it is difficult to explain every kind of mistake that could be made. But since you would lose considerable credit if your program only runs one thread at a time, you should take a few minutes to make sure your program is not doing this.
The easy way to check for this is to run your program under gdb, because gdb will print messages when each thread begins and finishes. You do not even need to set any breakpoints in gdb, just run wc-threaded.x under gdb with some filenames as command–line arguments (as an example, run gdb wc-threaded.x, then at the gdb prompt, run run public*). If you get output that looks something like what’s on the left below, and different times you run the program you see different orders of threads starting vs. exiting, your program should be running multiple threads concurrently. (The hexadecimal numbers are the thread’s IDs just in hex, LWP stands for “lightweight process”, and the integer following “LWP” in each line is a number that the kernel uses to refer to threads.) However, if you get output like that on the right below, and even when you run the program different times you always see every thread being created and exiting before the next one is created, then your program is not using threads correctly.
[New Thread 0x7ffff77f0700 (LWP 11671)] [New Thread 0x7ffff6fef700 (LWP 11672)] [New Thread 0x7ffff67ee700 (LWP 11673)] [Thread 0x7ffff77f0700 (LWP 11671) exited] [New Thread 0x7ffff5fed700 (LWP 11674)] [Thread 0x7ffff6fef700 (LWP 11672) exited] [New Thread 0x7ffff57ec700 (LWP 11675)] [New Thread 0x7ffff4feb700 (LWP 11676)] [Thread 0x7ffff67ee700 (LWP 11673) exited] [Thread 0x7ffff57ec700 (LWP 11675) exited] [Thread 0x7ffff4feb700 (LWP 11676) exited]
[New Thread 0x7ffff77b9700 (LWP 20342)] [Thread 0x7ffff77b9700 (LWP 20342) exited] [New Thread 0x7ffff77b9700 (LWP 20343)] [Thread 0x7ffff77b9700 (LWP 20343) exited] [New Thread 0x7ffff77b9700 (LWP 20344)] [Thread 0x7ffff77b9700 (LWP 20344) exited] [New Thread 0x7ffff77b9700 (LWP 20345)] [Thread 0x7ffff77b9700 (LWP 20345) exited] [New Thread 0x7ffff77b9700 (LWP 20346)] [Thread 0x7ffff77b9700 (LWP 20346) exited] [New Thread 0x7ffff77b9700 (LWP 20347)]
By the way, even if you do see that your threads are running and exiting in different orders when you run the program different times, this does not guarantee that you are doing everything correctly with concurrency. You could somehow
© 2023 L. Herman; all rights reserved 3
be unnecessarily limiting the concurrent execution of threads inside the code where the thread function is reading from the file. But although this does not guarantee that your code is perfect, it will let you know, before you submit, whether you are making several common types of mistakes, so you can fix them if so.
A.4 Grading criteria
Your grade for this project will be based on:
public tests
65 points
secret tests
35 points
But note that, as explained above, if you do not use threads as described, your score will be zero.
• Project–specific requirements, suggestions, and other notes
◦ If your wc-threaded.c doesn’t use multiple threads– one for each file argument– you will not receive any credit for the project. (The entire purpose of this project is to use concurrency and threads.)
◦ Your threads must return values. There are different ways of returning values from threads in Pthreads, but you cannot just have the threads store their counts in global variables or static local variables or other shared variables . There must be code in your program after creating and running each thread, when the threads finish, that gets their return values and adds them to three cumulative sums, and this must be done outside the threads (after the threads finish).
You also may not change any of the existing variables in the wc.c program by making them global or static local variables.
You will lose significant credit unless each thread returns the three counts of lines, words, and characters in the file that that thread is reading.
◦ Your program must use the minimum synchronization necessary to achieve correct results, but no more syn-chronization than that. At the extreme, if you were to only allow one thread at a time to do anything, your program would work fine– but as described you would effectively not even have used concurrency. Minimum synchronization just means that the program’s threads must be able to run concurrently as much as possible, and a thread should only wait for another one to do something when absolutely necessary to ensure correct results (meaning threads should only wait for others in cases where correct results cannot be achieved without that).
◦ Your program code can not call sleep() anywhere . Some lecture examples of concurrency used sleep() in order to cause the results of small concurrent programs to be more random, so that you could see concurrency working. However, a more realistic concurrent program like this one will exhibit different behavior on its own; you do not need to make this happen artificially. Things have been set up so your program will not even compile on the submit server if it calls sleep().
Also do not use a loop to make a thread wait for another one thread to do something. This is called busy–waiting, and it is not efficient. Appropriate ways for threads to wait other threads to do something have been covered in the course.
◦ You can only use the Pthreads features that have been covered in class, or you will lose significant credit.
◦ When your program creates threads it will have to save their IDs. You don’t know in advance, while coding, how many threads your program will have to create, because you can’t see the secret tests. Situations such as this, where you have to store data and don’t know how much data there will be (the program can only tell how much data it needs to store while it is actually running) are situations where memory allocation must be used. So you must use some sort of dynamically–allocated memory or data structure to store the threads’ IDs. (If your program just creates a giant fixed–size array, without allocating memory you may fail secret tests, and you will also lose credit during grading.)
◦ All your code must be in the file wc-threaded.c. No other user–written source (.c) or header files can be added. (Otherwise things probably won’t compile on the submit server.)
© 2023 L. Herman; all rights reserved 4
• The project style guide disallows using (in general) global variables in projects. If you want you may use global variables in this project, but (as stated above) you cannot use them to “return” values from thread functions, and you cannot use them to keep track of the cumulative counts of lines, words, and characters. We stress that the project can be written without using any global variables. If you are trying to use global variables you are at the minimum making things more difficult for yourself than they need to be, or you are violating the requirements above in a way that would cause you to lose significant credit. Instead of using any global variables, ask for help in the TAs’ office hours to understand why you don’t need to use them.
Keep in mind that you may not change any of the existing variables in the program by making them global or static local variables.
• When your program has to allocate any memory, you will lose credit if you cast the return value of any memory allocation functions. Besides being completely unnecessary, in some cases this can mask certain errors in code.
• Your program should check whether any memory allocations are successful; if any are not, it should print some sort of explanatory message and quit. It doesn’t matter how you accomplish this.
• Your program must free any dynamically–allocated memory once it is no longer in use. One of the public tests tests this, so you will fail that test if you are not freeing allocated memory, and secret tests may also test this.
• As Project #9 explained, you cannot run gdb (or valgrind) on a shell script (which the public tests are). As explained more fully in Project #9, if you are failing a test and want to use the debugger, run gdb wc-threaded.x, look at the public test shell script, and run your wc-threaded.x in gdb with the same command–line arguments that the script is running it with. Similarly with valgrind.
• You can create you own tests by copying the public test scripts and editing them. You will have to use the chmod command given in the Project #10 assignment to give your scripts executable permission.
• For this project you will lose one point from your final project score for every submission that you make in excess of four submissions. You will also lose one point for every submission that does not compile, in excess of two noncompiling submissions. Therefore be sure to compile, run, and test your project’s results before submitting.
• Academic integrity
Please carefully read the academic honesty section of the syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, publicly providing others access to your project code online, or unauthorized use of computer accounts, will be submitted to the Office of Student Conduct, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to projects. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. More information is in the course syllabus – please review it now.
The academic integrity requirements also apply to any test data for projects, which must be your own original work. Exchanging test data or working together to write test cases is also prohibited.
• End–of–semester ELMS and TerpConnect account cleanup
D.1 Saving information from ELMS (if desired)
Soon after the semester the class ELMS space will become inaccessible, so if there are any materials from it (lecture slides, handouts, etc.) that you might possibly want later that you don’t already have, be sure to download and save them after finals. Students often find that they want to go back and review material from this course in later CMSC courses, because this material is used again in multiple later courses. And students often want to review CMSC 216 material in preparing for technical interviews. If you may ever want copies of your coursework or the course materials you have to save them yourself after the semester, because due to the size of the course we will not be able to provide them to you in the future after the ELMS space is gone.
If you want to save copies of any PDF materials from ELMS you should also save the PDF password somewhere. (Due to the size of our courses it won’t be possible to provide this in the future to anyone who didn’t save it.)
© 2023 L. Herman; all rights reserved 5
Students also sometimes want to go back and refer to their CMSC 216 projects in later courses when they need to use the material in this course there. Also, companies sometimes want to see examples of class projects during internship or job interviews. After a year the CS department removes projects from the submit server, so you won’t be able to down-load them from it after that. (Note that until then you can still see your projects by going to https://submit.cs.umd.edu and clicking on Older semesters.)
D.2 Saving information from Grace (if desired)
A couple weeks after the semester ends you will lose the ability to log in to the Grace systems, although you will still be able to log into terpconnect.umd.edu and access your course disk space there, at least for a limited time. However, early the next semester you will lose permission to access your course disk space, including all your projects, as well as everything in ~/216public, because the class space will be automatically deleted then by DIT (the Division of Information Technology). (As mentioned above, after a year, course projects are also removed from the CMSC submit server.) So if you want to save any of your projects or other coursework, or anything that we provided (lecture or discussion examples, secret tests, etc.), you will have to do so yourself over the summer. Sometimes companies want to see examples of class projects during job or internship interviews, and students taking upper–level CMSC courses often want to go back and look at relevant projects and materials from earlier courses. The instructional staff will not be able to provide copies of these in the future, so you’ll need to save them yourself if you might ever want them. Recall that the -r option to the cp command recursively copies a directory and all its contents, including subdirectories, so a command like
cp -r ~/216/ ~/216.sp23
would copy everything in your extra course disk space (which the symbolic link 216 in your home directory points to) to a directory in your home directory named “216.sp23” (use a different name if you like). (Of course you have to have enough free disk space in your TerpConnect account to store the files; use the quota command when you are logged into terpconnect.umd.edu to check this.) Although you will lose login permission to the Grace systems after the semester (unless you’re taking another course using them), you will still be able to log into your TerpConnect account as long as you’re associated with the University, via terpconnect.umd.edu.
You can also download files to your own computer; if you’re using Windows the left pane of MobaXterm will allow you to do this. (Click on Session, then on SFTP, then log into Grace. Hopefully it’s self–explanatory from there.)
On a Mac or Linux system you should be able to just open a terminal, cd on your computer to where you want to copy the files, and use a command like the following, where loginID is your directory ID, and the final period refers to the current directory:
scp -r loginID@grace.umd.edu:216 .
I have known many students who lost all of their coursework and other data, even though they had everything on their computer, because something happened to their computer, for example their laptop was stolen, or their hard drive/SSD died. I suggest that if you want to save your information it’s not sufficient to just have it on your computer, you should have an external backup as well. I have known a few students who had issues with cloud storage of their data, so I don’t completely trust that. I recommend just buying an external USB backup drive, connecting it to your computer once a day, and setting things so it backs up your files daily. (They are not expensive– I can find external USB drives from what a high–quality manufacturer for around $55 for a 1 TB (terrabyte) drive, $70 for a 2 TB drive, and $100 for a 4 TB drive.)
D.3 Resetting your Grace (TerpConnect) account
After you’ve copied the files from Grace that you want to save, and you have looked at or copied the secret tests (if you want) for the remaining projects when they are provided, the last step you need to do is to undo the changes to your account that you made during an early discussion section as described below, since later courses may use the Grace systems, and the changes you made for this course will probably conflict with changes necessary for them. Here is how to undo the changes to your account (the steps below assume that you are located in your home directory):
1. First just run /usr/glue/scripts/newdefaults (using its full pathname). This is a shell script provided by DIT that will replace any account control files in your home directory that you modified at the beginning of the semester with new unchanged copies (meaning as they were before you modified them). For example, you
© 2023 L. Herman; all rights reserved 6
modified the file .path as part of your account setup, so this command will replace your .path file with the original version.
Note that this will not lose any information, because before copying new versions of files the newdefaults script will rename your current version with the year, month, and day that you run it. So if you run the command on May 28, your current .path file will be renamed as .path-23-05-28, then the script will copy the original version of the .path file to your home directory.
The account control files that you modified when setting up your account were .emacs, .path, and .cshrc.mine. Possibly you might have created aliases of your own in your .aliases file, and possibly you might have added customizations of your own in your .emacs file. (It is unlikely that you modified any other account control files but if you did, using ls -a after running the newdefaults script will show them with a suffix like 23-05-28.)
If you want to keep your modified version of one of the account controls files, for example suppose that you want to keep the changes that you made to your .aliases file, just copy it back after running the newdefaults script. For example (given the assumptions above) cp .aliases-23-05-28 .aliases will do this. (But if you want to keep your changes to your .emacs file, comment out the line starting with load that you added to it, because it is referring to a file in 216public that will cease to exist when DIT removes the class files, which might lead to errors. Comments in Emacs control files start with a semicolon, so just put a ; character at the beginning of that line to disable it.)
2. Then remove the symbolic link named 216 that you created from your home directory to your extra disk space, by just removing the symlink, as in rm 216. You could still reach the files in your extra disk space after that if you wanted to (until you lose access to them) by just using the full pathname, for example, instead of cd 216 you could still use a command like:
cd /afs/glue/class/spring2023/cmsc/216/0101/student/loginID
3. Lastly, remove the symbolic link named 216public that you created in your home directory, which points to the class public directory, after copying any files you want from there, as in rm 216public
Make sure you can log out and log back in successfully, and are able to list and view files in your home directory after that, to ensure that you didn’t make any mistakes performing the changes.
© 2023 L. Herman; all rights reserved 7