Starting from:
$35

$29

Assignment 2. Fun with Collections Solved

.

Each student begins with four late days that may be used throughout the quarter. You may submit this assignment 24 hours late by using one late day or 48 hours late using two late days. No submissions will be accepted more than 48 hours after the due date without prior approval by the head TA. See the syllabus for more information about our late policies.

All due dates and submission times are expressed in Pacifc time.

This assignment must be completed individually. Working in groups is not permitted.


This assignment is all about the amazing things you can do with collections. It’s a two-parter. The frst is a program that can guess the language a piece of text is written in. The second models rising sea levels on a sampler of realistic terrains. By the time you've completed this assignment, you'll have a much better handle on the container classes and how to use diferent types like queues, maps, vectors, and sets to model and solve problems. Plus, you'll have some things we think you'd love to share with your friends and family.

This assignment has two parts. It will be quite a lot to do if you start this assignment the night before it’s due, but if you make slow and steady progress on this assignment each day you should be in great shape. Here’s our recommended timetable:

Aim to complete Rosetta Stone within four days of the assignment going out.


Aim to complete Rising Tides within seven days of this assignment going out.


As always, feel free to reach out to us if you have questions. Feel free to contact us on EdStem, to email your section leader, or to stop by the LaIR.

Assignment Logistics

Starter Files

We provide a ZIP of the starter project. Download the zip, extract the fles, and double-click the .pro fle to open the project in Qt Creator.

 

Starter code


Resources

Feel free to check out our Python-to-C++ guide if you're moving from Python to C++.

Also, check out our style guide, guide to testing, and debugging guide.

Getting Help

Keep an eye on the Ed forum for an announcement of the Assignment 2 YEAH (YEAH = Your Early Assignment Help) group session where our veteran section leaders will answer your questions and share pro tips. We know it can be daunting to sit down and break the barrier of starting on a substantial programming assignment – come to YEAH for advice and confdence to get you on your way!

We also here to help if you get run into issues along the way! The Ed forum is open 24/7 for general discussion about the assignment, lecture topics, the C++ language, using Qt, and more. Always start by searching frst to see if your question has already been asked and answered before making a new post.

To troubleshoot a problem with your specifc code, your best bet is to bring it to the LaIR helper hours or ofce hours.

Part One: Rosetta Stone

Rosetta Stone is a joint project with Katie Creel and Diana Navas, Embedded Ethics Extraordinaires. Thanks to Richard Lin, Patricia Wei, Jin-Hee Lee, Neha Chetry, Nuhu Osman Attah, and Rose Novick for providing sample texts.


https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    1/21
2023/3/25 17:28    CS106B Fun with Collections

Have you ever navigated your web browser to a site not written in your native language? If so, you might have seen something like this in the corner:














The browser has automatically detected that the website I visited is written in Haitian Creole, while the browser itself is set to English.

How does the browser know what language a webpage is written in? This is the language identifcation problem, which has been studied extensively for at least the last thirty years. In this part of the assignment, you’ll implement a simple but efective language identifcation algorithm, gaining experience working with the Map container type in the process.

Background: Trigrams

The approach we’ll be using for language identifcation is based on trigram profles. To understand what a trigram profle is, let’s do an example. Suppose we have this (admittedly silly) piece of text:


A BANANA BANDANA


The trigram profle for this string is a collection of all its length-three subtrings, along with the number of times that length-three substring appears. Specifcally, it’s this group:

"ANA": 3
"BAN": 2
"NA ": 1
" BA":
2
"AND":
1
"NAN":
1
"A B":
2
"DAN":
1
"NDA":
1


That is, the pattern ANA appears three times, the pattern BAN appears twice, and the pattern NDA appears exactly once. These substrings are called trigrams. (More generally, the k-grams of a text are all of the substrings of that text that have length exactly k.)

Trigram profles are useful for language identifcation because the patterns of trigrams that appear in a long piece of text often heavily refects the language that text was written in. For example, here’s the top trigrams from a profle of an English text (specifcally, the Wikipedia article “Human”):

" th": 667
"ion": 423
"ing": 318
"the": 616
" of": 376
"man": 289
"he ": 533
" in": 375
"ng ": 288
" an": 497
"of ": 363

"nd ": 492
"tio": 333

"and": 470
"ed ": 320



That top entry, " th", indicates that the text has a lot of words that start with th. You can see that the word "the" is extremely common, as are the sufxes "ing" and "ion".

Contrast this with a set of trigrams taken from a piece written in Malay:

"an ": 184
"kan": 52
" ya": 39
"ang": 82
"sia": 52
"yan": 39
" da": 73
"ia ": 47
" ma": 38
" me": 71
"men": 42

"ng ": 71
" be": 40

"dan": 52
" pe": 39

Or this set of trigrams from a piece written in Spanish:

" de": 531
" la": 293
" co": 217
"os ": 396
"es ": 277
" es": 208
"de ": 374
"la ": 239
"en ": 198
"ent": 298
"el ": 232
"ien": 198







 
 
2023/3/25 17:28

CS106B Fun with Collections
"nte": 196
" en": 185

"as ": 193


Or this piece written in Slovene:


"je ": 82
" na": 39
"raz": 33
" po": 65
"ost": 39
"in ": 32
" pr": 45
"ove": 37
"sti": 32
"ih ": 45
"lov": 36

"anj": 43
" je": 34

"nos": 43
" ra": 33

Or this one in Hausa:


"ar ": 93
" a ": 39
"a s": 35
"an ": 85
" ka": 38
"ana": 33
" da": 72
" ta": 38
"na ": 33
"in ": 69
"ara": 38

"da ": 58
" sh": 35

"iya": 42
" wa": 35



As you can see, the relative frequencies of the trigrams in the text are markedly diferent from one another. This makes it possible to guess a text’s language by computing its trigram profle and then fnding which language that profle most resembles.

Background: UTF-8

Before we dive into the actual coding part, we have to address an issue you’ll encounter almost immediately when working on this assignment.

Here’s what happens if we gram trigrams from a piece of text written in Vietnamese:

"ng ": 591
" th": 307
"\341\273\235": 219
"h\341\273": 442
" nh": 276
"\304\221\341": 211
" \304\221": 371
"
ng":
251
"nh ": 205
"\260\341\273": 311
"\303\240 ": 231
"i\341\273": 200
"\306\260\341": 311
"
tr":
229



And here’s some trigrams from Bulgarian text:

"\320\275\320": 835
"\320\265\320": 491
"\260 \320": 356
"\321\202\320": 701
"\320\260\320": 450
"\320\270 ": 349
"\320\276\320":
607
"\320\270\320": 438
"\320\276\321": 348
"\320\260 ": 551
"\320\262\320":
433
"\320\260\321": 321
"\321\200\320":
517
"\320\270\321":
395



And here's some from Farsi:

"\330\247\331": 562
"
\330\247": 306
"\331\207 ": 240
"\330\247\330": 529
"\333\214 ":
287
"\331\206 ": 235
"\247\331\206": 391
"\331\210\330":
281
"\330\257\330": 231
"\331\206\330":
361
"\333\214\330":
256
"\330\261\330": 230
"\330\263\330":
320
"
\330\250":
241



What’s going on here? Why are we seeing lots of numbers and slashes rather than actual characters?

The reason has to do with two historical artifacts of C++ colliding with contemporary realities. C++ is an older language (it dates back to the 1980s), at a time where computer networks were small and memory was scarce. To save space, most C++ implementations only allowed for 256 diferent values for the char type. And that was fne, since computers were often used in contexts where the English alphabet (plus a few added characters for other languages) was all that was needed. You could easily get away with having only 256 possible diferent characters, and the space savings were key at a time where even a 1GB hard drive would be considered a luxury.

But now that memory is much cheaper and people all over the world are connecting and sharing text in their native languages, this restriction on the char type is more of a liability than an asset. Restricting the char type to only have 256 possible values means that there







 


https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    3/21
2023/3/25 17:28    CS106B Fun with Collections

are more Chinese characters than there are possible char values – not to mention more emojis than chars.

To address this, a compromise was reached – while (for backwards compatibility) English letters would take up just one char value, glyphs from other languages would be encoded as a sequence of multiple consecutive chars representing a single glyph. And, as a strange side-efect, the individual chars making up those glyphs can’t easily be displayed in isolation. They are often displayed as \NNN. For example, the sequence \331\206 corresponds to the Arabic character ن, while \320\270 corresponds to the Cyrillic letter и. This system is called UTF-8.

In the context of this assignment, this means that a “trigram” formed from a piece of text might actually consist of a fractional character. For example, the ancient Phoenician character is encoded as a sequence of four char values, so trigrams from Phoenician text would often contain quarters or halves of individual Phoenician letters. While in general it is not okay to subdivide characters this way (many systems will display error characters if you try printing out a partial glyph), for the purposes of this assignment you should

always form trigrams from a group of three char values, regardless of what fraction of a glyph those three chars represent.

To summarize:

    1. Don’t be surprised if you see sequences like \302 or \225 appearing in your trigrams. That’s a consequence of using characters that 1980s Americans wouldn’t have expected to see on their computers. It’s a complete historical artifact, but one we have to live with today.

    2. For the purposes of this assignment, it is perfectly okay – and in fact, expected – that when forming trigrams, you should always form groups of three char values, rather than trying to segment text at the actual boundaries between glyphs. Surprisingly, this still works well in practice.

    3. Outside of the context of CS106B, be aware that char does not mean “a character in any language.” Some glyphs require multiple characters to encode. The “correct” way to handle these sorts of strings is to use a library that properly breaks text apart into its individual units.

Milestone One: Form k-Grams

Your frst task in this problem is to implement a function



Map<string, double> kGramsIn(const string& text,

int kGramLength);


that takes as input a string, then returns a Map<string, double> containing the frequencies of all the k-grams of length kGramLength. Each key should be a k-gram (a substring of length kGramLength), and each value should be the number of times that this k-gram appears in the string. For convenience, any k-gram that doesn’t appear in the text should not be present in the map, not even with value zero.

You can assume that the kGramLength parameter is positive (for example, it shouldn’t be 0 or -137). If the kGramLength parameter is not positive (that is, it's zero or negative), you should call the special function error, defned in "error.h", to report an error. You can call this function as follows:


error("Some error message");



The error function acts like an emergency abort and will immediately exit the function with an error.

The input string might be shorter than kGramLength. If that happens, there are no k-grams, and you should return an empty Map.







Due Friday,   at 1:00 pm


Assignment Logistics Part One: Rosetta Stone

Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams

Milestone One Requirements

Milestone Two: Normalize Frequen

Milestone Three: Filter Out Uncom

Milestone Four: Implement Cosine

Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions




https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    4/21

 2023/3/25 17:28    CS106B Fun with Collections



Milestone One Requirements

    1. Implement the kGramsIn function in RosettaStone.cpp.

    2. Use the provided tests to confrm that your code works correctly.

    3. Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!


You might be wondering why we’re having you return a Map<string, double> here rather than a Map<string, int>. After all, the frequencies here are all integers, and int seems like a more appropriate type. There are two reasons why we’re having you do this:

    1. In a later step, you will be dividing all the entries of this map to make them between 0 and 1, inclusive. Working with a Map<string, double> here makes that step a bit easier.

    2. Later on, you’ll be doing a calculation involving multiplying each of the terms in the map by themselves. The int type is (usually) limited to a range from roughly negative two billion to roughly positive two billion, and some of the entries in the map, when squared, exceed that range. On the other hand, doubles can comfortably handle values outside this range.

Some notes on this problem:

There's an Endearing C++ Quirk you may encounter in this part of the


assignment. The .length() and .size() functions on a string return what's called an unsigned integer. These are integer values that are not permitted to be negative, which means that they can have unusual interactions with subtraction. For example, if you have a string of length 2 and compute str.length() - 3, instead of getting -1, you'll get a gigantic positive number (this is called integer overfow - take CS107 for details!). As a result, avoid subtraction when working with .length() and

.size(). If you're tempted to subtract something from the length of a string, instead see if you can add something to another quantity.

Just to make sure the notion of a k-gram is clear, the 3-grams of the string "ABCDEF" are "ABC", "BCD", "CDE", and "DEF"; the 4-grams of the string "ABCDEF" are "ABCD", "BCDE", and "CDEF"; the only 6-gram of "ABCDEF" is "ABCDEF"; the six 1-grams of "ABCDEF" are "A", "B", "C", "D", "E", and "F"; and there are no 137-grams of the string "ABCDEF".


To make sense of the syntax here, the name of the function you're writing is kGramsIn, and the function return type is Map<string, double>. This does not mean that there is an actual Map<string, double> named kGramsIn. If you want to create a Map<string, double> as a local variable, feel free to do so, but pick a more descriptive name for it.


You should call the error function if the k-gram length is zero or negative, because there's no meaningful way to defne a 0-gram or a -137-gram. However, if the k-gram length is greater than the length of the input string, you should return an empty Map, since in general large k-grams are a meaningful thing to talk about and there just don't happen to be any for a shorter string.


If you're looking for information about the syntax of diferent operations on our container types, check out the Stanford C++ Library Documentation page.


Milestone Two: Normalize Frequencies

We now have a Map<string, double> representing the trigrams in the original text. Once we have these frequencies, we can guess the text’s language by seeing which language’s trigram distribution it’s most similar to. There are many ways to defne what “most similar to” means, but one of the most common ways to do this uses something called the cosine similarity, which is described below.

First, an observation. Here are the trigram distributions for three pieces of text, each of which are written in English:

Distribution 1







Due Friday,   at 1:00 pm


Assignment Logistics


Part One: Rosetta Stone


Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams

Milestone One Requirements

Milestone Two: Normalize Frequen

Milestone Three: Filter Out Uncom

Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions


https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    5/21
2023/3/25 17:28

CS106B Fun with Collections

" th": 219
" to": 62
"and":
54
"the": 171
"at ": 60
"re ":
54
"he ": 113
"nd ": 60
" be":
53
"ing": 79
"acc": 58
" in":
53
" of": 76
"cin": 58
"e t":
53
"ng ": 76
"cci": 57


"of ": 69
"to ": 57


Distribution 2



" th": 495
"ed ": 158
"er ":
128
"the": 388
" to": 157
"e t":
126
"he ": 351
"I":154
"re ":
124
"ing": 231
" an": 153
"is ":
121
"nd ": 194
"and": 152
" of":
117
"ng ": 193
"to ": 146


" a ": 159
"her": 134


Distribution 3



" th": 510
"ing": 208
" a ":
171
"he ": 371
"nd ": 205
" of":
159
"the": 339
"ng ": 196
"is ":
158
"ed ": 269
"and": 195
"as ":
154
" to": 257
" in": 187
"tha":
154
" an": 222
"at ": 175


"to ": 217
"hat": 174




In each case, we can see that the very most common trigrams are pretty much the same across the three texts (though the ordering is a bit diferent). However, notice that the numbers are generally increasing as we move from the frst document to the third. The reason for that is simple – these were computed from progressively longer texts.

To make the trigram frequencies more consistent across texts of diferent lengths, you will need to normalize the scores. Borrowing a technique from linear algebra, we’ll normalize scores as follows:

Add up the squares of all the frequencies in the map.


Divide each frequency by the square root of this number.


As a very simplifed example, suppose we have these trigrams:

"aaa": 3


"baa": 1


"aab": 1


We’d compute the number 32 + 12 + 12 = 11, then divide each number by the square root of eleven to get these new frequencies:

"aaa": 0.904534


"baa": 0.301511


"aab": 0.301511


Applying this same approach to the three distributions from above (including the frequencies of the less frequent k-grams that we didn’t show) gives the following:

Distribution 1


" th": 0.40398
" to": 0.11489
"and": 0.10007
"the": 0.31503
"at ": 0.11118
"re ": 0.10007
"he ": 0.20755
"nd ": 0.11118
" be": 0.09821
"ing": 0.14639
"acc": 0.10562
"e t": 0.09821
" of": 0.14083
"cin": 0.10562
"ed ": 0.09821
"ng ": 0.14083
"to ": 0.10562

"of ": 0.12786
"cci": 0.10377

Distribution 2


" th": 0.37585
"the": 0.29460
"he ": 0.26651







Due Friday,   at 1:00 pm


Assignment Logistics


Part One: Rosetta Stone


Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams

Milestone One Requirements

Milestone Two: Normalize Frequen

Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions


https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    6/21
2023/3/25 17:28

CS106B Fun with Collections
"ing": 0.17539
" I ": 0.11693
"e t": 0.09567
"nd ": 0.14730
" an": 0.11617
"re ": 0.09415
"ng ": 0.14654
"and": 0.11541
"is ": 0.09187
" a ": 0.12072
"to ": 0.11085
" of": 0.08883
"ed ": 0.11997
"her": 0.10174

" to": 0.11921
"er ": 0.09719

Distribution 3


" th": 0.33214
"ing": 0.13546
" a ": 0.11136
"he ": 0.24161
"nd ": 0.13350
" of": 0.10355
"the": 0.22077
"ng ": 0.12764
"is ": 0.10289
"ed ": 0.17518
"and": 0.12699
"as ": 0.10029
" to": 0.16737
" in": 0.12178
"tha": 0.10029
" an": 0.14458
"at ": 0.11397

"to ": 0.14132
"hat": 0.11331



With the scores normalized, these numbers are much closer to one another, indicating that they’re using fairly similar distributions of trigrams.

Your next task is to implement a function



Map<string, double> normalize(const Map<string, double>& profile);


that takes as input a trigram profle, then returns an normalized version of the profle.

There is an important edge case you need to handle. If the input Map is empty or does not contain any nonzero values, then the calculation we’ve instructed you to do above would divide by zero. (Do you see why?) To address this, if the map doesn’t contain at least one nonzero value, you should use the error() function to report an error.



Milestone Two Requirеments

    1. Implement the normalize function in RosettaStone.cpp.

    2. Use the provided tests to confrm that your code works correctly.

    3. Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!


Some notes on this problem:

There are no restrictions on what the keys in the map can be. Although you’ll ultimately be using this function in the context of trigrams, you should not assume that the keys will necessarily be strings of length three, nor that the keys could be a set of trigrams for a piece of text. Similarly, while it’s never possible to have a negative number of copies of a given trigram, your code should be able to handle negative values. In each case, sum up the squares of all the values, then divide each of the values by the square root of that sum.


The keys in the resulting map should be the same as the keys in the input map. In other words, you should not add or remove keys.


For those of you coming from Python: C++ doesn’t have a built-in exponentiation operator like **. Instead, to compute powers or square roots, include the header <cmath> and use the pow and sqrt functions.


If you're looking for information about the syntax of diferent operations on our container types, check out the Stanford C++ Library Documentation page.


Milestone Three: Filter Out Uncommon Trigrams

While all texts written in a particular language will show a strong degree of similarity between their most frequent trigrams, the infrequent trigrams in two pieces of text written in the same language often difer greatly. For example, these infrequent trigrams often refect uncommon words that appear a few times in a text.







Due Friday,   at 1:00 pm


Assignment Logistics


Part One: Rosetta Stone


Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams

Milestone One Requirements

Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions



https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    7/21
2023/3/25 17:28    CS106B Fun with Collections

To address this, we’re going to ask you to write a function



Map<string, double> topKGramsIn(const Map<string, double>& profile, int numToKeep);

that takes as input a k-gram profle and a maximum number of k-grams to keep, then returns a Map containing only the numToKeep most frequent k-grams in the profle. If numToKeep is bigger than the number of k-grams in the profle, you should return all the original k-grams. And if numToKeep is negative, you should report an error(), since that’s not feasible. (If numToKeep is zero, you should just return an empty map.)

To do this, you’ll need some way to work with k-grams sorted by frequency. For that, we’ll introduce a new, useful container type that you’ll see later on in the quarter: a priority queue. If you include the header fle "priorityqueue.h", you’ll get access to the PriorityQueue container type.

Like a regular queue, the PriorityQueue supports the enqueue, dequeue, peek, size, and isEmpty operations. The diference between a priority queue and a regular queue is the order in which the elements are dequeued. In a regular Queue, elements are lined up in sequential order, and calling dequeue() or peek() gives back the element added the longest time ago. In the PriortyQueue, each time you enqueue an element, you associate it with a number called its weight. The element that’s returned by dequeue() or peek() is the element that has the lowest weight. For example, let’s imagine we set up a PriorityQueue like this, holding a collection of past CS TAs:


PriorityQueue<string> pq;

pq.enqueue("Amy",    103);

pq.enqueue("Ashley", 101);

pq.enqueue("Anna",    110);

pq.enqueue("Luna",    161);


If we write



string elem = pq.dequeue();

cout << elem << endl;


then the string printed out will be "Ashley", since of the four elements enqueued, the weight associated with Ashley (101) was the lowest. Calling pq.dequeue() again will return "Amy", since her associated weight (103) was the lowest of what remains. Calling pq.dequeue() a third time would return "Anna".

Let’s insert more values. If we call pq.enqueue("Chioma", 103) and then pq.dequeue(), the return value would be the newly-added "Chioma", because her associated weight is lower than all others. And note that Chioma was the most-recently-added TA here; just goes to show you that this is quite diferent from a regular Queue!

We’re going to leave it to you to determine how to use the PriorityQueue type to fnd the most frequent k‑grams. There are many strategies you can use to do this. Something to think about: PriorityQueues are great at removing items with low weight, and you want to fnd items of high weight.



Milestone Three Requirements

    1. Implement the topKGramsIn function in RosettaStone.cpp.

    2. Use the provided tests to confrm that your code works correctly.

    3. Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!


Some notes on this problem:







Due Friday,   at 1:00 pm


Assignment Logistics


Part One: Rosetta Stone


Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams

Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions







https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    8/21
2023/3/25 17:28    CS106B Fun with Collections

Your solution needs to make use of the PriorityQueue type. While there are other ways to sort things in C++, for the purposes of this problem we'd prefer if you avoid them.


If multiple k-grams are tied for having the same weight, you can break ties however you’d like.


The input frequencies may or may not be normalized.


Although this wouldn’t be possible in a k-grams setting, this function should work just fne even if some of the values in the map are zero or negative.


Milestone Four: Implement Cosine Similarity

At the end of the previous milestone, we saw this trio of trigram profles for English texts:

Distribution 1


" th": 0.40398
" to": 0.11489
"and": 0.10007
"the": 0.31503
"at ": 0.11118
"re ": 0.10007
"he ": 0.20755
"nd ": 0.11118
" be": 0.09821
"ing": 0.14639
"acc": 0.10562
"e t": 0.09821
" of": 0.14083
"cin": 0.10562
"ed ": 0.09821
"ng ": 0.14083
"to ": 0.10562

"of ": 0.12786
"cci": 0.10377

Distribution 2


" th": 0.37585
"ed ": 0.11997
"er ": 0.09719
"the": 0.29460
" to": 0.11921
"e t": 0.09567
"he ": 0.26651
" I ": 0.11693
"re ": 0.09415
"ing": 0.17539
" an": 0.11617
"is ": 0.09187
"nd ": 0.14730
"and": 0.11541
" of": 0.08883
"ng ": 0.14654
"to ": 0.11085

" a ": 0.12072
"her": 0.10174

Distribution 3


" th": 0.33214
"ing": 0.13546
" a ": 0.11136
"he ": 0.24161
"nd ": 0.13350
" of": 0.10355
"the": 0.22077
"ng ": 0.12764
"is ": 0.10289
"ed ": 0.17518
"and": 0.12699
"as ": 0.10029
" to": 0.16737
" in": 0.12178
"tha": 0.10029
" an": 0.14458
"at ": 0.11397

"to ": 0.14132
"hat": 0.11331





We as humans can look at these numbers and say “yep, they’re pretty close,” but how could a computer do this? Suppose we have two trigram profles P₁ and P₂ that have already been normalized. One measure we can use to determine how similar they are is to compute their cosine similarity, which can be computed as follows: for each trigram that appears in both and multiply the frequency of that trigram in and then add up all the
P1    P2,    P1    P2,
numbers you found this way.

As an example, consider these two (highly contrived, not actually real) trigram profles:

Profle 1

"aaa": 0.333


"bbb": 0.667


"ccc": 0.667


Profle 2

"bbb": 0.333


"ccc": 0.667


"ddd": 0.667Assignment Logistics Part One: Rosetta Stone

Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions


https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    9/21
2023/3/25 17:28    CS106B Fun with Collections

These two trigram sets have the trigrams "bbb" and "ccc" in common. We’d therefore fnd their cosine similarity by computing

(product of "bbb" frequencies) + (product of "ccc" frequencies)
    • (0.667 × 0.333) + (0.667 × 0.667)

    • 0.667

The frequencies of "aaa" and "ddd" aren’t relevant here, since those trigrams appear in only one of the two trigram sets.

In the context of trigrams, cosine similarities range from 0 (the two documents have no trigrams in common) to 1 (the two documents have identical relative trigram frequencies).

Your next task is to write a function



double cosineSimilarityOf(const Map<string, double>& lhs, const Map<string, double>& rhs);

that takes as input two sets of normalized scores, then returns their cosine similarity using the formula given above.



Milestone Four Requirements

    1. Implement the cosineSimilarityOf function in RosettaStone.cpp.

    2. Use the provided tests to confrm that your code works correctly.

    3. Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!


Some notes:

Although you’ll ultimately be applying this function to trigram profles, your function should work in the case where the keys in the two maps are arbitrary strings. In other words, you shouldn’t assume that you’re just working with maps whose keys are all length-three strings.


You can assume that the two sets of input scores are normalized and don’t need to handle the case where this isn’t true.


If you're looking for information about the syntax of diferent operations on our container types, check out the Stanford C++ Library Documentation page.


Milestone Five: Guess a Text’s Language

At this point, you have the ability to take a text, extract its trigrams, and normalize its profle. You have the ability to take two normalized profles and compare their scores against one another. All that’s left to do now is put everything together to guess the language of a text.

Here’s how. Before releasing this assignment to you, we went online and got lots and lots of texts written in diferent languages. Each of the texts associated with a given language is called a corpus (plural corpora) and is designed to be large and diverse enough to give a representative sample of what that language looks like in practice. We then computed trigram profles for each of those languages, giving a trigram profle for the language as a whole.

To make an educated guess about what language a piece of text is written in, you can simply compute the similarity of that text’s profle against the profles of all the diferent languages. Whichever language has the highest similarity to your text then gives the best guess of what language the text was written in.

Your fnal task is to write a function



string guessLanguageOf(const Map<string, double>& textProfile, const Set<Corpus>& corpora);







Due Friday,   at 1:00 pm


Assignment Logistics Part One: Rosetta Stone

Background: Trigrams


Background: UTF-8

Milestone One: Form k-Grams

Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions





https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    10/21
2023/3/25 17:28    CS106B Fun with Collections

that takes as input a text’s k-gram profle (which has been normalized and then trimmed to store only the most frequent k-grams) and a set of corpora for diferent languages. This function then returns the name of the language that has the highest cosine similarity to the text. As an edge case, if the set of corpora is empty, you should call error() to report an error, since in that case there aren’t any languages to pick from.

You may have noticed that the last parameter to this function is a Set<Corpus> representing the text corpora. What exactly is a Corpus? In the RosettaStone.h header fle (look at Headers/src/RosettaStone.h), you'll see that it's defned like this:


struct Corpus {

string name;    // Name of the language

Map<string, double> profile; // Normalized and trimmed

};


This is a structure, a type representing a bunch of diferent objects all packaged together as one. This structure type groups a string named name (the name of the language) and a Map<string, double> named profile (its normalized trigram profle). The name Corpus refers to a type, just like int or string. You can create variables of type Corpus just as you can variables of any other type. For example, you could declare a variable of type Corpus named english like this:


Corpus english;


One you have a variable of type Corpus, you can access the constituent elements of the struct by using the dot operator. For example, you can write code like this:


Corpus english;

english.name = "American English";

english.profile["the"] = 0.13742;




Milestone Five Requirements

    1. Implement the guessLanguageOf function in RosettaStone.cpp.

    2. Use the provided tests to confrm that your code works correctly.

    3. Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!


Some notes on this part of the problem:

You can assume that whoever is calling this function will already have normalized the frequencies of all the k-grams in the profle and each corpus and then fltered them down to just the most common k-grams. You do not need to - and should not - repeat these steps, since doing so will slow down your function.


If there is a tie between two or more languages being the most similar, you can break the tie in whatever way you’d like.


If you're looking for information about the syntax of diferent operations on our container types, check out the Stanford C++ Library Documentation page.


Milestone Six: Explore and Evaluate

You’ve just implemented a system to guess the languages of various pieces of texts. It’s now time to ask: how well does it work? When does it give good guesses? When does it do poorly? Questions like these are important, especially given the cultural and political signifcance of languages.

Let’s begin with some cases where your program does a great job identifying languages. The following texts will all have their languages correctly identifed.







Due Friday,   at 1:00 pm


Assignment Logistics Part One: Rosetta Stone

Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions


Submission Instructions







https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    11/21
2023/3/25 17:28    CS106B Fun with Collections



Run the “Rosetta Stone” demo from the top-level menu and use it to answer each of the following questions. Write your answers in the fle LanguageIdentification.txt. As a note, assuming you've coded everything up correctly, the answers your program produces should be correct.

1. What language is this written in?


,    ,


.    .


,    .


2. What language is this written in?


ᐁᐘᑯ ᐅᒪ ᐊᓯᓂᐏ ᒣᓂᑲᐣ ᑭᒥᓇᐘᐠ

ᐃᓂᓂᐘᐠ ᒪᓂᑐᐸ ᑳᔭᒋᐠ᙮ ᐁᐘᑿᓂᐠ ᐅᑭ

ᑲᓄᒋᐦᑕᒋᐠ ᐊᓯᓂᐏᐊᑐᐢᑮᓂᓂᐤ ᑲᑭᒥᓂᐦᒋᐠ

ᐅᒣᓂᐤ᙮ ᐊᑿᓂ ᒥᑕᐦᑐᒥᑕᓇᐤ ᐊᐢᑭᐩ ᐊᓴᐩ

ᐁᐊᑐᐢᑫᒋᐠ ᐅᑕ ᒪᓂᑐᐸ᙮


3. What language is this written in?


Penei ka manao ana o kekahi poe. I ke kaua ana o kekahi aina

me kekahi aina, o ka poe i hee, makau lakou o lukuia mai a holo wale lakou ma ka moana, a loaa kahi aina, a noho iho la malaila, a pela hou no, a hiki no na kanaka ma na moku a pau. I ka manao o kekahi poe, ua pupuhi wale ia na waa i holo ma ka moana a pae wale aku i kekahi aina. Ua manao wale ia keia mau mea, aole i maopopo.



4. What language is this written in?


Tsídiiłtsooí binák ee halzhinígíí
éí tsídii tsídígíí dah yikahjí
atah yisdzoh, áádóóʼ éí Náhook sjí
kéyah dah si ánígíí bikáá góó
hóĺǫ.
ǫ
ʼ
ʼ






5. What language is this written in?


Negema wangu binti, mchachefu wa sanati upulike wasiati asa ukanzingatia. Maradhi yamenshika hatta yametimu mwaka sikupata kutamka neno lema kukwambia. Ndoo mbee ujilisi na wino na qaratasi moyoni nina hadithi nimependa kukwambia.

6. What language is this written in?


გულისფანცქალით მივდიოდით მხარეში, საიდანაც ვიწყებდით

ძველი

წელთაღრიცხვის საიდუმლოებებით მოცულ მიწისზედა თუ

მიწისქვეშა

ტალანებსა და გამოქვაბულებში ძრომა-ხეტიალს.


7. What language is this written in?


Намрын бороо зөөлөн шиврэхэд аав минь дуртай

Насыг нь зөөж буцсаар л байгаа шувуудад хайртай

Өөрөө өтлөвч, орчлонд үлдэнэ гэж надад хайртай

Өдрөөс өдөрт холдсоор л байгаа аавдаа би хайртай

Шар наран улам алслаад л







Due Friday,   at 1:00 pm


Assignment Logistics Part One: Rosetta Stone

Background: Trigrams


Background: UTF-8

Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions




https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    12/21
2023/3/25 17:28    CS106B Fun with Collections


8. What language is this written in?

   ϩ ̀ ϩ  ̀     ̀  ̀   ϣ  ̀ ̀      ̀   ̀  ̀    ϩ  ̀ ̀ ̀ ϣ  .

   ϩ    ϫ  ̀    ̀    ̀      ϫ         ̀        ̀      ̀    ̀
̀
ϩ    ϥ ̀    ̀ϩ    .    ϩ ̀    ϥ  ̀    ̀ ̀ ϣ    ̀      ̀ ϫ
  ̀       ϥ        ϣϯ    ̀   ϣ    .

9. What language is this written in?


Akyinkyinakyinkyin ama mahu nneɛma,

Asante Bonwire Kente nwene deɛ,

Manhu bi da o,

Kwame nim adeɛ yɔ

Ne kente nwono na abɔ me gye

Ne nsa; ne nan, n’asadua saa nie.


10. What language is this written in?


ཇི་སེད་ས་དག་ཕོགས་བཅའི་འཇིག་རེ་ན། །

དས་གསམ་གཤེགས་པ་མི་ཡི་སེང་གེ་ཀན། །

བདག་གིས་མ་ལས་དེ་དག་ཐམས་ཅད་ལ། །

ལས་དང་ངག་ཡི་དང་བས་ཕག་བགིའོ།།



It would be great if your program always correctly identifed the language a text is written in. However, your tool isn’t perfect, and it will sometimes get languages wrong. This can happen for a variety of reasons. Your next task is to explore where this can happen.



Run the “Rosetta Stone” demo from the top-level menu. Answer each of the following questions in the fle LanguageIdentification.txt:

    11. This text by Robert Burns is written in Scots. What language does your program say it’s written in? (Sometimes, the trigrams from a text in one language will closely match the trigrams from a related language.)

And there's a hand, my trusty fiere! and gie's a hand o' thine! And we'll

tak' a right gude-willie waught, for auld lang syne.


    12. This text is written in English. What language does your program say it’s written in? (Short texts can have trigrams that do not match the general trigram distribution for texts in that language.)

San Jose is in Santa Clara County, California.


    13. This text is written in English. What language does your program say it’s written in? (Trigrams are case-sensitive. We derived the English trigrams from texts that were written in formal English, in which it’s very unusual to see long strings of capitalized letters. As a result, this text will be a low match against English.)







Due Friday,   at 1:00 pm


Assignment Logistics

Part One: Rosetta Stone


Background: Trigrams

Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions




















https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    13/21
2023/3/25 17:28    CS106B Fun with Collections


IT WAS THE BEST OF TIMES, IT WAS THE WORST OF TIMES, IT WAS THE AGE OF
WISDOM, IT WAS THE AGE OF FOOLISHNESS, IT WAS THE EPOCH OF BELIEF, IT WAS THE EPOCH OF INCREDULITY, IT WAS THE SEASON OF LIGHT, IT WAS THE SEASON OF DARKNESS, IT WAS THE SPRING OF HOPE, IT WAS THE WINTER OF DESPAIR, WE HAD EVERYTHING BEFORE US, WE HAD NOTHING BEFORE US, WE WERE
ALL GOING DIRECT TO HEAVEN, WE WERE ALL GOING DIRECT THE OTHER WAY – IN
SHORT, THE PERIOD WAS SO FAR LIKE THE PRESENT PERIOD, THAT SOME OF ITS
NOISIEST AUTHORITIES INSISTED ON ITS BEING RECEIVED, FOR GOOD OR FOR EVIL, IN THE SUPERLATIVE DEGREE OF COMPARISON ONLY.

    14. This song is written in Nepali. What language does your program say it’s written in? (Nepali is traditionally written with Devanagari script, and we used those texts to get our trigrams. In this example, it’s transliterated using a Latin-derived alphabet.
Therefore, this will have a low match with Nepali.)


Rāt
ikā nidhārmā

alakkai
alkiy

ō ṭ
pī śiraimā






ō
hākā

halakkai

halkiy

ṭō

alakka



ō
rāt
ikā nidhārmā


alkiy

ō ṭ


āg



ō

chāti bhitra māyāk


salkiy


ē bijulik  tār tār ō tārō tār


ō


ō





śanivār
phursad bha'ē ē hajur bhē au


phursad bha'ē ē hajur bhēṭauṅ śanivār
hākā





pī śiraimā  halakka
halkiy

ṭō


ō
māyā khēlnē khēlā ī jhulukkai jhulkiy
ē bijulik


ō

tār tār tār tār




ō

śanivār
phursad chaina ē hajur ā'u d

phursad chaina ē hajur ā'uṅdō śanivār
ṅ ō







Due Friday,   at 1:00 pm


Assignment Logistics Part One: Rosetta Stone

Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions








Milestone Seven: Ethical Considerations

As you’ve just seen, your tool is accurate in some cases but not in others. This section will ask you to put yourself in the position of a hypothetical engineer working on a language identifcation tool in a socially and politically important context.

Suppose you are a software engineer working at a company that is about to launch a new social media website. The company has decided to use automated software tools to identify, fag, and remove posts above a certain degree of toxicity, with the aim of ameliorating emerging conficts and preventing online harassment. Your language identifcation tool will be used as the frst step in the process by identifying the language user-posted content is written in. From there, the company will employ language-specifc moderation tools to determine whether the posted content should be automatically removed.



Answer the following question in the fle ShortAnswers.txt:

Q1. Suppose that your tool incorrectly determines the language of a user's post. Identify a person who may realistically be harmed as a result. How, specifcally, would that person be harmed? (Your answer should be 2 – 4 sentences in length.)


Because of the harms that can result from misidentifying a language, you tell your product manager that you will need to test your tool to determine contexts in which it performs well and contexts in which it performs poorly. Your manager agrees and gives you an additional week and a very small budget to test your language identifcation tool.



Answer the following question in the fle ShortAnswers.txt:

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    14/21
2023/3/25 17:28    CS106B Fun with Collections


Q2. Given the time and fnancial constraints outlined above, what would you do to test your tool? What steps would you take to be confdent that the information you gathered from testing was accurate? Here are some considerations to keep in mind:

If you search online for texts written in a given language, the texts you fnd may not actually be written in that language. As an example, we originally used the Scots Wikipedia as our source for the Scots language trigrams, but we had to replace it with a diferent Scots work because much of the Scots Wikipedia was written by someone who didn’t speak Scots. This means that sourcing texts to
test on requires a degree of familiarity with the language in question.

For some languages (Kazakh, Bengali, etc.), it's easy to fnd large amounts of contemporary text in that language through a quick Google search. For others (Kalaallisut, Izon, etc.), this is much harder. These disparities often refect larger political or historical trends.

Social and political forces can make it difcult to determine the "correct" label for a piece of text. For example, diferent people in Moldova may label a single piece of writing in the language they speak as either "Moldovan" or "Romanian."

There are around 8,000 extant human languages, and your tool has trigrams for around 200 of them. As a result, any individual person testing the tool cannot be a native or secondary speaker of all the languages that need to be tested. Similarly, any one tester won't know of all the political, historical, and social concerns surrounding each of these languages.

Your answer should be 4 - 8 sentences in length.



Testing your tool is important because it will enable to you to identify areas where your program misclassifes languages and then adjust the underlying algorithm accordingly. However, because of the resource contraints on your testing plan from Q2, you will not be able to assess all of your tool's strengths and weaknesses. As a result, you can expect that your tool will perform markedly better for text written in languages that were tested more thoroughly.



Answer the following question in the fle ShortAnswers.txt:

Q3. Look back over your testing plan. Identify a specifc group of people who are likely to be negatively impacted due to gaps in your test coverage. In what specifc ways would they be impacted? How would you assess the seriousness of these impacts? Your answer should be 3 - 6 sentences in length.


Because of the issues you identifed, you decide to ask your product manager for additional time and resources to test and improve your tool.



Answer the following question in the fle ShortAnswers.txt:

Q4. Write a memo to your product manager explaining why it is important to invest additional time and resources in a stricter testing regime. In doing so, please do the following:

Mention the specifc harms to the specifc stakeholders you identifed in Q3.

Propose three specifc ways in which you could use the extra time and resources to address and mitigate those harms.

Your answer should be 5 - 10 sentences in length.



Your product manager gets back to you congratulating you for being mindful of the societal impact of your tool. She agrees with your assessment, but says she can only provide enough additional time and resources for you to implement one of the three mitigation strategies you proposed in your memo.







Due Friday,   at 1:00 pm


Assignment Logistics

Part One: Rosetta Stone


Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions


Submission Instructions


https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    15/21
2023/3/25 17:28    CS106B Fun with Collections



Answer the following question in the fle ShortAnswers.txt:

Q5. In Q4 of this problem you proposed three ways in which you could use additional time and resources to improve your testing regimen. If you had to select only one of them, which one would you pick, and why? In writing your answer, please address the following:

What are the advantages of picking the approach you selected compared with the other two approaches? What are the disadvantages?

What are the advantages of picking the other two approaches? What are the disadvantages?

Based on your analysis, why do you think it is best to select the approach you did?

Your answer should be 5 - 10 sentences in length.






Thank you for working through this exercise. If you are interested in exploring more of the tradeofs involved here, we invite you to think about the following.

    1. In the scenario described above, your tool was to be used for content moderation on a social media website that had not yet launched. However, suppose you were instead adding moderation to an existing social media site. How would you balance the benefts and harms associated with rolling out an imprecise content moderation tool with the benefts and harms of delaying the rollout of the tool until it had been improved and polished?

    2. In practice, your tool may face a tradeof in which it can either identify a smaller number of languages more precisely or a larger number of languages less precisely. What impact might this have if you tool was used in the context of content moderation? What benefts and harms may arise in each case? Who would be impacted, and how would you decide which approach to take?

You aren't expected to submit anything in response to these last two questions, though you are welcome to come and chat with us about them if you are interested!



Part Two: Rising Tides

Background: Our Model

Global sea levels have been rising, and the most recent data suggest that the rate at which sea levels are rising is increasing. This means that city planners in coastal areas need to start designing developments so that an extra meter of water doesn’t food people out of their homes.

Your task in this part of the assignment is to build a tool that models fooding due to sea level rise. To do so, we’re going to model terrains as grids of doubles, where each double represents the altitude of a particular square region on Earth. Higher values indicate higher elevations, while lower values indicate lower elevations. For example, take a look at the three grids to the right. Before moving on, take a minute to think over the following questions, which you don’t need to submit. Which picture represents a small hill? Which one represents a long, sloping incline? Which one represents a lowland area surrounded by levees?







Due Friday,   at 1:00 pm


Assignment Logistics Part One: Rosetta Stone

Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions















https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    16/21
2023/3/25 17:28    CS106B Fun with Collections

We can model the fow of water as follows. We’ll imagine that there’s a water source somewhere in the world and that we have a known height for the water. Water will then fow anywhere it can reach by moving in the four cardinal directions (up/down/left/right) without moving to a location at a higher elevation than the initial water height. For example, suppose that the upper-left corner of each of the three above worlds is the water source. Here’s what would be underwater given several diferent water heights:







Due Friday,   at 1:00 pm


Assignment Logistics


Part One: Rosetta Stone


Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions












A few things to notice here. First, notice that the water height is independent of the height of the terrain at its starting point. For example, in the bottom row, the water height is always two meters, even though the terrain height of the upper-left corner is either 0m or -1m, depending on the world. Second, in the terrain used in the middle column, notice that the water stays above the upper diagonal line of 4’s, since we assume water can only move up, down, left, and right and therefore can’t move diagonally through the gaps. Although there’s a lot of terrain below the water height, it doesn’t end up under water until the height reaches that of the barrier.

It’s possible for a grid to have multiple water sources. This might happen, for example, if we were looking at a zoomed-in region of the San Francisco Peninsula, we might have water to both the east and west of the region of land in the middle, and so we’d need to account for the possibility that the water level is rising on both sides. Here’s another set of images, this time showing where the water would be in the sample worlds above assume that both the top-left and bottom-right corner are water sources. (We’ll assume each water source has the same height.)







































https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    17/21
2023/3/25 17:28    CS106B Fun with Collections







Due Friday,   at 1:00 pm


Assignment Logistics


Part One: Rosetta Stone


Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions


Submission Instructions












Notice that the water overtops the levees in the central world, completely fooding the area, as soon as the water height reaches three meters. The water line never changes, regardless of the current elevation. As such, water will never food across cells at a higher elevation than the water line, but will food across cells at the same height or below the water line

Your task is to implement a function



Grid<bool> floodedRegionsIn(const Grid<double>& terrain,

const Vector<GridLocation>& sources,

double height);


that takes as input a terrain (given as a Grid<double>), a list of locations of water sources (represented as a Vector<GridLocation>; more on GridLocation later), and the height of the water level, then returns a Grid<bool> indicating, for each spot in the terrain, whether it’s under water (true) or above the water (false).

You may have noticed that we’re making use of the GridLocation type. This is a type representing a position in a Grid. You can create a GridLocation and access its row and column using this syntax:


GridLocation location;

location.row = 137;

location.col = 42;


GridLocation otherLocation = { 106, 103 }; // Row 106, column 103 otherLocation.row++; // Increment the row.
cout << otherLocation.col << endl; // Prints 103


Background: Breadth-First Search

Now that we’ve talked about the types involved here, let’s address how to solve this problem. How, exactly, do you determine what’s going to be underwater? Doing so requires you to determine which grid locations are both (1) below the water level and (2) places water can fow to from one of the sources.

Fortunately, there’s a beautiful algorithm you can use to solve this problem called breadth-frst search. The idea is to simulate having the water fow out from each of the sources at greater and greater distances. First, you consider the water sources themselves. Then, you

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    18/21
2023/3/25 17:28    CS106B Fun with Collections

visit each location one step away from the water sources. Then, you visit each location two steps away from the water sources, then three steps, four steps, etc. In that way, the algorithm ends up eventually fnding all places the water can fow to, and it does so fairly quickly!

Breadth-frst search is typically implemented by using a queue that will process every fooded location. The idea is the following: we begin by enqueuing each water source, or at least the sources that aren’t above the water line. That means that the queue ends up holding all the fooded locations zero steps away from the sources. We’ll then enqueue a next group of elements, corresponding to all the fooded locations one step away from the sources. Then we’ll enqueue all fooded locations two steps away from the sources, then three steps away, etc. Eventually, this process will end up visiting every fooded location in the map.

Let’s make this a bit more concrete. To get the process started, you add to the queue each of the individual water sources that happen to be at least as high as the grid cell they’re located in. From then on, the algorithm operates by dequeuing a location from the front of the queue. Once you have that location, you look at each of the location’s neighbors in the four cardinal directions. For each of those neighbors, if that neighbor is already fooded, you don’t need to do anything because the search has already considered that square. Similarly, if that neighbor is above the water line, there’s nothing to do because that square shouldn’t end up under water. However, if neither of those conditions hold, you then add the neighbor to the queue, meaning “I’m going to process this one later on.” By delaying processing that neighbor, you get to incrementally explore at greater and greater distances from the water sources. The process stops once the queue is empty; when that happens, every location that needs to be considered will have been visited.

At each step in this process, you’re removing the location from the queue that’s been sitting there the longest. Take a minute to convince yourself that this means that you’ll frst visit everything zero steps from a water source, then one step from a water source, then two steps, etc.

To spell out the individual steps of the algorithm, here’s some pseudocode for breadth-frst search:


create an empty queue;

for    (each water source at or below the water level) { flood that square;
add that square to the queue;

}


while (the queue is not empty) {

dequeue a position from the front of the queue;


for (each square adjacent to the position in a cardinal direction)

{

if (that square is at or below the water level and isn't yet flooded) {

flood that square;

add that square to the queue;

}

}

}


As is generally the case with pseudocode, several of the operations that are expressed in English need to be feshed out a bit. For example, the loop that reads


for (each square adjacent to the position in a cardinal direction)


is a conceptual description of the code that belongs there. It’s up to you to determine how to code this up; this might be a loop, or it might be several loops, or it may be something totally diferent. The basic idea, however, should still make sense. What you need to do is







Due Friday,   at 1:00 pm


Assignment Logistics Part One: Rosetta Stone

Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions


Submission Instructions




https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    19/21
2023/3/25 17:28    CS106B Fun with Collections

iterate over all the locations that are one position away from the current position. How you do that is up to you.

It’s important to ensure that the code you write works correctly here, and to help you with that we’ve provided a suite of test cases with the starter fles. These test cases look at a few examples, but they aren’t exhaustive. You’ll need to add at least one custom test case – and, preferably, more than one – using the handy STUDENT_TEST macro that you saw in the frst programming assignment. We don’t recommend copying one of the test cases from the handout, since (1) that’s really tedious and time-consuming and (2) it’s better to more directly test particular tricky behaviors with smaller, more focused examples.

Your Task

To summarize, here’s what you need to do for this assignment:




    1. Implement the floodedRegionsIn function in RisingTides.cpp. Your code should use the breadth-frst search algorithm outlined above in pseudocode to determine which regions are under water. Water fows up, down, left, and right and will submerge any region whose height is less than or equal to the global water level passed in as a parameter. Test as you go.

    2. Add in at least one custom test case into RisingTides.cpp – preferably, not one of the ones we listed above – and if necessary revisit your code to correct any errors you fnd.


Some notes on this problem:

Need a refresher on Grid type? Check the Stanford C++ Library Documentation.


The initial height of the water at a source may be below the level of the terrain there. If that happens, that water source doesn’t food anything, including its initial terrain cell. (This is an edge case where both the “food it” and “don’t food it” options end up being weird, and for consistency we decided to tiebreak in the “don’t food it” direction.)


The heights of adjacent cells in the grid may have no relation to one another. For example, if you have the topography of Preikestolen in Norway, you might have a cell of altitude 0m immediately adjacent to a cell of altitude 604m. If you have a low-resolution scan of Mount Whitney and Death Valley, you might have a cell of altitude 4,421m next to a cell of altitude -85m.


Curious about an edge case? Check the examples above.


Make sure, when passing large objects around, to do so by reference (or const reference, if appropriate) rather than by value. Copying large grids can take a very long time and make your code slow down appreciably.


Once you’re passing all the tests, click the “Rising Tides” option to see your code in action! This demo uses your code to food-fll the sample world you’ve chosen. Some of these sample worlds are large, and it might take a while for that food-fll to fnish. However, it should probably take no more than 30 seconds for the program to fnish running your code.

The demo app we’ve included will let you simulate changes to the sea levels for several geographical regions of the United States. The data here is mostly taken from the National Oceanographic and Atmospheric Administration (NOAA), which provides detailed topological maps for the US. Play around with the data sets – what do you fnd?

We also have some maps from Mars, supplied by a former CS106B student as an extension to their assignment! The Martian terrain includes many regions way above and way below 0 elevation, so try out some large positive and large negative altitudes to get the full experience here.

Part Three: (Optional) Extensions

If you'd like to run wild with these assignments, go for it! Here are some suggestions for additional features to add.







Due Friday,   at 1:00 pm


Assignment Logistics Part One: Rosetta Stone

Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions Submission Instructions


https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    20/21
2023/3/25 17:28    CS106B Fun with Collections

Rosetta Stone: The starter fles for this assignment include trigrams for over 200 languages. We made a reasonable efort to make sure those trigram fles were refective of how those languages are actually written, but there may be some trigram fles derived from corpora that don’t adequately represent the language. Investigate the quality of the trigrams fles for other languages and, if they seem of, fnd a new corpus and create a better trigrams fle. Let us know how you did it so we can update our attributions fle.


We know for a fact that our list of languages is incomplete. Do you speak any languages that aren’t represented here? Build a corpus for it and send it our way, along with a description of how you came up with it.







Due Friday,   at 1:00 pm


Assignment Logistics


Part One: Rosetta Stone


Background: Trigrams


Background: UTF-8


Milestone One: Form k-Grams


Milestone One Requirements


Milestone Two: Normalize Frequen


As you saw, the tool sometimes guesses the wrong language. When it does, is that because it's completely and totally fooled, or because it thought two or more languages were a good match? Try changing the function that guesses languages to return a set of the top n matches for some value of n (say, the top fve matches) along with its confdence in each. Tell us if you fnd anything interesting when you do!

Rising Tides: The sample terrains here have their data taken from the NOAA, which is specifc to the United States, but we’d love to see what other data sets you can fnd. Get some topographical data from other sources, modify it to ft our fle format, and load it into the program. (You can fnd information about the fle format in the res/CreatingYourOwnTerrains.txt fle.) What information – either for good or for ill – does that forecast?


Or, consider the following question: suppose you have a terrain map of your area and know your home’s position within that area. Can you determine how much higher the water can rise without fooding your home?

Submission Instructions

Before you call it done, run through our submit checklist to be sure all your ts are crossed and is are dotted. Make sure your code follows our style guide. Then upload your completed fles to Paperless for grading.

Please submit only the fles you edited; for this assignment, these fles will be:

RosettaStone.cpp


LanguageIdentifications.txt (don't forget this one!)


ShortAnswers.txt (don't forget this one!)


RisingTides.cpp


You don't need to submit any of the other fles in the project folder.


Submit to Paperless



If you modifed any other fles that you modifed in the course of coding up your solutions, submit those as well. And that’s it! You’re done!


Good luck, and have fun!



All course materials © Stanford University 2023

Milestone Three: Filter Out Uncom


Milestone Four: Implement Cosine


Milestone Five: Guess a Text’s Lan


Milestone Six: Explore and Evalua


Milestone Seven: Ethical Considera


Part Two: Rising Tides


Part Three: (Optional) Extensions


Submission Instructions


Website programming by Julie Zelenski with modifcations by Keith Schwarz • Styles adapted from Chris Piech • This page last updated 2023-Mar-20























https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a2/    21/21

More products