Starting from:
$35

$29

Data Structures and Algorithms Pro ject #3 Solution




In recent years there has been a great deal of interest in so-called “small-world” graphs [7]. These types of graphs are characterized by a high degree of local clustering and a small number of long-range connections, making them very efficient at transferring information. This “small-world” architecture underlies the well- known “six degrees of separation” phenomenon, in which it is believed that a connection can be found between any two people on the planet requiring no more than six intermediate links.

In this project, we represent climatological data as a graph and compute some characteristics of it to determine if it is a small-world graph.




Note: This project is to be completed individually. Your implementation must use C/C++ and your code must compile and run on the Linux machine general.asu.edu.




All dynamic memory allocation must be done yourself, i.e., using either malloc() and free(), or new()

and delete(). You may not use any external libraries to implement any part of this project.




As always, a script will be used to check the correctness of your program. Therefore, absolutely no changes to these project requirements are permitted.

Use a version control system as you develop your solution to this project, e.g., Dropbox, Bitbucket, or

GitHub. Your code repository should be private to prevent anyone from plagiarizing your work.







1 Arctic Sea Ice




Sea ice covers most of the Arctic Ocean and plays a significant role in the global water cycle and the global energy balance. It is also considered to be a sensitive indicator of climate change. Thus, any changes in the Earth’s climate are likely to first be seen in areas such as the high Arctic.

Since the 1970s, the areal extent of sea ice has been shrinking. In September of 2007, the mean sea ice extent was 1.65 million square miles, which was the lowest ever recorded for the month of September, shattering the previous record in 2005 by 23%. Current climate model projections indicate that the Arctic could be seasonally ice-free by 2050–2100, which will significantly impact the global climate [2].

Because of its importance as a proxy indicator of climate change, a great deal of research is conducted on Arctic sea ice. Data acquired by meteorological satellites provides one of the most effective ways to study large-scale changes in sea ice conditions in the Arctic. The longest continuous satellite record of sea ice comes from the Nimbus-7 Scanning Multi-channel Microwave Radiometer (SMMR) and Defense Meteorological Satellite Program Special Sensor Microwave/Imager (DMSP SSM/I) series of meteorological satellites. Data acquisition started in late 1978, with the first full year of data in 1979, and is ongoing. This data is maintained for the Arctic and Antarctic by the National Snow & Ice Data Center (NSIDC); see

http://nsidc.org/data/seaice index/.

The sea ice concentration (SIC) anomaly data set that we will use consists of 27 years (1979–2005) of weekly SIC anomaly data derived from the SMMR-SSM/I passive microwave data set. An anomaly data set is when the long-term average is subtracted from the data, to remove seasonal trends, making the data more amenable to statistical analysis.

The data for each week is for a 304 × 448 floating point array representing the northern hemisphere. The

data value at each cell (x, y) in the array represents the percentage of deviation in ice concentration from the

27-year average for a given week. For example looking at the values of the array for week 30 of 1990, at cell (100, 200), the value is -4.5. This means that at cell (100, 200) for week 30 of 1990, the sea ice concentration was 4.5% lower than the 27-year average value for week 30 for that cell.

Since there are 52 weeks per year and 27 years of data, there are 52 × 27 = 1, 404 sea ice concentration readings for each position (x, y) over the years. Therefore, for each of the 304 × 448 = 136, 192 positions there is a time series [x, y, t], 1 ≤ t ≤ 1, 404, of SIC data with 1,404 values, starting at week 1 of 1979.

The data set is given as 1,404 files each containing a 304 × 448 32-bit floating point array (little-endian byte order). The format of the filenames is: diffwNNyYYYY+landmask, where wNN denotes week NN and yYYYY denotes the year. For example, diffw31y1983+landmask is the file for week 31 of 1983. The “+landmask” part of the name indicates that a landmask was applied to the data.

Since we are dealing with sea ice, land masses can be ignored; these constitute approximately half of the cells in each of the arrays. Land is denoted by the value 168. Missing data is denoted by the value of 157. Figure 1(a) is a sample SSM/I sea ice concentration image, which has been pseudo-coloured to make it easier to view. Each pixel corresponds to a nominal physical area of 25 square kilometers. There is a large circular disk over the North Pole, an area of missing data due to the satellite’s orbit. The satellite orbits from pole to pole (i.e., longitudinally), but at an incline, so there is a circular area that is not covered. Hence, the only missing data is in the circular region over the North Pole.

















































(a) Full Arctic region (b) Beaufort Sea subregion




Figure 1: Sample SSM/I total sea ice concentration image.




Figure 1(b) shows a subregion near the Beaufort Sea. The corresponding data set has only 63 × 63 = 3969 cells for each of the 27 years. This smaller data set is otherwise identical to the full data set and should be used for code development and testing.




1.1 A Graph Representation for the Climatological Data Set




Recall that our data set consists of arrays representing sea ice concentration for each week for the 27 years

1979–2005. The data set can be thought of as a stack of two-dimensional (304×448) arrays, ordered according to time. Since there are 27 years in the data set and the data was weekly, there are 52 × 27 = 1, 404 arrays in the stack. The data set can therefore be visualized as a three-dimensional array consisting of a stack of two dimensional arrays. For each cell (x, y), there is a time series [x, y, t], 1 ≤ t ≤ 1, 404, of values in the stack.

How is a graph constructed from this type of climatological data? Tsonis et al. derived a correlation-based graph G = (V, E) [1, 6]. The vertex set V corresponds to the cells, i.e., for the full SIC data set, the graph has 302 × 448 vertices. To determine the edge set, the Pearson correlation coefficient (see §1.1.1) is calculated between all pairs of cells (x, y) and (x0 , y0 ), 1 ≤ x, x0 ≤ 304, 1 ≤ y, y0 ≤ 448, (x, y) = (x0 , y0 ), of time series vectors. That is, the correlation coefficient is computed between [x, y, t] and [x0 , y0 , t], 1 ≤ t ≤ 1, 404, for each possible pair of cells. Since there are n = 136, 192 cells, there are n(n − 1)/2 pairs of cells, and so the correlation coefficient is calculated for 3,547,116 pairs of time series. If the correlation coefficient for a pair of cells (x, y) and (x0 , y0 ) of time series, i.e., [x, y, t] and [x0 , y0 , t], 1 ≤ t ≤ 1, 404, is greater than some

threshold rthresh , then an edge is inserted between cells (x, y) and (x0 , y0 ). The final result is a graph with edges between all cells having a correlation greater than the threshold rthresh .

Use an adjacency list of represent the graph because it is expected to be reasonably sparse.




1.1.1 Pearson Correlation Coefficient




To get a measure of how strongly two vectors X and Y , of length n, are related, we use the correlation coefficient. Correlation is concerned with trends: if X increases, does Y tend to increase or decrease? How much? How strong is this tendency?

The Pearson correlation coefficient measures the strength and direction of a linear relationship between

X and Y . The formula for the sample correlation coefficient, denoted by r is:




Sxy

r =

pS

xx




Syy



where,




n

2
Sxx =

X (Xi − X ) ,

i=1

n
Syy =







Sxy =

X (Y Y )2 , and

i −

i=1

n

X(Xi − X )(Yi − Y ).

i=1



In our case the vector X corresponds to the time series of length 1,404 for cell (x, y), whereas the vector Y corresponds to the time series of the same length for cell (x0 , y0 ). You will need to compute the sample correlation coefficient between all pairs of cells (x, y) and (x0 , y0 ), (x, y) = (x0 , y0 ). For a given pairs of cells (x, y) and (x0 , y0 ), if |r| ≥ rthresh then you should insert an edge in the graph between the vertex corresponding to (x, y) and the vertex corresponding to (x0 , y0 ). (The absolute value |r| of the correlation coefficient should be used, since r can be positive or negative.)







2 Analyses of the Climatological Graph




We are interested in whether the graph derived from the SIC climatological data is a small-world graph. First, construct a correlation-based graph Gr = (Vr , Er ) for the sea ice anomaly dataset for each correla- tion threshold rthresh ∈ {0.95, 0.925, 0.9}. (Start with the largest threshold as it will yield the sparsest graph.)




Now, for each correlation-based graph Gr :




1. Plot the histogram of the degree distribution of Gr . If |Vr | = n vertices, the degree of a vertex can range from zero (the vertex is isolated) to n − 1 (the vertex has an edge to every other vertex in the graph). A histogram of the degree distribution for Gr therefore plots a count of the number of vertices of each degree d, 0 ≤ d ≤ n − 1.

Recall that a small-world graph is characterized by a high degree of local clustering and a small number of long-range connections. As a result, we expect degree distribution of a small-world graph to have a long-tailed distribution.




2. Compute the number of connected components in Gr and their size (i.e., number of vertices in each component). Gr with n vertices may consist of from one to n connected components. A connected component, C = (VC , EC ), is a subgraph of Gr such that VC ⊆ Vr and EC ⊆ Er . For any pair of vertices u, v ∈ VC , there exists a path from u to v consisting of a finite number of edges in EC . Similarly, every
edge in EC has endpoints in the vertices of C . Therefore, every vertex within a component is reachable from any other vertex in that component, but there are no paths between components. Discovery of connected components is implemented in this project using a recursive depth-first traversal of Gr .

For a small-world graph, what do you think the component structure should look like?




3. Another way to determine if the graph Gr is a small-world graph is to calculate the mean clustering coefficient γ(G) and the characteristic path length (or diameter), L(G), of Gr and compare it to a random graph Grandom of the same size (i.e., the same number of vertices). (These characteristics are defined in §2.1 and in §2.2, respectively.) For a small-world graph, γ(Gr ) γ(Grandom ) and L(Gr ) ≥ L(Grandom ) [5, 7].

Compute the clustering coefficient, γ(Gr ), and the characteristic path length, L(Gr ), of the graph Gr .




4. Compare γ(Gr ) and L(Gr ) to γ(Grandom ) and L(Grandom ) for the random graph, Grandom , of compa- rable size. (See §2.3 for details.)




For the milestone an full project deadline, you will need to write a report that summarizes your findings for the 27-year period. Be sure to work on the “small” Beaufort Sea data set first.




2.1 Clustering Coefficient




The neighbourhood N (v) of a vertex v consists of all the vertices adjacent to v. The graph generated by N (v), hN (v)i has vertex set N (v) and its edges are all edges of the graph with both endpoints in N (v). Use k(v) and e(v) to denote the numbers of vertices and edges in hN (v)i. The clustering coefficient γv of v is:




e(v)




2e(v)


γv = k(v) = k(v)(k(v) 1) .

2

In words, γv for a vertex v is the proportion of edges between the vertices within its neighbourhood divided by the number of edges that could possibly exist between them.

The clustering coefficient of a graph G is the mean of the clustering coefficient of all vertices of G and is denoted γ(G).




2.2 Characteristic Path Length




2

Let di,j be the length of the shortest path between the vertices i and j. Then the characteristic path length L(G) for the graph G = (V, E), is di,j averaged over all n pairs of vertices, where n = |V |. The Floyd-Warshall all-pairs shortest paths algorithm may be useful here.




2.3 Metrics for Random Graphs




A random graph Grandom corresponding to Gr has the same number of vertices as Gr , namely n = |Vr |. However, edges are inserted into Grandom at random such that the edge density of Grandom matches the edge density of Gr , i.e., the probability p that edges are present in Grandom should match that of Gr . This edge probability p is calculated from Gr having n vertices and mean vertex degree k as:




nk nk k

p = 2 = 2 = .
n n(n−1)

2 2

n − 1
This calculation is derived intuitively by reasoning that the fraction of actual edges nk




out of the total
number of possible edges

n

2

2

should approximate the edge probability of the graph.
n

The clustering coefficient of a random graph Grandom is γ(Grandom ) = k . Similarly, the characteristic

log k

path length of a random graph Grandom is L(Grandom ) = log n . Here, n is the total number of vertices in the correlation graph Gr , and k is the mean vertex degree of Gr .

Given the degree distribution of Gr it is straightforward to compute k.
2.4 Submission Instructions




All submissions are electronic. This project has two submission deadline dates.




1. The milestone deadline date is Monday, 11/14/2016.




2. The final project deadline date is on Monday, 11/28/2016.




It is your responsibility to submit your pro ject well before the time deadline!!! Late pro jects will not be accepted. Do not expect the clock on your machine to be synchronized with the one on Blackboard!




Multiple submissions are allowed. The last submission will be graded.




2.5 Requirements for Milestone Deadline




By the milestone deadline, your project must compute the degree distribution and number and size of con- nected components in a correlation based graph derived from the “small” Beaufort Sea SIC data set.




Submit electronically, on Monday, 11/14/2016 using the submission link on Blackboard for the Project

#3 milestone, a zip1 file named yourFirstName-yourLastName.zip containing the following:




Pro ject State (5%): In a folder (directory) named State provide a brief report (.txt, .doc, .docx, .pdf )

that addresses the following:




1. Explain all design decisions. Discuss your representation of the graph, and any optimizations you made in computations. (Depending on the order in which you calculate the statistics, you can likely save and make use of previous results to cut down on the computation time.) Can you compute a worst-case bound on the time and/or space of your algorithms?

2. Describe any problems encountered in your implementation for this project milestone.

3. Describe any known bugs in your project milestone.

4. While this project is to be completed individually, describe any significant interactions with anyone

(peers or otherwise) that may have occurred.

5. Cite any external code bases, books, and/or websites used or referenced.




Implementation (50%): In a folder (directory) named Code provide:




1. In one or more files, your well documented C/C++ source code implementing this project mile- stone.

2. A Makefile that compiles your program to an executable named p3 on the Linux machine general.asu.edu. Our TA will write a script to compile and run all student submissions on general.asu.edu; therefore executing the command make p3 in the Code directory must pro- duce the executable p3 also located in the Code directory.




Report (20%): In a folder (directory) named Report provide a report (.doc, .docx, .pdf ) containing the following:




1. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold

rthresh ∈ {0.95, 0.925, 0.9}, plot the degree distribution.

(a) Do you think the degree distribution is consistent with that of a small-world graph? Why or why not?




1 Do not use any other archiving program except zip.

(b) Identify any supernodes, i.e., vertices with significantly higher vertex degree than the average, and where they occur. Describe your determination of supernode.

2. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold rthresh ∈ {0.95, 0.925, 0.9}, compute the number of connected components in Gr and their size (i.e., number of vertices).

(a) For a small-world graph, how do you think the component structure should look? (b) Do your results support your hypothesis?




Correctness (25%): The correctness of your program will be determined by running your program on the

“small” data set for a given threshold.




The milestone is worth 30% of the total project grade.




2.6 Requirements for Final Pro ject Deadline

For the full project deadline, your project must perform all the analyses described in §2 a correlation based graphs derived from the “small” Beaufort Sea SIC data set. If you can run the analyses on the full data set, a bonus of up to 10% is available.




Submit electronically, on Monday, 11/28/2016 using the submission link on Blackboard for the complete

Project #3, a zip2 file named yourFirstName-yourLastName.zip containing the following:




Pro ject State (5%): Follow the same instructions for Project State as in §2.5.




Implementation (40%): Follow the same instructions for Implementation as in §2.5.




Report (25%): In addition to the analyses run for the Report for the milestone deadline (§2.5), you must also run the following analyses:




1. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold rthresh ∈ {0.95, 0.925, 0.9}, compute the clustering coefficient, γ(Gr ), and the characteristic path length, L(Gr ), of the graph Gr .

2. Compute the clustering coefficient, γ(Grandom ), and the characteristic path length, L(Grandom ), of a random graph Grandom corresponding to each Gr .

(a) Compare γ(Gr ) and L(Gr ) to γ(Grandom ) and L(Grandom ) for the random graph, Grandom , of comparable size.

(b) What conclusion can you draw from your results?




Correctness (30%): The same instructions for Correctness as in §2.5 apply.




Bonus (10%): Repeat the analyses on the full data set. (I know that this is a lot of extra work for a bonus;

it is a good indicator of the scalability of your code.)






More products