Starting from:
$30

$24

CSE Data Structures Project #3 Solution

In recent years there has been a great deal of interest in \small-world" graphs [5]. These types of graphs are characterized by a high degree of local clustering and a small number of long-range connections, making them very e cient at transferring information. This \small-world" architecture underlies the well-known \six degrees of separation" phenomenon, in which it is believed that a connection can be found between any two people on the planet requiring no more than six intermediate links.

In this project, we represent climatological data as a graph and compute some characteristics of it to determine if it is a small-world graph.

Note: This project is to be completed individually. Your implementation must use C/C++ and ultimately your code must run on the Linux machine general.asu.edu.

Any dynamic memory allocation must be done yourself, using either malloc() and free(), or new() and delete(). You may not use any external libraries to implement any part of this project, aside from the standard libraries for I/O, le I/O, and string functions. If you are in doubt about what you may use, ask!

You must use a version control system as you develop your solution to this project, e.g., GitHub or similar. Your code repository must be private to prevent anyone from plagiarizing your work.

The rest of this project description is organized as follows. x1 describes the sea ice data set and how to represent it as a graph. x2 describes the types of analyses to be performed on such graphs. x3 describes the submission requirements for the milestone and the full project deadlines.

    • Arctic Sea Ice

Sea ice covers most of the Arctic Ocean and plays a signi cant role in the global water cycle and the global energy balance. It is also considered to be a sensitive indicator of climate change. Thus, any changes in the Earth’s climate are likely to rst be seen in areas such as the high Arctic.

Since the 1970s, the areal extent of sea ice has been shrinking. In September of 2007, the mean sea ice extent was 1.65 million square miles, which was the lowest ever recorded for the month of September, shattering the previous record in 2005 by 23%. Current climate model projections indicate that the Arctic could be seasonally ice-free by 2050{2100, which will signi cantly impact the global climate [2].

Because of its importance as a proxy indicator of climate change, a great deal of research is conducted on Arctic sea ice. Data acquired by meteorological satellites provides one of the most e ective ways to study large-scale changes in sea ice conditions in the Arctic. The longest continuous satellite record of sea ice comes from the Nimbus-7 Scanning Multi-channel Microwave Radiometer (SMMR) and Defense Meteorological Satellite Program Special Sensor Microwave/Imager (DMSP SSM/I) series of meteorological satellites. Data acquisition started in late 1978, with the rst full year of data in 1979, and is ongoing. This data is maintained for the Arctic and Antarctic by the National Snow & Ice Data Center (NSIDC).

1.1    Sea Ice Concentration Anomaly Data Set

The sea ice concentration (SIC) anomaly data set that we will use consists of 27 years (1979{2005) of weekly SIC anomaly data derived from the SMMR-SSM/I passive microwave data set. An anomaly data set is when the long-term average is subtracted from the data, to remove seasonal trends, making the data more amenable to statistical analysis.


1

The data for each week is for a 304 448 oating point array representing the northern hemisphere. The data value at each cell (x; y) in the array represents the percentage of deviation in ice concentration from the 27-year average for a given week. For example looking at the values of the array for week 30 of 1990, at cell (100; 200), the value is -4.5. This means that at cell (100; 200) for week 30 of 1990, the sea ice concentration was 4.5% lower than the 27-year average value for week 30 for that cell.

Since there are 52 weeks per year and 27 years of data, there are 52 27 = 1; 404 sea ice concentration readings for each position (x; y) over the years. Therefore, for each of the 304 448 = 136; 192 positions there is a time series [x; y; t], 1 t 1; 404, of SIC data with 1,404 values, starting at week 1 of 1979.

The data set is given as 1,404 les each containing a 304 448 32-bit oating point array (little-endian byte order). The format of the lenames is: diffwNNyYYYY+landmask, where wNN denotes week NN and yYYYY denotes the year. For example, diffw31y1983+landmask is the le for week 31 of 1983. The \+landmask" part of the name indicates that a landmask was applied to the data.

Since we are dealing with sea ice, land masses can be ignored; these constitute approximately half of the cells in each of the arrays. Land is denoted by the value 168. Missing data is denoted by the value of 157. Figure 1(a) is a sample SSM/I sea ice concentration image, which has been pseudo-coloured to make it easier to view. Each pixel corresponds to a nominal physical area of 25 square kilometers. There is a large circular disk over the North Pole, an area of missing data due to the satellite’s orbit. The satellite orbits from pole to pole (i.e., longitudinally), but at an incline, so there is a circular area that is not covered. Hence, the only missing data is in the circular region over the North Pole.
















(a) Full Arctic region    (b) Beaufort Sea subregion

Figure 1: Sample SSM/I total sea ice concentration image.

Figure 1(b) shows a subregion near the Beaufort Sea (Ellesmere Island is on the lower right.) The corresponding data set has only 63 63 = 3969 cells for each of the 16 years (1990{2005). This smaller data set is otherwise identical to the full data set and should be used for code development and testing.

1.2    A Graph Representation for the Climatological Data Set

How is this type of climatological data represented as a graph? Tsonis et al. derived a correlation-based graph G = (V; E) [1, 4]. The vertex set V corresponds to the cells, i.e., for the full SIC data set the graph has 304 448 vertices. To determine the edge set, the Pearson correlation coe cient (see x1.2.1) is calculated between all pairs of cells (x; y) and (x0; y0), 1 x; x0 304, 1 y; y0 448, (x; y) 6= (x0; y0), of time series vectors. That is, the correlation coe cient is computed between [x; y; t] and [x0; y0; t], 1 t 1; 404, for each pair of distinct cells.

Given that there are n = 136; 192 cells and there are n(n 1)=2 pairs of cells, the correlation coe cient is calculated for 3,547,116 pairs of time series. If the correlation coe cient for a pair of cells (x; y) and (x0; y0) of time series, i.e., [x; y; t] and [x0; y0; t], 1 t 1; 404, is greater than some threshold rthresh, then an edge

2

is inserted between cells (x; y) and (x0; y0). The nal result is a graph with edges between all cells having a correlation greater than the threshold rthresh.

1.2.1    Pearson Correlation Coe  cient

To get a measure of how strongly two vectors X and Y , of length n, are related, we use the correlation coe cient. The Pearson correlation coe cient measures the strength and direction of a linear relationship between X and Y . The formula for the sample correlation coe cient, denoted by r is:



r =


Sxy











p









where,




SxxSyy






n


n

Y )2


n
Sxx =    (Xi
X)2; Syy =    (Yi


; and Sxy =   (Xi   X)(Yi   Y ):
X


X






Xi



















i=1


i=1




=1





In our case the vector X corresponds to the time series of length 1,404 for cell (x; y), whereas the vector

    • corresponds to the time series of the same length for cell (x0; y0). You will need to compute the sample correlation coe cient between all pairs of distinct cells (x; y) and (x0; y0). For a given pairs of cells (x; y) and (x0; y0), if jrj rthresh then you should insert an edge in the graph between the vertex corresponding to (x; y) and the vertex corresponding to (x0; y0). (The absolute value jrj of the correlation coe cient should be used, since r can be positive or negative.)

    • Analyses of the Climatological Graph

We are interested in whether the graph derived from the SIC climatological data is a small-world graph. First, construct a correlation-based graph Gr = (Vr; Er) for the sea ice anomaly dataset for each correlation threshold rthresh 2 f0:95; 0:925; 0:9g. Use an adjacency list of represent the graph. (Start with the largest threshold as it will yield the sparsest graph.)

Now, for each correlation-based graph Gr:

1. Plot the histogram of the degree distribution of Gr. If jVrj = n vertices, the degree of a vertex can range from zero (the vertex is isolated) to n 1 (the vertex has an edge to every other vertex in the graph). A histogram of the degree distribution for Gr therefore plots a count of the number of vertices of each degree d, 0 d n 1.

Recall that a small-world graph is characterized by a high degree of local clustering and a small number of long-range connections. As a result, we expect degree distribution of a small-world graph to have a long-tailed distribution.

    2. Compute the number of connected components in Gr and their size (i.e., number of vertices in each component). Gr with n vertices may consist of from one to n connected components. A connected component, C = (VC ; EC ), is a subgraph of Gr such that VC Vr and EC Er. For any pair of vertices u; v 2 VC , there exists a path from u to v consisting of a nite number of edges in EC . Similarly, every edge in EC has endpoints in the vertices of C. Therefore, every vertex within a component is reachable from any other vertex in that component, but there are no paths between components. The discovery of connected components is implemented in this project using a recursive depth- rst traversal of Gr.

For a small-world graph, what do you think the component structure should look like?

    3. Another way to determine if the graph Gr is a small-world graph is to calculate the mean clustering coe cient (G) and the characteristic path length (or diameter), L(G), of Gr and compare it to a


3
random graph Grandom of the same size (i.e., with the same number of vertices). (These characteristics are de ned in x2.1 and in x2.2, respectively.) For a small-world graph, (Gr) (Grandom) and L(Gr) L(Grandom) [3, 5].

Compute the clustering coe    cient,    (Gr), and the characteristic path length, L(Gr), of the graph Gr.

    4. Compare (Gr) and L(Gr) to (Grandom) and L(Grandom) for the random graph, Grandom, of compa-rable size. (See x2.3 for details.)

2.1    Clustering Coe  cient

The neighbourhood N(v) of a vertex v consists of all the vertices adjacent to v. The graph generated by N(v), hN(v)i has vertex set N(v) and its edges are all edges of the graph with both endpoints in N(v). Use k(v) and e(v) to denote the numbers of vertices and edges in hN(v)i. The clustering coe cient v of v is:

v =
e(v)
=

2e(v)
:








k(2v)

k(v)(k(v)  1)

In words, v for a vertex v is the proportion of edges between the vertices within its neighbourhood divided by the number of edges that could possibly exist between them.

The clustering coe cient of a graph G is the mean of the clustering coe cient of all vertices of G and is denoted (G).

2.2    Characteristic Path Length

Let di;j be the length of the shortest path between the vertices i and j. Then the characteristic path length L(G) for the graph G = (V; E), is di;j averaged over all n2 pairs of vertices, where n = jV j. Use the Floyd-Warshall all-pairs shortest paths algorithm.

2.3    Metrics for Random Graphs

A random graph Grandom corresponding to Gr has the same number of vertices as Gr, namely n = jVrj. However, edges are inserted into Grandom at random such that the edge density of Grandom matches the edge density of Gr, i.e., the probability p that edges are present in Grandom should match that of Gr. This edge probability p is calculated from Gr having n vertices and mean vertex degree k as:




nk



nk




k





p =

2


=

2


=


:









n(n
1)











n






n
1





2



2






This calculation is derived intuitively by reasoning that the fraction of actual edges
nk
out of the total

2

number of possible edges
n
should approximate the edge probability of the graph.








2


















The clustering coe cient of a random graph Grandom is (Grandom) = nk . Similarly, the characteristic path length of a random graph Grandom is L(Grandom) = loglog nk . Here, n is the total number of vertices in the correlation graph Gr, and k is the mean vertex degree of Gr.

Given the degree distribution of Gr it is straightforward to compute k.

    • Submission Instructions

Submissions are always due before 11:59pm on the deadline date.

    1. For SLN 84794 (MW) the milestone is due on Wednesday, 11/20/2019. For SLN 83657 (TTh) the milestone is due on Thursday, 11/21/2019. See x3.1 for requirements.

4

    2. For SLN 84794 (MW) the complete project is due on Wednesday, 12/04/2019. For SLN 83657 (TTh) the complete project is due on Thursday, 12/05/2019. See x3.2 for requirements.

It is your responsibility to submit your project well before the time deadline!!! Late projects are not accepted. Do not expect the clock on your machine to be synchronized with the one on Canvas!

An unlimited number of submissions are allowed. The last submission will be graded.

3.1    Requirements for Milestone Deadline

By the milestone deadline, your project must compute the degree distribution and number and size of con-nected components in a correlation based graph derived from the \small" Beaufort Sea SIC data set.

Using the submission link on Canvas for the Project #3 milestone, a zip1 le named yourFirstName-yourLastName.zip containing the following:

Project State (5%): In a folder named State provide a brief report that addresses the following:

    1. Explain all design decisions. Discuss your representation of the graph, and any optimizations you made in computations. (Depending on the order in which you calculate the statistics, you can likely save and make use of previous results to cut down on the computation time.) Can you compute a worst-case bound on the time and/or space of your algorithms?

    2. Describe any problems encountered in your implementation for this project milestone.

    3. Describe any known bugs in your project milestone.

    4. While this project is to be completed individually, describe any signi cant interactions with anyone (peers or otherwise) that may have occurred.

    5. Cite any external code bases, books, and/or websites used or referenced.

Implementation (50%): In a folder named Code provide:

    1. In one or more les, your well documented C/C++ source code implementing this project mile-stone.

    2. A makefile that compiles your program to an executable named seaice on the Linux machine general.asu.edu. Our TA will write a script to compile and run all student submissions on general.asu.edu; therefore executing the command make seaice in the Code directory must produce the executable seaice also located in the Code directory.

Report (20%): In a folder named Report provide a report containing the following:

    1. For the \small" data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold rthresh 2 f0:95; 0:925; 0:9g, plot the degree distribution.
        (a) Do you think the degree distribution is consistent with that of a small-world graph? Why or why not?

        (b) Identify any supernodes, i.e., vertices with signi cantly higher vertex degree than the average, and where they occur. Describe your determination of supernode.

    2. For the \small" data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold rthresh 2 f0:95; 0:925; 0:9g, compute the number of connected components in Gr and their size (i.e., number of vertices).

        (a) For a small-world graph, how do you think the component structure should look?

        (b) Do your results support your hypothesis?


1Do not use any other archiving program except zip.

5

Correctness (25%): The correctness of your program will be determined by running your program on the \small" data set for a given threshold.

The milestone is worth 30% of the total project grade.

3.2    Requirements for Complete Project Deadline

For the full project deadline, your project must perform all the analyses described in x2 on correlation based graphs derived from the \small" Beaufort Sea SIC data set. If you can run the analyses on the full data set, a bonus of up to 10% is available.

Using the submission link on Canvas for the complete Project #3, a zip2 le named yourFirstName-yourLastName.zip containing the following:

Project State (5%): Follow the same instructions for Project State as in x3.1.

Implementation (40%): Follow the same instructions for Implementation as in x3.1.

Report (25%): In addition to the analyses run for the Report for the milestone deadline (x3.1), you must also run the following analyses:

    1. For the \small" data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold rthresh 2 f0:95; 0:925; 0:9g, compute the clustering coe cient, (Gr), and the characteristic path length, L(Gr), of the graph Gr.

    2. Compute the clustering coe cient, (Grandom), and the characteristic path length, L(Grandom), of a random graph Grandom corresponding to each Gr.

        (a) Compare (Gr) and L(Gr) to (Grandom) and L(Grandom) for the random graph, Grandom, of comparable size.

        (b) What conclusion(s) can you draw from your results?

Correctness (30%): The same instructions for Correctness as in x3.1 apply.

Bonus (10%): Repeat the analyses on the full data set. (I know that this is a lot of extra work for a bonus; it is a good indicator of the scalability of your code.)

Acknowledgements

Thanks to my former student Wayne S. Chan who motivated this project.

References

    [1] J. P. Onnela, K. Kaski, and J. Kertesz. Clustering and information in correlation based nancial networks. European Physical Journal B, 38(2):353{362, March 2004.

    [2] J. T. Overpeck, M. Sturm, J. A. Francis, D. K. Perovich, M. C. Serreze, R. Benner, E. C. Carmack, F. S. Chapin III, S. C. Gerlach, L. C. Hamilton, L. D. Hinzman, M. Holland, H. P. Huntington, J. R. Key, A. H. Lloyd, G. M. MacDonald, J. McFadden, D. Noone, T. D. Prowse, P. Schlosser, and C. Vorosmarty. Arctic system on trajectory to new, seasonally ice-free state. Earth Observation Science, 86(34):309{316, August 2005.

    [3] A. A. Tsonis, K. L. Swanson, and P. J. Roebber. What do networks have to do with climate? Bulletin of the American Meteorological Society, pages 585{595, May 2006.

    [4] M. Tumminello, D. Matteo, T. Aste, and R. N. Mantegna. Correlation based networks of equity returns sampled at di erent time horizons. European Physical Journal B, 55:209{217, 2007.

    [5] D. J. Watts and S. H. Strogatz. Collective dynamics of \small-world" networks. Nature, 393:440{442, 1998.


2Do not use any other archiving program except zip.

6

More products