Quantitative analysis is never as neat and tidy as one would like it to be. At least, I’ve found that it is often hard to be graceful when you’re dealing with large amounts of data, with the need to clean it up, transform it, present it, and, finally, interpret it. And, then, of course, there is the issue of documenting your process and method, which can also be tedious. Data management and analysis are hard-to-master art forms.

Especially with geography, where you typically are looking at multiple aspects of a large sample of places, presenting the greatest amount of relevant information in the most convincing way to an audience makes for a messy task. Not only is it messy in the planning and development stages, but this carries on all the way to presenting it. (Life would be so much easier with a word processor like LaTeX, which gives much more control over presentation to the writer, except that so many people still use and edit documents collaboratively, which LaTeX doesn’t make easy).

My purpose here today is to share my workflow process, including its shortcomings and how I would like to improve it, as well as to continue the analysis of Texas, which I seem to be fixated upon at the moment. In this post, I outline my workflow process. In a subsequent post, I will present the data.

The method I describe here is cluster analysis. Cluster analysis is a technique that in theory condenses a tremendous amount of information for easier access, yet in practice is quite messy. Essentially, cluster analysis groups objects based on their similarity. It is mainly an exploratory technique, useful for generating hypotheses. There are many potential algorithms for sorting the observations. In a chapter of my dissertation, I used a hierarchical method, which began with all observations in their own cluster and then aggregated two clusters together at each stage. This process avoids the issue of selecting in advance the number of clusters, and employs ANOVA to determine the optimal distance between clusters with the aim of minimizing the sum of squares of a given pair of clusters. The method strives for maximum internal homogeneity in each cluster based on the dependent variables. Other post-estimation techniques are available to confirm the optimal number of clusters.

Here, I use k-means clustering, which has three steps. First, the number of clusters is specified. Second the location of each cluster is initialized. Third, each object is attributed to the nearest cluster. Convergence is achieved when the algorithm can no longer change the assignment of each observation for an optimal fit. It should be noted that, in this way, k-means clustering can group the same set of observations differently on each independent attempt. That is, I could run the same process today and discover an entirely new membership for each observation. Another main difference between k-means and hierarchical methods is that in the former the number of clusters are designated in advance. There are also differences in post-estimation tests.

Let me now move on to the more mundane aspects.

**My workflow process**

Here is the process by which I assembled data and performed a cluster analysis.

- Data collection
- I gathered data from four datasets: employment (Quarterly Workforce Indicators, www.ledextract.ces.census.gov); unemployment (Bureau of Labor Statistics, http://www.bls.gov/lau/); poverty and income (Small Area Income & Poverty Estimates, US Census, http://www.census.gov/did/www/saipe/data/statecounty/); and, number of banks and deposits (from FDIC Summary of Deposits, https://www2.fdic.gov/sod/).
- The geography was Texan counties, of which there are 254 (big sample).
- I downloaded the relevant data for the most earliest available date, which differs somewhat between each dataset. Sometimes the data were not all available for all observations, so I retrieved the earliest dataset that had complete data.
- Employment data: end of 2012
- Population (total size, Hispanic): end of 2012
- Unemployment: June 2014
- Banking: June 2014

- I then merged each of the datasets using STATA and saved this as the master file.

- Running the cluster analysis
- The first step is to standardize the variables, in order to place all variables on the same scale and to prevent distortions.
- There was a long trial-and-error process here, where I would select various combinations of variables and then perform post-estimation tests and create maps to determine what worked best. Bear in mind, I am trying to sort counties according to local economic structure, which explains why I initially included different sectors. To make a long story short, I ultimately prevailed upon four variables: total population; the unemployment rate; the poverty rate (all ages); and, the share of oil and gas employment in total employment (all variables were standardized). Often, the maps produced from each iteration would just not seem right, so I would return to the drawing board and repeat the process. Scrutinizing a map may not sound particularly scientific, but the bigger problem is that there is no hard rule about what constitutes an appropriate threshold for the post-estimation tests, so my subjective knowledge is as valid as a seemingly objective mathematical test.
- For each set of variables, I would always run the analysis with at least two and at most 13 cluster groups. I would then invoke a cluster stop rule, which produces an F-index. When comparing the F-index values for each cluster, the point is to locate the highest values. You would then gather the cluster groups with the greatest F-index values and look at the frequency with which observations appear in each group (for example, important questions here are: does one group have almost all the observations, or are there many groups with only one observation?). Then you could summarize the means of the dependent variables for each cluster and see how different they are.
- Then, if you really wanted to be rigorous, you could run ANOVA tests with the cluster group identifier as the independent variable regressed on each of the dependent variables (population, unemployment, poverty, oil/gas). A statistically-significant F-test in ANOVA can tell us whether or not the clusters are distinct from each other. However, again, there are no tools that have been recognized by scientific consensus as being the most accurate or reliable in determining the number of clusters. Some users of cluster analysis would pursue further statistical techniques, whereas my choice is to generate maps and utilize my knowledge of geography to gauge reliability.
- I will justify the selection of the four variables in the next post.

- Creating tables and maps to present the information
- The final cluster analysis I performed identified eight cluster groups for the four dependent variables. I created several tables:
- One to show the mean values of the dependent variables, as well as number of counties, for each cluster group.
- Another to show the mean values on a set of variables not used in constructing the cluster schematic, partly as a further test of robustness but also as a means of analysis.
- And finally, a table showing the average quarterly change in number of jobs (not employment; this is the flow measure for ‘job creation’) from 2009Q3 to 2013Q3. Again, including this data allows for a comparative analysis for hypothesis-testing.

- I exported the county identifiers as well as their cluster group identities into a .csv file and then converted that file into a .dbf file. This step is necessary for correctly adding the county cluster identities to a shapefile for Texas counties. I combined the .dbf file with the shapefile using QGIS and then moved the new shapefile over to TileMill, where I tinkered with the color scheme and added a legend (TileMill is less clunky than QGIS).

- The final cluster analysis I performed identified eight cluster groups for the four dependent variables. I created several tables:

This is the process by which data was collected, tuned up, and applied for the purpose of cluster analysis. In the next section, I present the results and then advance some hypotheses and discuss some thoughts of mine on the workflow process and also the empirical findings.