|Computing Multiple Correspondence Analysis|
Computing Multiple Correspondence Analysis
In this section we briefly describe how multiple correspondence analysis can be computed using MultipleCar (i.e., as a stand-alone application), and MultipleCar.m (i.e., as a Matlab script). First we present a simple dataset (that can be downloaded from the free-download area of our web site). Then we explain how to load data from files. Finally, we explain how to compute multiple correspondence analysis.
Dataset analysed in this tutorial
This example studies the rate of foodborne illness caused by the consumption of vegetables and fruit in the European Union and the United States (Callejón, Rodríguez-Naranjo, Ubeda, Hornedo-Ortega, Garcia-Parrilla & Troncoso, 2015). The illnesses in the analysis were Norovirus, Salmonella spp, Escherichia coli, Campylobacter spp, Shigella spp, Clostridium spp, Staphylococcus spp, Yersinia spp, Bacillus spp, Other viruses, and Other microorganisms. The vegetables were Salad (all produce items related to salad), Leafy (all produce related to leaves, Tomato, and Other vegetables. The fruits were Sprouts (all produce related to sprouts), Berries, Melon, Juices, and Other fruits. For a detailed explanation of the variables, see Table 1 in Callejón et al, 2015.
MultipleCar can read data from text files and from Matlab files (i.e., the typical .mat data files). To replicate our analysis, you should download the file vegetables.dat. This is a text file that contains three columns of data. The first column is the variable illness, the second is the variable vegetables/fruits. And, finally, the third column is the variable region. Figure 6 shows an extract of the information contained in the file.
… … … 4 3 2 4 7 2 4 9 2 5 1 2 5 2 2 5 2 2 7 1 2 7 1 2 9 1 2 9 9 2 10 2 2 etc
As can be observed in the figure, numerical values must be used to code the information for each observation. The data must also be separated using a white space. In this file, the information is the raw data.
In order for the output of the analysis to be more interpretable, the numerical codification of data can be labelled. The text file vegetables_labels.txt contains the labels for the three variables (see figure 7).
Figure 7. Labels contained in file vegetables_labels.txt
Norovirus Salmonella E-coli Campylobacter Shigella Clostridium Staphylococcus Yersinia Bacillus Other-viruses Other-microorganisms Salad Leafy Tomato Other-vegetables Sprouts Berries Melon Juices Other-fruits EU USA
Each label must be a single word, and separations are not allowed. Please note that Matlab users can use more complex labels with more complex Matlab variables (see for example, http://es.mathworks.com/help/matlab/ref/strings.html). In addition, please note that the labels for the three variables (illness, vegetable/fruit and region) must be in the same order as in the columns in the raw data (the same applies when analysing Indicator or Burt contingency matrices).
To load data from files
Data can be loaded from text files, but also from Matlab data files. The next video shows how to load the three text files presented in the section above.
As you can see in the video, the information in each data file is loaded as a variable that must be given a particular name. The name of each variable should be short, comprehensible, and have no special characters (like the addition symbol ‘+’). Once all the variables have been loaded, we advise you to save the information in a Matlab data file: this data file can be loaded in the future if the data need to be reanalysed. The last steps in the video show you how to save data in this way, and how to load it.Matlab users can load any Matlab file (*.mat), the all the variables in the file will be available in MultipleCar.
How to compute multiple correspondence analysis
The next video shows how to compute multiple correspondence analysis.
We are analysing raw data stored in X. As X is a matrix that contains raw data, we have the label Raw data in the drop-box. This is the first step in the video. You can use the button “Show it” to check that you have correctly selected the variable.
Figure 8. MultipleCar configured to analyse vegetable data.
Once the analysis has finished, you will find the file output.txt in your working folder as a text file. In addition, the output file is shown using the Windows application notepad. The figure can be manipulated to help to display the configuration. Matlab users may have more options in this manipulation.
Once the analysis has finished, users will realise that two variables are available: MCAR_Indicator_Matrix and MCAR_Burt_Matrix. These two matrices can be analysed in the future so we suggest that you save the data. Computing the Indicator matrix is a time consuming process: if the matrix has already been computed, new analysis of the same data can be much faster. The next video shows how to compute multiple correspondence analysis from the Indicator matrix. Note that the first analysis using the raw matrix was done in 0.685 seconds, while the second (using the Indicator matrix) was done in 0.23 seconds. With large datasets, the difference could turn out to be quite significant.
The next video shows how to compute multiple correspondence analysis from the Burt matrix: in this case we opted for a Joint Correspondence analysis. Please note that the number of options available when computing multiple correspondence analysis from the Burt matrix are limited.
Callejón, R. M., Rodríguez-Naranjo, M. I., Ubeda, C., Hornedo-Ortega, R., Garcia-Parrilla, M. C., & Troncoso, A. M. (2015). Reported Foodborne Outbreaks Due to Fresh Produce in the United States and European Union: Trends and Causes. Foodborne pathogens and disease, 12(1), 32-38.