Using characteristic features of ECM proteins
Using characteristic features of ECM proteins and a computational pipeline combining interrogation of protein and gene databases, we previously defined the matrisome as the ensemble of ECM and ECM-associated proteins [, , ]. In mammals, the matrisome represents 4% of the genome, or approximately 1000 genes. We further classified these mimosine into core matrisome components, consisting of collagens, proteoglycans, and glycoproteins (including laminins, fibronectins, etc.), and matrisome-associated components, including proteins that could incorporate into ECMs or are co-purified with ECM proteins. These components are further subdivided into ECM-affiliated proteins (e.g., C-type lectins, galectins, annexins, semaphorins, syndecans, and glypicans), ECM regulators (e.g., MMPs, ADAMs, and crosslinking enzymes), and secreted factors (e.g., TGF-β, BMPs, FGFs, Wnt proteins, and chemokines) [, , ]. More recently, we employed a computational approach to predict the in-silico matrisome of the zebrafish . Defining the matrisome of organisms has been instrumental to annotate transcriptomic and proteomic data and has permitted the identification of ECM signatures of biological processes  and of human diseases including cancers and fibrosis [, , , , ].
Here, we devised a novel bioinformatic pipeline combining gene orthology and de-novo identification to define the C. elegans matrisome. We report the identification of 719 genes potentially encoding ECM and ECM-associated proteins, including 181 collagens of which 173 are predicted to be components of the cuticle. Based on their collagen-domain organization, we propose to group these cuticular collagens into five novel clusters and further divide them in sub-clusters. In addition, we demonstrate that the newly defined C. elegans matrisome can be used to annotate data from high throughput RNAi screens, transcriptomic, and proteomic data, and can assist with the identification of ECM genes or signatures relevant in the context of various physiological and pathological processes.
The workflow and steps for defining the C. elegans matrisome are outlined in Fig. 1.
In order to better classify and study the 185 collagen-domain-containing proteins in C. elegans, we propose to define a novel nomenclature based on their collagen-domain organization and the addition of other characteristic protein domains (e.g. C-type lectin; C4, the collagen IV NC1 domain; TSP; FNIII), similar to the mammalian collagen classification . To do so, we clustered the 181 collagens and the 4 collagen-domain containing proteins into four major groups: (1) the vertebrate-like collagens (similar to mammalian type IV, XVIII, XXV), (2) the collagen-domain-containing proteins with mammalian orthologues (collectins and gliomedin), (3) the non-cuticular collagens with no clear orthology to mammalian collagens, and (4) the cuticular collagens. This last group contains the largest number of 173 collagens and which we further propose to subdivide into five main clusters (A to E). For detailed comparison and to facilitate the diffusion of this proposed classification, we constructed the C. elegans collagen database, CeColDB, available at: http://CeColDB.permalink.cc/.
RNA sequencing and proteomics are standard techniques used by many C. elegans research laboratories to elucidate physiological and pathological processes. In addition, genome-wide RNA interference (RNAi) screens are commonly used to identify the mechanism(s) underlying phenotypes of interest. To demonstrate the applicability and power of our matrisome definition and classification, we used the Matrisome Annotator we developed here (http://ce-matrisome-annotator.permalink.cc/) to re-annotate existing datasets. We first re-analyzed our previously published study using transcriptomics to identify genes involved in longevity . We found 79 matrisome genes out of the total 426 transcriptionally upregulated genes when comparing long-lived C. elegans under reduced Insulin/IGF-1 conditions with short-lived C. elegans that lack the oxidative stress transcription factor SKN-1/Nrf1,2,3 (Supplementary Table 5) . Although, we previously recognized the upregulated collagens and potentially secreted proteases , the re-annotation of this data set paints a more complete picture to envision a remodeling of the ECM in long-lived C. elegans. Our list can also be used to annotate proteomic datasets. Here, we re-annotated a proteomic dataset from a recently published study aimed at studying longevity in C. elegans . In contrast to the 11 collagens highlighted in their study, we found 25 matrisome proteins out of the 177 total upregulated proteins when comparing long-lived germ stem cell mutant glp-1 with wild-type C. elegans (Supplementary Table 6). Our additionally identified matrisome proteins includes laminin A and B (EPI-1 and LAM-1), prolyl 4-hydroxylase (DPY-18), and secreted proteases (Supplementary Table 6). Together with the 11 previously identified collagens , this suggests a potential remodeling of the ECM in long-lived C. elegans, consistent with the findings from the mRNA expression profile . Last, we set out to re-annotate data from a whole-genome RNAi screen aimed at identifying antifungal innate immunity genes , since this would help to identify functional importance of matrisome genes. We found that 18 out of the 297 gene hits that regulate antimicrobial peptide gene expression are matrisome genes (Supplementary Table 7) . These 18 matrisome genes include six cuticular collagens, three secreted proteases, and one collagen cross-linking enzyme (Supplementary Table 7), suggesting a potential role for strengthening or stiffening of the ECM to form a protective barrier against fungal infections.