We noted that 3 (YnhF, YshB, and YqfG) of the short genes validated in our study were annotated in a screen with the program Gene Trek in Prokaryotic Space which also combines filtering for the presence of RBSs and homology, though with different parameters (Kosugeet al
We noted that 3 (YnhF, YshB, and YqfG) of the short genes validated in our study were annotated in a screen with the program Gene Trek in Prokaryotic Space which also combines filtering for the presence of RBSs and homology, though with different parameters (Kosugeet al., 2006). Keywords:Information theory, genome annotation, membrane proteins == Introduction == One of the challenges inherent in characterizing the proteome of an organism is the isolation and identification of small proteins. It is difficult NM107 to reliably annotate genes encoding small proteins computationally, since these genes frequently lack sufficient sequence for domain and homology determination (Basraiet al., 1997;Blattneret al., 1997;Cliftenet al., 2001;Consortium, 2004;Ruddet al., 1998). The small size of these genes also limits the frequency in which they are disrupted in random genetic screens (Basraiet al., 1997;Kastenmayeret al., 2006). The problem of identifying small polypeptides is further compounded by difficulties NM107 involved in using standard proteomic techniques to isolate and identify proteins less than 10 kDa in size (Garbiset al., 2005). Although proteins of 16-50 amino acids (herein referred to as small proteins) are difficult to predict, isolate and characterize, an increasing body of evidence shows that these polypeptides have important cellular and intercellular functions. For example, in the bacteriumBacillus subtilis, the Sda protein (46 amino acids) represses aberrant sporulation by inhibiting the activity of the KinA kinase (Burkholderet al., 2001;Rowlandet al., 2004). In eukaryotes, small proteins play important roles at both a cellular and organismal level. Recent work has shown that three previously unannotated essential small proteins in yeast are members of the kinetochore complex and are necessary for proper chromosome segregation (Mirandaet al., 2005). Small proteins are important components of photosystem II in plants (Shi and Schrder, 2004). In animals, cationic antimicrobial peptides are a first-line defense against pathogen attack (Gallo and Nizet, 2003), and many hormones are peptides derived from larger proteins (Canaffet al., 1999). These examples illustrate the diverse functions of small proteins across species, and suggest that NM107 future NM107 studies of small proteins will provide new biological insights. Relatively fewE. coliproteins of 16-50 amino acids have been characterized. Most of the characterized small proteins are members of three different categories: leader peptides, ribosomal proteins, or toxic proteins. Leader peptides have been identified upstream of 11 genes which primarily encode proteins involved in amino acid metabolism. In these cases, translation of the short open reading frame (ORF) regulates transcription and/or translation of the downstream genes during periods of amino acid starvation (reviewed in (Yanofsky, 2000)). It is still unknown whether these leader peptides have independent functions after they are translated, although the peptides can accumulate upon overexpression (Gonget al., 2006). TheE. coliribosome also contains a number of relatively small proteins, and two components of the 50S subunit, L36 (encoded byrpmJ) and L34 (encoded byrpmH) are proteins of less than 50 amino acids. One ribosome-associated protein, Sra (also denoted S22 and RpsV) also is only 45 amino acids in length. The small proteins in the third category can be toxic to cells, especially when overexpressed. Their toxicity can be mitigated by co-expression of a corresponding antitoxin protein blocking activity or antisense small RNA blocking expression. Included in this group are members of the Hok family. This toxic gene family was originally identified on plasmids, but intacthokgenes are also encoded on someE. coligenomes (Gerdeset al., 1997;Ruddet al., 1998). In plasmids, the Hok system insures that cells retain the plasmid during replication, however the function of the chromosomally-encoded toxic genes is still unclear. The 35-amino acid protein LdrD expressed from one of the long direct repeat sequences inE. coliK-12 has been shown to be toxic when overproduced, and it is likely that the homologous LdrA, LdrB and LdrC proteins expressed from the other three copies of the LDR sequences are also toxic at high levels (Kawanoet al., 2002). Overexpression of the 18 or 19 amino acid Ibs proteins NM107 encoded by the five copies of theE. coliK-12 SIB repeats is similarly toxic (Fozoet al., 2008). Three other small proteins shown to be toxic at elevated levels are the 48-amino acid entericidin B protein (Bishopet al., 1998), the 29-amino acid TisB protein (Vogelet al., 2004) and the 26-amino acid ShoB protein (Fozoet al., 2008). The mechanisms by which elevated levels of these proteins kill cells or inhibit growth are not known though all are predicted to be membrane proteins. Aside from the three classes of proteins listed above, only a few small proteins have been identified inE. coli. As already mentioned, it has been difficult to reliably annotate small proteins (Ochman, 2002). Automated annotation methods usually either under-annotate or over-annotate small proteins. Among the sequencedE. colistrains, the number of annotated short ORFs ranges from 0 in strain APEC 01 to 323 CALCA in strain E24377A. A simple way to improve the accuracy.