HomeAboutSoftwarePublicationsPostsMicroBinfie Podcast

A bioinformatician's guide to serovars and antigenic formulae in Salmonella

Posted on October 31, 2022
an AI generated picture (Midjourney) with prompt; 'halloween bacteria'. You can share and adapt this image following a CC BY-SA 4.0 licence

In the post I will describe the basics of Salmonella serovar nomenclature. This post will act as a primer for bioinformaticians starting to work with Salmonella genomes, who are often at a loss.

Serovar designations by the White-Kauffman-Le Minor scheme in Salmonella (also known as the Kauffman-White scheme) are a standard molecular method for describing groups within Salmonella enterica. Serovar designations are often consistent with sequence based typing and genome spanning phylogenies (Achtman et al., 2012; Ashton et al., 2016). Groups defined by Salmonella serovar are indeed meaningful and the serovar names - such as Typhimurium (Tie-fi-mu-ri-um) or Chorelasuis (Ko-re-la-su-iz) - have fantastic mouth feel when pronounced. These designations will thus persist for the forseeable future.

Serotyping in Salmonella is based mainly on surface antigens with phenotypic characteristics, which can be tested in the laboratory, being used in certain cases. Information about pathogenicity or niche can also give helpful hints. There are over 2,600 known serotype profiles (Grimont and Weill, 2007; Issenhuth-Jeanjean et al., 2014). Serotypes are written as a string of numbers and symbols that often break programming scripts that try to read comma delimited files. 'II 4,12:b:1,5' is one such example. The formulae do mean something, even though they are not immediately interpretable. For simplicity, you can treat the antigenic formulae as a unique and distinct string identifier. If the entire string is not identical, it is not the same serovar.

Why do serovars have names like 'Typhimurium'?

Many Salmonella serovars have a specific name as well as the serotype. This is in contrast to other organisms like Escherichia coli (e.g. O157:H7 is simply that). The naming system has evolved with our understanding of the organism. Each serotype of Salmonella was originally believed to be a seperate species and were given descriptive names, usually describing host specificity (Ryan, O'Dywer and Adley, 2017). As it became clear that the given names were not always correct, the convention became to name new serovars after the country or city of isolation (Carpenter, 1968). In 1966, perhaps because there were too many serotypes to name, the International Enterobacteriaceae Subcommittee formally rescinded serovar names for serovars belonging to subspecies other than S. enterica subspecies enterica (although at the time subspecies enterica was called subgenus enterica) (Carpenter, 1968). Finally in 1987, Le Minor and Popoff noted that studies using DNA-DNA hybridisation and other methods showed that Salmonella was comprised of only two species, and that these and the subspecies within were not consistent with serotype. Practically speaking, the growing number of serovars (2,100 at the time) made the one-serovar-one-species concept -- 'untenable' (Le Minor and Popoff, 1987). By this point, however, the names were firmly embedded in the literature and remain in use to this day. Serovars commonly encountered in the literature include Typhi, Typhimurium, Paratyphi A, Paratyphi B, Paratyphi C, Chorelasuis, Infantis, Dublin, Enteritidis, Heidelberg, Javiana, and Newport. All serovars with specifc names have an underlying antigenic formulae, for instance, the antigenic formulae for Infantis is I 6,7,14:r:1,5.

What are the 'O' antigens and 'serogroups'?

O antigen (O polysaccharide) is a part of the lipopolysaccharide (LPS) component of the outer membrane found in all Enterobacteriacae. It is a highly variable cell constituent, making it a discriminant marker. It consists of oligosaccharide repeats ('O' units), normally containing two to eight sugar residues. The variation is mostly in the types of sugars present, their order in the structure, and the linkages between them. There are 46 O serogroups in described in the White-Kauffman-Le Minor scheme, these used to be denoted by letters of the alphabet but they ran out of letters, and the number of the characteristic O factor are now used instead (Liu et al. 2014). The letter definitions are still propagated in the literature. For example, Infantis (I 6,7,14:r:1,5) is serogroup O:7 (or C1 using the letter defintions). Wikipedia has a simple table of the serogroups and some of their members. Grimont and Weill (2007) is a more detailed catalog. Genetic variation within the O-antigene cluster (between gnd and galF), genetic variation of other associated genes encoded on mobile genetic elements, and side chains of the lipopolysaccharide have been used to further divide members of different serogroups (Liu et al. 2014).


Gram-negative bacterial cell-wall structure with emphasis in the lipopolysaccharide (LPS) presence in the outer membrane. From Monteiro & Faciola (2020).

The 'H' antigen and phase variation in Salmonella

H antigens are found in the bacterial flagella and are usually involved in the activation of host immune responses. There are over 114 H antigens described in the White-Kauffman-Le Minor scheme.

Most Salmonella have two genes encoding flagella, and these are alternatively expresed (through a process called phase variation). The alternation is controlled by inverting a 800-base-pair sequence of DNA. This acts like a switch to turn an encoded promotor on or off. When the promoter is off, the fliC gene, found elsewhere in the genome, is expressed and specifies one specific flagellum. When the promotor is on, the fljB gene that is immediately downstream of the promoter and encodes a different flagellum is expresed instead. At the same time, there is also a transcriptional repressor (fljA) expressed, which represses fliC and makes expression of the two flagella encoding genes mutally exclusive. See the diagram below and Silverman, Hillmen, and Simon (1979). Salmonella are thus either 'biphasic', 'monophasic' or 'non-motile', with the capability of expressing two, one or no flagella, respectively. The typing results of both flagella are included as part of the antigenic formulae. When serotyping in the lab, specific growth conditions are used to coerce Salmonella to choose one phase or the other.

Phase Salmonella

A schematic representation of flagellar phase variation in S. enterica. The promoter for the fljBA operon is located within an invertible DNA segment whereby inversion of the promoter is mediated by the Hin recombinase. In one orientation, the fljBA operon is expressed and FljB flagellin is produced along with FljA, repressor of the unlinked fliC gene that encodes FliC flagellin. In the opposite orientation, the fljB gene is not expressed, nor is the repressor FljA, thus allowing transcription of the fliC gene. From Bonifield and Hughes (2003)

Not all serovars are defined strictly by surface antigens (O & H)

Serovar specification is not strictly limited to surface antigens ('O' and 'H'). Other antigens (Vi) are also used on occasion and other unrelated features can be used as well. An example of this are the serovars Paratyphi C, Chorelasuis sensu stricto, Chorelasuis var. Kuzendorf, Chorelasuis var. Decatur, and Typhisuis. All of these have the same antigenic formulae (I 6,7:c:1,5). However, have differential abilities e.g. the ability/inability to ferment Dulcitol, H2S or Mucate. Each of these have different host ranges, and form distinct phylogenetic groups (Zhou et al. 2018).

Is it 'serovar' or 'serotype'?

The terms 'serovars' and 'serotypes' are interchangeable. The World Health Organisation (WHO)/Institut Pasteur use the term 'serovar' (Grimont and Weill, 2007). While the US Centers for Disease Control and Prevention uses the word 'serotype' (Brenner et al. 2000). Serovar seems to be the historic term but serotype is the term used in other organisms.

Salmonella taxonomy today

The White-Kauffman-Le Minor scheme recognises two species within the genus Salmonella; S. enterica and S. bongori. Within S. enterica, the scheme recognises six subspecies (Grimont and Weill, 2007). So Salmonella looks like this:

  • Salmonella enterica
    • S. enterica subsp. enterica
    • S. enterica subsp. salamae
    • S. enterica subsp. arizonae
    • S. enterica subsp. diarizonae
    • S. enterica subsp. houtenae
    • S. enterica subsp. indica
  • Salmonella bongori

This does make some of the names very long, for instance, serovar Heidelberg would be Salmonella enterica subsp. enterica ser. Heidelberg. There are shorthand abbreviations for each of the subspecies, used in antigenic formulae.

Salmonella subspeciesAbbreviation
S. enterica subsp. entericaI
S. enterica subsp. salamaeII
S. enterica subsp. arizonaeIIIa
S. enterica subsp. diarizonaeIIIb
S. enterica subsp. houtenaeIV
S. enterica subsp. indicaVI

S. bongori used to be classed as S. enterica subspecies bongori and used the abbreviation 'V' but has been since been promoted, which is why 'V' does not appear in the table above. There are likely other subspecies beyond what is described in the White-Kauffman-Le Minor scheme (Alikhan et al. 2018; Criscuolo et al. 2019).

What do the antigenic formulae actually mean?

At first glance it makes no sense:

I 6,7,14:r:1,5

First is the subspecies abbreviation (I, II, IIIa...) according to the table above, and then followed by a colon seperated list of 3 (maybe 4) individual lists, in the format of:

O antigen results : flagellar (H1) results : flagellar (H2) results : Other results (maybe)

If the subspecies abbreviation is missing, then it is implied that strain belongs to subspecies enterica (I). For instance, Infantis is sometimes written as simply '6,7,14:r:1,5'. This can be misleading as the same antigenic profile can be found in multiple subspecies. The nomenclature system at the US Centers for Disease Control and Prevention includes the subspecies abbreviation (I, II, IIIa...) in serotypes (Brenner et al., 2000).

Each value in the respective sublists is a positive result for different antisera. So in the example above (for Infantis) it means:

Infantis is: I 6,7,14:r:1,5
Subspecies enterica (I)
Serogroup O:7
Postive for O antigens 6,7, and 14
Postive for H1 antigens r
Postive for H2 antigens 1 and 5

The numbers between lists are not related. So '1' being postive for O antigen has nothing to do with H2 being postive for '1'. A positive in this case is observating agglutination when tested. Each value in the list refers to a pre-defined antiserum. If you are curious about what agglutination looks like, there is a video here. The UKSHA's laboratory process for identifying Salmonella is described here, and here.

There are additional symbols used as well, which makes things more complicated. For instance, the { } means that the antigens are mutually exclusive. Let's take serovar Huvudsta, which is 3,{10}{15,34}:b:1,7 as an example.

Huvudsta is: I 3,{10}{15,34}:b:1,7
Subspecies enterica (I)
Serogroup O:3,10
Postive for O antigens 3,10 OR
Postive for O antigens 3,15,and 34
Postive for H1 antigens b
Postive for H2 antigens 1 and 7

Other symbols include [ ], which means that O or H factor in question may be present or absent without relation to phage conversion. When H factors are in square brackets, this means that they are exceptionally found in wild strains. And finally ( ) means that O or H factor in question is only weakly agglutinable. All in all, it should feel quite similar to regular expressions, if you are familiar with those. When the strain is monophasic, a '-' is used to denote the absence of a second flagellum.

Here are some antigenic formulae for different serovars, which should now look more friendly.

SerovarAntigenic formulae
NorwichI 6,7:e,h:1,6
BrisbaneI 28:z:e,n,z15
IpswichI 41:z4,z24:1,5
SeattleI 28:a:e,n,x

If you would like to learn more, you can read the bible of Salmonella serovars, Grimont and Weill's Antigenic Formulae of the Salmonella Serovars (2007). Have a look and see if your hometown has a serovar named after it!

What is 'I 1,4,[5],12:i:-'?

The antigenic formulae for monophasic variants of Salmonella enterica serovar Typhimurium is I 1,4,[5],12:i:-. It seems to be regularly encountered by bioinformaticians, and I am often asked about this serotype in particular. You may see the shorthand, I 4:i:-, used by US Centers for Disease Control and Prevention. When the strain is monophasic, a '-' is used to denote the absence of a second flagellum.

All I 1,4,[5],12:i:- are not monophasic Typhimurium, however, there are a number of serotypes that start with I 1,4,[5],12:i:... differing only by H2. When H2 is dropped, the monophasic variants will have the same antigenic formulae. As such, if you see the antigenic formulae I 1,4,[5],12:i:- you need other information (such as MLST, where monophasic Typhimurium is usually ST34) to determine whether it is Typhimurium or not.

These are the serovars that could collide as I 1,4,[5],12:i:- from Grimont and Weill (2007):

Serovar nameSerogroupOH1H2

N.B. O antigens 4,12 are definitive for serogroup O4. So, even through the specific O profile varies, these are all considered the same serogroup. With this in mind, replacing H2 with '-' makes all serotypes above indistinguishable.

I searched EnteroBase for existing genomes that would fit this case. I looked at genomes with the MLST ST associated with each serovar, and then looked at the serotype prediction from SISTR for cases where I 1,4,[5],12:i:- is reported. I found 16 Agama (eBG 167) and 1 Lagos (ST 2469). Many of these were listed with a clinical/human source.

Why is this so complicated?

The simple serovar names (e.g. Typhi) belies the complex discussion of the Salmonella genus. In my opinion, the White-Kauffman-Le Minor scheme attempts to simultaneously solve problems around taxonomy, clinical identification, characterising pathogen potential, and provide a stable nomenclature. Admirable? Yes. Foolhardy? Maybe. For instance, serotypes were expected intially to define species. Later, serogroups reflected a neighbourhood of Salmonella. And differences in antigenic formula were thought to imply differences in host association and pathogenicity. The scheme goes beyond giving stable names, measuring genetic variation (like counting SNP differences) or defining flat clonal complexes (like MLST).

Indeed, it is better to consider the White-Kauffman-Le Minor scheme as an addressing system for Salmonella rather than just flat names. The scheme is hand and hand with our understanding of Salmonella. Many of the address fields we use for Salmonella were defined by the scheme. Hence, if you received a letter from a Salmonella, the return address might read:

Ser. Heidelberg
c/o 1,4,[5],12:r:1,2
subsp. enterica (I)

Salmonella on holiday at the beach

'Wish you were here,' An AI (Midjourney) generated picture with the prompt; 'Salmonella on holiday, postcard'.

Questions or comments? @ me on Twitter @happy_khan

The banner image is an AI generated picture (Midjourney) with prompt; 'halloween bacteria'. You can share and adapt this image following a CC BY-SA 4.0 licence