File Format: PGF (NON-OFFICIAL-RELEASE)

DRAFT

The PGF (probe group file) provides information about what probes are contained within a probeset and information about the nature of the probes necessary for analysis. The current PGF file format (version 1) is only specified for expression style probesets.

The PGF file is based on version 2 of the TSV file format .

Specifications

Types

Type columns in PGF files use the following string format to catagorize probesets, atoms, and probes:

simple_type:=[a-z0-9\_\-]+

So an example simple type

pm
mm
st
at
control
affx
spike

Furthermore, types can be nested. For example a particular spike may be from Affymetrix and is intended for use as a control. As a result you would combine the simple types to reflect this:

control->affx->spike

Thus

nested_type:=(simple_type|nested_type)->(simple_type)

Lastly, a given probeset, atom, or probe may belong to multiple independent types. For example, a probeset may be both a normalization control gene and part of the main design:

normgene->exon:main

Thus

compound_type:=(simple_type|nested_type|compound_type):(simple_type|nested_type)

Currently type values are not strongly enumerated. Values used in current commercial PGF files include:

Parsing and Writing

The official C++ parser used by affy can be found in APT under sdk/file/TsvFile/PgfFile.h. When possible, parsing and writing of PGF files should be done using this code.

Notes

Specific applications may require extra/optional columns in the PGF file. Thus a valid PGF file may fail for a particular application or analysis algorithm because the information needed by that application and/or algorithm is not contained in the PGF file.

It should be noted that there is no significance to the ordering of probes within atoms and atoms within probesets or even probesets within the PGF file.

IDs do not have to be unique between different levels. In other words, the ID space for probeset_ids is separate from the ID space for atom_ids and probe_ids.

Example 1 -- Human Exon 1.0 ST PGF File Excerpt

#%chip_type=HuEx-1_0-st-v2
#%chip_type=HuEx-1_0-st-v1
#%chip_type=HuEx-1_0-st-ta1
#%lib_set_name=HuEx-1_0-st
#%lib_set_version=r2
#%create_date=Tue Sep 19 15:18:05 PDT 2006
#%guid=0000008635-1158704285-0183259307-0389325148-0127012107
#%pgf_format_version=1.0
#%header0=probeset_id type
#%header1= atom_id
#%header2=  probe_id type gc_count probe_length interrogation_position probe_sequence
2590411 main
 1
  5402769 pm:st 12 25 13 CGAAGTTGTTCATTTCCCCGAAGAC
 2
  4684894 pm:st 13 25 13 ATGAGGTCACGACGGTAGGACTAAC
 3
  3869021 pm:st 11 25 13 AGGAGTACAGGGTAAGATATGGTCT
 4
  3774604 pm:st 14 25 13 CCCCGAAGACCCTAAGATGAGGTCA
...

Example 2 -- Human Genome U133 2.0 Plus PGF File Excerpt

Here is a hypothetical example of a PGF file for an expression array with PM/MM pairs.

#%pgf_format_version=1.0
#%chip_type=HG-U133_Plus_2
#%lib_set_name=HG-U133_Plus_2
#%lib_set_version=1
#%create_date=Tue Mar 29 16:48:05 2005
#%header0=probeset_id type probeset_name
#%header1= atom_id
#%header2=  probe_id type gc_count probe_length interrogation_position probe_sequence exon_position
1354897  204339_s_at
 1354898
  1221821 pm:target->at 13 25 13 ACAACGACCGTTCCGGAATCGACAT 1703
  1222985 mm:target->at 13 25 13 ACAACGACCGTTGCGGAATCGACAT 1703
 1354899
  788881 pm:target->at 8 25 13 TTACATCATACTTTCTTGTCTCTAG 1355
  790045 mm:target->at 8 25 13 TTACATCATACTATCTTGTCTCTAG 1355
 1354900
  516645 pm:target->at 12 25 13 GAATCTTATCACGAGTTCCACCTCC 1518
  517809 mm:target->at 12 25 13 GAATCTTATCACCAGTTCCACCTCC 1518
 1354901
  736948 pm:target->at 12 25 13 GAGTTCCCTCGATTCCATAAACGAG 1543

Related Pages

Developer Notes