Affymetrix® CDF Data File Format

CDF FILE

Description

The CDF file describes the layout for an Affymetrix GeneChip array. An array may contain Expression, Genotyping, CustomSeq, Copy Number and/or Tag probe sets. All probe set names within an array are unique. Multiple copies of a probe set may exist on a single array as long as each copy has a unique name.

The information below will describe the following versions:

  • ASCII text format is used by the MAS and GCOS 1.0 software. This was also known as the ASCII version.
  • XDA format is used by the GCOS 1.2 and above software. This was also known as the binary or XDA version.

ASCII Text Format

The format of this CDF file is an ASCII text file similar to the Windows INI format.

The file is divided up into sections. The start of each section is defined by a line containing a section name enclosed in square braces. The section names are: "CDF", "Chip", "QCI" (where I ranges from 1 to the number of QC probe sets), "Unit J" (where J is an internal index to uniquely distinguish probe sets), and "Unit J_Block K" (where J and K are internal indices used to distinguish subsets of a probe set). The data in each section is of the format TAG=VALUE.

The "CDF" section contains the version number of the file. The TAGS are:

TAG Description
Version The version number. Should always be set to "GC1.0", "GC2.0", "GC3.0", "GC4.0", "GC5.0", "GC6.0", or "GC7.0". This document describes GC3.0, GC4.0, GC5.0, and GC6.0 version CDF files.
GUID The unique identifier of the CDF. (Only available in version 6 or 7)
md5 The integrity md5 of the CDF. (Only available in version 6 or 7)

The "Chip" section contains the following TAGS:

TAG Description
Name The name of the array. This item is not used by the software.
ChipType The probe array type. Multiple entries may exist. (Only available in version 6 or 7)
Rows The number of rows of cells on the array.
Cols The number of columns of cells on the array.
NumberOfUnits The number of units in the array not including QC units. For CustomSeq arrays, there are 2 units: Unit1 contains the probes interrogating a sense target and Unit2 contains the probes interrogating an anti-sense target. For all other array types, there exists one unit per probe set.
MaxUnit Each unit is given a unique number. This value is the maximum of the unit numbers of all the units in the array (not including QC units).
NumQCUnits The number of QC units. QC units are defined in version 2 and above. CustomSeq arrays do not contain any QC units.
ChipReference Used for CustomSeq, HIV and P53 arrays only. This is the reference sequence displayed by the Affymetrix software. The sequence may contain spaces. This value is defined for version 2 and above.

The next set of sections where the name begins with "QC" define the QC units or probe sets in the array. There are NumQCUnits (from the Chip section) QC sections.

Each section name is a combination of "QC" and an index ranging from 1 to NumQCUnits-1 and will be listed sequentially. QC units are defined for version 2 and above.

Each section contains the following TAGS:

TAG Description
Type Defines the type of QC probe set. The defined types are:

0 - Unknown
1 - Checkerboard Negative
2 - Checkerboard Positive
3 - Hybridization Negative
4 - Hybridization Positive
5 - Text Features Negative
6 - Text Features Positive
7 - Central Negative
8 - Central Positive
9 - Gene Expression Negative
10 - Gene Expression Positive
11 - Cycle Fidelity Negative
12 - Cycle Fidelity Positive
13 - Central Cross Negative
14 - Central Cross Positive
15 – Cross Hyb Negative
16 – Cross Hyb Positive

NumberCells The number of cells in the probe set.
CellHeader Defines the data contained in the subsequent lines, separated by tabs.

For all QC probe set types:
X - The X coordinate of the cell.
Y - The Y coordinate of the cell.
PROBE - The probe sequence of the cell. Typically set to "N".
PLEN - The number of bases in the probe sequence.
ATOM - An index used to group multiple cells.
INDEX - An index used to look up the corresponding cell data in the CEL file.

The final data items are dependent on the type of the QC probe set:
MATCH - A boolean flag indicating a perfect match probe. For types: 7 - Central Negative, 8 - Central Positive, 9 - Gene Expression Negative, 10 - Gene Expression Positive
BG - A boolean flag indicating a background (blank) cell. For types: 9 - Gene Expression Negative, 10 - Gene Expression Positive
CYCLES - This item is always a list of 0's separated by a tab. There are as many 0's as number of bases in the probe sequence (PLEN). For types: 11 - Cycle Fidelity Negative, 12 - Cycle Fidelity Positive

Celli This contains the information about a cell that belongs to the probe set. The value of i in the tag ranges from 1 to the number of cells in the probe set and will be listed sequentially. The values in each line depend on the CellHeader. The values are separated by tabs.

The next set of sections where the name begins with "Unit" define the probes that are a member of the unit (probe set). Each unit is divided into subsections termed "Blocks" which are referred to as "groups" in the Files SDK documentation.

Each section name is a combination of "Unit" and an index. There is no meaning to the index value. Immediately following the "Unit" section there will be the "Block" sections for that unit before the next unit is defined.

Each "Unit" section contains the following TAGS:

TAG Description
Name The name of the unit. The probe set name for Genotyping, Copy Number, Polymorphic Marker and Multichannel Marker units or "NONE" for all other unit types.
Direction Defines if the probes are interrogating a sense target or anti-sense target (1 - sense, 2 - anti-sense, 3 - both).
NumAtoms The number of atoms in the entire probe set. This TAG name contain two values after the equal sign. The first is the number of atoms and the second (if found) is the number of cells in each atom. An atom is a probe quartet for CustomSeq units and a probe pair for all other unit types.
NumCells The number of cells in the entire probe set. Probe pairs contain 2 cells and probe quartets contain 4 cells.
UnitNumber An arbitrary index value for the probe set.
UnitType Defines the type of unit (0 - Unknown, 1 - CustomSeq, 2 - Genotyping, 3 - Expression, 7 - Tag/GenFlex, 8 - Copy Number, 9 - Genotyping Control, 10 - Expression Control, 11 - Polymorphic Marker, 12 - Multichannel Marker). An array may contain units of varying types.
NumberBlocks The number of blocks or groups in the probe set.
MutationType Used for Genotyping units only in defining the type of polymorphism (0 - substitution, 1 - insertion, 2 - deletion). This value is available in version 2 and above.

After the "Unit" section follows the "Unit_Block" sections. There are as many "Unit_Block" sections as defined by NumberBlocks. A block will list the probes as its members.

The TAGS are:

TAG Description
Name The name of the block. For Genotyping units this is the allele. For Polymorphic Marker and Multichannel Marker units this is "None". For all other unit types this is the name of the probe set.
BlockNumber An index to the block.
Wobble
The wobble situation for Polymorphic Marker and Multichannel Marker units in the block. Only available in version 4, 5, 6, and 7.
Allele
The allele code for Polymorphic Marker and Multichannel Marker units in the block. Only available in version 4, 5, 6, and 7.
Channel
The channel code for multichannel microarray platform. Only available in version 5, 6, and 7.
RepType
The probe replication type (0 - unknown, 1 - different probe sequences, 2 - some probe sequences are identical, 3 - all probe sequences are identical) for probe set groups used under multichannel microarray platform. Only available in version 5, 6, and 7.
NumAtoms The number of atoms in the block.
NumCells The number of cells in the block.
StartPosition The position of the first atom.
StopPosition The position of the last atom.
Direction Used for Genotyping, Polymorphic Marker and Multichannel Marker units only in defining whether the probes are interrogating a sense target or anti-sense target (0 - no direction, 1 - sense, 2 - anti-sense). This value is available in version 3 and above.
CellHeader Defines the data contained in the subsequent lines, separated by tabs. The values are:

X- The X coordinate of the cell.
Y - The Y coordinate of the cell.
PROBE- The probe sequence of the cell. Typically set to "N".
FEAT - Unused string.
QUAL - The probe set name plus the allele for Genotyping units. The probe set name for all other unit types.
EXPOS - Ranges from 0 to the NumAtoms - 1 for Expression units. For all other unit types, provides relative positional information for the probe.
PLEN - The length of probe sequence. Only available in version 4, 5, 6, and 7.
POS - An index to the base position within the probe where the mismatch occurs.
CBASE - Not used.
PBASE - The probe base at the substitution position.
TBASE - The base of the target where the probe interrogates at the substitution position.
ATOM - An index used to group probe pairs or quartets. For Expression, identical to EXPOS.
INDEX - An index used to look up the corresponding cell data in the CEL file.
GROUP - The physical grouping of probe on the array. Only available in version 4, 5, 6, and 7.

PROBEID<\strong> - A unique key associated with the probe's sequence. Only available in version 7.

The following are only available in version 2 and above:
CODONIND - Always set to -1
CODON -Always set to -1
REGIONTYPE - Always set to 99
REGION - Always set to a blank character

Celli This contains the information about a cell that belongs to the block. The value of i in the tag ranges from 1 to the number of cells in the block. The values in each line depend on the CellHeader. The values are separated by tabs.

XDA Format

The format of this CDF file is a binary file created for faster access and smaller file size. The values in the file are stored in little-endian format.

The file contents are defined by:

Item Description Type
1 Magic number. Always set to 67. integer
2 Version number. Should set to 1, 2, 3, 4, or 5.
integer
3 The length of the GUID, an unique identifier of the CDF. (Only available in version 4)
unsigned integer
4 GUID, the unique identifier of the CDF. (Only available in version 4)
char[length defined above]
5 The integrity md5 of the CDF. (Only available in version 4)
char[32]
6 The number of probe array types. (Only available in version 4)
unsigned char
7 The length of probe array type. (Only available in version 4)
unsigned integer
8 The probe array type. (Only available in version 4)
char[length defined above]
9 The length and value of probe array type as described in Item 7 and 8 respectively if there is more than one entry. (Only available in version 4)
(unsigned integer + char[length defined]) * (# of probe array types - 1)
10 The number of columns of cells on the array. unsigned short
11 The number of rows of cells on the array. unsigned short
12 The number of units in the array not including QC units. The term unit is an internal term which means probe set. integer
13 The number of QC units. integer
14 The length of the CustomSeq reference sequence. integer
15 The CustomSeq reference sequence. char[ length defined above]
16 The probe set name. The UNIT name for CustomSeq, Genotyping, Polymorphic Marker, and Multichannel Marker. The BLOCK name for Expression. char[64] * (# of units)
17 File position for the start of each QC unit information block. integer * (# of QC units)
18 File position for the start of each unit information block. integer * (# of units)
19 QC information, repeated for each QC unit:

Type - unsigned short
Number of probes - integer

Probe information, repeated for each probe in the QC unit:

X coordinate - unsigned short
Y coordinate - unsigned short
Probe length - unsigned char
Perfect match flag - unsigned char
Background probe flag - unsigned char
see description
20 Unit information, repeated for each unit:

UnitType - unsigned short (1 - Expression, 2 - Genotyping, 3 - CustomSeq, 4 - Tag, 5 - Copy Number, 6 - Genotyping Control, 7 - Expression Control, 8 - Polymorphic Marker, 9 - Multichannel Marker)
Direction - unsigned char
Number of atoms - integer
Number of blocks - integer (always 1 for Expression units)
Number of cells - integer
Unit number (probe set number) - integer
Number of cells per atom - unsigned char

Block information, repeated for each block in the unit:

Number of atoms - integer
Number of cells - integer
Number of cells per atom - unsigned char
Direction - unsigned char
The position of the first atom - integer
<unused integer value> - integer
The block name - char[64]
Wobble situation - unsigned short (only available in version 2, 3, 4, and 5)
Allele code - unsigned short (only available in version 2, 3, 4, and 5)
Channel - unsigned char (only available in version 3, 4, and 5)
RepType - unsigned char (0 - unknown, 1 - different probe sequences, 2 - some probe sequences are identical, 3 - all probe sequences are identical) (Only available in version 3, 4, and 5)

Cell information, repeated for each cell in the block:

Atom number - integer
X coordinate - unsigned short
Y coordinate - unsigned short
Index position (relative to sequence for CustomSeq, Genotyping, Copy Number, Polymorphic Marker, and Multichannel Marker units, for Expression units this value is the atom number) - integer
Base of probe at substitution position - char
Base of target at interrogation position - char
Length of probe sequence - unsigned short (only available in version 2, 3, 4, and 5)
Physical grouping of probe - unsigned short (only available in version 2, 3, 4, and 5)
Probe sequence ID - int (only available in version 5)

see description