Automated design of genomic Southern blot probes
Mike DR Croning, David G Fricker, Noboru H Komiyama and Seth GN Grant
Introduction
A novel software pipeline for designing and optimizing Southern blot probes in silico for use against genomic DNA targets is described.
The software was written and validated for two reasons:
- To address our own needs to regularly design Southern blot probes, automating this process, reducing the time needed to do this manually.
- To optimize the resultant probes by employing a brute-force search approach to significantly improve the chances of finding the (near) best probe for the loci of interest, the aim being to reduce both the time and expense in the laboratory that results from failed Southern blot assays, and subsequent rounds of probe redesign.
The in silico scoring measures that we developed for evaluating the automated probe designs suggest they should perform as well, or better, than previous manual designs, while reducing the time taken by the molecular biologist to yield a successful probe as we had planned.
We went on to experimentally test around 15 probes in the study and we report this experimental validation in the manuscript. The majority of the probes we tested in Southern blotting performed well confirming our in silico prediction methodology and the general usefulness of the software for automated genomic Southern blot probe design.
The software is freely-available under the terms of the Artistic License 2.0, and we hope that it finds extensive reuse by investigators in the genomics, genetics and molecular biology communities.
Tiling algorithm for Southern blot probe design
Given user-supplied chromosomal coordinates, and a desirable size range for the southern blot probe, we used a tiling approach to generate many possible probes in the specified design window. The program starts from the maximum allowable probe length, tiling the window by moving by a small percentage of the probe length each time (default 5%). Once this is completed the probe length is reduced by 50 bases (configurable) and the window re-tiled generating more candidate probes. The process is repeated until the minimum probe length is reached.
This approach produces a linear relationship between the numbers of candidate probes to search against the target genome with respect to the length of the input design window. With a desirable probe length range of 500-1300bp this produces approximately 900 probes to search for a 3kb input window.
Calibration with 8 experimentally-validated probes
We calibrated the method using a set of 8 manually-designed mouse genomic probes (download here) that we have previously successfully employed for Southern blotting. We searched these against the NCBI m33 genome assembly (see below).
Probe Name | Gene Target | Length bases | Self / second hit score ratio | Second hit identity (%) | Second hit query coverage (%) | Min repetitive & low-complexity DNA (%) |
---|---|---|---|---|---|---|
Dusp6_5prime_probe | Dusp6 (5') | 946 | 30.1 | 71 | 8 | 3.2 |
SAP102_5prime_PDZ3_probe | Dlg3 (5' PDZ3) | 969 | 27.2 | 72 | 8 | 2.7 |
Dusp6_3prime_probe | Dusp6 (3') | 1004 | 29.4 | 61 | 13 | 4.5 |
actb_probe | Actb | 881 | 22.8 | 91 | 6 | 6.7 |
SAP102_3prime_probe | Dlg3 (3') | 886 | 22.2 | 77 | 8 | 19.4 |
NR2B_probe | Grin2b | 567 | 11.1 | 81 | 14 | 9.5 |
SAP102_5prime_probe | Dlg3 | 784 | 9.89 | 68 | 29 | 81.7 |
PSD-95_exon_9_probe | Dlg4 (exon 9) | 296 | 3.3 | 76 | 54 | nd |
Average ± standard error | 791.6 ± 85.9 | 19.5 ± 3.6 | 74.6 ± 3.2 | 17.5 ± 5.8 | 18.2 ± 10.8 |
As can be seen above these had an average length of approximately 800 bp. When searched with Exonerate (with parameters --model affine:local --score 150) all of these produced a perfect match to their genomic locus (as would be expected) and a number of additional lower-scoring alignments to other loci. These second best matches spanned 17.5 ± 5.8% (mean ± standard error) of the probe length, with 74.6 ± 3.2% DNA sequence identity (n=8). From the scores of the 'self' and the highest scoring off-target locus alignments we calculated a score ratio as measure of uniqueness of the candidate probe. Our calibration probes averaged 19.5 ± 3.6. This score ratio is proportional to both the length and sequence identity of the matches. nd=not determined.
Comparing the probe sequences to a version of the genomic assembly that has been screened for repeats and low-complexity regions by RepeatMasker and DUST allows us to estimate the repetitive DNA content of individual probes. Our calibration probes contained 18.2 ± 10.8% such DNA (see Supplementary x).
We chose a minimum score ratio of 10 and a maximum combined repetitive and low-complexity base content of 5% as the minimum requirements for probe acceptance. candidate probes reaching these criteria that were completely overlapped by a longer and better scoring probe are considered redundant and removed from the passing set.
124 automated Southern blot designs
Probe Design | Mouse Chr | Passed | Genomic Design Window (bases) | Length Best Probe (bases) | Best Score Ratio | Unique Probes | Non-Unique Probes Passed | Total Probes Passed | Total Candidate Probs | Total Probes Passed (%) | Candidate Probes / Kilobase |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | Pass | 4587 | 800 | na | 48 | 8 | 56 | 1559 | 3.6 | 339.9 |
2 | 2 | Pass | 6327 | 600 | na | 15 | 461 | 476 | 2277 | 20.9 | 359.9 |
3 | 11 | Pass | 2001 | 650 | na | 10 | 127 | 137 | 205 | 66.8 | 102.4 |
4 | 11 | Fail | 2001 | 500 | 1.1 | 0 | 0 | 0 | 205 | 0 | 102.4 |
5 | 8 | Fail | 1001 | 550 | na | 5 | 0 | 0 | 51 | 0 | 50.9 |
6 | 8 | Pass | 1001 | 550 | na | 3 | 48 | 51 | 51 | 100 | 50.9 |
7 | 15 | Fail | 1001 | 500 | 10.7 | 0 | 0 | 0 | 51 | 0 | 50.9 |
8 | 15 | Pass | 1001 | 500 | na | 22 | 0 | 22 | 51 | 43.1 | 50.9 |
9 | 2 | Pass | 2893 | 700 | na | 46 | 6 | 52 | 434 | 12 | 150 |
10 | 2 | Fail | 1069 | 550 | 6.2 | 0 | 0 | 0 | 60 | 0 | 56.1 |
11 | X | Pass | 16430 | 1300 | na | 214 | 288 | 502 | 3207 | 15.7 | 195.1 |
12 | 16 | Pass | 647 | 350 | na | 13 | 23 | 36 | 287 | 12.5 | 443.6 |
13 | 16 | Pass | 3964 | 900 | na | 109 | 711 | 820 | 1305 | 62.8 | 329.2 |
14 | 16 | Pass | 3660 | 1300 | na | 251 | 535 | 786 | 1179 | 66.7 | 322.1 |
15 | 16 | Pass | 2460 | 550 | na | 7 | 292 | 299 | 683 | 43.8 | 277.6 |
16 | 16 | Pass | 4260 | 900 | na | 156 | 564 | 720 | 1423 | 50.6 | 334 |
17 | 16 | Pass | 3661 | 1300 | na | 253 | 533 | 786 | 1159 | 67.8 | 316.6 |
18 | 16 | Fail | 1051 | 500 | 14.5 | 0 | 0 | 0 | 112 | 0 | 106.5 |
19 | 16 | Pass | 2461 | 550 | na | 8 | 293 | 301 | 684 | 44 | 277.9 |
20 | 16 | Pass | 3171 | 700 | na | 49 | 532 | 581 | 974 | 59.7 | 307.2 |
21 | 16 | Pass | 3717 | 600 | na | 22 | 174 | 196 | 1200 | 16.3 | 322.8 |
22 | 2 | Pass | 2001 | 550 | 17.5 | 0 | 12 | 12 | 494 | 2.4 | 246.9 |
23 | 2 | Fail | 2501 | 500 | na | 2 | 0 | 0 | 700 | 0 | 279.9 |
24 | 12 | Pass | 2001 | 1000 | 32.9 | nd | nd | 252 | 405 | 62.2 | 202.4 |
25 | 12 | Pass | 2001 | 1000 | 32.9 | nd | nd | 217 | 405 | 53.6 | 202.4 |
SUMMARY | Total passed | Average length | Average length | Average best score ratio | Average unique probes | Average non-unique probes passed | Average total probes passed | Average per design | Average total probes passed (%) | Average probes / kilobase | |
103/124 | 3094.8 ± 202.9 | 818.1 ± 25.0 | 23.7 1.3 | 85.2 ± 14.1 | 176.7 ± 23.1 | 240.8 ± 24.2 | 899.6 ± 72.6 | 28.6 ± 2.6 | 263.4 ± 7.2 |
Southern Blot package documentation
Index
Downloads & class documentation
- Perl module/class browser
- Download package: southern_blot_design_25_08_09.tar.gz MD5: e8a1191dc99dca75cfee52ff64212f5f, 206582 bytes
A. Copyright & licensing conditions
Scripts, software and documentation copyright 2005-2009 Genes to Cognition Programme (G2C) and Genome Research Limited (GRL).
You may distribute this file/module under the terms of the artistic licence: http://www.perlfoundation.org/artistic_license_2_0
B. Introduction
Southern blotting is an experimental procedure where DNA, from a genomic or other source, is digested with a restriction enzyme and then separated by size using gel electrophoresis. The fragments are transferred from the gel onto a membrane ('blotted') which is then incubated with a labelled single-stranded DNA probe. Such a procedure allows one to locate a particular sequence of DNA within a complex mixture of DNA. From a gene targeting perspective, Southern blotting can be used to detect whether a targeting event has successfully taken place.
Designing a 'good' Southern blot probe for a particular gene or locus involves finding a stretch of DNA sequence at that locus, generally 500-1000bp long, that has the desirable qualities of being unique to that locus, with little or no repetitive DNA content. Molecular biologists tend to design their probes manually, by excising portions of genome sequence from online genome browsers (such as such as Ensembl) and then pasting them into a genome-search site enabling them to check the genome for sequence hits. Ideally a probe sequence should return a single hit to the region it was designed against, with little or no cross-reactivity to other parts of the genome.
If this is not the case, the investigator will likely shift the piece of DNA chosen a short distance away from what they might consider as the optimal site, and search again. Another option might be to shorten the candidate probe sequence, and repeat the search, particularly if was obvious one end of the initial sequence appeared to be lacking the desired specificity, thus giving rise to the extraneous hits.
With it taking quite a few minutes to perform each round of cutting, pasting and genome searching that proves necessary to find an acceptable sequence for a Southern blot probe, one can appreciate that this does not make effective use of a molecular biologists time, and is very unlikely to find an optimal probe.
The design strategy outlined is clearly very amenable to automation using bioinformatics with the added benefit that the number of candidate probes that can be examined during the design process need not be limited to a few (as when carried out manually) but can be increased to hundreds, or thousands, allowing a very fine-grained analysis to be performed, significantly increasing the chances of finding the best, or at least a near-optimal probe for the chosen locus.
C. Implementation
With the number of genome searches to be carried out potentially taking hours for each probe to be designed, the writing of a single programme which would complete the whole task outlined above is not likely to yield a satisfactory solution.
Instead a more elaborate system is required utilising a database to store and retrieve the design information for each probe, and subsequently the results of the many genome searches carried out for candidate probes. These results can then be analysed to find the best probes.
Such a system would also allow more than one computer to be used to carry out the searches, speeding design, and would also permit the user to modify the selection parameters for the probe, without requiring one to re-run the genome searches for a particular probe, should the initial constraints be found to be too stringent at a particular locus.
A MySQL database (12 tables) was designed for this purpose together with a set of Perl data objects and adaptors to allow programmes to write and retrieve from the database. These follow the design paradigm set by the Ensembl genome analysis system, where one creates a set of (DBEntry) classes for the 'business' objects used by the system, partnered by a set of complementary DBSQL classes that hold the cognate SQL necessary for storing and retrieving from the database. Changes to the database schema can the then be made without impact on the DBEntry classes. The naming conventions used by the classes (and data types returned) generally follow those used in the Ensembl core API, see http://www.ensembl.org/info/docs/api/core
Southern blot design package classes
- GeneTargeting
- ::DBEntry
- ::ComponentHit
- ::Conf
- ::Exonerate
- ::DNAProbe
- ::ExternalDB
- ::Hit
- ::Job
- ::Sequence
- ::Xref
- ::DBAdaptor
- ::BaseAdaptor
- ::ConfAdaptor
- ::DBAdaptor
- ::DNAProbeAdaptor
- ::ExternalDBAdaptor
- ::HitAdaptor
- ::JobAdaptor
- ::SequenceAdaptor
- ::SequenceHitAdaptor
- ::XrefAdaptor
- ::Utils - Grab-bag of general utility methods
- ::Config - Configuration file parsing and setup
- ::Counts
- ::GD - Utilities setting colours etc for GD
- ::Exonerate - Custom exonerate parser
- ::HTMLReport - Simple module for html output
- ::Primer3 - Primer-picking wrapping code
- ::DBEntry
Five programmes were written to perform the Southern blot probe design task in its entirety, along with three accessory scripts:
- create_probe_search_db_tables
- Creates the tables constituting the GeneTargeting MySQL database, allowing one to specify the server parameters on the command line
- create_probe_search
- Given the user-specified chromosomal coordinates for the acceptable design window for the Southern blot probe, enumerates all the possible probes in the window, at the chosen granularity, within the size-range chosen for the probe Creates a number of jobs of class GeneTargeting::DBEntry::Job that are stored in the database each one of which is executed by an instance of run_probe_search
- submit_probe_search
- User-executed script to submit a number of jobs to the compute farm to be utilised, wrapping the underying LSF system
- run_probe_search
- The 'runnable' script used by the nodes of the compute farm to fetch candidate probes from the database, search them against the genome (specified in the config file for create_probe_search) and store the results back for later analysis
- analyse_probe_search
- Used to analyse the results from the all the genome searches, determining which of the candidate probes exceed the minimum acceptable criteria for a Southern blot probe. Results are outputting as static html, including a graphic representation of 'unique', 'good' and 'bad' regions in the previously specified probe design window.
- Also picks primers with Primer3 for recovery of the candidate probes.
Accessory scripts
- get_probe_search_cpu_time
- Calculates the total time taken for the execution of all the jobs making up the Southern blot probe search. Does this by parsing the output files written by LSF
- delete_job_results
- Deletes the results from the database of a job - presumably which are erroneous due to a system failure.
- delete_probe_search
- Deletes the whole probe design from the database, when it is no longer needed or possibly the coorinates were specified incorrectly.
D. Performance
Results seem favourable when compared to a number of manually-designed probes (see the paper) that have been used successfully by Genes to Cognition research programme at the Wellcome Trust Sanger Institute.
Experimentally validation has been performed on a set of the probes automatically designed by the software (see the paper).
E. Available documentation
All the scripts making up the Southern blot probe design system contain POD documentation detailing their use and command line parameters. Should the scripts be run with an invalid parameter combination then the documentation is automatically displayed in order to guide the user.
The configuration files for each of the scripts (present in the conf directory of the package) are commented as to the function and meaning of their various sections and options.
The GeneTargeting API modules utilised for Southern blot design are documented with POD which can be displayed with 'perldoc Module_name.pm' or perldoc classname, such as perldoc GeneTargeting::Utils::Exonerate
Other documentation files include
- southern_blot_design/docs/southern_blot_probe_design.txt - this file
- southern_blot_design/docs/example_run.txt
- southern_blot_design/docs/example_run_output/
F. Package directory structure
- GeneTargeting/
- conf/ - example programme config files
- docs/ - documentation
- modules/
- GeneTargeting/
- DBEntry/
- DBSQL/
- Utils/
- GeneTargeting/
- scripts/ - deployed scripts
- run/ - runner scripts started by pipeline
Environment variables
- GeneTargetingConfDir
- full path to the conf directory
- GeneTargetingBaseDir
- full path to the southern_blot_design directory
H. Program configuration files
Each of the five programmes requires a windows-style .ini configuration file. These should be stored in the directory pointed to by the environment variable GeneTargetingConfDir.
Sections of the supplied .ini configuration files group related configuration options, and are commented.
I. Perl modules required
- Name
- Version tested
- Bio
- 1.5.0
- Bio::Ensembl
- branch 32
- Bio::Tools::Run
- 1.4
- Config::IniFiles
- 2.38
- DBI
- 1.32
- GD
- 2.17
Note other modules may be required by the above modules, please see their documentation should "Can't locate Module.pm" errors arise.
J. Other applications required
- Name
- Version tested
- exonerate
- exonerate-1.0.0
- primer3
- primer3_1.0.0
- LSF
- 5.1 from Platform Computing Corp.