Automated design of genomic Southern blot probes

Mike DR Croning, David G Fricker, Noboru H Komiyama and Seth GN Grant

Introduction

A novel software pipeline for designing and optimizing Southern blot probes in silico for use against genomic DNA targets is described.

The software was written and validated for two reasons:

  • To address our own needs to regularly design Southern blot probes, automating this process, reducing the time needed to do this manually.
  • To optimize the resultant probes by employing a brute-force search approach to significantly improve the chances of finding the (near) best probe for the loci of interest, the aim being to reduce both the time and expense in the laboratory that results from failed Southern blot assays, and subsequent rounds of probe redesign.

The in silico scoring measures that we developed for evaluating the automated probe designs suggest they should perform as well, or better, than previous manual designs, while reducing the time taken by the molecular biologist to yield a successful probe as we had planned.

We went on to experimentally test around 15 probes in the study and we report this experimental validation in the manuscript. The majority of the probes we tested in Southern blotting performed well confirming our in silico prediction methodology and the general usefulness of the software for automated genomic Southern blot probe design.

The software is freely-available under the terms of the Artistic License 2.0, and we hope that it finds extensive reuse by investigators in the genomics, genetics and molecular biology communities.

Tiling algorithm for Southern blot probe design

Given user-supplied chromosomal coordinates, and a desirable size range for the southern blot probe, we used a tiling approach to generate many possible probes in the specified design window. The program starts from the maximum allowable probe length, tiling the window by moving by a small percentage of the probe length each time (default 5%). Once this is completed the probe length is reduced by 50 bases (configurable) and the window re-tiled generating more candidate probes. The process is repeated until the minimum probe length is reached.

Probe tiling

Probe tiling
Enlarge this image (1043 x 832)

This approach produces a linear relationship between the numbers of candidate probes to search against the target genome with respect to the length of the input design window. With a desirable probe length range of 500-1300bp this produces approximately 900 probes to search for a 3kb input window.

Probes vs window

Probes vs window
Enlarge this image (965 x 588)

Calibration with 8 experimentally-validated probes

We calibrated the method using a set of 8 manually-designed mouse genomic probes (download here) that we have previously successfully employed for Southern blotting. We searched these against the NCBI m33 genome assembly (see below).

Probe Name Gene Target Length bases Self / second hit score ratio Second hit identity (%) Second hit query coverage (%) Min repetitive & low-complexity DNA (%)
Dusp6_5prime_probe Dusp6 (5') 946 30.1 71 8 3.2
SAP102_5prime_PDZ3_probe Dlg3 (5' PDZ3) 969 27.2 72 8 2.7
Dusp6_3prime_probe Dusp6 (3') 1004 29.4 61 13 4.5
actb_probe Actb 881 22.8 91 6 6.7
SAP102_3prime_probe Dlg3 (3') 886 22.2 77 8 19.4
NR2B_probe Grin2b 567 11.1 81 14 9.5
SAP102_5prime_probe Dlg3 784 9.89 68 29 81.7
PSD-95_exon_9_probe Dlg4 (exon 9) 296 3.3 76 54 nd
Average ± standard error 791.6 ± 85.9 19.5 ± 3.6 74.6 ± 3.2 17.5 ± 5.8 18.2 ± 10.8

As can be seen above these had an average length of approximately 800 bp. When searched with Exonerate (with parameters --model affine:local --score 150) all of these produced a perfect match to their genomic locus (as would be expected) and a number of additional lower-scoring alignments to other loci. These second best matches spanned 17.5 ± 5.8% (mean ± standard error) of the probe length, with 74.6 ± 3.2% DNA sequence identity (n=8). From the scores of the 'self' and the highest scoring off-target locus alignments we calculated a score ratio as measure of uniqueness of the candidate probe. Our calibration probes averaged 19.5 ± 3.6. This score ratio is proportional to both the length and sequence identity of the matches. nd=not determined.

Comparing the probe sequences to a version of the genomic assembly that has been screened for repeats and low-complexity regions by RepeatMasker and DUST allows us to estimate the repetitive DNA content of individual probes. Our calibration probes contained 18.2 ± 10.8% such DNA (see Supplementary x).

We chose a minimum score ratio of 10 and a maximum combined repetitive and low-complexity base content of 5% as the minimum requirements for probe acceptance. candidate probes reaching these criteria that were completely overlapped by a longer and better scoring probe are considered redundant and removed from the passing set.

124 automated Southern blot designs

Probe Design Mouse Chr Passed Genomic Design Window (bases) Length Best Probe (bases) Best Score Ratio Unique Probes Non-Unique Probes Passed Total Probes Passed Total Candidate Probs Total Probes Passed (%) Candidate Probes / Kilobase
1 2 Pass 4587 800 na 48 8 56 1559 3.6 339.9
2 2 Pass 6327 600 na 15 461 476 2277 20.9 359.9
3 11 Pass 2001 650 na 10 127 137 205 66.8 102.4
4 11 Fail 2001 500 1.1 0 0 0 205 0 102.4
5 8 Fail 1001 550 na 5 0 0 51 0 50.9
6 8 Pass 1001 550 na 3 48 51 51 100 50.9
7 15 Fail 1001 500 10.7 0 0 0 51 0 50.9
8 15 Pass 1001 500 na 22 0 22 51 43.1 50.9
9 2 Pass 2893 700 na 46 6 52 434 12 150
10 2 Fail 1069 550 6.2 0 0 0 60 0 56.1
11 X Pass 16430 1300 na 214 288 502 3207 15.7 195.1
12 16 Pass 647 350 na 13 23 36 287 12.5 443.6
13 16 Pass 3964 900 na 109 711 820 1305 62.8 329.2
14 16 Pass 3660 1300 na 251 535 786 1179 66.7 322.1
15 16 Pass 2460 550 na 7 292 299 683 43.8 277.6
16 16 Pass 4260 900 na 156 564 720 1423 50.6 334
17 16 Pass 3661 1300 na 253 533 786 1159 67.8 316.6
18 16 Fail 1051 500 14.5 0 0 0 112 0 106.5
19 16 Pass 2461 550 na 8 293 301 684 44 277.9
20 16 Pass 3171 700 na 49 532 581 974 59.7 307.2
21 16 Pass 3717 600 na 22 174 196 1200 16.3 322.8
22 2 Pass 2001 550 17.5 0 12 12 494 2.4 246.9
23 2 Fail 2501 500 na 2 0 0 700 0 279.9
24 12 Pass 2001 1000 32.9 nd nd 252 405 62.2 202.4
25 12 Pass 2001 1000 32.9 nd nd 217 405 53.6 202.4
SUMMARY Total passed Average length Average length Average best score ratio Average unique probes Average non-unique probes passed Average total probes passed Average per design Average total probes passed (%) Average probes / kilobase
103/124 3094.8 ± 202.9 818.1 ± 25.0 23.7   1.3 85.2 ± 14.1 176.7 ± 23.1 240.8 ± 24.2 899.6 ± 72.6 28.6 ± 2.6 263.4 ± 7.2
124 entries 123...5TXTCSVXLS

Southern Blot package documentation

Downloads & class documentation

A. Copyright & licensing conditions

Scripts, software and documentation copyright 2005-2009 Genes to Cognition Programme (G2C) and Genome Research Limited (GRL).

You may distribute this file/module under the terms of the artistic licence: http://www.perlfoundation.org/artistic_license_2_0

B. Introduction

Southern blotting is an experimental procedure where DNA, from a genomic or other source, is digested with a restriction enzyme and then separated by size using gel electrophoresis. The fragments are transferred from the gel onto a membrane ('blotted') which is then incubated with a labelled single-stranded DNA probe. Such a procedure allows one to locate a particular sequence of DNA within a complex mixture of DNA. From a gene targeting perspective, Southern blotting can be used to detect whether a targeting event has successfully taken place.

Designing a 'good' Southern blot probe for a particular gene or locus involves finding a stretch of DNA sequence at that locus, generally 500-1000bp long, that has the desirable qualities of being unique to that locus, with little or no repetitive DNA content. Molecular biologists tend to design their probes manually, by excising portions of genome sequence from online genome browsers (such as such as Ensembl) and then pasting them into a genome-search site enabling them to check the genome for sequence hits. Ideally a probe sequence should return a single hit to the region it was designed against, with little or no cross-reactivity to other parts of the genome.

If this is not the case, the investigator will likely shift the piece of DNA chosen a short distance away from what they might consider as the optimal site, and search again. Another option might be to shorten the candidate probe sequence, and repeat the search, particularly if was obvious one end of the initial sequence appeared to be lacking the desired specificity, thus giving rise to the extraneous hits.

With it taking quite a few minutes to perform each round of cutting, pasting and genome searching that proves necessary to find an acceptable sequence for a Southern blot probe, one can appreciate that this does not make effective use of a molecular biologists time, and is very unlikely to find an optimal probe.

The design strategy outlined is clearly very amenable to automation using bioinformatics with the added benefit that the number of candidate probes that can be examined during the design process need not be limited to a few (as when carried out manually) but can be increased to hundreds, or thousands, allowing a very fine-grained analysis to be performed, significantly increasing the chances of finding the best, or at least a near-optimal probe for the chosen locus.

C. Implementation

With the number of genome searches to be carried out potentially taking hours for each probe to be designed, the writing of a single programme which would complete the whole task outlined above is not likely to yield a satisfactory solution.

Instead a more elaborate system is required utilising a database to store and retrieve the design information for each probe, and subsequently the results of the many genome searches carried out for candidate probes. These results can then be analysed to find the best probes.

Such a system would also allow more than one computer to be used to carry out the searches, speeding design, and would also permit the user to modify the selection parameters for the probe, without requiring one to re-run the genome searches for a particular probe, should the initial constraints be found to be too stringent at a particular locus.

A MySQL database (12 tables) was designed for this purpose together with a set of Perl data objects and adaptors to allow programmes to write and retrieve from the database. These follow the design paradigm set by the Ensembl genome analysis system, where one creates a set of (DBEntry) classes for the 'business' objects used by the system, partnered by a set of complementary DBSQL classes that hold the cognate SQL necessary for storing and retrieving from the database. Changes to the database schema can the then be made without impact on the DBEntry classes. The naming conventions used by the classes (and data types returned) generally follow those used in the Ensembl core API, see http://www.ensembl.org/info/docs/api/core

Southern blot design package classes

  • GeneTargeting
    • ::DBEntry
      • ::ComponentHit
      • ::Conf
        • ::Exonerate
      • ::DNAProbe
      • ::ExternalDB
      • ::Hit
      • ::Job
      • ::Sequence
      • ::Xref
    • ::DBAdaptor
      • ::BaseAdaptor
      • ::ConfAdaptor
      • ::DBAdaptor
      • ::DNAProbeAdaptor
      • ::ExternalDBAdaptor
      • ::HitAdaptor
      • ::JobAdaptor
      • ::SequenceAdaptor
      • ::SequenceHitAdaptor
      • ::XrefAdaptor
    • ::Utils     - Grab-bag of general utility methods
      • ::Config     - Configuration file parsing and setup
      • ::Counts
      • ::GD     - Utilities setting colours etc for GD
      • ::Exonerate     - Custom exonerate parser
      • ::HTMLReport     - Simple module for html output
      • ::Primer3     - Primer-picking wrapping code

Five programmes were written to perform the Southern blot probe design task in its entirety, along with three accessory scripts:

create_probe_search_db_tables
Creates the tables constituting the GeneTargeting MySQL database, allowing one to specify the server parameters on the command line
create_probe_search
Given the user-specified chromosomal coordinates for the acceptable design window for the Southern blot probe, enumerates all the possible probes in the window, at the chosen granularity, within the size-range chosen for the probe Creates a number of jobs of class GeneTargeting::DBEntry::Job that are stored in the database each one of which is executed by an instance of run_probe_search
submit_probe_search
User-executed script to submit a number of jobs to the compute farm to be utilised, wrapping the underying LSF system
run_probe_search
The 'runnable' script used by the nodes of the compute farm to fetch candidate probes from the database, search them against the genome (specified in the config file for create_probe_search) and store the results back for later analysis
analyse_probe_search
Used to analyse the results from the all the genome searches, determining which of the candidate probes exceed the minimum acceptable criteria for a Southern blot probe. Results are outputting as static html, including a graphic representation of 'unique', 'good' and 'bad' regions in the previously specified probe design window.
Also picks primers with Primer3 for recovery of the candidate probes.

Accessory scripts

get_probe_search_cpu_time
Calculates the total time taken for the execution of all the jobs making up the Southern blot probe search. Does this by parsing the output files written by LSF
delete_job_results
Deletes the results from the database of a job - presumably which are erroneous due to a system failure.
delete_probe_search
Deletes the whole probe design from the database, when it is no longer needed or possibly the coorinates were specified incorrectly.

D. Performance

Results seem favourable when compared to a number of manually-designed probes (see the paper) that have been used successfully by Genes to Cognition research programme at the Wellcome Trust Sanger Institute.

Experimentally validation has been performed on a set of the probes automatically designed by the software (see the paper).

E. Available documentation

All the scripts making up the Southern blot probe design system contain POD documentation detailing their use and command line parameters. Should the scripts be run with an invalid parameter combination then the documentation is automatically displayed in order to guide the user.

The configuration files for each of the scripts (present in the conf directory of the package) are commented as to the function and meaning of their various sections and options.

The GeneTargeting API modules utilised for Southern blot design are documented with POD which can be displayed with 'perldoc Module_name.pm' or perldoc classname, such as perldoc GeneTargeting::Utils::Exonerate

Other documentation files include

  • southern_blot_design/docs/southern_blot_probe_design.txt     - this file
  • southern_blot_design/docs/example_run.txt
  • southern_blot_design/docs/example_run_output/

F. Package directory structure

  • GeneTargeting/
    • conf/     - example programme config files
    • docs/     - documentation
    • modules/
      • GeneTargeting/
        • DBEntry/
        • DBSQL/
        • Utils/
    • scripts/     - deployed scripts
      • run/     - runner scripts started by pipeline

Environment variables

GeneTargetingConfDir
full path to the conf directory
GeneTargetingBaseDir
full path to the southern_blot_design directory

H. Program configuration files

Each of the five programmes requires a windows-style .ini configuration file. These should be stored in the directory pointed to by the environment variable GeneTargetingConfDir.

Sections of the supplied .ini configuration files group related configuration options, and are commented.

I. Perl modules required

Name
Version tested
Bio
1.5.0
Bio::Ensembl
branch 32
Bio::Tools::Run
1.4
Config::IniFiles
2.38
DBI
1.32
GD
2.17

Note other modules may be required by the above modules, please see their documentation should "Can't locate Module.pm" errors arise.

J. Other applications required

Name
Version tested
exonerate
exonerate-1.0.0
primer3
primer3_1.0.0
LSF
5.1 from Platform Computing Corp.
© G2C 2014. The Genes to Cognition Programme received funding from The Wellcome Trust and the EU FP7 Framework Programmes:
EUROSPIN (FP7-HEALTH-241498), SynSys (FP7-HEALTH-242167) and GENCODYS (FP7-HEALTH-241995).

Cookies Policy | Terms and Conditions. This site is hosted by Edinburgh University and the Genes to Cognition Programme.