************************************* Running Validate on Syngenta data set ************************************* -------------------------- Install Cyverse CLI tools -------------------------- Accessing Validate through the Agave API hugely streamlines the task of managing your data and accessing applications through the Cyverse Data Store with a variety of command-line tools. For a comprehensive overview please see the `full documentation `_. Download Agave API:: git clone https://github.com/iPlantCollaborativeOpenSource/cyverse-sdk.git Change directory into the downloaded repository and unpack the cli tools:: cd cyverse-sdk tar xf cyverse-cli.tgz Add the cli tools to your home directory and bash profile:: mv cyverse-cli $HOME echo "PATH=\$PATH:\$HOME/cyverse-cli/bin" >> ~/.bashrc source ~/.bashrc (note that on a Macintosh system your default profile may be .bash_profile) Initialise the CLI tools:: tenants-init -t iplantc.org Create an OAAuth client application with API keys:: clients-create -S -v -N my_client -D "Client used for app development" Where -S saves the keys for future use, -D provides a brief description, and -N is your application name. You will be asked to enter your CyVerse account information:: auth-tokens-create -S The above code is the authorization of a token. Tokens are a form of short-lived (4 hours), temporary authenticiation and authorization used in place of your username and password. To continue interacting with Agave and other Cyverse APIs after the token has expired, you will simply type in the command 'auth-tokens-create' to refresh your token. At this point:: apps-list should return a list of all publicly-available apps. .. _jobs: ------------------------ Configure Job Submission ------------------------ A template .JSON file is available to get started. The basic parameters define a few basic settings for your job-- this example, "FLMMDongWangFull," requests two hours of runtime to process the publically-available Dong Wang simulation data, which is referenced in the parameter file using an agave url. While the parameters should be fairly self-explainatory, full details can be found in the `documentation for the Agave API `_. download fastLMM job skeleton:: wget https://github.com/CyVerse-Validate/Stampede-Files/raw/master/json/fastlmm-job.json (you may need to install wget if using a mac) You can open fastlmm-job.json in your text editor of choice to go over your options & make any needed changes (depending on your needs and TACC system status, you may want to request more or less time for your job):: { "jobName": "FLMMDongWangFull", "softwareName": "FaST-LMM-hpc-2.07u1", "requestedTime": "02:00:00", "archive": true, "inputs":{ "inputPED": "agave://http://datacommons.cyverse.org/browse/iplant/home/shared/syngenta_sim/Dong_Wang_sim/Analysis_Files/dongwang.ped", "inputMAP": "agave://http://datacommons.cyverse.org/browse/iplant/home/shared/syngenta_sim/Dong_Wang_sim/Analysis_Files/dongwang.map", "inputPHENO": "agave://http://datacommons.cyverse.org/browse/iplant/home/shared/syngenta_sim/Dong_Wang_sim/Analysis_Files/dongwangpheno.txt" }, "parameters":{ "output": "full_results" } } Inputs must be specified based on the app you're running-- ensure the file format is correct! Refer to individual app's documentation for more information. In this tutorial we are accessing PED, MAP and phenotype files from the public syngenta_sim dataset. To access your own data from the Data Store, change the url to replace "/shared/" with the path to your user account; ie:: "agave://data.iplantcollaborative.org/yourusername/results/yourdata.ped", ---------- Submit Job ---------- from the working directory where the .json file is located, type:: jobs-submit -F fastlmm-job.json If successful you should soon see:: Job [job_id] successfully submitted Copy down the job_id. You'll use it to check the status of your job with:: jobs-status *jobid* Once your job is complete, you can view the outputs with:: jobs-output *jobid* And download selected outputs with:: jobs-output --download --path *filename* *jobid* ----------------------------------------------- Run more GWAS / Prediction Apps for comparision ----------------------------------------------- Validate helps you find the tools best-suited for working with your data sets. Depending on your experience & the nature of your project, you may wish to process the syngenta data set with a variety of GWAS / Prediction tools, including: - Ridge Regression - Plink - Gemma - GenSel The job submission process for any app on Stampede is fairly similar-- configure a JSON wrapper containing your input / output parameters, submit the task with jobs-submit, and move the output files into your data store. ---------------------------- Send Output Files to Winnow ---------------------------- For full inputs & outputs see `here `_ The required files for winnow are the Known Truth file and the output from a GWAS tool (FastLMM in our case) Once you download the fastlmm output, upload it to a new location in your DE:: files-upload -S data.iplantcollaborative.org -F *fastlmm output which should now be local* yourusername/yourdatafolder Download the winnow example skeleton:: wget https://github.com/CyVerse-Validate/Stampede-Files/raw/master/json/winnow-job.json You can edit and submit this file using the same process described above. ---------------------------------- Visualize Results with Demonstrate ---------------------------------- **Currently it is easier to work with R on your own system than on Stampede.** The final step of Validation will be comparing the results using a visualization method of your choice. Refer to :ref:`datastore` to download results from Stampede to your own disk or the data store. Demonstrate is the final step in the Validate known-truth pipeline. Using output from Winnow, it produces a set of graphics showing differences in a GWAS/QTL applications performance under varying heritability and population structure. Demonstrate also functions without the need for heritability or population structure, but different graphics will be produced in that case. The function you will want to use depends on what type of data you have: Data with Heritability and Population Structure Specified --------------------------------------------------------- If you want to visualize differences in your data based on heritability or population structure, you'll want to use the original function Demonstrate. To run Demonstrate, type R on your terminal or command line to open the R console. From there use:: library(Demonstrate) If nothing happens, then you did it correctly! Now the Demonstrate package is loaded. Here are the options to run the function:: Demonstrate(dir, make.AUC.plot=TRUE, AUC.plot.title="Mean AUC By Population Structure and Heritability", make.MAE.plot=TRUE, MAE.plot.title="Mean MAE By Population Structure and Heritability",herit.strings=list("_03_","_04_","_06_") ,herit.values=list(0.3,0.4,0.6),struct.strings=list("PheHasStruct","PheNPStruct"),struct.values=list(TRUE,FALSE)) In this function, dir represents the directory where all Winnow output is stored. These default values are based on the sample data found within this repository. Once run, the function will create two graphs on the mean absolute error (MAE) and area under the receiver operator curve (AUC) across varying levels of heritability and/or population structure. The graphs are in pdf format. Other data from Winnow ----------------------- For other types of data, or if you're more interested in comparing GWAS tools than comparing data, you will want to use the Demonstrate2 function. Before running it though, you will need to include the function in your global environment:: source("/Demonstrate2.R") Then run the function:: Demonstrate2(dir, make.pos.plot=TRUE, pos.plot.title="True Positives by False Positives", make.error.plot=TRUE, error.plot.title="Plot of AUC by MAE", extra.plots=TRUE, AUC.axis.min=0, AUC.axis.max=1.0, MAE.axis.min=0, MAE.axis.max=2.0) Assuming all outputs are kept, Demonstrate2 will output five files in total. First, two frequency histograms illustrating the distribution of both true and false positives (if multiple Winnow files were in the original directory, the pdf files will have multiple pages). Second, a .csv file detailing the average sensitivity, specificity, and precision of each file. Finally, two plots based on true vs. false positives and mean absolute error vs. area under the curve will be produced. Demonstrate2 will color the points based on the file they came from, so you can compare multiple GWAS analysis results on the same plot.