Data Files

This page provides a comprehensive guide to all raw and processed data files in the osteosarcoma genomics project. Data is stored in a Google Cloud Storage bucket and organized by data type and time point. Each section attributes the data to the lab or individual who generated or processed it.

(This page was originally generated from this source Google Doc, but we have since added more details to this page that are not in the GDoc.)

gs://osteosarc-genomics
T0 Nov 2022 — resection
T1 Jun 2024 — re-resection
T2 Jan 2025 — biopsy
T3 Apr 2025 — resection

Overview

The dataset spans four clinical time points and includes whole genome sequencing (WGS), whole exome sequencing (WES), bulk RNA-seq, single-cell RNA-seq (Illumina & Oxford Nanopore), spatial transcriptomics, flow cytometry, and pathology imaging.

Contributors: Boston Gene UCLA UCSF Tempus Elucidate Bio Hudson Lab Jeremiah & Alfredo RCRF / Pattern Unify

Bucket Storage by Directory

24.88 TiB
Total data in gs://osteosarc-genomics
wgs_assembly
7.85 TiB
hudson_lab
5.56 TiB
kamil
3.32 TiB
genomics
1.93 TiB
genomics_reprocessing
1.83 TiB
ONT
1.03 TiB
hms_spatial
973.5 GiB
ucsf
751.9 GiB
wgs_cleaned
447.1 GiB
elucidate
412.8 GiB
scratch
285.1 GiB
pathology_images
159.3 GiB
Plasmidsaurus
114.6 GiB
rna-seq
99.6 GiB
xenium
67.2 GiB
1cell
35.9 GiB
web
32.2 GiB
T1
20.2 GiB
svaba
19.3 GiB
ref_genome
15.3 GiB
pacbio
1.7 GiB
webpage
265.0 MiB
neoantigen_prediction
178.3 MiB
hla_alignments
7.3 MiB

Bulk DNA Sequencing

DNA sequencing is available at four time points (T0–T3), including both whole exome (WES) and whole genome (WGS) data.

Whole Genome Sequencing (WGS)

TimepointSampleData TypeBucket PathSizeSource
Nov 2022 (T0)Tumor + NormalWGS (Personalis)genomics/genomics-bulk/2022.12.16/DNA/2022.12.16.dna.personalis.WGSPersonalis
Jun 2024 (T1)TumorFASTQgenomics/genomics-bulk/2024.06.06/DNA/2024.06.06.dna.WGS/raw/tumor184 GBUCLA
Jun 2024 (T1)TumorBAM (BQSR)genomics_reprocessing/DNA/T1_2024_BAM/preprocessing/recalibrated/tumor/BAM186 GBJeremiah / Alfredo
Jun 2024 (T1)OrganoidFASTQgenomics/genomics-bulk/2024.06.06/DNA/2024.06.06.dna.WGS/raw/organoid185 GBUCLA
Jun 2024 (T1)OrganoidBAM (BQSR)genomics_reprocessing/DNA/T1_2024_BAM/preprocessing/recalibrated/organoid176 GBJeremiah / Alfredo
Jun 2024 (T1)Blood NormalFASTQgenomics/genomics-bulk/2024.06.06/DNA/2024.06.06.dna.WGS/raw/normal-blood115 GBUCLA
Jun 2024 (T1)Blood NormalBAM (BQSR, cleaned)wgs_cleaned/blood.ss.cleaned.bam117 GBJeremiah / Alfredo
Jan 2025 (T2)TumorFASTQgenomics/genomics-bulk/2025.01.06/DNA/2025.01.06.dna.WGS/fastqs/tumor274 GBUCLA
Jan 2025 (T2)TumorBAM (BQSR)genomics_reprocessing/DNA/T2_2025_01_BAMJeremiah / Alfredo
Jan 2025 (T2)TumorBAM (BQSR, cleaned)wgs_cleaned/SG.wgs.UCLA.2025.01.tumor_cleaned.recal.bamJeremiah / Alfredo
Jan 2025 (T2)TumorBAM (name sorted)genomics_reprocessing/DNA_namesorted/SG.WGS.UCLA.2025.01.tumor.recal.name_sorted.bam634 GBJeremiah

Coverage: T2 January 2025 WGS — 130.97×.  T1 Blood Normal WGS — 62.56×.

Variant Calling (WGS)

TimepointCallersBucket PathPipeline
Jun 2024 (T1)ASCAT, CNVkit, FreeBayes, HaplotypeCaller, Manta, Mutect2, Strelkagenomics_reprocessing/DNA/T1_2024_WGS_sarek_variantsSarek 3.5.1
Jan 2025 (T2)ASCAT, Manta, Mutect2, Strelkagenomics_reprocessing/DNA/T2_2025_01_WGS_sarek_variants/variant_callingSarek 3.5.1

Whole Exome Sequencing (WES)

Bulk RNA Sequencing

Bulk RNA-sequencing is available at three tumor time points (T0, T1, T2), including both short-read (Illumina) and long-read (PacBio) data.

TimepointData TypeBucket PathSizeSource
Nov 2022 (T0)FASTQrna-seq/fastq/bostongene_202210 GBBoston Gene
Nov 2022 (T0)STAR alignmentsgenomics/genomics-bulk/2022.12.16/RNA/2022.12.16.rna.bostongene/processedAlfredo
Nov 2022 (T0)FASTQrna-seq/fastq/tempus_20221.4 GBTempus
Jun 2024 (T1)FASTQrna-seq/fastq/bostongene_20248.3 GBBoston Gene
Jun 2024 (T1)FASTQrna-seq/fastq/personalis_2024Personalis
Jun 2024 (T1)STAR alignmentsgenomics/genomics-bulk/2024.06.06/RNA/2024.06.06.rna.bostongene/processedAlfredo
Jun 2024 (T1)PacBio long-read (all)ucsf/T1/pacbio_bams/IPISRC044_T1_sclrs287 GBUCSF
Jan 2025 (T2)FASTQrna-seq/fastq/ucla_2025UCLA
Jan 2025 (T2)STAR alignmentsgenomics/genomics-bulk/2025.01.06/RNA/2025.01.06.rna.ucla-core/processed/STARAlfredo

Single Cell RNA-seq (Tumor)

Tumor single-cell and long-read RNA sequencing data from UCSF, spanning time points T1–T3. Includes 10x Illumina scRNA-seq and Oxford Nanopore (ONT) long-read sequencing.

Illumina scRNA-seq (UCSF)

TimepointData TypeBucket PathSize
Jun 2024 (T1)FASTQ (GEX)ucsf/T1/FASTQ/IPISRC044_T1_SCG137 GB
Jun 2024 (T1)FASTQ (TCR)ucsf/T1/FASTQ/IPISRC044_T1_TCR
Jun 2024 (T1)FASTQ (BCR)ucsf/T1/FASTQ/IPISRC044_T1_BCR
Jun 2024 (T1)Aligned BAMsucsf/T1/illumina_bams21 GB
Jun 2024 (T1)Seurat RDSucsf/T1/seurat_objects/IPISRC044_T1_scrna_live_processed_annot_101824.rds1 GB
Jan 2025 (T2)FASTQ (GEX)ucsf/T2/FASTQ/IPISRC044_T2_SCG138 GB
Jan 2025 (T2)FASTQ (TCR)ucsf/T2/FASTQ/IPISRC044_T2_TCR
Jan 2025 (T2)FASTQ (BCR)ucsf/T2/FASTQ/IPISRC044_T2_BCR
Jan 2025 (T2)FASTQ (ADT/CITE)ucsf/T2/FASTQ/IPISRC044_T2_ADT
Jan 2025 (T2)Aligned BAMsucsf/T2/illumina_bams34 GB
Jan 2025 (T2)Seurat RDSucsf/T2/seurat_objects/biopsy_01225.rds3.1 GB
Apr 2025 (T3)FASTQ (GEX)ucsf/T3/FASTQ/IPISRC044_T3_SCG149 GB
Apr 2025 (T3)FASTQ (CD45- enriched)ucsf/T3/FASTQ/IPISRC044_T3_CD45neg_enriched_illumina57 GB
Apr 2025 (T3)FASTQ (TCR)ucsf/T3/FASTQ/IPISRC044_T3_TCR
Apr 2025 (T3)FASTQ (BCR)ucsf/T3/FASTQ/IPISRC044_T3_BCR
Apr 2025 (T3)FASTQ (ADT/CITE)ucsf/T3/FASTQ/IPISRC044_T3_ADT
Apr 2025 (T3)Cell Ranger outputucsf/T3/cellranger_output/
Apr 2025 (T3)Seurat RDSucsf/T3/seurat_objects/biopsy_20250417.rds1.7 GB

All Illumina scRNA-seq data generated by UCSF

Kamil’s Tumor scRNA-seq Analysis (GEX, TCR, CNV)

Re-analysis of tumor scRNA-seq data by Kamil, including Cell Ranger multi outputs (GEX + TCR + BCR), scanpy-based clustering, cell type prediction, and CNV analysis.

TimepointData TypeBucket Path
Jun 2024 (T1)Cell Ranger outputkamil/tumor/output/T1
Jan 2025 (T2)Cell Ranger outputkamil/tumor/output/T2
Apr 2025 (T3)Cell Ranger outputkamil/tumor/output/T3
Apr 2025 (T3)Cell Ranger output (CD45-)kamil/tumor/output/T3_CD45neg
T1–T3Analysis (h5ad, markers, CNV, TCR)kamil/tumor/analysis

Oxford Nanopore (ONT) Long-Read RNA-seq

All ONT data generated by UCSF

Merged Objects (T1 + T2 + T3)

Annotations guide: Tumor cell labels are in T1_T2_T3_overall_TC_identity. Final cell type annotations are in fine_final_annot. Cluster labels are in merged_louvain_res1.5. Coarse annotations are in coarse_final_annot.

Key findings: Tumor cell percentage (relative to whole sample) decreases by T3, consistent with histopathology. Increase in immune infiltration across time points. See Darya Orlova’s analysis.

Longitudinal Single Cell RNA-seq (PBMC)

Sorted live T cells (CD3+) from PBMCs at multiple time points. This dataset is growing as additional time points are collected. Data includes up to 2026-01-13.

Hudson Lab (FASTQ + Seurat + Flow)

Data generated by Hudson Lab

Kamil’s PBMC scRNA-seq Analysis (GEX, TCR, CNV, CITE)

Re-analysis of PBMC scRNA-seq data by Kamil, including Cell Ranger multi outputs for GEX, TCRαβ, TCRγδ, and CITE-seq across multiple blood draw time points.

  • Cell Ranger outputs: kamil/blood/output — GEX, TCRαβ, TCRγδ, CITE-seq pools for each time point
  • Analysis: kamil/blood/analysis — combined CNV (h5ad), TCR clonotype analysis, database matching

Spatial Transcriptomics & Proteomics

Elucidate Bio (Phenocycler Fusion + Visium HD)

Generated by Elucidate Bio. Includes multiplexed IF data with Phenocycler Fusion instrument and custom conjugated antibodies, plus Visium HD run on the same section.

Note: The Visium data is quite sparse. Web summaries: Sample 1 · Sample 2. Elucidate extracted signal from Visium by pseudobulking RNA counts by cell type as called by proteins. See presentation.

Xenium (10x In Situ)

Xenium in-situ spatial transcriptomics for T0 tissue blocks B3 and C3.

HMS ORION Multiplex Imaging

ORION highly multiplexed immunofluorescence imaging and Minerva story visualizations.

Pathology & Imaging

Histopathology slides and immunohistochemistry images. Viewable at osteosarc.com/imaging.

TimepointBlocks / StainBucket PathSizeSource
Nov 2022 (T0)H&E — B1, B2, B3, B4, C3elucidate/HE_images/16.4 GBElucidate (scanned)
Nov 2022 (T0)H&E — B9, B10, B12, B14, B15, B16, C1, C3, D1, D2, D3pathology_images/czi_scans_ucsf_202210 GBUCSF (Zeiss)
Apr 2025 (T3)B7-H3 IHCUCSF
Apr 2025 (T3)EphA2 IHCUCSF

HLA Type

Red Cross HLA typing results.

Class I

HLA-AHLA-BHLA-C
A*01:01B*08:01C*01:02
A*01:11NB*27:05C*07:01

Class II

LocusAllele 1Allele 2
HLA-DPA1*01:03*02:01
HLA-DPB1*04:01*11:01
HLA-DQA1*05:01*04:01
HLA-DQB1*04:02*02:01
HLA-DRB1*03:01*08:01
HLA-DRB3*01:01*01:01
HLA-DRB4Absent
HLA-DRB5Absent

Vaccine Overlap Summary

View the vaccine overlap summary spreadsheet →

Analysis Tools

Shiny App

An R Shiny app compares bulk RNA-seq data to healthy tissues (GTEx) and other cancers (TCGA).

Interactive Visualizations on This Site

Data Access

Google Cloud Bucket

All data is stored in the gs://osteosarc-genomics bucket.

Pattern Unify System (RCRF)

Several datasets are indexed in the RCRF Pattern Unify system for programmatic access.

  1. Email unify-admin@rcrf.org to request access.
  2. Go to data-commons.rcrf-dev.org and create an account with the same email.
  3. Use the patternq Python library to query data.

Contact Us

Please contact us with any questions or comments.