The goal of WESfilter is to provide handy utility functions to quickly
filter WES VEP (variant effect predictor) files as obtained from YCGA
core facility. Typically these are large files and can have several
hundred thousand rows, and several hundred columns (depending on number
of samples, and version of the underlying pipeline used)
A typical file has as many rows as the total number of SNPs/SNVs
detected across all the samples under the study. In a common workflow
there are initially some samples sequenced and analyzed, and as more
clinical samples become available, they are added to the cohort. The
file is constructed such that as long as at least one member of the
cohort has a specific SNP/SNV detected it is part of the file. There are
usually two files, one for the coding regions, and one for the
non-coding regions. The non-coding regions include UTRs,
introns, non-expressed exons, and regions around the splice sites.
You can install the development version of WESfilter like so:
library(remotes)
remotes::install_github("sameet/WESfilter")
This package assumes the following: - The data has been pre-processed to
an extent.
Normally the data is delivered as a very big file sometimes over 10s of
GBs. This files needs to be processed such that the metadata of the
SNPs/SNVs is separated from the Depth and Genotype Call for each sample.
It is also important that both the tables generated have same rowid as
the first column (or one of the columns) in order for the data to be put
back together. - These pre-processing usually needs to happen separately
for the coding regions, and the non-coding regions. These files are
provided separately by YCGA. - The renv.lock file has a snapshot of
all the packages used in the development of this packages, and should
automatically ensure that the correct versions are already installed.
We will also add a python notebook to show how this package can be
used.
Amount of data being processed is huge. The non-coding WES data
contains over 846K rows. Standard output from the YCGA pipeline
generates an output (depending on the model reference organism) that
contains may columns of meta-data for each called SNP/SNV. For hg38,
there are 89 columns of meta-data. From column 90 onwards we have data
for each sample in the cohort, one column per sample. To streamline the
data-analysis, we divide this large table into 2 smaller tables. Table
1, is the meta-data table, and contains only the columns of meta
information. Table 2 is only the information about the samples in the
cohort. We then add the row-number as an additional column to each of
these tables. As the rows in both tables are identical, the rowid is
also identical. This ensures that we can now process the two pieces of
information separately, and then put the results together, because both
tables share the rowid.
This process becomes computationally expensive as number of samples
increases. This needs to be treated more deliberately, and standard
analysis methods on standard Desktop/Laptop do not work. Common
approaches usually fail with memory errors. To get around this step, we
use duckdb. This approach treats the data as an in memory (or
on disk) data base, uses lazy-evaluation and is able to accomplish
reshaping of very large number of rows. This reshaped data is saved on
the data_board as a pin, and used in the examples shown here.
The WES_data_board folder mentioned here is on the local machine.
Latest version of this folder can be found
here.
Download this folder. Use the path to this folder in the
data_board <- ... line in the example below. Depending on the OS used,
you may have to give complete path.
library(WESfilter)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Loading required package: nanoparquet
data_board <- pins::board_folder("~/Data/WES_analysis_nov2025/WES_data_board/") # The path will be specific to you
wes_data <- read_wes_data(data_board, pin_name = "reshaped_WES-coding") # this will be specific to you
meta_df <- read_meta_data(data_board, pin_name = "meta-data_coding") # specific to you
family_df <- read_family_table(
"~/Data/WES_analysis_nov2025/sib-pair-yaleOnly-from-NoorandJeff-dec012025.csv"
)
# family_wes_ids <- family_df |>
To get an idea of distribution of mutations in a single sample
# use_sample <- select_random_sample(wes_data)
get_data_single_sample("M_S-F", wes_data, meta_df) |> # get data for a single sample
filter(as.numeric(`CADD Conservation`) > 20) |> # select mutations that are in the top 1% of deleterious mutations
filter(!(grepl("MUC", Gene))) |> # Remove the mucin genes which are usually highly mutated
make_plots_sample() |> # make plots
purrr::map(\(p) {
p +
ggplot2::theme(
title = ggplot2::element_text(size = 12),
axis.text = ggplot2::element_text(size = 8),
axis.title = ggplot2::element_text(size = 10),
legend.position = "bottom",
legend.key.size = ggplot2::unit(2, "mm"),
legend.text = ggplot2::element_text(
size = 3,
margin = ggplot2::margin(l = 1, unit = "pt")
)
)
}) |>
patchwork::wrap_plots(ncol = 2) +
patchwork::plot_annotation(title = "M_S-F")
#> 1529 variants found for sample M_S-F
#> Number of genes mutated: 1161
#> Warning: There was 1 warning in `filter()`.
#> ℹ In argument: `as.numeric(`CADD Conservation`) > 20`.
#> Caused by warning:
#> ! NAs introduced by coercion

Only top 10 genes by number of mutations are shown here. From the graph
it should be noted that most the genes have a single mutation, and most
of the mutations are Heterozygotic.
cohort_wes <- filter_by_sample(
wes_data,
family_df |> distinct(wes_id) |> pull()
)
siblings <- get_family_siblings(5, family_df)
sibling_snps <- get_per_sib_snps(
siblings,
cohort_wes |> filter(!(grepl("Ref|\\./\\.", wes_info))) # select only those SNPs that are mutated (are not Ref/REF)
)
sibling_specific_snps <- get_specific_snps(sibling_snps)
sibling_specific_wes_data <- data.frame(
rowid = Reduce("union", x = sibling_specific_snps)
) |>
inner_join(cohort_wes, by = join_by("rowid"))
sibling_specific_binary_wes <- make_non_ref_df(sibling_specific_wes_data)
# get data for the entire cohort for the same SNPs.
make_sib_venn(sibling_snps)

If the above steps worked without errors, we see the following objects
message(paste0("Number of rows in wes_data: ", nrow(wes_data)))
#> Number of rows in wes_data: 297828438
message(paste0("Number of rows in cohort_wes: ", nrow(cohort_wes)))
#> Number of rows in cohort_wes: 17179109
message(paste0("Number of siblings: ", length(sibling_specific_snps)))
#> Number of siblings: 2
message(paste0(
"Number of rows in sibling_specific_snps: ",
length(Reduce("union", sibling_specific_snps))
))
#> Number of rows in sibling_specific_snps: 13946
It should be clear from the Venn Diagram that ~ 13K SNPs are present only in either of the siblings.