WESfilter

The goal of WESfilter is to provide handy utility functions to quickly filter WES VEP (variant effect predictor) files as obtained from YCGA core facility. Typically these are large files and can have several hundred thousand rows, and several hundred columns (depending on number of samples, and version of the underlying pipeline used)

A typical file has as many rows as the total number of SNPs/SNVs detected across all the samples under the study. In a common workflow there are initially some samples sequenced and analyzed, and as more clinical samples become available, they are added to the cohort. The file is constructed such that as long as at least one member of the cohort has a specific SNP/SNV detected it is part of the file. There are usually two files, one for the coding regions, and one for the non-coding regions. The non-coding regions include UTRs, introns, non-expressed exons, and regions around the splice sites.

Installation

You can install the development version of WESfilter like so:

library(remotes)
remotes::install_github("sameet/WESfilter")

Important Details

This package assumes the following: - The data has been pre-processed to an extent.
Normally the data is delivered as a very big file sometimes over 10s of GBs. This files needs to be processed such that the metadata of the SNPs/SNVs is separated from the Depth and Genotype Call for each sample. It is also important that both the tables generated have same rowid as the first column (or one of the columns) in order for the data to be put back together. - These pre-processing usually needs to happen separately for the coding regions, and the non-coding regions. These files are provided separately by YCGA. - The renv.lock file has a snapshot of all the packages used in the development of this packages, and should automatically ensure that the correct versions are already installed.

Planned Work

We will also add a python notebook to show how this package can be used.

Details of Pre-processing

Amount of data being processed is huge. The non-coding WES data contains over 846K rows. Standard output from the YCGA pipeline generates an output (depending on the model reference organism) that contains may columns of meta-data for each called SNP/SNV. For hg38, there are 89 columns of meta-data. From column 90 onwards we have data for each sample in the cohort, one column per sample. To streamline the data-analysis, we divide this large table into 2 smaller tables. Table 1, is the meta-data table, and contains only the columns of meta information. Table 2 is only the information about the samples in the cohort. We then add the row-number as an additional column to each of these tables. As the rows in both tables are identical, the rowid is also identical. This ensures that we can now process the two pieces of information separately, and then put the results together, because both tables share the rowid.

This process becomes computationally expensive as number of samples increases. This needs to be treated more deliberately, and standard analysis methods on standard Desktop/Laptop do not work. Common approaches usually fail with memory errors. To get around this step, we use duckdb. This approach treats the data as an in memory (or on disk) data base, uses lazy-evaluation and is able to accomplish reshaping of very large number of rows. This reshaped data is saved on the data_board as a pin, and used in the examples shown here.

Example

The WES_data_board folder mentioned here is on the local machine. Latest version of this folder can be found here. Download this folder. Use the path to this folder in the data_board <- ... line in the example below. Depending on the OS used, you may have to give complete path.

library(WESfilter)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: nanoparquet


data_board <- pins::board_folder("~/Data/WES_analysis_nov2025/WES_data_board/") # The path will be specific to you
wes_data <- read_wes_data(data_board, pin_name = "reshaped_WES-coding") # this will be specific to you
meta_df <- read_meta_data(data_board, pin_name = "meta-data_coding") # specific to you

family_df <- read_family_table(
  "~/Data/WES_analysis_nov2025/sib-pair-yaleOnly-from-NoorandJeff-dec012025.csv"
)
# family_wes_ids <- family_df |>

To get an idea of distribution of mutations in a single sample

# use_sample <- select_random_sample(wes_data)
get_data_single_sample("M_S-F", wes_data, meta_df) |> # get data for a single sample
  filter(as.numeric(`CADD Conservation`) > 20) |> # select mutations that are in the top 1% of deleterious mutations
  filter(!(grepl("MUC", Gene))) |> # Remove the mucin genes which are usually highly mutated
  make_plots_sample() |> # make plots
  purrr::map(\(p) {
    p +
      ggplot2::theme(
        title = ggplot2::element_text(size = 12),
        axis.text = ggplot2::element_text(size = 8),
        axis.title = ggplot2::element_text(size = 10),
        legend.position = "bottom",
        legend.key.size = ggplot2::unit(2, "mm"),
        legend.text = ggplot2::element_text(
          size = 3,
          margin = ggplot2::margin(l = 1, unit = "pt")
        )
      )
  }) |>
  patchwork::wrap_plots(ncol = 2) +
  patchwork::plot_annotation(title = "M_S-F")
#> 1529 variants found for sample M_S-F
#> Number of genes mutated: 1161
#> Warning: There was 1 warning in `filter()`.
#> ℹ In argument: `as.numeric(`CADD Conservation`) > 20`.
#> Caused by warning:
#> ! NAs introduced by coercion

Only top 10 genes by number of mutations are shown here. From the graph it should be noted that most the genes have a single mutation, and most of the mutations are Heterozygotic.

cohort_wes <- filter_by_sample(
  wes_data,
  family_df |> distinct(wes_id) |> pull()
)

siblings <- get_family_siblings(5, family_df)
sibling_snps <- get_per_sib_snps(
  siblings,
  cohort_wes |> filter(!(grepl("Ref|\\./\\.", wes_info))) # select only those SNPs that are mutated (are not Ref/REF)
)

sibling_specific_snps <- get_specific_snps(sibling_snps)

sibling_specific_wes_data <- data.frame(
  rowid = Reduce("union", x = sibling_specific_snps)
) |>
  inner_join(cohort_wes, by = join_by("rowid"))

sibling_specific_binary_wes <- make_non_ref_df(sibling_specific_wes_data)

# get data for the entire cohort for the same SNPs.
make_sib_venn(sibling_snps)

If the above steps worked without errors, we see the following objects

message(paste0("Number of rows in wes_data: ", nrow(wes_data)))
#> Number of rows in wes_data: 297828438
message(paste0("Number of rows in cohort_wes: ", nrow(cohort_wes)))
#> Number of rows in cohort_wes: 17179109
message(paste0("Number of siblings: ", length(sibling_specific_snps)))
#> Number of siblings: 2
message(paste0(
  "Number of rows in sibling_specific_snps: ",
  length(Reduce("union", sibling_specific_snps))
))
#> Number of rows in sibling_specific_snps: 13946

It should be clear from the Venn Diagram that ~ 13K SNPs are present only in either of the siblings.