How Recursion Built a High Throughput Screening Lab Capable of Producing Four Petabytes of Biological Data

How Recursion Built a High Throughput Screening Lab Capable of Producing Four Petabytes of Biological Data

On Friday, Recursion will announce the full results of our RxRx1 competition, CellSignal, at NeurIPS 2019 in a panel at 9:30 a.m. The contest encouraged researchers to apply deep learning methods to our open-source RxRx1 data set of biological images. RxRx1 offers a glimpse into what we do every day at Recursion: combine automated, experimental biology with artificial intelligence to discover novel treatments. In order to accomplish our mission of decoding biology to radically improve lives, we generate petabytes of biological image data to extend our collective understanding of biology. 

While RxRx1 is a large, robust dataset ripe for machine learning research. It actually represents less than 0.2% of our weekly data output. In fact, since that dataset was released, the capacity of our laboratory has doubled! Over the past three years, we have increased the weekly output from the laboratory by a factor of 85, generating four petabytes of data and growing. Building a laboratory capable of producing data of this scale and consistency has been my driving focus at Recursion and I’d like to share some of the components critical to our success. 

Three years ago, Recursion’s HTS laboratory was capable of processing up to 34 384-well plates per week. Each well in the plate holds a small culture of human cells that are carefully combined with disease causing agents and potential treatments and then imaged. We maintain a homogenous, sterile environment for the cells to grow, feeding them and perturbing them into the various states we want to observe as they grow. This assay allowed our machine learning systems to uncover consistent differences between disease and healthy cells and we knew it was time to increase the flexibility and output of the laboratory. Additional instrumentation, personnel, and process would be necessary to scale while maintaining high-quality and relatable datasets — but we soon realized our path forward wasn’t that clear. 

There are thousands of parameters that define our assay from reagent concentrations, dispensing speeds, instrument types, incubation periods, temperature, humidity, etc. In the beginning, very few of these parameters were well-characterized or well-controlled. We needed to quickly identify our most impactful parameters, their optimal values and how to keep those optimized values constant. A few key aspects of our strategy responsible for our success:

  1. Capture all the data you can now, even if you're not certain on how it will be analyzed later
  2. Create consistent feedback loops between execution and data analysis
  3. Implement data-driven quality control

Computer-controlled instrumentation and sensors produce reams of data that can be stored and analyzed later, but most liquid dispensers are not designed to measure the accuracy or precision of the liquid volumes they are dispensing. We implemented an automated spectrophotometer to measure these volumes consistently at each critical step in the process. This data, plus the data coming from our incubators, acoustic dispensers and automated microscopes, is all captured by a schema-less database as it is generated. At the same time, we mounted webcams throughout the laboratory to capture video of experiments being processed manually. With this rich data history, we could review how experiments were executed in the lab. We weren’t sure yet which data would provide critical insights or how those insights would be uncovered, but we captured it.

Every week, our extremely talented team of laboratory technicians, biologists, chemists and data scientists review the data generated in the previous week. We pay careful attention to any anomalous or interesting results. These can be noted by the laboratory team, such as when a bottle of reagent is left at room temperature longer than usual, or by our scientists when they notice that a certain set of morphologies is less consistent than in the past. Each team brings their observations to the discussion and we debate potential causes of variation and their impact, then we review data to narrow down on potential root causes. Through this constant feedback cycle, we have determined which parameters are most critical to our process, and which parameters can vary over a wider range without impacting results.

Finally, our data-driven quality control process is foundational to our improvements over the past three years. We practice careful review of intentional and unintentional changes to the laboratory process. We track metrics to measure consistency over weeks, months and years of operation to ensure results are relatable. We measure the relative performance of each instrument in the laboratory daily to ensure all instrumentation is calibrated and functionally-identical before experiments are processed. When changes are detected, investigations are launched and corrective and preventive actions are undertaken to ensure these types of unintentional changes do not continue. Intentional changes to critical reagents are carefully risk-assessed and validated by cross-functional teams to ensure morphological consistency is maintained before they are implemented.

Building a phenotypic drug discovery laboratory capable of producing nearly four petabytes in under three years is no easy feat. The entire Recursion team of software engineers, data scientists, biologists, chemists and technicians continually rises to the challenge to scale our output. 2020, is no different. We aim to at least double our weekly output again this coming year. Thanks to partners like HighRes Biosolutions, Molecular Devices and Splunk, as well as all our incredible Recursionauts like AJ and Thos and Charles and Nick, we're not only on track, but can easily exceed our goal. If you're interested in learning more about scaling HTS operations or our RxRx1 competition, send me a message and let's connect!

Dima S.

Market Development Manager at HQ Science Ltd.

1mo

Ben, thanks for sharing!

Like
Reply
David G. Pace

Author and Editor | opinions posted are my own & don't necessarily reflect the positions of my place of work

4y

Glad to see this from Recursion; and glad to know that of late th we've been able to supply at least one alum from the School of Biological Sciences to the research team. Congratulations! 

Like
Reply
Collin Burton, MBA

Data Science and machine learning leader

4y

It's been fun hanging out with the Recursion crew here at Neurips.

Marius Georgescu , MBA

District Service Manager, West US at Agilent Technologies

4y

That's some data. Glad #BioTek can be a part of this great endeavor.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics