R&D: Multiple Interleaved RS Codes for Storage Using Up to Mb-Scale Synthetic DNA in Living Cells

Synthetic Biology Journal has published an article written by Weigang Chen, School of Microelectronics，Tianjin University，Tianjin 300072，China, and Frontiers Science Center for Synthetic Biology （MOE，Tianjin University，Tianjin 300072，China, Panpan Wang, Qi Ge, School of Microelectronics，Tianjin University，Tianjin 300072, China, Mingzhe Han, School of Chemical Engineering and Technology，Tianjin University，Tianjin 300072，China, and ^.Frontiers Science Center for Synthetic Biology （MOE，Tianjin University, Tianjin 300072，China, and Jian Guo, School of Microelectronics，Tianjin University，Tianjin 300072，China.

Abstract: “The synthetic DNA, as a potential digital data storage medium, has a high storage density and can be used for very long period. It is expected to serve as an important option for future massive data storage. However, the synthesis, assembly and sequencing of DNA often introduce multiple types of base errors, which cannot satisfy the reliability requirements of data storage, while reliability-enhanced coding schemes usually sacrifice the logical coding density by adding redundancy. To deal with this problem, an encoding process for DNA data storage using large synthetic DNA fragments in Saccharomyces Cerevisiae was proposed. Data writing into DNA chunks was constructed by interleaving multiple codewords of Reed Solomon (RS) codes with very high code rate, embedded with autonomous replication sequences (ARSs) in alternation to form a yeast artificial chromosome. Utilizing the high-throughput sequencing, data readout combines short read assembly with the de Bruijn graphs, ARS guided contig combination and erasure/error correction to achieve reliable data recovery. The error correction capability has been fully exploited by interleaving the large missing fractions into random erasures across all the RS codewords and correcting more erasures than errors. We designed and simulated a 2.5 Mb ring chromosome and successfully recovered the original data from 20x high-throughput sequencing reads. The simulated sequencing data are generated using the ART simulation software, which has been trained using the real sequencing data from an artificial chromosome of 254,886 bp constructed for data storage previously. All the processes including the large DNA chunk assembly, DNA replication, extraction and high-throughput sequencing are viewed as the DNA storage channel in information theory community. We provided an efficient encoding scheme matching the codes and the DNA storage channel based on the information theory paradigm. The logical density of the data DNA chunks was 1.973 bit/bp, and the overall logical density still reached up to 1.947 bit/bp including the biological units (ARSs and vector backbones). The demonstrated design process can support DNA coding schemes with the different lengths from Kb up to Mb, which provides flexible verification and support for wet experiments in the synthesis and sequencing of large fragments of DNA for digital data storage.“