AlphaFold - Wikiwand

AlphaFold（直譯：阿爾法折疊）是Alphabet旗下Google旗下DeepMind開發的一款蛋白質結構預測程式^[1]。該程序被設計為一個深度學習系統^[2]。

本條目存在以下問題，請協助改善本條目或在討論頁針對議題發表看法。

此條目可參照英語維基百科相應條目來擴充。 (2020年12月3日)

此條目需要精通或熟悉相關主題的編者參與及協助編輯。 (2020年12月3日)

AlphaFold人工智能有2個主要版本：AlphaFold 1（2018）和AlphaFold 2（2020）。前者使用AlphaFold 1在2018年12月的第13屆CASP（英語：Critical Assessment of protein Structure Prediction，直譯：蛋白質結構預測的關鍵評估）的排名中第一。該程序特別成功地預測了被競賽組織者評為最困難的目標的最準確結構，其中沒有來自具有部分相似序列的蛋白質的現有模板結構。

蛋白質透過捲曲摺疊會構成三維結構，蛋白質的功能正由其結構決定。了解蛋白質結構有助於開發治療疾病的藥物^[3]。DeepMind稱，AlphaFold能在數天內識別蛋白質的形狀，而此前學界識別蛋白質形狀經常需花費數年時間^[4]。2020年11月，在第14屆CASP（英語：Critical Assessment of protein Structure Prediction，直譯：蛋白質結構預測的關鍵評估）競賽中^[5]，AlphaFold 2(2020)表現良好，中位分數為92.4（滿分100分）^[6]。它的準確度遠遠高於其他任何程式^[7] 。

2021年7月15日，AlphaFold 2論文在《自然》雜誌上作為高級訪問出版物與開源軟件和可搜索的物種蛋白質組數據庫一起發表^[8]^[9]^[10]。

2024年5月8日，AlphaFold 3發佈。它可以預測蛋白質與DNA、RNA、各種配體和離子形成的複合物的結構。^[7]

蛋白質折疊問題

蛋白質由蛋白質一級結構組成，蛋白質折疊的過程中蛋白質會自發折疊形成蛋白質三級結構。蛋白質結構對蛋白質生物學功能至關重要。然而，了解氨基酸序列如何確定蛋白質三級結構極具挑戰性，這被稱為「蛋白質折疊問題」。^[11]「蛋白質折疊問題」涉及折疊穩定結構的原子間力熱力學、蛋白質以極快速達到其最終折疊狀態的機制和途徑，以及如何從氨基酸序列預測蛋白質天然結構。^[12]

蛋白質結構過去通過諸如X射線晶體學、低溫電子顯微鏡和核磁共振等技術進行實驗確定，這些技術既昂貴又耗時。^[11]

過去60年努力只確定了約170,000種蛋白質結構，而所有生命形式中已知蛋白質超過2億種。^[13]

如果可以僅從氨基酸序列預測蛋白質結構，將極大地促進科學研究。然而利文索爾佯謬表明，雖蛋白質可在幾毫秒內折疊，但隨機計算所有可能的結構以確定真正的天然結構所需的時間比已知宇宙的年齡要長，這使得預測蛋白質為科學家們構建了生物學中的一項重大挑戰。^[11]

多年來，研究人員應用了許多計算方法來解決蛋白質結構預測問題，但除了小而簡單的蛋白質外，它們準確性還遠遠遠沒有接近實驗技術，從而限制了科學研究。

CASP於1994年發起，旨在挑戰科學界做出最好的蛋白質結構預測，結果對於最困難的到2016年的蛋白質發現GDT分數也只能達到100滿分的40分。^[13]

2018年，AlphaFold使用人工智能深度學習技術參加CASP。^[11]

算法

总结

视角

已隱藏部分未翻譯內容，歡迎參與翻譯。

DeepMind 已知在一個公開的蛋白質序列與結構資料庫中，訓練了超過 170,000 種蛋白質。該程式使用了一種注意力網絡，這是一種深度學習技術，專注於讓AI演算法辨識較大問題中的各個部分，然後將這些部分組合起來，以獲得整體解決方案。^[2] 整體訓練是在 100 到 200 個 GPU 的運算能力下進行的。^[2] 在這些硬件上訓練系統花費了「數週」時間，隨後該程式在對每個結構進行收斂時僅需「數天」。^[14]

AlphaFold 1（2018）

DeepMind 的 AlphaFold 1（2018 年）是基於 2010 年代由不同團隊開發的研究成果所建立，這些研究利用來自許多不同生物的大型 DNA 序列資料庫（大多數尚未知道其 3D 結構），試圖在不同殘基中找到可能存在的相關變化，即使這些殘基在主鏈中並不相鄰。這些相關性顯示這些殘基在物理上可能彼此接近，即使在序列中距離較遠，從而能夠估算出一個接觸圖。基於 2018 年之前的研究成果，AlphaFold 1 將這種方法擴展，估算殘基之間距離的概率分佈，將接觸圖轉換為可能的距離圖。它還使用比以往更先進的學習方法來進行推論。團隊結合基於這種概率分佈的統計勢能以及配置的局部吉布斯自由能，透過梯度下降法來尋找最符合這兩者的解。^{[需要解釋]}^[15]^[16]

在技術層面上，Torrisi 等人在 2019 年對 AlphaFold 1 的方法做了如下總結：^[17]

AlphaFold 的核心是一個距離圖預測器，由一個非常深的殘差神經網絡實作，使用 220 個殘差模塊來處理 64×64×128 的表示，這對應於兩個 64 個胺基酸片段計算得來的輸入特徵。每個殘差模塊都有三層，包括一個 3×3 的擴張卷積層，這些模塊輪流使用擴張值 1、2、4 和 8。整體模型擁有 2,100 萬個參數。該網絡結合了一維和二維輸入，包括來自不同來源的演化特徵檔案以及共演化特徵。除了非常細緻的距離直方圖形式的距離圖外，AlphaFold 還會預測每個殘基的Φ 和 Ψ 角度，並以此建立初步的 3D 結構預測。AlphaFold 的研究團隊認為，模型的深度、大型的裁剪尺寸、大約 29,000 種蛋白質的大型訓練集、現代深度學習技術以及距離直方圖所提供的豐富資訊，都是讓 AlphaFold 能夠達成高精確度接觸圖預測的關鍵。

AlphaFold 2（2020）

File:AlphaFold 2 block design.png

AlphaFold 2 設計。(源：^[14])

2020 年版本的程式（AlphaFold 2，2020 年）與 2018 年在 CASP 13 中獲勝的原始版本有顯著不同，根據 DeepMind 團隊的說法。^[18]^[19]

DeepMind 團隊指出，其先前的方法結合了局部物理與基於模式識別的引導勢能，但這種方法往往會過度考慮序列中彼此相鄰的殘基之間的相互作用，相較之下，對於鏈條中距離較遠的殘基之間的相互作用則考慮不足。因此，AlphaFold 1 傾向於偏好帶有比實際情況更多二級結構（如α螺旋和β摺板）的模型，這是一種過擬合現象。^[20]

AlphaFold 1 中使用的軟件設計包含許多獨立訓練的模組，這些模組被用來生成引導勢能，然後與基於物理的能量勢能結合。AlphaFold 2 則用一套由多個子網絡組成的單一可微分端對端模型取代，完全基於模式識別，並作為單一整合結構進行訓練。^[19]^[21] 局部物理（基於 AMBER 模型的能量微調）僅作為神經網絡預測收斂後的最終微調步驟，且只對預測結構進行輕微調整。^[20]

2020 年系統的關鍵部分是兩個模組，據信基於Transformer設計，用於逐步完善每個蛋白質中胺基酸殘基與另一殘基之間（以圖論中的「邊」表示）的關係資訊向量（綠色陣列表示），以及每個胺基酸位置與輸入序列比對中不同序列之間的關係（紅色陣列表示）。^[21] 在內部，這些微調轉換包含多層結構，透過訓練數據學習的情境相關方式，將相關數據匯聚並過濾掉不相關數據（「注意力機制」）。這些轉換會重複進行，每一步的輸出作為下一步的輸入，並讓精煉後的殘基/殘基資訊進一步完善殘基/序列資訊，反之亦然。^[21]

這些迭代的輸出最終會用於結構預測模組，^[21] 該模組同樣使用 Transformer，^[22] 並且同樣進行多次迭代。在 DeepMind 提供的範例中，結構預測模組於首次迭代即達成正確拓撲，GDT_TS 分數為 78，但有大量（90%）立體化學違規（如不合理的鍵角或鍵長）。隨着後續迭代進行，違規數量逐漸下降。至第三次迭代時，GDT_TS 分數接近 90，第八次迭代時違規數量接近零。^[23]

AlphaFold 團隊於 2020 年 11 月表示，他們認為 AlphaFold 尚有發展空間，並能在準確性方面進一步提升。^[18]

最初的訓練數據僅限於單一胜肽鏈。然而，2021 年 10 月的更新版本 AlphaFold-Multimer 將蛋白質複合物納入訓練數據。DeepMind 表示，這項更新在準確預測蛋白質-蛋白質互動方面的成功率約為 70%。^[24]

競賽

CASP13

2018年12月，DeepMind的AlphaFold在第13屆蛋白質結構預測技術評估（CASP）中，於總體排名中位居第一。^[25]^[26]

該程序在預測被競賽組織者評定為最難的目標結構時特別成功，這些目標是沒有現有的模板結構可供參考，且蛋白質的序列部分相似。AlphaFold在此類目標中，對43個蛋白質目標中的25個給出了最佳預測，^[26]^[27]^[28] 在CASP的全球距離測試（GDT）中取得了58.9的中位數分數，超過了排名第二和第三的兩支隊伍，分別為52.5和52.4，這兩隊也在使用深度學習估算接觸距離。^[29]，他們也在使用深度學習估算接觸距離。^[30]^[31] 整體來說，該程序在所有目標中獲得了68.5的GDT分數。^[32]

2020年1月，AlphaFold 1的實現及示範代碼在GitHub上公開發佈開源。^[33]^[11] 但正如該網站的「讀我」文件中所述：「此代碼無法用於預測任意蛋白質序列的結構。它僅能用於預測CASP13數據集中的結構（下面鏈接）。特徵生成代碼與我們的內部基礎設施以及外部工具緊密結合，因此我們無法開源它。」因此，實際上，所存放的代碼並不適合一般用途，而僅限於CASP13蛋白質。該公司截至2021年3月5日，尚未宣佈有關公開其代碼的計劃。

CASP14

在2020年11月，DeepMind的新版本AlphaFold 2在CASP14競賽中獲得了第一名。該程式對97個目標中的88個做出了最佳預測。根據競賽中的全球距離測試（GDT）衡量，AlphaFold 2的中位數分數達到了92.4（滿分為100），這意味着超過一半的預測結果在原子位置上的準確度超過92.4%，這一準確度被認為與X光晶體學等實驗技術相當。

AlphaFold 2在這次競賽中的表現遠遠超過了2018年AlphaFold 1的成績，當時只有2個預測達到相同的準確度。88%的預測結果在GDT_TS得分上超過了80，其中在最困難的目標群體中，AlphaFold 2的中位數得分為87。

此外，AlphaFold 2在蛋白質骨架主鏈的α碳原子位置的根均方誤差（RMSD）方面也表現出色，88%的預測結果在RMSD上小於4Å，76%的預測結果達到3Å以下，46%的預測結果更是精確到2Å以下。

AlphaFold 2的模型準確度被描述為「非常非常出色」，尤其是在表面側鏈的建模上。為了進一步驗證AlphaFold 2的準確性，競賽主辦方邀請了四個領先的實驗小組來測試那些他們無法確定結構的蛋白質，並且AlphaFold 2生成的三維模型在這些蛋白質結構的確定中足夠準確，這些結構被用於分子替代法。

儘管如此，AlphaFold 2在三個結構的預測上表現較差，其中兩個來自於蛋白質NMR（核磁共振）技術，這些結構是直接在水溶液中確定的，而AlphaFold大多數是基於X光晶體學數據進行訓練的。另外一個則是多結構複合體，包含52個相同的蛋白質結構單元，這樣的情況AlphaFold並未專門設計來處理。對於單一結構的所有目標（除了一個非常大的蛋白質和兩個NMR結構），AlphaFold 2的GDT_TS得分均超過80。

== 看法 == AlphaFold 2 在 CASP 的全球距離測試 (GDT) 中得分超過 90 被認為是計算生物學領域的重大成就^[13]，也是邁向生物學數十年來的重大挑戰的一大進步。諾貝爾化學獎得主、結構生物學家 Venki Ramakrishnan 稱這一結果是「在蛋白質摺疊問題上的驚人進展」，^[13] 並補充道：「這一進展發生得比很多領域內的專家預測的要早幾十年。看到它將如何根本改變生物學研究，將是非常令人興奮的。」^[14]

Propelled by press releases from CASP and DeepMind,^[34]^[14] AlphaFold 2's success received wide media attention.^[35] As well as news pieces in the specialist science press, such as Nature,^[36] Science,^[13] MIT Technology Review,^[2] and New Scientist,^[37]^[38] the story was widely covered by major national newspapers,^[39]^[40]^[41]^[42] as well as general news-services and weekly publications, such as Fortune,^[43]^[19] The Economist,^[18] Bloomberg,^[32] Der Spiegel,^[44] and The Spectator.^[45] In London The Times made the story its front-page photo lead, with two further pages of inside coverage and an editorial.^[46]^[47] A frequent theme was that ability to predict protein structures accurately based on the constituent amino acid sequence is expected to have a wide variety of benefits in the life sciences space including accelerating advanced drug discovery and enabling better understanding of diseases.^[36]^[48] Writing about the event, the MIT Technology Review noted that the AI had "solved a fifty-year old grand challenge of biology."^[2] The same article went on to note that the AI algorithm could "predict the shape of proteins to within the width of an atom."^[2]

As summed up by Der Spiegel reservations about this coverage have focussed in two main areas: "There is still a lot to be done" and: "We don't even know how they do it".^[49]

Although a 30-minute presentation about AlphaFold 2 was given on the second day of the CASP conference (December 1) by project leader John Jumper,^[50] it has been described as "exceedingly high-level, heavy on ideas and insinuations, but almost entirely devoid of detail".^[7]Template:Unreliable source Unlike other research groups presenting at CASP14, DeepMind's presentation was not recorded and is not publicly available. DeepMind is expected to publish a scientific paper giving an account of AlphaFold 2 in the proceedings volume^[何時？] of the CASP conference; but it is not known whether it will go beyond what was said in the presentation.

Speaking to El País, researcher Alfonso Valencia said "The most important thing that this advance leaves us is knowing that this problem has a solution, that it is possible to solve it... We only know the result. Google does not provide the software and this is the frustrating part of the achievement because it will not directly benefit science."^[42] Nevertheless, as much as Google and DeepMind do release may help other teams develop similar AI systems, an "indirect" benefit.^[42] In late 2019 DeepMind released much of the code of the first version of AlphaFold as open source; but only when work was well underway on the much more radical AlphaFold 2. Another option it could take might be to make AlphaFold 2 structure prediction available as an online black-box subscription service. Convergence for a single sequence has been estimated to require on the order of $10,000 worth of wholesale compute time.^[51] But this would deny researchers access to the internal states of the system, the chance to learn more qualitatively what gives rise to AlphaFold 2's success, and the potential for new algorithms that could be lighter and more efficient yet still achieve such results. Fears of potential for a lack of transparency by DeepMind have been contrasted with five decades of heavy public investment into the open Protein Data Bank and then also into open DNA sequence repositories, without which the data to train AlphaFold 2 would not have existed.^[52]^[53]^[54]

Of note, on June 18th, 2021 Demis Hassabis tweeted: "Brief update on some exciting progress on #AlphaFold! We’ve been heads down working flat out on our full methods paper (currently under review) with accompanying open source code and on providing broad free access to AlphaFold for the scientific community. More very soon!"^[55]

However it is not yet clear to what extent structure predictions made by AlphaFold 2 will hold up for proteins bound into complexes with other proteins and other molecules.^[56] This was not a part of the CASP competition which AlphaFold entered, and not an eventuality it was internally designed to expect. Where structures that AlphaFold 2 did predict were for proteins that had strong interactions either with other copies of themselves, or with other structures, these were the cases where AlphaFold 2's predictions tended to be least refined and least reliable. As a large fraction of the most important biological machines in a cell comprise such complexes, or relate to how protein structures become modified when in contact with other molecules, this is an area that will continue to be the focus of considerable experimental attention.^[56]

With so little yet known about the internal patterns that AlphaFold 2 learns to make its predictions, it is not yet clear to what extent the program may be impaired in its ability to identify novel folds, if such folds are not well represented in the existing protein structures known in structure databases.^[57]^[56] It is also not well known the extent to which protein structures in such databases, overwhelmingly of proteins that it has been possible to crystallise to X-ray, are representative of typical proteins that have not yet been crystallised. And it is also unclear how representative the frozen protein structures in crystals are of the dynamic structures found in the cells in vivo. AlphaFold 2's difficulties with structures obtained by protein NMR methods may not be a good sign.

On its potential as a tool for drug discovery, Stephen Curry notes that while the resolution of AlphaFold 2's structures may be very good, the accuracy with which binding sites are modelled needs to be even higher: typically molecular docking studies require the atomic positions to be accurate within a 0.3 Å margin, but the predicted protein structure only have at best an RMSD of 0.9 Å for all atoms. So AlphaFold 2's structures may only be a limited help in such contexts.^[57]^[56] Moreover, according to Science columnist Derek Lowe, because the prediction of small-molecule binding even then is still not very good, computational prediction of drug targets is simply not in a position to take over as the "backbone" of corporate drug discovery—so "protein structure determination simply isn’t a rate-limiting step in drug discovery in general".^[58] It has also been noted that even with a structure for a protein, to then understand how it functions, what it does, and how that fits within wider biological processes can still be very challenging.^[59] Nevertheless, if better knowledge of protein structure could lead to better understanding of individual disease mechanisms and ultimately to better drug targets, or better understanding of the differences between human and animal models, ultimately that could lead to improvements.^[60]

Also, because AlphaFold processes protein-only sequences by design, other associated biomolecules are not considered. On the impact of absent metals, co-factors and, most visibly, co- and post-translational modifications such as protein glycosylation from AlphaFold models, Elisa Fadda (Maynooth University, Ireland) and Jon Agirre (University of York, UK) highlighted the need for scientists to check databases such as UniProt-KB for likely missing components, as these can play an important role not just in folding but in protein function.^[61] However, the authors highlighted that many AlphaFold models were accurate enough to allow for the introduction of post-predictional modifications.^[61]

Finally, some have noted that even a perfect answer to the protein prediction problem would still leave questions about the protein folding problem—understanding in detail how the folding process actually occurs in nature (and how sometimes they can also misfold).^[62]

But even with such caveats, AlphaFold 2 was described as a huge technical step forward and intellectual achievement.^[63]^[64]

AlphaFold蛋白質結構數據庫

AlphaFold蛋白質結構數據庫於2021年7月22日啟動，這是AlphaFold和歐洲分子生物學實驗室的歐洲生物信息研究所的共同努力。AlphaFold提供對超過2億個蛋白質結構預測的開放訪問，以加速科學研究。在啟動時，該數據庫包含人類和20種模式生物的幾乎完整UniProt 蛋白質組的AlphaFold預測蛋白質結構模型，總計超過365,000種蛋白質（該數據庫不包括少於16個或多於2700個氨基酸殘基蛋白質^[65]，但對人類而言，殘基蛋白質可在文件中獲得。^[66]）。

AlphaFold目標是覆蓋UniRef90中1億個蛋白質大部分集合。截至2022年5月15日，已有992,316個可用。^[67]

應用

AlphaFold已被用於預測SARS-CoV-2（COVID-19的病原體）的蛋白質結構。這些蛋白質的結構在2020年初有待實驗檢測^[68]。在將結果發佈到更大的研究界之前，英國弗朗西斯·克里克研究所（英語：Francis Crick Institute）(Francis Crick Institute)的科學家們對結果進行了檢查。該團隊還證實了對實驗確定的SARS-CoV-2刺突蛋白的準確預測，該蛋白在國際開放存取數據庫蛋白質資料庫(Protein Data Bank)中共享，然後發佈了計算確定的未充分研究的蛋白質分子的結構^[69]。

參見

參考文獻

Loading content...

外部連結

Loading content...

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.