Data is the elephant in the room. If we want planetary-scaled blockchains, how do we construct them such that we not only stay in agreement, but have efficient and accessible means of proving what we agree about in which anyone can reasonably participate.
This notion of "reasonable participation" is critical. If you construct your network and consensus protocol such that it optimizes speed to finality and transaction throughput only, you inevitably make the cost and complexity of validation much higher, to the extent that ordinary people cannot reasonbly participate. Ethereum is sometimes criticised for not adopting the scaling methods used by other chains, but it ultimately always boils down to this principled approach which favours diversity at many different levels and thinks carefully and intentionally about the kinds of trade-off any change necessarily implies.
There have been many debates about the way to approach the Ethereum 2.0 roadmap, especially as it relates to data. One interesting proposal was called ["Eth1 and Done"](https://ethresear.ch/t/phase-one-and-done-eth2-as-a-data-availability-engine/5269). It was thinking like this, and the ([surprising](https://twitter.com/jinglejamOP/status/1310718738417811459)) success of parallel work on rollups and STARKs which eventually led us to the current "rollup-centric roadmap" and Vitalik's (very useful) "[incomplete guide to rollups](https://vitalik.ca/general/2021/01/05/rollup.html)". We can also recommend [this section of the ethereum.org docs](https://ethereum.org/en/developers/docs/data-availability/) which are, in general, a good resource to turn to in most cases.
## History
At first, we thought the way to scale Ethereum was to "shard" it all. "[Sharding](https://ethereum.org/en/upgrades/sharding/)" is a term used when working with big data, where you cut up the DB into lots of small pieces and have each piece do stuff only related to the data it is responsible for storing. You then need some way of making sure that all the different "shards" can occasionally exchange relevant updates with one another so that, if data in one shard reference data in another and gets updated, that other shard can be aware of it.
This was the line of thinking that led to the development of the beacon chain, which was originally intended to be the thing that helped all the different shards talk to one another about stuff happening to the data they were responsible for storing and updating. People were very worried about (i) the complexity and (ii) the user experience of all this. If Maker lives in one shard, and Uniswap in another, how do we quickly and reliably get data about DAI balances into the Uniswap shard and vice versa? There was much gnashing of teeth.
However, we eventually realized that rollups were going to work, and were going to be usable much more quickly than we had originally thought. And, we can do all sorts of intense computation on a rollup, and then just post the state transition (optimistic rollups) or the validity proof (validity rollups) to the beacon chain in order to "anchor" each rollup in the robust and time-proven security of Ethereum. This solves some of the scaling problems we face, and it simplifies cross-shard communication and complexity by reducing the number of shards we need as well as what responsibilities those shards have.
The first issue is that storage (`SSTORE`) on Ethereum is underpriced, and `calldata` is overpriced. Rollups post their state transition or proof in the `calldata` of a transaction to their anchor contract on L1, but it is needlessly expensive. A proposal to fix this was made in [EIP 4488](https://eips.ethereum.org/EIPS/eip-4488).
[EIP 4844](https://ethereum-magicians.org/t/eip-4844-shard-blob-transactions/8430) - protodanksharding - creates a new transaction format that includes a "data blob" which can contain a large amount of data which cannot be accessed by the EVM, but whose commitment can be. What is a commitment? It is an efficient way to help validators (Consensus Layer/Beacon Chain) sample the data included in a blob and ensure that it is all available such that Execution Layer clients can include it in shared state.
EIP 4844 just set up the transaction format; it doesn't actually do any sharding of all this data. So, validators still have to download all the data, check it, and store it (even though we may only store it for a relatively short period of time, after which it may be "[pruned](https://ethereum-magicians.org/t/eip-4444-bound-historical-data-in-execution-clients/7450)", depending on the outcome of work on EIP 4444).
## Techniques
You can learn a lot by reading the relevant EIPs linked above and - if you're reading this - it's likely that you will benefit from familiarising yourself with [Ethereum Magicians](https://ethereum-magicians.org/) and the [EthResearch](https://ethresear.ch/) Fora, which is where a lot of the cutting edge work and discussions take place.
The exact techniques used to create the data blobs mentioned above are called "data erasure encoding" and "KZG commitments".
Commitments are an important concept in cryptography. They allow us to prove we have a message without revealing what it is (a property often called "hidden)", and verify what we commited to without finding another message which also satisfies our original commitment (a property often called "binding"). We can take this further and, rather than commiting to a message, we can commit to a polynomial. We do this because we can "open" polynomials at certain points without revealing the whole curve we're using. This is useful both for zero-knowledge proofs *and* for scalability. It is useful for scalability, because polynomial commitment schemes - like KZG commitments - give us constant size proofs for arbitrarily large degree polynomials (i.e. I can stuff *a lot* of data into my polynomial, and prove it using a constant-size proof). We recommend these two resources:
1. [Overview](https://scroll.io/blog/kzg)
2. [Dankrad's intro](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html)
Erasure encoding is a method by which we extend a piece of data M "chunks" long into a piece N chunks long, in a very specific way that allows us to prove all the data is available using *any* M of N chunks. Basically, we use fancy maths to extend the data, split it up and then prove that it is all available using a much smaller sample of the overall data (Vitalik calls this the "self-healing cube" - it's very cool).
As you can likely tell from the above, this well runs deep. Here are some further resources:
1. [Vitalik saw it first](https://blog.ethereum.org/2014/08/16/secret-sharing-erasure-coding-guide-aspiring-dropbox-decentralizer), calling erasure encoding "arguably the single most underhyped set of algorithms (except perhaps SCIP) in computer science or cryptography".
2. [He continued to write about for years](https://github.com/ethereum/research/wiki/A-note-on-data-availability-and-erasure-coding), integrating work on the game theory of fraud proofs and the latest updates on SNARKs and STARKs.
The story around data is far from done. Hopefully, the above gives you a sense of the kind of work and insight required to create truly planetary-scale consensus systems.
Author: cryptowanderer