Bulk Storage

Rules of thumb

Documents must be named. Naming via a filename is possible, but large and variable-length names are an unfortunate headache. If we assign random numbers then the birthday paradox tells us that we can only have the square root of the number of possible documents before the probability of a collision is 50%. However we can check if a document exists before using a document id, just we can get away with smaller document ids than random assignment would allow. To keep to a power of 2 bits in length, and so give enough document ids we use 64-bit ids.

Communications

All nodes can communicate on the general multicast channel (GMC), 224.97.103.108. Every node listens on this channel and keeps an passive track of:

This information is small and is carried in the payload of most small messages to the GMC.

The packet loss and ping times should be roughly the same for every node. If they aren't then we should probably alert an admin. We also expect response times to be normally distributed. The global response time dictates our timeouts.

(aside: In reality, there will be level distinct levels of nodes - those on the local switch and those one level removed. But networking hardware is fast, certainly a lot faster than a disk seek if a node needs to check if it has a given document.)

Node communications are UDP and are either unicast (in which case the target must acknowledge the packet and the source must retransmit a number of times), or broadcast - to the GMC. The replies to broadcast messages may have an expected number of respondents. If this is the case, each node that could respond (which is generally all of them) flips a weighted coin and, based on its knowledge of the rough number of active nodes in the network, may choose not to reply.

Nodes

Nodes have a number of disks. A given document is never replicated across different disks of the same node (or across multiple copies on the same disk, for that matter) due to the increased likelihood of a failure destroying that whole node.

A node keeps bloom filters of the documents that can be found on each disk. This saves on expensive IO lookups. The size of those bloom filters is up to each node, but with 625000 documents per disk a 16 fold, 4-bit counted filter is only 5MB.

Each disk in a node keeps a full record of the docids stored on each other disk in the same node. Again, for 625000 documents per disk and three `other' disks, per disk and 8 bytes (64-bits) per documents that's only 15MB. In reality it will be a little more than this because we won't bother compacting that list very often when documents are removed.

Each node also watches a number of others. For each node that it watches it keeps a full list of the docids on that node on every disk. For each node that it's watching that's 625000*4(disks)*8(bytes per docid)*4(of our disks) = 80MB.

(aside: it would be possible to make savings by using erasure codes rather than duplicating this information across every disk. But given the small overhead it's questionable if the added complexity is worth it.)

Every document on a disk is either incoming, normal or superfluous. All are reported in reply to a lookup. Incoming documents are still being downloaded and superfluous documents are not reported as taking up any space and are deleted as space pressure requires. Only normal documents are marked as members of a disk when updating docids lists (on other disks and on watchers).

Every node has an HTTP port open which answers requests like http://node/0011223344556677. If the node has the requested docid it replies with that data. If not, it triggers a lookup and HTTP redirects to a node which does.

Introductions

A complete virgin node announces itself to the GMC and requests watchers. The expected number of replies is 20 and the node picks those who are currently watching the least nodes. It unicasts with those nodes and, from then on, updates its watchers whenever the contents of its disks change. (With a small amount of buffering).

Normal operations

Lookup

A node broadcasts a lookup request on the GMC and the first good reply is the `answer'. The node keeps track of other replies and the number received is either none, less than the replication level, more than the replication level or correct.

If none, we retransmit until we hit a 1 in a million chance of packet loss and give up.

(aside: we know the global GMC packet loss and we assume the worst - that only one node has the document. Thus, with a 1% GMC drop rate we need to retransmit three times.)

If the number is less then we need to replicate. We pick a node from our local knowledge that has a lot of free space (in percent) and redirect it to copy from one of the nodes that we know has it. If the node refuses we try again. (The node selection is probabilistic to avoid the whole network buggering the most virgin node).

There is a race condition here, but the window is small (as nodes will report incoming documents as soon as they have started replicating it) and the consequence are small and fixed in the next paragraph...

If we get too many replies then we need to tell some of the nodes that their copy is superfluous. We number the nodes by a 64 bit number which is the md4 hash of their IP + port number and the `distance' to a docid is the absolute difference of the docid and the node number. Those nodes most distant from the document get their copies marked as superfluous.

I don't believe that this is racable. At worst, some documents don't get marked as superfluous.

Insertion

Insertion is covered above, except that the source isn't another node.

Stir

Every node performs two types of stir - block stir and doc stir. Block stir consists of checksumming documents and verifying their md4 sum against the value in the metadata. This guards against silent disk corruption (which may not be the disk's fault). If a document is corrupt it is deleted and a lookup is performed to correct its loss.

When serving up a document it's probably a good idea to md4 sum it anyway, since the IO cost is nill. Unfortunately, if a mistake is found it's a little too late but at least it won't happen again.

Doc stir is the continual process where by nodes perform lookups on documents which they hold. In theory this isn't needed, in practice Weird Shit Happens and doc stir causes the network to always move towards a more correct state. If a node has no better data (see loss panic below) then it picks documents at random for the stir. Starting on a random point on a random disk and going in order from there is as good a way as any.

(aside: with a replication level of three, and a GMC packet loss of 1% we expect 3% of packets to need a retransmit. If each node stirs one document per second that's 412 packets per second on the GMC. If we assume a couple hundred bytes each that's less than 100K per second, which is fine. There's about 1300 replies, but they are unicast. The load of 400 lookups per second on each node is a little troubling - even with the bloom filters. Our one billion documents would still take a month to stir completely with such a load)

Bloom only stirring

Consider, for the moment, that we don't actually make nodes perform a disk lookup to confirm that they actually have a copy of a given file during stirring. Given the rate of stirring that we require in order to stir everything within a reasonable time period this could save a lot of disk io.

How big would the blooms need to be to reduce the chance of something bad happening to acceptable levels?

While stirring we know that at least one node has a copy of a given docid (the one that suggested the stir to us). So we consult the blooms of all other nodes and a number, call it n of them return a positive. Thus m (which is <= n) nodes actually have that document. Let p be the false positive rate for our blooms. Let there be N nodes in total.

Assuming a replication level of three, if we find less that n is less than two then we know that something is amiss. So, what is the probability that we get n bloom positives, but actually have no other copies in the network? In this case N-n nodes returned a true negative (with probability of 1-p) and n returned a false positive (probability p). Thus the total probability is NCn (1-p)N-n * pn.

We can plug in a few numbers here. With 16 bits per document in the bloom filter p=0.000459. Assume 400 other nodes and that 2 bloom filters returned positive. The gives a probability of 1.4% that there aren't actually any other copies of that document in the network. That is bad.

Bits per doc in bloomProbability of zero copies with 400 nodes and two hits
160.01398
200.00035
247.66675e-6
281.64732e-7
323.52945e-9
367.55886e-11

That formula gives some nicer numbers for very small numbers of bits per doc, because the probability that there are only two false positives becomes so long - but that's unhelpful.

So, with 32 bits per doc we get a decent result. But consider that over 1 billion documents that means that 3 or 4 documents with only one copy in the network won't be rescued by the stir.

What if we get more than replication-1 bloom hits? What is the probability that the document is in danger then?

Again, assuming that we have a replication level of three, we get 3 blooms hits. The probability that we have no extra copies is NC3 (1-p)N-3 * p3. For 32 bits per doc the probability is 9.84781e-14 which is a good number. For 16 bits it's 0.00085 - which isn't.

But we not consider the number of times that we'll actually get 3 false positives. For 32 bits, it's a similar number: 9.80361e-14 - so the savings made aren't worth the extra code.

So, from the above. If we have 36-fold bloom filters then maybe we can avoid a lot of disk io. Something to think about. (a 36-fold, 4-bit counted filter for 625000*4 documents is only 45MB).

Loss Panic

In the event of a disk loss the node that lost the disk takes a list of the lost docids (from one of the other disks) and starts firing panic messages to every other node that it knows about. Panic messages should be the maximum length (wrt the network MTU) and packed with docids. When receiving such a packet, those docids are put in the loss panic queue and random stirring is suspended until the LPQ is empty.

For a disk of 625000 documents that's 1600 docids per node and about 26 minutes as 400 stirs per second.

TODO: node loss panic.

Site Map
/Root
     AlternateThe Weird and Wonderful
          BacklinksWhat are backlinks
          John GilmoreWhat's Wrong with Copy Protection
     ArchivesBlog Archives
          OneArchive 1
          TwoArchive 2
          ThreeArchive 3
          FourArchive 4
          FiveArchive 5
          SixArchive 6
          SevenArchive 7
          EightArchive 8
          NineArchive 9
          TenArchive 10
          ElevenArchive 11
          TwelveArchive 12
          ThirteenArchive 13
          FourteenArchive 14
          FifteenArchive 15
          SixteenArchive 16
          SeventeenArchive 17
          EighteenArchive 18
          NineteenArchive 19
          Twenty Archive 20
          Twenty OneArchive 21
          Twenty TwoArchive 22
          Twenty ThreeArchive 23
          Twenty FourArchive 24
          Twenty FiveArchive 25
          Twenty SixArchive 26
          Twenty SevenArchive 27
          Twenty EightArchive 28
          Twenty NineArchive 29
          Thirty Archive 30
          Thirty OneArchive 31
     PhotosPoor People Caught on Film
          Jack and the Beanstalk Jack and the Beanstalk
          RIP ScanResults of a Stage Scan Fire
          YosemiteYosemite National Park
     ProjectsIncomplete things from the lab
          Seagull's BaneLinux Automounter
          bttrackdBitTorrent Tracker
          CAPTCHACAPTCHA CGI script
          ConservConsole Serving
          DeerparkUsing Tor with Firefox/1.1 (Deerpark)
          DNSFixFixing DNS
          XoversXTA Crossover Control
          IAFSArchive Org Storage
          JBIG2JBIG2 Encoder
          VerifyPGP Key Verifier
          MaxFlowMaximal Flow in Python
          PyBloomBloom Filters in Python
          pyGnuTLSPython wrapping of GnuTLS
          SxmapApache SuEXEC Map
          HellardUnion Server Notes
     RecordingsFree recordings
          ICSM ChoirSt Paul's Church
     SchoolAncient School Stuff
     WritingsWho knows
          Cap SystemsCapability Systems
          IntroIntroduction to me
          SupremaJMC2 Group Project
          MP LettersLetters I've written to my MP
          SoundSound With Dramsoc
          SyncThreadingThe wonders of user-land threads