From: rkinyon Date: Fri, 9 Feb 2007 16:15:29 +0000 (+0000) Subject: Article changes X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=2cfc2645be3c08e0a0b72e16cba2f73fc570d123;p=dbsrgits%2FDBM-Deep.git Article changes --- diff --git a/article.pod b/article.pod index c4bc33d..b372b49 100644 --- a/article.pod +++ b/article.pod @@ -4,7 +4,7 @@ L is a module written completely in Perl that provides a way of storing Perl datastructures (scalars, hashes, and arrays) on disk instead of -in memory. The datafile produced is able to be ftp'ed from one machine to +in memory. The datafile produced is able to be transferred from one machine to another, regardless of OS or Perl version. There are several reasons why someone would want to do this. @@ -21,7 +21,7 @@ set marshalling periods. Normally, datastructures are limited by the size of RAM the server has. L allows for the size a given datastructure to be limited by disk -instead. +instead (up to the given perl's largefile support). =item * IPC @@ -30,29 +30,32 @@ worrying about the specifics of how a given OS handles IPC. =back -And, with the release of 1.00, there is now a fourth reason - -software-transactional memory, or STM -(L). - =head1 How does DBM::Deep work? L works by tying a variable to a file on disk. Every -single read and write go to the file and modify the file immediately. To +read and write go to the file and modify the file immediately. To represent Perl's hashes and arrays, a record-based file format is used. There is a file header storing file-wide values, such as the size of the internal file pointers. Afterwards, there are the data records. +The most important feature of L is that it can be +completely transparent. Other than the line tying the variable to the file, no +other part of your program needs to know that the variable being used isn't a +"normal" Perl variable. + =head2 DBM::Deep's file structure -L's file structure is a record-based structure. The key (or array +L's file structure is record-based. The key (or array index - arrays are currently just funny hashes internally) is hashed using MD5 and then stored in a cascade of Index and Bucketlist records. The bucketlist record stores the actual key string and pointers to where the data records are stored. The data records themselves are one of Null, Scalar, or Reference. Null represents an I, Scalar represents a string (numbers are -stringified for simplicity) and are allocated in 256byte chunks. References -represent an array or hash reference and contains a pointer to an Index and -Bucketlist cascade of its own. +stringified internally for simplicity) and are allocated in 256byte chunks. +Reference represent an array or hash reference and contains a pointer to an +Index and Bucketlist cascade of its own. Reference will also store the class +the hash or array reference is blessed into, meaning that almost all objects +can be stored safely. =head2 DBM::Deep's class hierarchy @@ -73,8 +76,8 @@ delegate to the engine. There are currently three classes in this layer. These classes manage the file format and all of the ways that the records interact with each other. Nearly every call will make requests to the File -class for reading and/or writing data to the file. There are currently nine -classes in this layer. +classes for reading and/or writing data to the file. There are currently nine +classes in this layer, including a class for each record type. =item * File class @@ -107,11 +110,11 @@ application has created money. With a transaction wrapping the money transfer, if the application crashes in the middle, it's as if the action never happened. So, when the application recovers from the crash, Joe and Bob still have the same amount of money in -their accounts as they did before and the transaction can restart and Bob can +their accounts as they did before. The transaction can restart and Bob can finally receive his zorkmids. More formally, transactions are generally considered to be proper when they are -ACID-compliant. ACID is an acronym that means the following: +ACID-compliant. ACID is an acronym that stands for the following: =over 4 @@ -122,18 +125,22 @@ Either every change happens or none of the changes happen. =item * Consistent When the transaction begins and when it is committed, the database must be in -a legal state. This restriction doesn't apply to L very much. +a legal state. This condition doesn't apply to L as all +Perl data structures are internally consistent. =item * Isolated As far as a transaction is concerned, it is the only thing running against the -database while it is running. Unlike most RDBMSes, L provides the -strongest isolation level possible. +database while it is running. Unlike most RDBMSes, L +provides the strongest isolation level possible, usually called +I by most RDBMSes. =item * Durable -Once the database says that a comit has happened, the commit will be -guaranteed, regardless of whatever happens. +Once the database says that a commit has happened, the commit will be +guaranteed, regardless of whatever happens. I chose to not implement this +condition in LN. =back @@ -141,42 +148,44 @@ guaranteed, regardless of whatever happens. The ability to have actions occur in either I (as in the previous example) or I from the rest of the users of the data is a powerful -thing. This allows for a certain amount of safety and predictability in how +thing. This allows for a large amount of safety and predictability in how data transformations occur. Imagine, for example, that you have a set of calculations that will update various variables. However, there are some situations that will cause you to throw away all results and start over with a different seed. Without transactions, you would have to put everything into temporary variables, then transfer the values when the calculations were found -to be successful. With STM, you start a transaction and do your thing within -it. If the calculations succeed, you commit. If they fail, you rollback and -try again. If you're thinking that this is very similar to how SVN or CVS -works, you're absolutely correct - they are transactional in the exact same -way. +to be successful. If you ever add a new value or if a value is used in only +certain calculations, you may forget to do the correct thing. With +transactions, you start a transaction and do your thing within it. If the +calculations succeed, you commit. If they fail, you rollback and try again. If +you're thinking that this is very similar to how SVN or CVS works, you're +absolutely correct - they are transactional in exactly the same way. =head1 How it happened =head2 The backstory -The addition of transactions to L has easily been the single most -complex software endeavor I've ever undertaken. The first step was to figure -out exactly how transactions were going to work. After several spikesN, the best design seemed to look to SVN -instead of relational databases. The more I investigated, the more I ran up -against the object-relational impedance mismatch +The addition of transactions to L has easily been the +single most complex software endeavor I've ever undertaken. The first step was +to figure out exactly how transactions were going to work. After several +spikesN, the best design seemed to +look to SVN instead of relational databases. The more I investigated, the more +I ran up against the object-relational impedance mismatch N, this time in terms of being able to translate designs. In the relational world, transactions are generally implemented either as row-level locks or using MVCC -N. Both of +N. Both of these assume that there is a I, or singular object, that can be locked transparently to everything else. This doesn't translate to a fractally repeating structure like a hash or an array. However, the design used by SVN deals with directories and files which corresponds very closely to hashes and hashkeys. In SVN, the modifications are -stored in the file's structure. Translating this to hashes and hashkeys, this -means that transactional information should be stored in the keys. This means -that the entire datafile is unaware of anything to do with transactions, except -for the key's data structure within the bucket. +stored in the file's metadata. Translating this to hashes and hashkeys, this +means that transactional information should be stored in the key's metadata. +Or, in L terms, within the Bucket for that key. As a nice +side-effect, the entire datafile is unaware of anything to do with +transactions, except for the key's data structure within the bucket. =head2 Transactions in the keys @@ -218,9 +227,9 @@ assigned to, look at the spot for the HEAD. =head2 The concept of the HEAD This is a concept borrowed from SVN. In SVN, the HEAD revision is the latest -revision checked into the repository. When you do a ocal modification, you're -doing a modification to the HEAD. Then, you choose to either check in your -code (commit()) or revert (rollback()). +revision checked into the repository. When you do a local modification, you're +doing a modification to your copy of the HEAD. Then, you choose to either +check in your code (commit()) or revert (rollback()). In L, I chose to make the HEAD transaction ID 0. This has several benefits: @@ -259,6 +268,8 @@ L). Committing, however, requires that all the changes must be transferred over from the bucket entry for the given transaction ID to the entry for the HEAD. +This tracking is done by the modified buckets themselves. They + =head2 Deleted marker Transactions are performed copy-on-write. This means that if there isn't an @@ -286,7 +297,7 @@ The second major piece to the 1.00 release was freespace management. In pre-1.00 versions of L, the space used by deleted keys would not be recycled. While always a requested feature, the complexity required to implement freespace meant that it needed to wait for a complete rewrite of -several pieces, such as for transactions. +several pieces, such as the engine. Freespace is implemented by regularizing all the records so that L only has three different record sizes - Index, BucketList, and Data. Each @@ -319,8 +330,8 @@ By providing a staleness counter for transactions, the costs of cleaning up finished transactions is deferred until the space is actually used again. This is at the cost of having less-than-optimal space utilization. Changing this in the future would be completely transparent to users, so I felt it was an -acceptable tradeoff for delivering working code quickly. +acceptable tradeoff for quick delivery of a functional product. -=head1 Conclusion +=head1 The future =cut