[dbsrgits/DBM-Deep.git] / article.pod

=head0 Adding transactions to DBM::Deep

=head1 What is DBM::Deep?

L<DBM::Deep> is a module written completely in Perl that provides a way of
storing Perl datastructures (scalars, hashes, and arrays) on disk instead of
in memory. The datafile produced is able to be ftp'ed from one machine to
another, regardless of OS or Perl version. There are several reasons why
someone would want to do this.

=over 4

=item * Transparent Persistence

This is the ability to save a set of data structures to disk and retrieve them
later without the vast majority of the program even knowing that the data is
persisted. Furthermore, the datastructure is persisted immediately and not at
set marshalling periods.

=item * Huge datastructures

Normally, datastructures are limited by the size of RAM the server has.
L<DBM::Deep> allows for the size a given datastructure to be limited by disk
instead.

=item * IPC

While not a common use, this allows for inter-process communication without
worrying about the specifics of how a given OS handles IPC.

=back

And, with the release of 1.00, there is now a fourth reason -
software-transactional memory, or STM.

=head1 What are transactions?

Originally from the database world, a transaction is a way of isolating the
effects of a given set of actions, then applying them all at once. It's a way
of saying "I'm going to try the following steps, see if I like the result,
then I want everyone else looking at this datastore to see the effects
immediately." The most common example is taken from banking. Let's say that an
application receives a request to have Joe pay Bob five zorkmids.  Without
transactions, the application would take the money from Joe's account, then
add the money to Bob's account. But, what happens if the application crashes
after debiting Joe, but before crediting Bob? The application has made money
disappear. Or, vice versa, if Bob is credited before Joe is debited, the
application has created moneyN<Yes, the US federal government does use
transactions, so this isn't the cause.>.

With a transaction wrapping the money transfer, if the application crashes in
the middle, it's as if the action never happened. So, when the application
recovers from the crash, Joe and Bob still have the same amount of money in
their accounts as they did before and the transaction can restart and Bob can
finally receive his zorkmids.

=head1 Why add them to DBM::Deep?

The ability to have actions occur in either I<atomically> (as in the previous
example) or I<isolation> from the rest of the users of the data is a powerful
one. This allows for a certain amount of safety and predictability in how data
transformations occur. Imagine, for example, that you have a set of
calculations that will update various variables. However, there are some
situations that will cause you to throw away all results and start over with a
different seed. Without transactions, you would have to put everything into
temporary variables, then transfer the values when the calculations were found
to be successful. With STM, you start a transaction and do your thing within
it. If the calculations succeed, you commit. If they fail, you rollback and
try again. If you're thinking that this is very similar to how SVN or CVS
works, you're absolutely correct - SCMs are transactional in the exact same
way.

=head1 How it happened

=head2 The backstory

This project has easily been the single most complex software endeavor I've
ever undertaken. The first step was to figure out exactly how transactions
were going to work. After several spikesN<These are throwaway coding
explorations.>, the best design seemed to be to make the keys mediate any
transactional information. This means that the entire datafile is unaware of
anything to do with transactions, except for the key's data structure.

=head2 Test-driven development

As transactions have so many edge cases that need to be tested for, it was
imperative that I write my tests along with my development. I subscribe to
TDD, or Test-Driven Development, particularly in my CPAN work. This means that
I try to write a test exercising a piece of functionality I<before> writing
the code providing that functionality. While I haven't always succeeded, doing
this meant that I was able to catch a lot of regressions very quickly. You'll
see this very quickly.

=head2 DBM::Deep's file structure

L<DBM::Deep>'s file structure is a record-based structure. The key (or array
index - arrays are just funny hashes internally) is hashed using MD5 and then
stored in a cascade of Index and Bucketlist records. The bucketlist record
stores the actual key string and pointers to where the data records are
stored. The data records themselves are one of Null, Scalar, or Reference.
Null represents an I<undef>, Scalar represents a string (numbers are
stringified for simplicity) and are allocated in 256byte chunks. Reference
represents an array or hash reference and contains a pointer to an Index and
Bucketlist cascade of its own.

=head2 Transactions in the keys

The first pass was to expand the Bucketlist sector to go from a simple key / 
datapointer mapping to a more complex key / transaction / datapointer mapping.
Initially, I interposed a Transaction record that the bucketlist pointed to.
That then contained the transaction / datapointer mapping. This had the
advantage of changing nothing except for adding one new sector type and the
handling for it. This was very quickly merged into the Bucketlist record to
simplify the resulting code.

This first step got me to the point where I could pass the following test:

  my $db1 = DBM::Deep->new( $filename );
  my $db2 = DBM::Deep->new( $filename );

  $db1->{abc} = 'foo';

  is( $db1->{abc}, 'foo' );
  is( $db2->{abc}, 'foo' );

  $db1->begin_work();

      is( $db1->{abc}, 'foo' );
      is( $db2->{abc}, 'foo' );

      $db1->{abc} = 'floober';

      is( $db1->{abc}, 'floober' );
      is( $db2->{abc}, 'foo' );

Just that much was a major accomplishment. The first pass only looked in the
transaction's spot in the bucket for that key. And, that passed my first tests
because I didn't check that C<$db1-E<gt>{abc}> was still 'foo' I<before>
modifying it in the transaction. To pass that test, the code for retrieval
needed to look first in the transaction's spot and if that spot had never been
assigned to, look at the spot for the HEAD.

=head2 The concept of the HEAD

This is a concept borrowed from SVN. In SVN, the HEAD revision is the latest
revision checked into the repository. When you do a ocal modification, you're
doing a modification to the HEAD. Then, you choose to either check in your
code (commit()) or revert (rollback()).

In L<DBM::Deep>, I chose to make the HEAD transaction ID 0. This has several
benefits:

=over 4

=item * Easy identifiaction of a transaction

C<if ( $trans_id ) {}> will run the code iff we are in a running transaction.

=item * The HEAD is the first bucket

In a given bucket, the HEAD is the first datapointer because we mutliply the
size of the transactional bookkeeping by the transaction ID to find the offset
to seek into the file.

=back

=head2 Tracking modified buckets

This gets commit

=head2 Deleted marker

This gets delete within the txn and assignment within the txn

=head2 Freespace management

This keeps the file from exploding in size, but it interacts with the deleted
marker.

=cut
Commit	Line	Data
315b34eb	1	=head0 Adding transactions to DBM::Deep
	2
	3	=head1 What is DBM::Deep?
	4
	5	L<DBM::Deep> is a module written completely in Perl that provides a way of
	6	storing Perl datastructures (scalars, hashes, and arrays) on disk instead of
	7	in memory. The datafile produced is able to be ftp'ed from one machine to
	8	another, regardless of OS or Perl version. There are several reasons why
	9	someone would want to do this.
	10
	11	=over 4
	12
	13	=item * Transparent Persistence
	14
	15	This is the ability to save a set of data structures to disk and retrieve them
	16	later without the vast majority of the program even knowing that the data is
	17	persisted. Furthermore, the datastructure is persisted immediately and not at
	18	set marshalling periods.
	19
	20	=item * Huge datastructures
	21
	22	Normally, datastructures are limited by the size of RAM the server has.
	23	L<DBM::Deep> allows for the size a given datastructure to be limited by disk
	24	instead.
	25
	26	=item * IPC
	27
	28	While not a common use, this allows for inter-process communication without
	29	worrying about the specifics of how a given OS handles IPC.
	30
	31	=back
	32
	33	And, with the release of 1.00, there is now a fourth reason -
	34	software-transactional memory, or STM.
	35
	36	=head1 What are transactions?
	37
	38	Originally from the database world, a transaction is a way of isolating the
	39	effects of a given set of actions, then applying them all at once. It's a way
	40	of saying "I'm going to try the following steps, see if I like the result,
	41	then I want everyone else looking at this datastore to see the effects
	42	immediately." The most common example is taken from banking. Let's say that an
	43	application receives a request to have Joe pay Bob five zorkmids. Without
	44	transactions, the application would take the money from Joe's account, then
	45	add the money to Bob's account. But, what happens if the application crashes
	46	after debiting Joe, but before crediting Bob? The application has made money
	47	disappear. Or, vice versa, if Bob is credited before Joe is debited, the
	48	application has created moneyN<Yes, the US federal government does use
	49	transactions, so this isn't the cause.>.
	50
	51	With a transaction wrapping the money transfer, if the application crashes in
	52	the middle, it's as if the action never happened. So, when the application
	53	recovers from the crash, Joe and Bob still have the same amount of money in
	54	their accounts as they did before and the transaction can restart and Bob can
	55	finally receive his zorkmids.
	56
	57	=head1 Why add them to DBM::Deep?
	58
	59	The ability to have actions occur in either I<atomically> (as in the previous
	60	example) or I<isolation> from the rest of the users of the data is a powerful
	61	one. This allows for a certain amount of safety and predictability in how data
	62	transformations occur. Imagine, for example, that you have a set of
	63	calculations that will update various variables. However, there are some
	64	situations that will cause you to throw away all results and start over with a
65	different seed. Without transactions, you would have to put everything into
66	temporary variables, then transfer the values when the calculations were found
67	to be successful. With STM, you start a transaction and do your thing within
68	it. If the calculations succeed, you commit. If they fail, you rollback and
d7866e36	69	try again. If you're thinking that this is very similar to how SVN or CVS
	70	works, you're absolutely correct - SCMs are transactional in the exact same
	71	way.
315b34eb	72
	73	=head1 How it happened
	74
d7866e36	75	=head2 The backstory
d7866e36	76
315b34eb	77	This project has easily been the single most complex software endeavor I've
d7866e36	78	ever undertaken. The first step was to figure out exactly how transactions
	79	were going to work. After several spikesN<These are throwaway coding
	80	explorations.>, the best design seemed to be to make the keys mediate any
	81	transactional information. This means that the entire datafile is unaware of
	82	anything to do with transactions, except for the key's data structure.
	83
bb67b124	84	=head2 Test-driven development
	85
	86	As transactions have so many edge cases that need to be tested for, it was
	87	imperative that I write my tests along with my development. I subscribe to
	88	TDD, or Test-Driven Development, particularly in my CPAN work. This means that
	89	I try to write a test exercising a piece of functionality I<before> writing
	90	the code providing that functionality. While I haven't always succeeded, doing
	91	this meant that I was able to catch a lot of regressions very quickly. You'll
	92	see this very quickly.
	93
d7866e36	94	=head2 DBM::Deep's file structure
	95
	96	L<DBM::Deep>'s file structure is a record-based structure. The key (or array
	97	index - arrays are just funny hashes internally) is hashed using MD5 and then
	98	stored in a cascade of Index and Bucketlist records. The bucketlist record
bb67b124	99	stores the actual key string and pointers to where the data records are
	100	stored. The data records themselves are one of Null, Scalar, or Reference.
	101	Null represents an I<undef>, Scalar represents a string (numbers are
	102	stringified for simplicity) and are allocated in 256byte chunks. Reference
	103	represents an array or hash reference and contains a pointer to an Index and
	104	Bucketlist cascade of its own.
	105
	106	=head2 Transactions in the keys
	107
	108	The first pass was to expand the Bucketlist sector to go from a simple key /
	109	datapointer mapping to a more complex key / transaction / datapointer mapping.
	110	Initially, I interposed a Transaction record that the bucketlist pointed to.
	111	That then contained the transaction / datapointer mapping. This had the
	112	advantage of changing nothing except for adding one new sector type and the
	113	handling for it. This was very quickly merged into the Bucketlist record to
	114	simplify the resulting code.
	115
	116	This first step got me to the point where I could pass the following test:
	117
	118	my $db1 = DBM::Deep->new( $filename );
	119	my $db2 = DBM::Deep->new( $filename );
	120
	121	$db1->{abc} = 'foo';
	122
	123	is( $db1->{abc}, 'foo' );
	124	is( $db2->{abc}, 'foo' );
	125
	126	$db1->begin_work();
	127
	128	is( $db1->{abc}, 'foo' );
	129	is( $db2->{abc}, 'foo' );
	130
	131	$db1->{abc} = 'floober';
	132
	133	is( $db1->{abc}, 'floober' );
	134	is( $db2->{abc}, 'foo' );
	135
	136	Just that much was a major accomplishment. The first pass only looked in the
	137	transaction's spot in the bucket for that key. And, that passed my first tests
	138	because I didn't check that C<$db1-E<gt>{abc}> was still 'foo' I<before>
	139	modifying it in the transaction. To pass that test, the code for retrieval
	140	needed to look first in the transaction's spot and if that spot had never been
	141	assigned to, look at the spot for the HEAD.
	142
	143	=head2 The concept of the HEAD
	144
	145	This is a concept borrowed from SVN. In SVN, the HEAD revision is the latest
	146	revision checked into the repository. When you do a ocal modification, you're
	147	doing a modification to the HEAD. Then, you choose to either check in your
	148	code (commit()) or revert (rollback()).
	149
	150	In L<DBM::Deep>, I chose to make the HEAD transaction ID 0. This has several
	151	benefits:
d7866e36	152
	153	=over 4
	154
bb67b124	155	=item * Easy identifiaction of a transaction
	156
	157	C<if ( $trans_id ) {}> will run the code iff we are in a running transaction.
	158
	159	=item * The HEAD is the first bucket
d7866e36	160
bb67b124	161	In a given bucket, the HEAD is the first datapointer because we mutliply the
	162	size of the transactional bookkeeping by the transaction ID to find the offset
	163	to seek into the file.
d7866e36	164
d7866e36	165	=back
315b34eb	166
bb67b124	167	=head2 Tracking modified buckets
	168
	169	This gets commit
	170
	171	=head2 Deleted marker
	172
	173	This gets delete within the txn and assignment within the txn
	174
	175	=head2 Freespace management
	176
	177	This keeps the file from exploding in size, but it interacts with the deleted
	178	marker.
	179
315b34eb	180	=cut