Commit | Line | Data |
315b34eb |
1 | =head0 Adding transactions to DBM::Deep |
2 | |
3 | =head1 What is DBM::Deep? |
4 | |
5 | L<DBM::Deep> is a module written completely in Perl that provides a way of |
6 | storing Perl datastructures (scalars, hashes, and arrays) on disk instead of |
7 | in memory. The datafile produced is able to be ftp'ed from one machine to |
8 | another, regardless of OS or Perl version. There are several reasons why |
9 | someone would want to do this. |
10 | |
11 | =over 4 |
12 | |
13 | =item * Transparent Persistence |
14 | |
15 | This is the ability to save a set of data structures to disk and retrieve them |
16 | later without the vast majority of the program even knowing that the data is |
17 | persisted. Furthermore, the datastructure is persisted immediately and not at |
18 | set marshalling periods. |
19 | |
20 | =item * Huge datastructures |
21 | |
22 | Normally, datastructures are limited by the size of RAM the server has. |
23 | L<DBM::Deep> allows for the size a given datastructure to be limited by disk |
24 | instead. |
25 | |
26 | =item * IPC |
27 | |
28 | While not a common use, this allows for inter-process communication without |
29 | worrying about the specifics of how a given OS handles IPC. |
30 | |
31 | =back |
32 | |
33 | And, with the release of 1.00, there is now a fourth reason - |
34 | software-transactional memory, or STM. |
35 | |
36 | =head1 What are transactions? |
37 | |
38 | Originally from the database world, a transaction is a way of isolating the |
39 | effects of a given set of actions, then applying them all at once. It's a way |
40 | of saying "I'm going to try the following steps, see if I like the result, |
41 | then I want everyone else looking at this datastore to see the effects |
42 | immediately." The most common example is taken from banking. Let's say that an |
43 | application receives a request to have Joe pay Bob five zorkmids. Without |
44 | transactions, the application would take the money from Joe's account, then |
45 | add the money to Bob's account. But, what happens if the application crashes |
46 | after debiting Joe, but before crediting Bob? The application has made money |
47 | disappear. Or, vice versa, if Bob is credited before Joe is debited, the |
48 | application has created moneyN<Yes, the US federal government does use |
49 | transactions, so this isn't the cause.>. |
50 | |
51 | With a transaction wrapping the money transfer, if the application crashes in |
52 | the middle, it's as if the action never happened. So, when the application |
53 | recovers from the crash, Joe and Bob still have the same amount of money in |
54 | their accounts as they did before and the transaction can restart and Bob can |
55 | finally receive his zorkmids. |
56 | |
57 | =head1 Why add them to DBM::Deep? |
58 | |
59 | The ability to have actions occur in either I<atomically> (as in the previous |
60 | example) or I<isolation> from the rest of the users of the data is a powerful |
61 | one. This allows for a certain amount of safety and predictability in how data |
62 | transformations occur. Imagine, for example, that you have a set of |
63 | calculations that will update various variables. However, there are some |
64 | situations that will cause you to throw away all results and start over with a |
65 | different seed. Without transactions, you would have to put everything into |
66 | temporary variables, then transfer the values when the calculations were found |
67 | to be successful. With STM, you start a transaction and do your thing within |
68 | it. If the calculations succeed, you commit. If they fail, you rollback and |
d7866e36 |
69 | try again. If you're thinking that this is very similar to how SVN or CVS |
70 | works, you're absolutely correct - SCMs are transactional in the exact same |
71 | way. |
315b34eb |
72 | |
73 | =head1 How it happened |
74 | |
d7866e36 |
75 | =head2 The backstory |
76 | |
315b34eb |
77 | This project has easily been the single most complex software endeavor I've |
d7866e36 |
78 | ever undertaken. The first step was to figure out exactly how transactions |
79 | were going to work. After several spikesN<These are throwaway coding |
80 | explorations.>, the best design seemed to be to make the keys mediate any |
81 | transactional information. This means that the entire datafile is unaware of |
82 | anything to do with transactions, except for the key's data structure. |
83 | |
bb67b124 |
84 | =head2 Test-driven development |
85 | |
86 | As transactions have so many edge cases that need to be tested for, it was |
87 | imperative that I write my tests along with my development. I subscribe to |
88 | TDD, or Test-Driven Development, particularly in my CPAN work. This means that |
89 | I try to write a test exercising a piece of functionality I<before> writing |
90 | the code providing that functionality. While I haven't always succeeded, doing |
91 | this meant that I was able to catch a lot of regressions very quickly. You'll |
92 | see this very quickly. |
93 | |
d7866e36 |
94 | =head2 DBM::Deep's file structure |
95 | |
96 | L<DBM::Deep>'s file structure is a record-based structure. The key (or array |
97 | index - arrays are just funny hashes internally) is hashed using MD5 and then |
98 | stored in a cascade of Index and Bucketlist records. The bucketlist record |
bb67b124 |
99 | stores the actual key string and pointers to where the data records are |
100 | stored. The data records themselves are one of Null, Scalar, or Reference. |
101 | Null represents an I<undef>, Scalar represents a string (numbers are |
102 | stringified for simplicity) and are allocated in 256byte chunks. Reference |
103 | represents an array or hash reference and contains a pointer to an Index and |
104 | Bucketlist cascade of its own. |
105 | |
106 | =head2 Transactions in the keys |
107 | |
108 | The first pass was to expand the Bucketlist sector to go from a simple key / |
109 | datapointer mapping to a more complex key / transaction / datapointer mapping. |
110 | Initially, I interposed a Transaction record that the bucketlist pointed to. |
111 | That then contained the transaction / datapointer mapping. This had the |
112 | advantage of changing nothing except for adding one new sector type and the |
113 | handling for it. This was very quickly merged into the Bucketlist record to |
114 | simplify the resulting code. |
115 | |
116 | This first step got me to the point where I could pass the following test: |
117 | |
118 | my $db1 = DBM::Deep->new( $filename ); |
119 | my $db2 = DBM::Deep->new( $filename ); |
120 | |
121 | $db1->{abc} = 'foo'; |
122 | |
123 | is( $db1->{abc}, 'foo' ); |
124 | is( $db2->{abc}, 'foo' ); |
125 | |
126 | $db1->begin_work(); |
127 | |
128 | is( $db1->{abc}, 'foo' ); |
129 | is( $db2->{abc}, 'foo' ); |
130 | |
131 | $db1->{abc} = 'floober'; |
132 | |
133 | is( $db1->{abc}, 'floober' ); |
134 | is( $db2->{abc}, 'foo' ); |
135 | |
136 | Just that much was a major accomplishment. The first pass only looked in the |
137 | transaction's spot in the bucket for that key. And, that passed my first tests |
138 | because I didn't check that C<$db1-E<gt>{abc}> was still 'foo' I<before> |
139 | modifying it in the transaction. To pass that test, the code for retrieval |
140 | needed to look first in the transaction's spot and if that spot had never been |
141 | assigned to, look at the spot for the HEAD. |
142 | |
143 | =head2 The concept of the HEAD |
144 | |
145 | This is a concept borrowed from SVN. In SVN, the HEAD revision is the latest |
146 | revision checked into the repository. When you do a ocal modification, you're |
147 | doing a modification to the HEAD. Then, you choose to either check in your |
148 | code (commit()) or revert (rollback()). |
149 | |
150 | In L<DBM::Deep>, I chose to make the HEAD transaction ID 0. This has several |
151 | benefits: |
d7866e36 |
152 | |
153 | =over 4 |
154 | |
bb67b124 |
155 | =item * Easy identifiaction of a transaction |
156 | |
157 | C<if ( $trans_id ) {}> will run the code iff we are in a running transaction. |
158 | |
159 | =item * The HEAD is the first bucket |
d7866e36 |
160 | |
bb67b124 |
161 | In a given bucket, the HEAD is the first datapointer because we mutliply the |
162 | size of the transactional bookkeeping by the transaction ID to find the offset |
163 | to seek into the file. |
d7866e36 |
164 | |
165 | =back |
315b34eb |
166 | |
bb67b124 |
167 | =head2 Tracking modified buckets |
168 | |
169 | This gets commit |
170 | |
171 | =head2 Deleted marker |
172 | |
173 | This gets delete within the txn and assignment within the txn |
174 | |
175 | =head2 Freespace management |
176 | |
177 | This keeps the file from exploding in size, but it interacts with the deleted |
178 | marker. |
179 | |
315b34eb |
180 | =cut |