$treeview $search $mathjax $extrastylesheet
librsync
2.0.2
$projectbrief
|
$projectbrief
|
$searchbox |
00001 * Fix symbol names: 00002 00003 * Rename all symbols that are intended to be private to `rs__` 00004 00005 * Rename those that don't match either prefix. 00006 00007 * We have a few functions to do with reading a netint, stashing 00008 it somewhere, then moving into a different state. Is it worth 00009 writing generic functions for that, or would it be too confusing? 00010 00011 * Duplicate block handling. Currently duplicate blocks are included in 00012 the signature, but we only put the first duplicate block in the 00013 hashtable so the delta only includes references to the first block. 00014 This can result in sub-optimal copy commands, breaking single large 00015 copies with duplicate blocks into multiple copies referencing the 00016 earlier copy of the block. However, this could also make patching use 00017 the disk cache more effectively. This solution is probably fine, 00018 particularly given how small copy instructions are, but there might be 00019 solutions for improving copy commands for long runs of duplicate blocks. 00020 00021 * Optimisations and code cleanups; 00022 00023 scoop.c: Scoop needs major refactor. Perhaps the API needs 00024 tweaking? 00025 00026 rsync.h: rs_buffers_s and rs_buffers_t should be one typedef? 00027 00028 * Just how useful is rs_job_drive anyway? 00029 00030 mdfour.c: This code has a different API to the RSA code in libmd 00031 and is coupled with librsync in unhealthy ways (trace?). Recommend 00032 changing to RSA API? 00033 00034 * Don't use the rs_buffers_t structure. 00035 00036 There's something confusing about the existence of this structure. 00037 In part it may be the name. I think people expect that it will be 00038 something that behaves like a FILE* or C++ stream, and it really 00039 does not. Also, the structure does not behave as an object: it's 00040 really just a shorthand for passing values in to the encoding 00041 routines, and so does not have a lot of identity of its own. 00042 00043 An alternative might be 00044 00045 result = rs_job_iter(job, 00046 in_buf, &in_len, in_is_ending, 00047 out_buf, &out_len); 00048 00049 where we update the length parameters on return to show how much we 00050 really consumed. 00051 00052 One technicality here will be to restructure the code so that the 00053 input buffers are passed down to the scoop/tube functions that need 00054 them, which are relatively deeply embedded. I guess we could just 00055 stick them into the job structure, which is becoming a kind of 00056 catch-all "environment" for poor C programmers. 00057 00058 * Meta-programming 00059 00060 * Plot lengths of each function 00061 00062 * Some kind of statistics on delta each day 00063 00064 * Encoding format 00065 00066 * Include a version in the signature and difference fields 00067 00068 * Remember to update them if we ever ship a buggy version (nah!) so 00069 that other parties can know not to trust the encoded data. 00070 00071 * abstract encoding 00072 00073 In fact, we can vary on several different variables: 00074 00075 * what signature format are we using 00076 00077 * what command protocol are we using 00078 00079 * what search algorithm are we using? 00080 00081 * what implementation version are we? 00082 00083 Some are more likely to change than others. We need a chart 00084 showing which source files depend on which variable. 00085 00086 * Encoding implementation 00087 00088 * Join up signature commands 00089 00090 * Encoding algorithm 00091 00092 * Self-referential copy commands 00093 00094 Suppose we have a file with repeating blocks. The gdiff format 00095 allows for COPY commands to extend into the *output* file so that 00096 they can easily point this out. By doing this, they get 00097 compression as well as differencing. 00098 00099 It'd be pretty simple to implement this, I think: as we produce 00100 output, we'd also generate checksums (using the search block 00101 size), and add them to the sum set. Then matches will fall out 00102 automatically, although we might have to specially allow for 00103 short blocks. 00104 00105 However, I don't see many files which have repeated 1kB chunks, 00106 so I don't know if it would be worthwhile. 00107 00108 * Extended files 00109 00110 Suppose the new file just has data added to the end. At the 00111 moment, we'll match everything but the last block of the old 00112 file. It won't match, because at the moment the search block 00113 size is only reduced at the end of the *new* file. This is a 00114 little inefficient, because ideally we'd know to look for the 00115 last block using the shortened length. 00116 00117 This is a little hard to implement, though perhaps not 00118 impossible. The current rolling search algorithm can only look 00119 for one block size at any time. Can we do better? Can we look 00120 for all block lengths that could match anything? 00121 00122 Remember also that at the moment we don't send the block length 00123 in the signature; it's implied by the length of the new block 00124 that it matches. This is kind of cute, and importantly helps 00125 reduce the length of the signature. 00126 00127 * State-machine searching 00128 00129 Building a state machine from a regular expression is a brilliant 00130 idea. (I think *The Practice of Programming* walks through the 00131 construction of this at a fairly simple level.) 00132 00133 In particular, we can search for any of a large number of 00134 alternatives in a very efficient way, with much less effort than 00135 it would take to search for each the hard way. Remember also the 00136 string-searching algorithms and how much time they can take. 00137 00138 I wonder if we can use similar principles here rather than the 00139 current simple rolling-sum mechanism? Could it let us match 00140 variable-length signatures? 00141 00142 * Support gzip compression of the difference stream. Does this 00143 belong here, or should it be in the client and librsync just have 00144 an interface that lets it cleanly plug in? 00145 00146 I think if we're going to just do plain gzip, rather than 00147 rsync-gzip, then it might as well be external. 00148 00149 * rsync-gzip: preload with the omitted text so as to get better 00150 compression. Abo thinks this gets significantly better 00151 compression. On the other hand we have to important and maintain 00152 our own zlib fork, at least until we can persuade the upstream to 00153 take the necessary patch. Can that be done? 00154 00155 abo says 00156 00157 It does get better compression, but at a price. I actually 00158 think that getting the code to a point where a feature like 00159 this can be easily added or removed is more important than the 00160 feature itself. Having generic pre and post processing layers 00161 for hit/miss data would be useful. I would not like to see it 00162 added at all if it tangled and complicated the code. 00163 00164 It also doesn't require a modified zlib... pysync uses the 00165 standard zlib to do it by compressing the data, then throwing 00166 it away. I don't know how much benefit the rsync modifications 00167 to zlib actually are, but if I was implementing it I would 00168 stick to a stock zlib until it proved significantly better to 00169 go with the fork. 00170 00171 * Licensing 00172 00173 Will the GNU Lesser GPL work? Specifically, will it be a problem 00174 in distributing this with Mozilla or Apache? 00175 00176 * Checksums 00177 00178 * Do we really need to require that signatures arrive after the 00179 data they describe? Does it make sense in HTTP to resume an 00180 interrupted transfer? 00181 00182 I hope we can do this. If we can't, however, then we should 00183 relax this constraint and allow signatures to arrive before the 00184 data they describe. (Really? Do we care?) 00185 00186 * Allow variable-length checksums in the signature; the signature 00187 will have to describe the length of the sums and we must compare 00188 them taking this into account. 00189 00190 * Testing 00191 00192 * Just more testing in general. 00193 00194 * Test broken pipes and that IO errors are handled properly. 00195 00196 * Test files >2GB, >4GB. Presumably these must be done in streams 00197 so that the disk requirements to run the test suite are not too 00198 ridiculous. I wonder if it will take too long to run these 00199 tests? Probably, but perhaps we can afford to run just one 00200 carefully-chosen test. 00201 00202 * Fuzz instruction streams. <https://code.google.com/p/american-fuzzy-lop/>? 00203 00204 * Generate random data; do random mutations. 00205 00206 * Try different block lengths. 00207 00208 * Tests should fail if they can't find their inputs, or have zero 00209 inputs: at present they tend to succeed by default. 00210 00211 * Test varying strong-sum inputs: default, short, long. 00212 00213 * Security audit 00214 00215 * If this code was to read differences or sums from random machines 00216 on the network, then it's a security boundary. Make sure that 00217 corrupt input data can't make the program crash or misbehave. 00218 00219 * Long files 00220 00221 * How do we handle the large signatures required to support large 00222 files? In particular, how do we choose an appropriate block size 00223 when the length is unknown? Perhaps we should allow a way for 00224 the signature to scale up as it grows. 00225 00226 * Perhaps make extracted signatures still be wrapped in commands. 00227 What would this lead to? 00228 00229 * We'd know how much signature data we expect to read, rather than 00230 requiring it to be terminated by the caller.