Copying a LOT of files

I am currently backing up the media server of NowPublic.com. We are talking millions of files, and terabytes of data. This a rather interesting task. If you try to "tar" the files together and then copy, that wont lead to anything good, just tarring up is days. Not to mention total lack of feedback... strace -p pid where pid is of the tarring process does not count as feedback in my books. Starting rsync to copy to somewhere else eats all physical RAM, then the swap then crashes and this was just the building the file list part (and yes I tried both latest 2.x and 3.x versions). As the files are hashed into 4096 directories, one could try copying out in chunks. It won't take more than 6-8 months... Using anything that would push really just does not want to work.

So we went to Amazon AWS, started an EC2 instance, tried a few PHP frameworks and then tried to download the files from the (secondary) media server and up to S3. First we tried using SQS -- queue the millions of file names, every process picks one from the queue, does the download-upload, repeat and rinse. At this point I found that the only even remotely decent Amazon AWS toolkit for PHP is http://tarzan-aws.com . But even this more does not work than does. On a second try, I created the file list from MySQL. We hold the list of files in a database file -- I suspect others do that too. Running a SELECT ... INTO OUTFILE was a few seconds. Doing the same on a file system level, actually listing the files? Takes forever. Then I used the Unix utility "split" to create small files, 1000 lines each, the s3sync package and some small glue scripts, which I copypasted in the end. This s3sync packages does something that all PHP packages I tried do not -- it catches errors properly and then sleeps 30 seconds before retry. I do not know why this is necessary but it's a fact -- it works where nothing else did.
Here is my php script:

<?php
$mydir = uniqid(getmypid(), TRUE);
mkdir($mydir);
foreach(glob('x0*') as $file) {
  if (rename($file, "$mydir/$file")) {
    break;
  }
}
chdir($mydir);
exec("wget -x -q -B http://example.com -i $file");
$it = new RecursiveDirectoryIterator('example.com');
foreach(new RecursiveIteratorIterator($it) as $file) {
  $s3_file = substr($file, strlen('example.com') + 1);
  exec ("/root/s3sync/s3cmd.rb put bucket:$s3_file $file");
  echo "/root/s3sync/s3cmd.rb put bucket:$s3_file $file\n";
  unlink($file);
}
?>

And my shell script:

#!/bin/bash
while [ 1 ]
do
  nohup time -a -o timer.txt  php process.php &
  nohup time -a -o timer.txt  php process.php &
  nohup time -a -o timer.txt  php process.php &
  nohup time -a -o timer.txt  php process.php &
  nohup time -a -o timer.txt  php process.php &
  nohup time -a -o timer.txt  php process.php &
  nohup time -a -o timer.txt  php process.php &
  nohup time -a -o timer.txt  php process.php &
  nohup time -a -o timer.txt  php process.php &
  nohup time -a -o timer.txt  php process.php & 
  sleep 3000
done

Where 3000 is something derived from timer.txt -- it takes currently 50 minutes to run 10 of these. Running 30 does not raise the throughput really. However, most of time is spent with uploading so I can just start more EC2 instances, move some of the splitted files there and run a healthy number of instances before the media server is saturated. I get a progress report in nohup.out. If one of the processes stalls, it's not a biggie, there will be others started and you can always clean up later.

Commenting on this Story is closed.