Coding Challenge #97 - BitcaskThis challenge is to build your own Bitcask storage engine and learn about log-structured hash tables.
Hi this is John with this week’s Coding Challenge. 🙏 Thank you for being one of the 89,613 software developers who have subscribed, I’m honoured to have you as a reader. 🎉 If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧 Coding Challenge #97 - BitcaskThis challenge is to build your own Bitcask. Bitcask is a log-structured hash table for fast key/value data storage and retrieval. Bitcask is the default storage backed for the Riak key-value store. The creators of Riak describe Bitcask as “A log-structured hash table for fast key/value data”. What does that mean? It means that Bitcask writes data to an append-only log file and uses an in-memory hash table to keep track of the positions of its log entries. They chose this approach as it provides high read/write throughput and is relatively easy to implement and reason about. Their specific technical goals were:
You can read more about Bitcask in the Bitcask paper. 🚨Would You Like To Learn Go By Building Coding Challenges! 🚨I’m running the Coding Challenges Learn Go By Building Live Course again in September. It is a live course that runs for three working weeks from September the 15th to the October the 3rd. During the course I’ll introduce you to ever aspect of Go that you need to build the following five real-world projects (based off five of my coding challenges): 🏗️ cat - By building cat you learn how to build and run command line programs in Go. 🏗️ sort - By building sort you learn how to use Go's data structures and control flow to implement sort. 🏗️ curl - By building curl you learn how to write network clients in Go. 🏗️ wc - By building wc you learn how to process text data and handle locales with Go. 🏗️ Memcached (Capstone Project) - By building a Memcached server clone you learn how to build efficient network servers in Go. Having built these five real-world applications you will be well equipped to take on new projects in Go! Check out the Coding Challenges Learn Go By Building Live Course for full details and to sign up. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It
The Challenge - Building A Bitcask ImplementationIn this coding challenge you’re going to implement Bitcask as a library and build a command line tool that allows you to set, get and delete keys as well as initiating a merge. If you’re wondering a merge is the term they chose to represent compacting the Bitcask log files. A Bitcast database is simply a directory and the files in it. That makes it nice and easy to back and restore, simply copy the directory backup and back from the backup to restore. Only one Bitcask process (the database server) can be accessing the database at a time. Within the directory only one of the files is active at a time. The current log file is the active file. When the active file reaches a specific file size it is closed and a new active file created. All files other than the active file are consider immutable. Step ZeroIn this step you’re going to set your environment up ready to begin developing and testing your solution. Please setup your IDE / editor of choice and programming language of choice for building command line tools. If you’re feeling adventurous, plan to add some networking - it seems many people commonly implement the ability to set and get keys using the Redis Serialisation Protocol (RESP) to build a server. You can learn more about RESP in the build your own Redis server coding challenge. Step 1In this step your goal is to create a simple command line tool that will allow the user to set and get values from the hash table. You should write the hash table out to file after every update, overwriting all the existing entries. On startup your code should read the file and create an in memory datastore. The file you write should be simple, looking something like this:
It’s a placeholder to get you started, you’ll evolve the design as you progress through the steps. Having completed this step, your tool should support something like this:
Here the Step 2In this step your goal is to begin writing the values to the log file. The log file is an append-only file (AOF) where each entry is written in the following format: The first field is the Cyclic Redundancy Check (CRC). If you’re not familiar with CRCs they’re error-detection codes often used in networking and storage devices to detect accidental or erroneous changes to data. Bitcask uses CRCs for precisely this reason, to check on startup that a record is valid (i.e. it’s not a corrupt partial record created during a crash). It’s not clear from the paper, but I believe the CRC used is CRC-32, which you can learn about on the CRC page of Wikipedia. Wikipedia also has an example of the CRC-32 algorithm on the computation of CRCs page. To calculate the CRC, first pack the other fields into an array of bytes*, then calculate the CRC using that array of bytes as an input. Once you have the CRC write the CRC followed by the array into the log file. As the log file is append only, each new record should be written at the end of the file. *Again it’s not specified in the document, but I believe the bytes are most likely to be stored in little endian. If, having written the new entry to the file the file size exceeds a preconfigured size, the file should be closed and a new one opened. For testing you might like to make the size quite low, i.e. a few hundred bytes. For a ‘production’ system it should be considerably larger. By default Riak uses 2GB. You could use TDD to check the code you’re building does the right thing, or dig into the produced files with a tool like Step 3In this step your goal is to update the hash table to refer to the log file. When Bitcask does this it stores some metadata in the hash table instead of the value. That metadata looks like this: To complete this step, update your Bitcask implementation to write the metadata when setting values. Step 4In this step your goal is to use the metadata to read the values from the log file. The metadata is used to look up the value entry for a key in the relevant log file, the Test your code by doing what we did earlier:
Be sure to set enough keys to spread the values over multiple files, then to test that the key is fetched. If you haven’t already, make sure that when a key is updated a new log file entry is written and the metadata is updated to reflect the new entry. Step 5In this step your goal is to implement deletion of keys. Bitcask handles deletion by writing a special tombstone value to the entry in the hash table and on disk. As far as I can tell this was done by setting the timestamp to the removal time, the value position, value size and file id to the max value for each type. To complete this step, implement support for a delete command, and update the code to set the tombstone value for the supplied key. Then update the get command not return any deleted key. For example:
Step 6In this step your goal is to implement the merge functionality of Bitcask. Merging, as Bitcask calls it, is the process of tidying up log files that no longer contain any live data values. Without this process the server’s disk would fill up and eventually run out of space. Bitcask does the merging in a background task. This task iterates over all the non-active files and produces a new set of files containing only the live version of each key. As an aid to rapid startup it also creates a hint file alongside each of these files. The hint file contains the timestamp, key size, value size, value position and key of each item in the file. Reading the hint file will allow a new Bitcask server to rapidly rebuild the hash table on startup, providing a fast startup/recovery. Once a file has been processed by the merge task it can be deleted, freeing up disk space and compacting the storage used. You now have the necessary bits in place to implement the rebuilding of the hash table on startup. You can use the CRC to verify the validity of each record in the active file, discarding any that were incomplete/corrupted by a previous shutdown/crash. Going FurtherIf you’d like to take this project further, consider making a network server for your Bitcask database and support a protocol such as RESP for getting, setting and deleting keys. Two Other Ways I Can Help You:
Share Your Solutions!If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know via Bluesky or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions Github repo Request for FeedbackI’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback greatly appreciated. You can reach me on Bluesky, LinkedIn or through SubStack Thanks and happy coding! John Invite your friends and earn rewardsIf you enjoy Coding Challenges, share it with your friends and earn rewards when they subscribe. |
Don't miss a thing Confirm your subscription Hi there, Thanks for subscribing to fitgirl-repacks.site! To get you up and running, please confirm your email address by clicking below. This will set you up with a WordPress.com account you can use to manage your subscription preferences. By clicking "confirm email," you agree to the Terms of Service and have read the Privacy Policy . Confirm email ...
Comments
Post a Comment