Coding Challenge #98 - Spelling Correction ToolThis challenge is to build your own spelling correction tool and learn about spelling correction algorithms and approaches.
Hi this is John with this week’s Coding Challenge. 🙏 Thank you for being one of the 89,989 software developers who have subscribed, I’m honoured to have you as a reader. 🎉 If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧 Coding Challenge #98 - Spelling Correction ToolThis challenge is to build your own spelling correction tool. We all probably use a tool with spelling correction built into it daily. Google offers spelling corrections for our searches, Google Docs, Notion or MS Office let us know when we might have spelt a word wrong as do many of the IDEs we use to write code. Despite that, surprising few software engineers know how they work, which is a shame as being able to detect and correct spelling errors is incredibly useful for most of us. If you’re building a website that allows users to enter data a spelling checker could improve the quality of the user experience or, improve the quality of the data you capture from users. If you’re working in data science, machine learning or the development and deployment of AI being able to detect and correct spelling errors can both improve the user experience and the quality and cleanliness of the data you capture to respond to or train upon. Personally I’ve used spelling error detection and correct techniques to sanitise, and homogenise datasets from multiple sources that all contain the typical problems you see in user entered data:
Alternately just do this project to have fun learning some new real-world algorithms! 📌 Next Systems Programming (Redis) Course Starts 20th OctoberWould you like to build a network server from scratch with me? Learning about network programming, concurrency, testing, and systems software development? If so check out my course: Build A Redis Server Clone: Master Systems Programming Through Practice. It is designed to be intense! It’s 11 hours of instructor time over two weeks. With the goal of having you build a clone of the original Redis server by the end of the two weeks. If you sign up before 6th October you can get $100 off using the code: EARLYBRO25 If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It
The Challenge - Building A Spelling Correction ToolIn this coding challenge you’re going to implement a tool to suggest spelling corrections. For the sake of demonstration / testing the challenge I suggest building a command line tool. You might like to consider building your solution in two parts:
Then you’ll be able to reuse your solution in future projects. Step ZeroCoding Challenges indexes from zero! In this first step your goal is to setup a new project with your chosen programming language and development environment and then get some training data. For the training data we want a set of words and the frequency they occur in a large body of text. You can get that by either scraping the data from a public data source directly and calculating it yourself or by accessing one of the public word frequency data sets. For my solution I went with the English Word Frequency dataset from Kaggle. If you’re working in another language check your favourite search engine for alternate datasets (or scrape your own). Step 1In this step your goal is to parse the training data you’ve collected and create a word frequency table within your application. If you’ve collected the data yourself you may need to clean it too. Step 2In this step your goal is to check if the provided word is in your word frequency table. If it is your program should report it as being correctly spelled. Step 3In this step your goal is to look for the correct spelling of a word, assuming that there is just one character of the word that is incorrect or missing. In other words you want to generate a list of all the ‘words’ that are:
Bonus exercise: calculate the number of words that will be just one edit away from the provided word using Big O notation where n is the length of the word. Once you have calculated all these “words” check if they are in the word frequency table, if not, then we can assume they’re not valid words. If they are, then suggest the most frequently occurring word as the correct spelling. Step 4In this step your goal is to expand your spelling corrector to consider words that have a Levenshtein distance of two from the supplied input word (in Step 3 we were considering words that had a Levenshtein distance of one). The Levenshtein distance is the number of edits required to transform one word into another. You won’t need to calculate the Levenshtein distance for this Coding Challenge, but it’s a useful concept to know about if you’re ever faced with cleaning data or matching disparate data sets that have errors from human data entry. I’ve found it particularly useful for consolidating address data. Once you have the words with a distance of two, again check if they exist in the word frequency table and if so, suggest the one that occurs with the highest frequency as the corrected spelling. Do keep in mind that the correct spelling is likely to be the highest frequency word from the set of words with the lowest Levenshtein distance. Step 5In this optional step your goal is to track the time taken to suggest corrections to a list of words and present the correction and then calculate the time take and words per second, i.e.:
Going FurtherIf you want to take this further you could explore building your own Bloom filter and using it to speed up determining if a word is in the list of known words. Beyond that you could explore providing corrections with a Levenshtein distance of greater than two or adding support for more complex approaches to spelling correction, i.e.:
P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It
Share Your Solutions!If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know via Bluesky or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions Github repo Request for FeedbackI’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback greatly appreciated. You can reach me on Bluesky, LinkedIn or through SubStack Thanks and happy coding! John You're currently a free subscriber to Coding Challenges. For the full experience, upgrade your subscription. |
Don't miss a thing Confirm your subscription Hi there, Thanks for subscribing to fitgirl-repacks.site! To get you up and running, please confirm your email address by clicking below. This will set you up with a WordPress.com account you can use to manage your subscription preferences. By clicking "confirm email," you agree to the Terms of Service and have read the Privacy Policy . Confirm email ...
Comments
Post a Comment