Original article is here
Large language models are all the rage these days and new ones are popping up every other day. Most of these linguistic behemoths, including OpenAI’s ChatGPT and Google’s Bard, are trained on text data from all over the internet – websites, articles, books, you name it. This means that their output is a mixed bag of genius. But what if instead of the web, LLMs were trained on the dark web? Researchers have done just that with DarkBERT to some surprising results. Let’s take a look.
What is DarkBERT?
A team of South Korean researchers have released a paper detailing how they built an LLM on a large-scale dark web corpus collected by crawling the Tor network. The data included a host of shady sites from various categories including cryptocurrency, pornography, hacking, weaponry, and others. However, due to ethical concerns, the team did not use the data as is. To ensure that the model wasn’t trained on sensitive data so that bad actors aren’t able to extract that information, the researchers polished the pre-training corpus through filtering, before feeding it to DarkBERT.
If you are wondering about the rationale behind the name DarkBERT, the LLM is based on the RoBERTa architecture, which is a transformer-based model developed back in 2019 by researchers at Facebook.
Meta had described RoBERTa as a “robustly optimized method for pre-training natural language processing (NLP) systems” that improves upon BERT, which was released by Google back in 2018. After Google made the LLM open-source, Meta was able to improve its performance.
Cut to the present, the Korean researchers have improved upon the original model even further by feeding it data from the dark web over the course of 15 days, eventually arriving upon DarkBERT. The research paper highlights that a machine with an Intel Xeon Gold 6348 CPU and 4 NVIDIA A100 80GB GPUs was used for the purpose.
Read the full article at: indianexpress.com