Netflix 4fa4698c19 | ||
---|---|---|
reports | ||
README.md | ||
json.rb | ||
lib.rb | ||
test.rb | ||
v2.md |
README.md
DarknetMarketsNoobs_ML_Study
The plan
My plan is to try to scrape DarknetMarketsNoobs's subdread. I did not want any usernames or any other personal information, All I needed is the body of the posts. After I scraped the subdread I wanted to use topic modeling, also known as Latent_Dirichlet_allocation ( wikip ). From what I understand IDA will take all the words and then it tries to find 4 topics within the words. The goal from doing this is to try to find what 4 topics n00bs are having issues with. Hopefully we might be able to use that information to try to make guides, add that stuff into the darkwork bible.
Finding a way to scrape Dread
The easiest way I thought of would be use curl. If you inspect the page and then go to the networkig tab. Before you can see any requests you have to refresh your page. Next you should see a bunch of request. Click the one that matches your URL. So if I was doing DarknetMarketsNoobs, I would look for something like this:
/d/DarknetMarketsNoobs?p=1
After finding the subdread, you need to right click and then
click Copy>
.
If you are on a POSIX system like me, then you would click the
copy as cURL (POSIX)
.
This is that request as a cURL command. We use this so we can get the URLS of each other threads in the DarknetMarketsNoobs subdread. Now, we are able to scrape the body of the text, we have to do the same thing but this tick on a random subdread. Now do what we just did again.
This will be used so we can get the body of the text. This is what we need.
The GetUrls class
The
GetUrls
class will looks like this:
class GetUrls
def self.run
i = 0
#File.truncate('DarknetMarketsNoobs_urls.txt', 0)
cmd ="the first cURL we got goes here."
stdout, status = Open3.capture3(cmd)
page = Nokogiri::HTML(stdout)
out = []
until i >= 20
t = page.xpath("/html/body/div/div[2]/div[3]/div[#{i}]/div[1]/a").css('a')
url = t.attribute('href').to_s.strip
out << url if url.size.to_i != 0
i = i+= 1
#}
end
File.open(File.join("DarknetMarketsNoobs_urls.txt"), "a") { |file| file.write(out.join("\n")) }
end
end
First The code will run the cURL commands and then save the output as a string. Then it will use
nokogiri
to read the HTML. Since there is only 20 subdreads allowed on each page. It will then loop until the i variable gets to 20. Now I went back to the page and then highlighted and clicked inspect on the webpage this got me the xpath of where the subdread URL should be. So basicly Nokogiri will parse the html from the cURL command and then get that text. Now we have to get the URL. We do this by using the attribute method to get the URL. Next it will check if the url_size is 0. This is done so we dont add blank lines into the outfile. It adds all the URLS to an array named
out
.
After the i variable hits 20 it will use the
.join
method to join all the elements in the array and add a new line. It then saves this output to a text file.
The GetBody class
This class is what gets the contents of the posts. It saves the contents to a file.
class GetBody
def self.run
a = []
File.readlines("DarknetMarketsNoobs_urls.txt").each do |l|
l = l.strip
# the CURL URL goes here make sure u do this to the domain:
curl = "curl 'http://dreadytofatroptsdj6io7l3xptbet6onoyno2yv7jicoxknyazubrad.onion#{l}'"
stdout, status = Open3.capture3(curl)
page = Nokogiri::HTML(stdout)
#puts page.xpath("/html/body/div/div[2]/div[2]/div/div[2]/div").text
out = page.xpath("/html/body/div/div[2]/div[2]/div/div[2]/div").text.strip
puts out
File.open(File.join("fucking_Work.txt"), "a") { |file| file.write(out) }
end
end
end
The LDA code
This removes any punctuations, fitlers out basic words that are useless like: your, the, there. Doing this will improve the results.
require 'lda-ruby'
require 'lingua/stemmer'
require 'textoken'
require 'stopwords'
require 'json'
corpus = Lda::Corpus.new
f = Stopwords::Snowball::Filter.new "en"
r = File.readlines("fucking_Work.txt").each do |k|
k = k.downcase
t = Textoken(k, exclude: 'punctuations').tokens
ll = f.filter(t)
l = Lda::TextDocument.new(corpus, ll.join(","))
corpus.add_document(l)
end
lda = Lda::Lda.new(corpus)
lda.num_topics = 15
lda.em('random')
topics = lda.top_words
puts topics
The Data
{0=>["mail", "carrier", "route", "order", "proper", "privacy", "addresses", "original", "wallet", "tails"],
1=>["key", "pgp", "bible", "dnm", "public", "set", "send", "proxy", "access", "tor"],
2=>["mail", "withdraw", "agency", "classes", "scammed", "open", "ban", "dispute", "support", "place"],
3=>["tails", "send", "wallet", "vendor", "package", "anonymous", "btc", "feather", "open", "binance"],
4=>["money", "os", "host", "questions", "due", "personal", "people", "companies", "protection", "federal"],
5=>["mail", "carrier", "storage", "order", "posts", "ship", "sorting", "started", "veracrypt", "usb"],
6=>["unconfirmed", "small", "good", "opened", "shows", "application", "easily", "compile", "decided", "pinned"],
7=>["click", "key", "order", "onion", "harm", "time", "give", "shipping", "pgp", "gpg"],
8=>["im", "tails", "cashapp", "safe", "ensure", "bastards", "wanting", "listed", "open", "tryna"],
9=>["wallet", "encrypt", "key", "message", "feather", "send", "public", "file", "sign", "im"],
10=>["mail", "address", "carrier", "purchase", "local", "change", "cash", "page", "card", "im"],
11=>["wondering", "order", "packages", "safer", "po", "ledger", "anonymous", "live", "sending", "aware"],
12=>["vendors", "people", "secure", "market", "xmr", "dont", "block", "learn", "cards", "stuff"],
13=>["wallet", "btc", "convert", "xmr", "send", "electrum", "exchange", "small", "order", "coinbase"],
14=>["bitcoin", "whm", "address", "vendor", "thc", "system", "put", "correct", "process", "markets"]}
REPORTS
Jan 30th 2021 Report
Feb 3rd 2021 Report
March 5th 2021 Report
March 6th 2021 Report
March 7th 2021 Report
March 10 2021 Report