Go to file
Netflix 4fa4698c19 Update 'reports/3-10-2021.md' 2021-03-11 21:30:46 +01:00
reports Update 'reports/3-10-2021.md' 2021-03-11 21:30:46 +01:00
README.md Update 'README.md' 2021-03-11 21:30:08 +01:00
json.rb Add 'json.rb' 2021-03-08 02:49:57 +01:00
lib.rb Add 'lib.rb' 2021-03-08 02:49:03 +01:00
test.rb test.rb 2021-03-08 03:01:20 +01:00
v2.md Update 'v2.md' 2021-03-08 03:05:40 +01:00

README.md

DarknetMarketsNoobs_ML_Study

The plan

My plan is to try to scrape DarknetMarketsNoobs's subdread. I did not want any usernames or any other personal information, All I needed is the body of the posts. After I scraped the subdread I wanted to use topic modeling, also known as Latent_Dirichlet_allocation ( wikip ). From what I understand IDA will take all the words and then it tries to find 4 topics within the words. The goal from doing this is to try to find what 4 topics n00bs are having issues with. Hopefully we might be able to use that information to try to make guides, add that stuff into the darkwork bible.

Finding a way to scrape Dread

The easiest way I thought of would be use curl. If you inspect the page and then go to the networkig tab. Before you can see any requests you have to refresh your page. Next you should see a bunch of request. Click the one that matches your URL. So if I was doing DarknetMarketsNoobs, I would look for something like this:

/d/DarknetMarketsNoobs?p=1

After finding the subdread, you need to right click and then click Copy>. If you are on a POSIX system like me, then you would click the copy as cURL (POSIX).

This is that request as a cURL command. We use this so we can get the URLS of each other threads in the DarknetMarketsNoobs subdread. Now, we are able to scrape the body of the text, we have to do the same thing but this tick on a random subdread. Now do what we just did again.

This will be used so we can get the body of the text. This is what we need.

The GetUrls class

The GetUrls class will looks like this:

class GetUrls
   def self.run
       i = 0
       #File.truncate('DarknetMarketsNoobs_urls.txt', 0)
       cmd ="the first cURL we got goes here."
       stdout, status = Open3.capture3(cmd)
       page = Nokogiri::HTML(stdout)
       out  = []
       until i >= 20
           t    = page.xpath("/html/body/div/div[2]/div[3]/div[#{i}]/div[1]/a").css('a')
           url  = t.attribute('href').to_s.strip
           out << url if url.size.to_i != 0
           i = i+= 1
           #}
       end
   File.open(File.join("DarknetMarketsNoobs_urls.txt"), "a") { |file| file.write(out.join("\n")) }
   end
end

First The code will run the cURL commands and then save the output as a string. Then it will use nokogiri to read the HTML. Since there is only 20 subdreads allowed on each page. It will then loop until the i variable gets to 20. Now I went back to the page and then highlighted and clicked inspect on the webpage this got me the xpath of where the subdread URL should be. So basicly Nokogiri will parse the html from the cURL command and then get that text. Now we have to get the URL. We do this by using the attribute method to get the URL. Next it will check if the url_size is 0. This is done so we dont add blank lines into the outfile. It adds all the URLS to an array named out. After the i variable hits 20 it will use the .join method to join all the elements in the array and add a new line. It then saves this output to a text file.

The GetBody class

This class is what gets the contents of the posts. It saves the contents to a file.

class GetBody
    def self.run
        a = []
        File.readlines("DarknetMarketsNoobs_urls.txt").each do |l|
            l = l.strip
            # the CURL URL goes here make sure u do this to the domain:
            curl = "curl 'http://dreadytofatroptsdj6io7l3xptbet6onoyno2yv7jicoxknyazubrad.onion#{l}'"
            stdout, status = Open3.capture3(curl)
            page = Nokogiri::HTML(stdout)
            #puts page.xpath("/html/body/div/div[2]/div[2]/div/div[2]/div").text
            out  =  page.xpath("/html/body/div/div[2]/div[2]/div/div[2]/div").text.strip
            puts out
            File.open(File.join("fucking_Work.txt"), "a") { |file| file.write(out) }
        end
    end
end

The LDA code

This removes any punctuations, fitlers out basic words that are useless like: your, the, there. Doing this will improve the results.

require 'lda-ruby'
require 'lingua/stemmer'
require 'textoken'
require 'stopwords'
require 'json'
corpus = Lda::Corpus.new
f = Stopwords::Snowball::Filter.new "en"
r = File.readlines("fucking_Work.txt").each do |k|
  k = k.downcase
  t = Textoken(k, exclude: 'punctuations').tokens
  ll = f.filter(t)
  l   = Lda::TextDocument.new(corpus, ll.join(","))
  corpus.add_document(l)
end
lda = Lda::Lda.new(corpus)
lda.num_topics = 15
lda.em('random')

topics = lda.top_words
puts topics

The Data

{0=>["mail", "carrier", "route", "order", "proper", "privacy", "addresses", "original", "wallet", "tails"],
 1=>["key", "pgp", "bible", "dnm", "public", "set", "send", "proxy", "access", "tor"], 
 2=>["mail", "withdraw", "agency", "classes", "scammed", "open", "ban", "dispute", "support", "place"], 
 3=>["tails", "send", "wallet", "vendor", "package", "anonymous", "btc", "feather", "open", "binance"], 
 4=>["money", "os", "host", "questions", "due", "personal", "people", "companies", "protection", "federal"], 
 5=>["mail", "carrier", "storage", "order", "posts", "ship", "sorting", "started", "veracrypt", "usb"],
 6=>["unconfirmed", "small", "good", "opened", "shows", "application", "easily", "compile", "decided", "pinned"],
 7=>["click", "key", "order", "onion", "harm", "time", "give", "shipping", "pgp", "gpg"], 
 8=>["im", "tails", "cashapp", "safe", "ensure", "bastards", "wanting", "listed", "open", "tryna"], 
 9=>["wallet", "encrypt", "key", "message", "feather", "send", "public", "file", "sign", "im"], 
 10=>["mail", "address", "carrier", "purchase", "local", "change", "cash", "page", "card", "im"], 
 11=>["wondering", "order", "packages", "safer", "po", "ledger", "anonymous", "live", "sending", "aware"], 
 12=>["vendors", "people", "secure", "market", "xmr", "dont", "block", "learn", "cards", "stuff"], 
 13=>["wallet", "btc", "convert", "xmr", "send", "electrum", "exchange", "small", "order", "coinbase"], 
 14=>["bitcoin", "whm", "address", "vendor", "thc", "system", "put", "correct", "process", "markets"]}

REPORTS

Jan 30th 2021 Report
Feb 3rd 2021 Report
March 5th 2021 Report
March 6th 2021 Report
March 7th 2021 Report
March 10 2021 Report