using python to do clustering and topic modeling 1

In this assignment, you’ll need to use the following dataset:

text_train.json: This file contains a list of documents. It’s used for training models

text_test.json: This file contains a list of document and labels of each document. It’s used for testing performance. This file is in the format shown below. Note, each document has a list of labels.

Text

Labels

faa issues fire warning for lithium …

[T1, T3]

rescuers pull from flooded coal mine …

[T1]

….

Q1: K-Mean Clustering

Define a function cluster_kmean() as follows:

Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile is the file path of text_test.json

Uses KMeans to cluster all documents in both train_f ile and test_f ile into 3 clusters by cosine similarity. Note, please combine documents in these two files and train a single clustering model from the combined documents.

Tests the clustering model performance using test_f ile :

Let’s only use the first label in the label list of each test document as the ground_truth label, e.g. the first document in the table above will have the ground_truth label “T1”. Apply majority vote rule to map the clusters to the labels in test_f ile , i.e., T1, T2, T3

Calculate precision/recall/f-score for each label

Check centroids/samples in each cluster to interpret it, and give a meaningful name (instead of T1, T2, T3) to it.

This function has no return. Print out precision/recall/f-score. Write down the meaningful cluster names in a document. Also find one document sample from train_f ile for each cluster in the doucment.

Q2: LDA Clustering

Define a function cluster_lda() as follows:

Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile

is the file path of text_test.json

Uses LDA to train a topic model with only documents in train_f ile and the number of topics K = 3

Predicts the topic distribution of each document in

(i.e. the topic with highest probability)

Evaluates the topic model performance using topic prediction from documents in test_f ile :

Let’s use the first label in the label list of each test document as the ground_truth label,

e.g. the first document in the table above will have the ground_truth label “T1”.

Apply majority vote rule to map the topics to the labels in test_f ile , i.e., T1, T2,

T3 Calculate precision/recall/f-score for each label

Based on the word distribution of each topic, give the topic a meaningful name

(instead of T1, T2, T3).

This function has no return. Print out precision/recall/f-score. Also, provide a document which

contains:

the meaningful topic names

one document sample from train_f ile for each topic

performance comparison between Q1 and Q2.

test_f ile , and selects only the top one topic

 

Do you need a similar assignment done for you from scratch? We have qualified writers to help you. We assure you an A+ quality paper that is free from plagiarism. Order now for an Amazing Discount!
Use Discount Code “Newclient” for a 15% Discount!

NB: We do not resell papers. Upon ordering, we do an original paper exclusively for you.

The post using python to do clustering and topic modeling 1 appeared first on Urgent Nursing Writers.


using python to do clustering and topic modeling 1 was first posted on October 10, 2020 at 1:37 am.
©2019 "Essay Lords | Bringing Excellence to students world wide". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at support@essaylords.us

Order a unique copy of this paper
(550 words)

Approximate price: $22

Basic Guarantees
  • Free title page and bibliography
  • Free unlimited revisions
  • Plagiarism-free papers
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Progressive delivery
  • Plagiarism report
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double or single line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Writing quality papers is a TOP priority. One expert takes one order at a time.
The service package includes topic brainstorm, research, drafting, proofreading, plagiarism check, citation formatting, and revisions.

Money-back guarantee

We appreciate how valuable your time is. Hence, we make sure all custom papers are 100% original and delivered within the agreed time frame

Read more

Zero-plagiarism guarantee

Each paper is written from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

We see it as our duty to follow all instruction the client provides. If you feel the completed paper does not meet your exact requirements, we will revise the paper if you let us know about the problem within 14 business days from the date of delivery.

Read more

Privacy policy

Your email is safe, we use your personal data for legal purposes only and in accordance with personal data protection law. Your payment details are also secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

You can easily contact us with any question or issues you need to be addressed. Also, you have the opportunity to communicate directly with assigned writer, e-mail us, submit revision requests, chat with us online, or call our toll-free on our site. We are always available to our customers.

Read more

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
-->