Problem Statement As a information scientist when it comes to marketing division at reddit.

Published by Vinyagar Group at October 13, 2020

See relationship-advice-scrape and dating-advice-scrape notebooks because of payday loans New Mexico this component.

After switching most of the scrapes into DataFrames, we spared them as csvs that you can get within the dataset folder with this repo.

Information Cleaning and EDA

dropped rows with null self text column becuase those rows are worthless in my experience.
combined name and selftext column directly into one brand brand new columns that are all_text
exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline accuracy rating 0.633 which means that if i find the value that develops usually, i will be appropriate 63.3% of that time period.

First attempt: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first set of scraping, pretty bad score with a high variance. Train 99%, test 72%

attempted to decrease maximum features and rating got a whole lot worse
tried with lemmatizer preprocessing instead and test score went as much as 74per cent

Merely increasing the information and stratifying y in my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a great deal. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and cross val to 82.3 Nonetheless, these score disappeared.

I believe Tfidf worked the most effective to reduce my overfitting due to variance issue because

we customized the stop terms to simply just take the ones away that have been really too regular to be predictive. It was a success, but, with additional time I most likely could’ve tweaked them a little more to boost all ratings. Taking a look at both the solitary terms and terms in categories of two (bigrams) had been the most readily useful param that gridsearch proposed, but, every one of my top many predictive terms wound up being uni-grams. My original set of features had a good amount of jibberish terms and typos. Minimizing the # of that time period an expressed term ended up being needed to show as much as 2, helped be rid of these. Gridsearch also recommended 90% max df rate which aided to get rid of oversaturated terms too. Finally, establishing max features to 5000 reduced cut down my columns to about one fourth of whatever they had been to simply concentrate probably the most frequently employed terms of that which was kept.

Summary and tips

Also I was able to successfully lower the variance and there are definitely several words that have high predictive power though I would like to have higher train and test scores

and so I think the model is prepared to introduce a test. If marketing engagement increases, exactly the same keywords might be utilized to locate other possibly profitable pages. It was found by me interesting that taking right out the overly used terms aided with overfitting, but brought the accuracy rating down. I do believe there clearly was probably nevertheless space to relax and play around with the paramaters for the Tfidf Vectorizer to see if various end words produce a different or

About

Used Reddit’s API, demands collection, and BeautifulSoup to clean articles from two subreddits: Dating guidance & union information, and trained a binary category model to predict which subreddit confirmed post originated from

Vinyagar Group

collier homme ohm 1collierfrance5325 ladychic 2019 nouveaute lune etoile rouge cristal boucles doreilles goutte or argent couleur collier femme argent fin 1collierfrance973 collier fantaisie rouge et noir comment redonner vie a une bague en argent balanbiu unique coccinelle boucles doreilles goutte pour femmes punk noir resine insecte de culture perle cristal vintage couleur or bijoux de mode

Comments are closed.

Problem Statement As a information scientist when it comes to marketing division at reddit.

Glance at Credit Rating and Loan Principles

Ways to obtain a credit loan that is bad? Discover more about ways to get a guarantor loan here.

Problem Statement As a information scientist when it comes to marketing division at reddit.

See relationship-advice-scrape and dating-advice-scrape notebooks because of payday loans New Mexico this component.

Found the baseline accuracy rating 0.633 which means that if i find the value that develops usually, i will be appropriate 63.3% of that time period.

I believe Tfidf worked the most effective to reduce my overfitting due to variance issue because

Also I was able to successfully lower the variance and there are definitely several words that have high predictive power though I would like to have higher train and test scores

Vinyagar Group

Related posts

Without a doubt about pay day loans family savings