Data Science: Regression on Javascript

Background

In the previous article, we analyzed the data representing the statistics of javascript during the year 2019. The data visualization showed us how various types of javascript developers are distributed across the countries. In this article, we will look at linear and logistic regression mechanisms that can be applied to the data to predict the type of javascript developers and the various technologies that will gain more usage in the future. We will particularly look at using several descriptive features given by the users to predict whether they work in AngularJS or not.

Data Preprocessing

The combineddf used for visualizing various charts to see the data distribution will be again used for regression. Since most of the descriptive features are categorical we will be using LabelEncoder to transform this to Numerical data. This can be done in two ways:

encoder = LabelEncoder()
df_LE = combineddf.apply(encoder.fit_transform)
print('Replacing categories by numerical labels: ')
print(df_LE.head())
filtered_df = combineddf[combineddf["yearly_salary"].notna()];
cleanup_nums = {"yearly_salary": { "work_for_free": 0, "0_10": 1, "10_30": 2, "30_50": 3, "50_100": 4, "100_200": 5, "more_than_200": 6}}
filtered_df.replace(cleanup_nums, inplace=True)
print(filtered_df)

Logistic regression — linear model

The first step is to apply a linear model for logistic regression. This means we would try to see if the data can be fit using a straight line. The logistic regression would give out a function that separates the two output classes of data (In our case whether the person codes/uses AngularJs or not). Using a linear model, we would get a function that represents a straight line.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sns
from sklearn import metrics

data_req = filtered_df[:1000]

feature_cols_react = ['yearly_salary', 'years_of_experience', 'backend_proficiency', 'css_proficiency']
X = data_req[feature_cols_react] # Features
y = data_req['angular'] # Target variable

X_train ,X_test ,y_train ,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

logreg = LogisticRegression(solver = 'lbfgs', C=1e5);
logreg.fit(X_train,y_train);


y_pred=logreg.predict(X_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print("confusion matrix \n", cnf_matrix)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision :",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
  1. Next, we select the target we want to predict ‘angular’.
  2. The data is now split into a training set and test set.
  3. The logistic-regression function from sklearn.linear_model is now used to fit the training data.
  4. The test set can be used for the prediction and verification of our model.
  • Confusion matrix:

Logistic regression — Polynomial model

The linear model used for classifying data is overfitting. The simpler way to solve this would be to provide different attributes as descriptive features and check their effect on the model. The reason for overfitting can be:

  • The Data has too many 1’s compared to 0’s for the Brown frogs.
  • The logistic regression is also sensitive to the outliers.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

data_req = filtered_df;

feature_cols_react = ['yearly_salary', 'css_proficiency']
X = data_req[feature_cols_react] # Features
y = data_req['angular'] # Target variable

X_train ,X_test ,y_train ,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

poly = PolynomialFeatures(degree = 2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_train)

logreg = LogisticRegression(solver = 'liblinear', multi_class = 'ovr');
logreg.fit(X_poly,y_train);

X_poly_test = poly.transform(X_test);
y_pred = logreg.predict(X_poly_test)

y_pred=logreg.predict(X_poly_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print("confusion matrix \n", cnf_matrix)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision :",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
  1. The logistic regression function can now fit the model and predict the data.
  • Confusion matrix:

CODER | BLOGGER | ARTIST | GHOST