Experience | Skills | Amritanshu Agrawal

Data Scientist

Wayfair, Boston USA

August''19 - Present

Part of Merchandising and Tagging Team. Building automatic modeling methods to support tagging of 14+ million company’s products.
Mentoring a new hire with onboarding process and getting them ramped up on the Tagging service.
Upgraded the Tagging service to provide coverage for Boolean, Enumerated List Schema Tags (~14K/26K) for 2 use cases: Catalog Cleanup and Product Addition. Corrected ~9.5 million revenue driving Schema Tag Values.
Organized non-work team outing activities for colleagues to bond and network.

Skills

OS: LINUX, WINDOWS, ANDROID

DATABASES: MongoDB, MySQL

TOOLS: Apache Spark, Hadoop, Weka, Docker, Vagrant, Ansible, LaTeX, Github

PACKAGES: Scikit-learn, Spacy, Spark mllib, NLTK, Pandas, Scipy, Numpy, Matplotlib, Jupyter

Research Assistant

NORTH CAROLINA STATE UNIVERSITY, USA

August'15 - May'19

I am working in RAISE Lab under the supervision of Dr. Timothy Menzies.

Brainstormed on industrial projects from LexisNexis (LN), IBM and Laboratory for Analytic Sciences.
Organized Data Hackathon to collect open source data from research papers and maintaining the data at SEACRAFT (http://tiny.cc/seacraft).
Created Predictive Modeling to classify legal text documents into relevant & non-relevant. Investigated Information Retrieval (IR), Natural Language Processing (NLP) Text mining features. Researched all methods involved in Knowledge Discovery Data stages. Collaborated with a team of 3 & communicated with LN experts in a private Github repository using Agile Development process. Code Base of around 10000 LOC. Reported Results in a technical paper "The "BigSE" Project: Lessons Learned from Validating Industrial Text Mining".
Evaluated and scaled Supervised learning, Incremental Learning, Active learning methods on StackExchange websites (~60 GB of raw data). Found LDA and TF-IDF features with SVM classification model to achieve optimal performance.
In another work, performed a Cross – Company Transfer Learning of Private Data features. [Code on Github].
- Generalized idea is to share the data to others without disclosing the data where it comes from. Collected and preprocessed the phishing data from multiple sources. Features in 1 data source was mutated, and subset of samples were shared to build a predictor and predicted on other sources. Achieved better performance with SVM (RBF kernel). Summarized results and methods can be seen online.
Teaching Assistant for CSC 510 Software Engineering Course in Spring 2018
- Mentored 17 Teams comprising of about 4 students each on different SE projects. Helped students with issues related to their Software framework, architecture, data modeling, which tools to use, proposed alternate solutions to solve an issue.

Programming Languages

Python

Java

Scala

Shell Scripting

Javascript

Data Scientist Intern

LUCIDWORKS INC, USA

May'18 - Aug'18

Mentored by Chao Han and the team. Improved current Question & Answer system by extracting fitter textual features.
Used Tika Parser to extract Spark, Hadoop, Lucene, Solr mailing lists, generating Q&A pairs (~200K) to validate extracted features.
Extracted 9 features like, Part-of-Speech, Position of Answer span, Named-entity recognition & more. Reached 86% accuracy using XGboost model. Incorporated as part of Fusion AI product.

IBM, RTP USA

May'17 - Aug'17

Part of Devops Insight Team under the guidance of Donald Cronin and mentored by Alexander Sobran.
Researched on providing insights into development practices such as how collaboration is among team, rate of issues/bugs/enhancement being closed, time taken to resolve issues/bugs/enhancements, impact of hero programmers, impact of introduction of continuous integration tools.
Analyzed 1,108 public and 538 enterprise Github Repositories. Found that contrary to Open Source principles, 80% of code are done by only 20% of developers in 77% projects. ARIMA models built on Github issues timeseries model can accurately forecast future bugs and enhancements.

Research Intern

DURHAM UNIVERSITY, ENGLAND

June'14 - July'14

Summer Internship at Intelligent Imaging Innovative Computing Group under the supervision of Dr Toby Breckon.

Project title: Object Recognition using Visual Bag of Words and Principle Components Analysis

Developed existing software implementation into an extended experimental suite. Carried out scientific evaluation of proposed Principle Components Analysis (PCA) approach using benchmark datasets for the task.
Learner used was Random Forest, featurization is done using Bag of Visual Words. Code Base of around 2000 LOC. Dataset of images around 1 GB. Successfully improved the accuracy of object recognition by 7-10%.