Amritanshu Agrawal
Data Scientist
Wayfair, Boston USA
August''19 - Present
​
-
Part of Merchandising and Tagging Team. Building automatic modeling methods to support tagging of 14+ million company’s products.
-
Mentoring a new hire with onboarding process and getting them ramped up on the Tagging service.
-
Upgraded the Tagging service to provide coverage for Boolean, Enumerated List Schema Tags (~14K/26K) for 2 use cases: Catalog Cleanup and Product Addition. Corrected ~9.5 million revenue driving Schema Tag Values.
-
Organized non-work team outing activities for colleagues to bond and network.
Skills
OS: LINUX, WINDOWS, ANDROID
DATABASES: MongoDB, MySQL
TOOLS: Apache Spark, Hadoop, Weka, Docker, Vagrant, Ansible, LaTeX, Github
PACKAGES: Scikit-learn, Spacy, Spark mllib, NLTK, Pandas, Scipy, Numpy, Matplotlib, Jupyter
Research Assistant
NORTH CAROLINA STATE UNIVERSITY, USA
August'15 - May'19
I am working in RAISE Lab under the supervision of Dr. Timothy Menzies.
-
Brainstormed on industrial projects from LexisNexis (LN), IBM and Laboratory for Analytic Sciences.
- Organized Data Hackathon to collect open source data from research papers and maintaining the data at SEACRAFT (http://tiny.cc/seacraft).
- Created Predictive Modeling to classify legal text documents into relevant & non-relevant. Investigated Information Retrieval (IR), Natural Language Processing (NLP) Text mining features. Researched all methods involved in Knowledge Discovery Data stages. Collaborated with a team of 3 & communicated with LN experts in a private Github repository using Agile Development process. Code Base of around 10000 LOC. Reported Results in a technical paper "The "BigSE" Project: Lessons Learned from Validating Industrial Text Mining".
- Evaluated and scaled Supervised learning, Incremental Learning, Active learning methods on StackExchange websites (~60 GB of raw data). Found LDA and TF-IDF features with SVM classification model to achieve optimal performance.
-
In another work, performed a Cross – Company Transfer Learning of Private Data features. [Code on Github].
-
Generalized idea is to share the data to others without disclosing the data where it comes from. Collected and preprocessed the phishing data from multiple sources. Features in 1 data source was mutated, and subset of samples were shared to build a predictor and predicted on other sources. Achieved better performance with SVM (RBF kernel). Summarized results and methods can be seen online.
-
-
Teaching Assistant for CSC 510 Software Engineering Course in Spring 2018
-
Mentored 17 Teams comprising of about 4 students each on different SE projects. Helped students with issues related to their Software framework, architecture, data modeling, which tools to use, proposed alternate solutions to solve an issue.​
-
Programming Languages
Python
Java
Scala
Shell Scripting
Javascript
Data Scientist Intern
LUCIDWORKS INC, USA
May'18 - Aug'18
-
Mentored by Chao Han and the team. Improved current Question & Answer system by extracting fitter textual features.
-
Used Tika Parser to extract Spark, Hadoop, Lucene, Solr mailing lists, generating Q&A pairs (~200K) to validate extracted features.
-
Extracted 9 features like, Part-of-Speech, Position of Answer span, Named-entity recognition & more. Reached 86% accuracy using XGboost model. Incorporated as part of Fusion AI product.
​
IBM, RTP USA
May'17 - Aug'17
​
-
Part of Devops Insight Team under the guidance of Donald Cronin and mentored by Alexander Sobran.
-
Researched on providing insights into development practices such as how collaboration is among team, rate of issues/bugs/enhancement being closed, time taken to resolve issues/bugs/enhancements, impact of hero programmers, impact of introduction of continuous integration tools.
-
Analyzed 1,108 public and 538 enterprise Github Repositories. Found that contrary to Open Source principles, 80% of code are done by only 20% of developers in 77% projects. ARIMA models built on Github issues timeseries model can accurately forecast future bugs and enhancements.
Research Intern
DURHAM UNIVERSITY, ENGLAND
June'14 - July'14
Summer Internship at Intelligent Imaging Innovative Computing Group under the supervision of Dr Toby Breckon.
Project title: Object Recognition using Visual Bag of Words and Principle Components Analysis
-
Developed existing software implementation into an extended experimental suite. Carried out scientific evaluation of proposed Principle Components Analysis (PCA) approach using benchmark datasets for the task.
-
Learner used was Random Forest, featurization is done using Bag of Visual Words. Code Base of around 2000 LOC. Dataset of images around 1 GB. Successfully improved the accuracy of object recognition by 7-10%.