Sajib Dasgupta
Department of Computer Science
University of Texas at Dallas
Advisor: Vincent Ng
Contact Information
- Postal Address
2238 Flat Creek Drive
Richardson, Tx 75080, USA
- Email: s d g n e w @ y a h o o . c o m (Discard space)
News and Life
- Proud to be inside IBM now, after the Jeopardy! showdown. All the familiar faces look machine-like, constantly thinking the next big thing.
- "Don't read the text, read the mind of the text" --> to me and my machine.
- Re-living spectral way, powerful way.
- Thesis extravaganza begins. Bayesian intrusion seems inevitable.
- Pensive Sajib: Am I a natural language processing researcher? All those years I just wished away from the delicacy of NLP -- parsing, discourse and MT.
I am a data miner where data is just a bag of words. Poor, resource-starved Sajib!
- Is machine learning the biggest distraction for NLP? Is NLP dying? Get ready NLPers: more distractions are on their way -- social networking, health analytics, business analytics, financial analytics bla bla bla.
Where is discourse analytics, where is conversational analytics? How long will we pursue the "easier" way out, and have NLP suffer?
- PhD completed! Joined IBM Research, Almaden as a Postdoc.
- Paper accepted in ICML 2010! I am elated.
- I have a paper accepted in a NIPS workshop on Clustering.
I learned spectral learning watching Dr. Ulrike von Luxburg's talk online, who happened to send me the acceptance email!
- Finally, I am releasing the part-of-speech lexicon that we induced from raw corpus without any labeled data! (Link: POS)
I am also releasing our unsupervised morphological segmentation output (Link: Morphology).
Both of these can be used directly in other unsupervised NLP systems like parsing.
- I am selected for the Louis Beecherl Graduate Fellowship for 2009-2010.
- I am back to Dallas after a month-long vacation to Bangladesh. Presented two papers in Singapore in the mean time.
- I have papers accepted in ACL 09 and EMNLP 09.
- I am back to school! After spending 1 year at IBM Almaden Research Center, California, I am back to UTD to finish my PhD.
- I reviewed for ACL 09, EMNLP 08.
- I have a patent submitted, thanks to IBM.
- My masters thesis is up for download finally. Link: Thesis
CV
Research
- Areas of Interests
Natural Language Processing and Machine Learning. My special interests in natural language processing are unsupervised learning, sentiment/text classification, morphology etc.
- Research Experience
Human Language Technology Research Institute (2005-Current):
Research on text clustering with an aim to producing multiple clusterings of the data simultaneously according to user interests. Researched on automatic review classification, unsupervised word segmentation without using any language specific grammatical knowledge for four different languages, with an application to language-independent part-of-speech induction.
IBM Research (2007 to 2008):
Worked in IBM Almaden Research Center for 1 year, where the goal was to learn cross-corpus associations from unstructured data sources in an unsupervised manner with an aim to bridging the gap in between two disparate subject areas.
Center for Research on Bangla Language Processing (2004 to 2005):
Worked in the Center for Research on Bangla Language Processing (CRBLP), BRAC University, Bangladesh as a research programmer from February 2004 to July 2005. Researched on knowledge driven two-level
morphological parsing for Bangla.
- Peer Reviewed Publications
Towards Subjectifying Text Clustering.
Sajib Dasgupta and Vincent Ng.
Accepted for presentation in SIGIR, 2010 (Acceptance rate: crazy 16.5%).
--- Won the Student Travel Scholarship. But Switzerland embassy had other ideas! Swiss dream nipped in the bud.
Topic-wise, Sentiment-wise, or Otherwise? Sentiment Clustering Using Human Feedback.
Sajib Dasgupta and Vincent Ng.
Accepted for publication in the Journal of Artificial Intelligence Research (JAIR), 2010.
Mining Clustering Dimensions.
Sajib Dasgupta and Vincent Ng.
Accepted for presentation in the International Conference on Machine Learning (ICML), 2010.
Single Data, Multiple Clusterings. [My talk on Videolectures.net]
Sajib Dasgupta and Vincent Ng.
Accepted for presentation in the NIPS workshop on "Clustering", 2009.
--- Won the Student Travel Scholarship. Biggest achievement though was to chat with Prof. Avrim Blum and Prof. Ulrike von Luxburg after the talk.
Topic-wise, Sentiment-wise, or Otherwise? Identifying the Hidden Dimension for Unsupervised Text Classification.
Sajib Dasgupta and Vincent Ng.
In the conference of the EMNLP, Singapore, 2009.
Mine the Easy, Classify the Hard: A Semi-Supervised Approach to Automatic Sentiment Classification.
Sajib Dasgupta and Vincent Ng.
In the conference of the ACL, Singapore, 2009.
Discriminative Models for Semi-Supervised Natural Language Learning.
Sajib Dasgupta and Vincent Ng.
Position paper in the NAACL-HLT 2009 workshop on Semisupervised Learning for Natural Language Processing, Boulder, 2009.
Unsupervised Part-of-Speech Acquisition for Resource-Scarce Languages.
Sajib Dasgupta and Vincent Ng.
In the conference on Empirical Methods in Natural Language Processing (EMNLP), Prague, 2007.
High-Performance, Language-Independent Morphological Segmentation.
Sajib Dasgupta and Vincent Ng.
In the conference of the NAACL-HLT, New York, 2007.
Unsupervised Morphological Parsing of Bengali.
Sajib Dasgupta and Vincent Ng.
In the journal of Language Resources and Evaluation (LRE), 2007, published by Springer.
Unsupervised Word Segmentation for Bangla.
Sajib Dasgupta and Vincent Ng.
In the conference of the ICON, India, 2007.
Examining the Role of Linguistics Knowledge Sources in the Automatic Identification and Classification of Reviews.
Vincent Ng, Sajib Dasgupta and S. M. Niaz Arifin.
In the conference of the ACL, Sydney, 2006.
- Masters Thesis
Toward Language Independent Morphological Segmentation and Part-of-speech Induction
Advisor: Vincent Ng, University of Texas at Dallas.
- Patent Submitted (IBM)
Information Extraction from Multiple Expertise-Specific Subject Areas. Docket no. ARC920080067US1. With co-inventors Dipayan Gangopadhyay and Norm Pass.
Datasets and Others:
Our Unsupervised Morphological Segmentation Output: English, Bengali
Our Unsupervised Part-of-Speech Lexicon Induction Output: English, Bengali
Goldstandard Used for Unsupervised Morphological Segmentation: Bengali, Finnish and Turkish
Goldstandard Created for Unsupervised Part-of-Speech Lexicon Induction: Bengali
Some old papers: Here