A real valued representation of lexical entities, based on semantic, grammatical or syntactic information, is a popular way to encode information contained in them. These encoded ‘embeddings’ can then be used as standalone features for tasks such as recognizing similar words, creating word clusters and analyzing parent-child relationships or act as additional features for other NLP tasks. This survey looked at the different methods in which words can be represented as vectors of real valued elements and how these methods have evolved over time. The survey focused primarily on primarily on distributed representations for words and compare and contrast the different popular approaches for the same. In addition, it also delves into the discussion of certain properties of distributed representations of words which make them useful for a wide variety of NLP tasks.
Complete text of the survey can be found here. Please provide due citation for the paper, if using it as a reference material.
Statistical language models a.k.a language models are probabilistic models that assign probability to the occurrence of a sentence using information of co-occurrence of words extracted from a corpus of baseline text. In this project, I explore how a simple 2-gram or bigram model can be modeled as an MDP and how different RL algorithms perform in learning a policy that can produce fixed length coherent and meaningful sentences. One of the prime challenges of defining an MDP is to design a reward function that is in accordance with the goal you are trying to achieve. In this project, I explored 4 different reward functions and compared them in order to determine which ones best reflected the desired objective, by computing a coherence score for sentences produced by the agent and comparing it with sample sentences extracted from the training corpus. I analyzed and compared the performance of three RL algorithms: SARSA, Q-Learning and Q(λ), on this task on the basis of their average return of rewards.
Codes for this project are available on my github page (RL-for-NLP-Project). Read More about it here.
Realistic colorization of videos has been of great interest to the artistic community, primarily for restoring historical color films and colorizing legacy videos. In this project, we experimented with several methods in order to automatically colorize videos on a frame-by-frame basis. We focused on rectifying two primary issues encountered with video colorizations : lack of color consistency between subsequent frames and desaturated colorization of individual frames. We used an LSTM to encode the sequential information of videos and thus maintain color consistency between successive frames. We used a class-rebalancing loss to re-weight color predictions on the basis of their rarity. We evaluated out results using average per-pixel RMSE over all frames in a single video and also set up a colorization “Turing Test” to determine which models gave the most realistic colorization.
The key ideas summarizing a set of sentences can be succinctly represented with the help of short key-phrases that provide information about the theme(s) present in the targeted text. These key-phrases can be used as a means to provide users with information regarding the primary themes present in a set of documents and grouping similar phrases pertaining to the same theme together can allow the users to query the theme that they are interested in, without acknowledging the other topics.
In this project, my team and I experimented with several methods of performing roll up of similar themes, with focus on different ways of representing phrases in such as way that their inherent similarity stays intact. We divided our phrase representation methods into three categories: based only on the words contained in the phrase, based only on the context of the phrase and based on both the words and the context of the phrase. We experimented with basic models such as averaging of word embeddings to get phrase embedding and with more complex models such as word2vec where each phrase was considered a separate word and weights were initialized randomly as well as with word2vec with weights initialized with pre-computed vectors. We also worked on a model called Feature-Rich Comositional Transform, which makes use of weighted features combined with word embeddings to prepare a distributed representation of phrases.
Schema Matching is a method of finding attributes that are either similar to each other linguistically or represent the same information. In this project, we took a hybrid approach at solving this problem by making use of both the provided data and the schema name to perform one to one schema matching and introduced the idea of a global dictionary to achieve one to many schema matching. We experimented with two methods of one to one matching and compare both based on their F-scores, precision and recall. In the first method for one-to-one matching, we clustered the train schema using both SOM and K-Means and calculated the centrid of each cluster. We then assigned a cluster to each schema name in the test database, based on its euclidean distance from the centroids. In the second method, we combined all schema names from both the train and test databases and clustered the final set of attributes using SOM and K-Means. For calculating one-to-one matching within each cluster, we used edit distance to find the closest match between a train and test schema.
Each of the methods that we tried have their own utility, with the second method having the advantage that it would create clusters with only test attributes or only train attributes as well. For one-to-any matching, we prepared a dictionary that contained all possible one-to-many matches for the attributes "name" and "address". The advantage of this method is that custom dictionaries can be prepared based on the domain, which would require a domain expert once, thus minimizing human involvement.
Codes for this project are available on my github page (Schema-Matching-using-Machine-Learning). Read More about the project here.
In the last 20 years, American entertainment, ranging from movies to television shows to comic books to novels, has seen a colossal increase in its fan following, and with this increase, more and more people have begun to keep a story going even when it is on hiatus, giving rise to fan-proposed theories of what the future of a story might be. In this projcet, these fan theories have been used as the data set over which several models have been trained and their generative performance has been compared.
Three generative models, N-grams, Hidden Markov Models(HMM) and Long- Short Term Memory(LSTM) recurrent neural networks have been explored and their results compared to analyze how they perform for the proposed task. The Stanford OpenIE package has been used to extract relational tuples while the CoreNLP package has been used to tokenize the input text, perform part-of- speech tagging and find the named entities in the text. This information has been utilized in combination 1 with the N-gram models to propose three different methods of sentence generation and assess how they perform as compared to other baseline models. Frequency of occurrence of each character name from the chosen show has been computed and sentence for the five most famous characters have been generated, with each sentence generation task being seeded with the name of the character. Comparison has been done on the basis of how understandable a sentence is and how much information it conveys. Human subjects with both presence as well as absence of domain knowledge were asked to rate the sentences, in order to obtain an understanding of how well each model maps domain knowledge in its generative methodology.
In this project, the task of architecture classification for monuments and buildings from the Indian subcontinent was explored. Five major classes of architecture were taken and various supervised learning methods, both probabilistic and non-probabilistic, were experimented with in order to classify the monuments into one of the five categories. The categories were: ’Ancient’, ’British’, ’IndoIslamic’,’Maratha’ and ’Sikh’. Local ORB feature descriptors were used to represent each image and clustering was applied to quantize the obtained features to a smaller size. Other than the typical method of using features to do an image-wise classification, another method where descriptor wise classification is done was also explored. In this method, image label was provided as the mode of the labels of the descriptors of that image. It was found that among the different classifiers, k nearest neighbors for the case of descriptor-wise classification performed the best.
Codes for this project are available on my github page (Architecture-Classification). Read More about the project here.
Lack of proper resources and potent technology have made transcription of music pieces to its written sheet music format and its easily accessible and secure storage a major problem within the music community. Not only have insecure storage methods such as cell phones and computers led to an increasing rise in the cases of stolen or leaked music pieces, mishaps to these fragile devices have also caused a great loss to musicians and their loved ones the world over. Though applications and devices have been developed to recognize musical notes, most require an excessive amount of paraphernalia, few convert the recognized notes to sheet music and none have the ability to provide safe storage to these converted music sheets. Through this project, development of the device A.T.O.M. - A TOol for Music transcription, has been proposed, to allow automated recording, transcription and storage of music pieces in its sheet format onto a secure server, which only the particular user will have access to. The purpose of this device is to allow musicians to store and retrieve their creations easily from any part of the globe by connecting to the secure server and accessing required files by submitting the appropriate login identification and password. The device will make use of audio as well as image processing to provide dual identification of the note being played, utilize cloud computing to store the converted music onto a secure remote server along with providing an easy to use and compact user interface.
The present system of manual sorting of packages is the cause of some of the most common reasons for delay in delivery of mails and courier packages. These include unavailability of complete and legible addresses on the packages and misplacement of packages at the sorting facility itself, which is a result of the unavoidable factor which comes into play at each level i.e. the factor of human error. The aim of this project was to develop a reusable voice recognition system on chip for the purpose of sorting of packages in the central and local sorting hubs of logistics companies, to enhance their productivity and minimize the chances of delay in delivery. The project employed a speech storage and recognition chip to store a consignee’s address at the packaging facility, which was then attached to the corresponding package. This chip would act as a beacon to let the employees at the sorting facility locate a package by simply calling out its destination address, thus minimizing the possibility of misplacement.
This project was prepared for the Texas Instruments Design Contest 2015 and proceeded to the quaterfinals of the competition. A patent has been filed on this technology with the Government of India under the name - Apparatus and Method for Locating Misplaced Item.