We ❤️ Open Source

We open source text-based toolkit, speech-based toolkit, models and dataset.

242
Open Model
174
Open Dataset
1TB+
Size of dataset
🤗 HuggingFace

End-to-End Toolkit for Text and Speech

Solving text and speech problems, powered by PyTorch.

📄 Text Modules

Text augmentation, tokenizer, language model, spelling correction, normalization, Jawi, Kesalahan Tatabahasa, text classification, embedding, text similarity, text tagging and parsing, summarization, translation, and Zero-shot classification.

Documentation

🗣️ Speech Modules

Language model, Speech-to-Text using RNNT, CTC and Seq2Seq, Force Alignment, End-to-End Text-to-Speech, speech classification, speaker diarization and streaming interface using PyAudio and TorchAudio.

Documentation

🗄️ Open Dataset

Massive Malaysian dataset gathered using pseudolabel and Large Language Model.

Github Repository

Multi-Images Multi-Audio Multi-turn Multi-Modal

Groundbreaking Bi-lingual Multimodal Large Language Model designed to comprehend multi-images, multi-audio, and multi-images-multi-audio within a single multiturn session.

MaLLaM 🌙 Malaysian Foundation Language Model

MaLLaM 🌙 (Malaysia Large Language Model), Malaysian Foundation Language Model, fully open research including the dataset, pretrained 1.1B, 3B and 5B from scratch on 349GB JSONL equivalent to 90 Billion tokens using 10 nodes of 8x A100 80GB DGX.

Bi-lingual, Multi-turn, RAG, 32k context length, Malaysian context

We finetuned Malaysian Foundation Models up to 13GB or equivalent to 2B tokens of instruction dataset consist of bilingual Malaysian context instruction.

vLLM Whisper

We forked vLLM to support better throughput and memory-efficient inference for Whisper. Not just that, this fork able to stream output tokens like other LLM serving and support SRT format for both streaming and batch serving.

Faster and Smaller Whisper

Implemented a static cache and utilizing torch.compile with HQQ quantization, resulted in a 4.5x speedup for non-quantized models and an impressive 6x speedup for quantized models compared to baseline.

Malaysian Whisper

End-to-End Speech-to-Text and Speech Translation finetuned on hyperlocal Malaysian context with open Malaysian pseudolabelled 16k hours dataset.

Malaysian Translation

Able to translate Malay, local Malay (social media texts or local context), English, Manglish, Javanese, Banjarese and Indonesian to target language. It also able to maintain the text structure as it is and only translate necessary texts, eg, programming code.

Interested for Open Source collaboration?

If you are from interested for collaboration to improve open source, email us at khalil@mesolitica.com or husein@mesolitica.com